summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2019-08-23f2fs: introduce f2fs_match_name() for cleanupChao Yu
This patch introduces f2fs_match_name() for cleanup. BTW, it avoids to fallback to normal comparison once it doesn't match casefolded name. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-23f2fs: Fix indefinite loop in f2fs_gc()Sahitya Tummala
Policy - Foreground GC, LFS and greedy GC mode. Under this policy, f2fs_gc() loops forever to GC as it doesn't have enough free segements to proceed and thus it keeps calling gc_more for the same victim segment. This can happen if the selected victim segment could not be GC'd due to failed blkaddr validity check i.e. is_alive() returns false for the blocks set in current validity map. Fix this by keeping track of such invalid segments and skip those segments for selection in get_victim_by_default() to avoid endless GC loop under such error scenarios. Currently, add this logic under CONFIG_F2FS_CHECK_FS to be able to root cause the issue in debug version. Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: fix wrong bitmap size] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-23f2fs: allocate memory in batch in build_sit_info()Chao Yu
build_sit_info() allocate all bitmaps for each segment one by one, it's quite low efficiency, this pach changes to allocate large continuous memory at a time, and divide it and assign for each bitmaps of segment. For large size image, it can expect improving its mount speed. Signed-off-by: Chen Gong <gongchen4@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-23f2fs: support FS_IOC_{GET,SET}FSLABELChao Yu
Support two generic fs ioctls FS_IOC_{GET,SET}FSLABEL, letting f2fs pass generic/492 testcase. Fixes were made by Eric where: - f2fs: fix buffer overruns in FS_IOC_{GET, SET}FSLABEL utf16s_to_utf8s() and utf8s_to_utf16s() take the number of characters, not the number of bytes. - f2fs: fix copying too many bytes in FS_IOC_SETFSLABEL Userspace provides a null-terminated string, so don't assume that the full FSLABEL_MAX bytes can always be copied. - f2fs: add missing authorization check in FS_IOC_SETFSLABEL FS_IOC_SETFSLABEL modifies the filesystem superblock, so it shouldn't be allowed to regular users. Require CAP_SYS_ADMIN, like xfs and btrfs do. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-23f2fs: fix to avoid data corruption by forbidding SSR overwriteChao Yu
There is one case can cause data corruption. - write 4k to fileA - fsync fileA, 4k data is writebacked to lbaA - write 4k to fileA - kworker flushs 4k to lbaB; dnode contain lbaB didn't be persisted yet - write 4k to fileB - kworker flush 4k to lbaA due to SSR - SPOR -> dnode with lbaA will be recovered, however lbaA contains fileB's data One solution is tracking all fsynced file's block history, and disallow SSR overwrite on newly invalidated block on that file. However, during recovery, no matter the dnode is flushed or fsynced, all previous dnodes until last fsynced one in node chain can be recovered, that means we need to record all block change in flushed dnode, which will cause heavy cost, so let's just use simple fix by forbidding SSR overwrite directly. Fixes: 5b6c6be2d878 ("f2fs: use SSR for warm node as well") Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-23f2fs: Fix build error while CONFIG_NLS=mYueHaibing
If CONFIG_F2FS_FS=y but CONFIG_NLS=m, building fails: fs/f2fs/file.o: In function `f2fs_ioctl': file.c:(.text+0xb86f): undefined reference to `utf16s_to_utf8s' file.c:(.text+0xe651): undefined reference to `utf8s_to_utf16s' Select CONFIG_NLS to fix this. Reported-by: Hulk Robot <hulkci@huawei.com> Fixes: 61a3da4d5ef8 ("f2fs: support FS_IOC_{GET,SET}FSLABEL") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-23Revert "f2fs: avoid out-of-range memory access"Chao Yu
As Pavel Machek reported: "We normally use -EUCLEAN to signal filesystem corruption. Plus, it is good idea to report it to the syslog and mark filesystem as "needing fsck" if filesystem can do that." Still we need improve the original patch with: - use unlikely keyword - add message print - return EUCLEAN However, after rethink this patch, I don't think we should add such condition check here as below reasons: - We have already checked the field in f2fs_sanity_check_ckpt(), - If there is fs corrupt or security vulnerability, there is nothing to guarantee the field is integrated after the check, unless we do the check before each of its use, however no filesystem does that. - We only have similar check for bitmap, which was added due to there is bitmap corruption happened on f2fs' runtime in product. - There are so many key fields in SB/CP/NAT did have such check after f2fs_sanity_check_{sb,cp,..}. So I propose to revert this unneeded check. This reverts commit 56f3ce675103e3fb9e631cfb4131fc768bc23e9a. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-23f2fs: cleanup the code in build_sit_entries.Lihong Kou
We do not need to set the SBI_NEED_FSCK flag in the error paths, if we return error here, we will not update the checkpoint flag, so the code is useless, just remove it. Signed-off-by: Lihong Kou <koulihong@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-23f2fs: fix wrong available node count calculationChao Yu
In mkfs, we have counted quota file's node number in cp.valid_node_count, so we have to avoid wrong substraction of quota node number in .available_nid/.avail_node_count calculation. f2fs_write_check_point_pack() { .. set_cp(valid_node_count, 1 + c.quota_inum + c.lpf_inum); Fixes: 292c196a3695 ("f2fs: reserve nid resource for quota sysfile") Fixes: 7b63f72f73af ("f2fs: fix to do sanity check on valid node/block count") Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-23f2fs: remove duplicate code in f2fs_file_write_iterLihong Kou
We will do the same check in generic_write_checks. if (iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT) return -EINVAL; just remove the same check in f2fs_file_write_iter. Signed-off-by: Lihong Kou <koulihong@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-23f2fs: fix to migrate blocks correctly during defragmentChao Yu
During defragment, we missed to trigger fragmented blocks migration for below condition: In defragment region: - total number of valid blocks is smaller than 512; - the tail part of the region are all holes; In addtion, return zero to user via range->len if there is no fragmented blocks. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-23f2fs: use wrapped f2fs_cp_error()Chao Yu
Just cleanup, no logic change. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-23f2fs: fix to use more generic EOPNOTSUPPChao Yu
EOPNOTSUPP is widely used as error number indicating operation is not supported in syscall, and ENOTSUPP was defined and only used for NFSv3 protocol, so use EOPNOTSUPP instead. Fixes: 0a2aa8fbb969 ("f2fs: refactor __exchange_data_block for speed up") Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-23f2fs: use wrapped IS_SWAPFILE()Chao Yu
Just cleanup, no logic change. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-23f2fs: Support case-insensitive file name lookupsDaniel Rosenberg
Modeled after commit b886ee3e778e ("ext4: Support case-insensitive file name lookups") """ This patch implements the actual support for case-insensitive file name lookups in f2fs, based on the feature bit and the encoding stored in the superblock. A filesystem that has the casefold feature set is able to configure directories with the +F (F2FS_CASEFOLD_FL) attribute, enabling lookups to succeed in that directory in a case-insensitive fashion, i.e: match a directory entry even if the name used by userspace is not a byte per byte match with the disk name, but is an equivalent case-insensitive version of the Unicode string. This operation is called a case-insensitive file name lookup. The feature is configured as an inode attribute applied to directories and inherited by its children. This attribute can only be enabled on empty directories for filesystems that support the encoding feature, thus preventing collision of file names that only differ by case. * dcache handling: For a +F directory, F2Fs only stores the first equivalent name dentry used in the dcache. This is done to prevent unintentional duplication of dentries in the dcache, while also allowing the VFS code to quickly find the right entry in the cache despite which equivalent string was used in a previous lookup, without having to resort to ->lookup(). d_hash() of casefolded directories is implemented as the hash of the casefolded string, such that we always have a well-known bucket for all the equivalencies of the same string. d_compare() uses the utf8_strncasecmp() infrastructure, which handles the comparison of equivalent, same case, names as well. For now, negative lookups are not inserted in the dcache, since they would need to be invalidated anyway, because we can't trust missing file dentries. This is bad for performance but requires some leveraging of the vfs layer to fix. We can live without that for now, and so does everyone else. * on-disk data: Despite using a specific version of the name as the internal representation within the dcache, the name stored and fetched from the disk is a byte-per-byte match with what the user requested, making this implementation 'name-preserving'. i.e. no actual information is lost when writing to storage. DX is supported by modifying the hashes used in +F directories to make them case/encoding-aware. The new disk hashes are calculated as the hash of the full casefolded string, instead of the string directly. This allows us to efficiently search for file names in the htree without requiring the user to provide an exact name. * Dealing with invalid sequences: By default, when a invalid UTF-8 sequence is identified, ext4 will treat it as an opaque byte sequence, ignoring the encoding and reverting to the old behavior for that unique file. This means that case-insensitive file name lookup will not work only for that file. An optional bit can be set in the superblock telling the filesystem code and userspace tools to enforce the encoding. When that optional bit is set, any attempt to create a file name using an invalid UTF-8 sequence will fail and return an error to userspace. * Normalization algorithm: The UTF-8 algorithms used to compare strings in f2fs is implemented in fs/unicode, and is based on a previous version developed by SGI. It implements the Canonical decomposition (NFD) algorithm described by the Unicode specification 12.1, or higher, combined with the elimination of ignorable code points (NFDi) and full case-folding (CF) as documented in fs/unicode/utf8_norm.c. NFD seems to be the best normalization method for F2FS because: - It has a lower cost than NFC/NFKC (which requires decomposing to NFD as an intermediary step) - It doesn't eliminate important semantic meaning like compatibility decompositions. Although: - This implementation is not completely linguistic accurate, because different languages have conflicting rules, which would require the specialization of the filesystem to a given locale, which brings all sorts of problems for removable media and for users who use more than one language. """ Signed-off-by: Daniel Rosenberg <drosen@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-23f2fs: include charset encoding information in the superblockDaniel Rosenberg
Add charset encoding to f2fs to support casefolding. It is modeled after the same feature introduced in commit c83ad55eaa91 ("ext4: include charset encoding information in the superblock") Currently this is not compatible with encryption, similar to the current ext4 imlpementation. This will change in the future. >From the ext4 patch: """ The s_encoding field stores a magic number indicating the encoding format and version used globally by file and directory names in the filesystem. The s_encoding_flags defines policies for using the charset encoding, like how to handle invalid sequences. The magic number is mapped to the exact charset table, but the mapping is specific to ext4. Since we don't have any commitment to support old encodings, the only encoding I am supporting right now is utf8-12.1.0. The current implementation prevents the user from enabling encoding and per-directory encryption on the same filesystem at the same time. The incompatibility between these features lies in how we do efficient directory searches when we cannot be sure the encryption of the user provided fname will match the actual hash stored in the disk without decrypting every directory entry, because of normalization cases. My quickest solution is to simply block the concurrent use of these features for now, and enable it later, once we have a better solution. """ Signed-off-by: Daniel Rosenberg <drosen@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-23f2fs: fix to avoid call kvfree under spinlockChao Yu
vfree() don't wish to be called from interrupt context, move it out of spin_lock_irqsave() coverage. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-23fs: f2fs: Remove unnecessary checks of SM_I(sbi) in update_general_status()Jia-Ju Bai
In fill_super() and put_super(), f2fs_destroy_stats() is called in prior to f2fs_destroy_segment_manager(), so if current sbi can still be visited in global stat list, SM_I(sbi) should be released yet. For this reason, SM_I(sbi) does not need to be checked in update_general_status(). Thank Chao Yu for advice. Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-23f2fs: disallow direct IO in atomic writeChao Yu
Atomic write needs page cache to cache data of transaction, direct IO should never be allowed in atomic write, detect and deny it when open atomic write file. Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-23f2fs: fix to handle quota_{on,off} correctlyChao Yu
With quota_ino feature on, generic/232 reports an inconsistence issue on the image. The root cause is that the testcase tries to: - use quotactl to shutdown journalled quota based on sysfile; - and then use quotactl to enable/turn on quota based on specific file (aquota.user or aquota.group). Eventually, quota sysfile will be out-of-update due to following specific file creation. Change as below to fix this issue: - deny enabling quota based on specific file if quota sysfile exists. - set SBI_QUOTA_NEED_REPAIR once sysfile based quota shutdowns via ioctl. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-23f2fs: fix to detect cp error in f2fs_setxattr()Chao Yu
It needs to return -EIO if filesystem has been shutdown, fix the miss case in f2fs_setxattr(). Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-23f2fs: fix to spread f2fs_is_checkpoint_ready()Chao Yu
We missed to call f2fs_is_checkpoint_ready() in several places, it may allow space allocation even when free space was exhausted during checkpoint is disabled, fix to add them. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-23f2fs: support fiemap() for directory inodeChao Yu
Adjust f2fs_fiemap() to support fiemap() on directory inode. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-23f2fs: fix to avoid discard command leakChao Yu
============================================================================= BUG discard_cmd (Tainted: G B OE ): Objects remaining in discard_cmd on __kmem_cache_shutdown() ----------------------------------------------------------------------------- INFO: Slab 0xffffe1ac481d22c0 objects=36 used=2 fp=0xffff936b4748bf50 flags=0x2ffff0000000100 Call Trace: dump_stack+0x63/0x87 slab_err+0xa1/0xb0 __kmem_cache_shutdown+0x183/0x390 shutdown_cache+0x14/0x110 kmem_cache_destroy+0x195/0x1c0 f2fs_destroy_segment_manager_caches+0x21/0x40 [f2fs] exit_f2fs_fs+0x35/0x641 [f2fs] SyS_delete_module+0x155/0x230 ? vtime_user_exit+0x29/0x70 do_syscall_64+0x6e/0x160 entry_SYSCALL64_slow_path+0x25/0x25 INFO: Object 0xffff936b4748b000 @offset=0 INFO: Object 0xffff936b4748b070 @offset=112 kmem_cache_destroy discard_cmd: Slab cache still has objects Call Trace: dump_stack+0x63/0x87 kmem_cache_destroy+0x1b4/0x1c0 f2fs_destroy_segment_manager_caches+0x21/0x40 [f2fs] exit_f2fs_fs+0x35/0x641 [f2fs] SyS_delete_module+0x155/0x230 do_syscall_64+0x6e/0x160 entry_SYSCALL64_slow_path+0x25/0x25 Recovery can cache discard commands, so in error path of fill_super(), we need give a chance to handle them, otherwise it will lead to leak of discard_cmd slab cache. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-23f2fs: fix to avoid tagging SBI_QUOTA_NEED_REPAIR incorrectlyChao Yu
On a quota disabled image, with fault injection, SBI_QUOTA_NEED_REPAIR will be set incorrectly in error path of f2fs_evict_inode(), fix it. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-23f2fs: fix to drop meta/node pages during umountChao Yu
As reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204193 A null pointer dereference bug is triggered in f2fs under kernel-5.1.3. kasan_report.cold+0x5/0x32 f2fs_write_end_io+0x215/0x650 bio_endio+0x26e/0x320 blk_update_request+0x209/0x5d0 blk_mq_end_request+0x2e/0x230 lo_complete_rq+0x12c/0x190 blk_done_softirq+0x14a/0x1a0 __do_softirq+0x119/0x3e5 irq_exit+0x94/0xe0 call_function_single_interrupt+0xf/0x20 During umount, we will access NULL sbi->node_inode pointer in f2fs_write_end_io(): f2fs_bug_on(sbi, page->mapping == NODE_MAPPING(sbi) && page->index != nid_of_node(page)); The reason is if disable_checkpoint mount option is on, meta dirty pages can remain during umount, and then be flushed by iput() of meta_inode, however node_inode has been iput()ed before meta_inode's iput(). Since checkpoint is disabled, all meta/node datas are useless and should be dropped in next mount, so in umount, let's adjust drop_inode() to give a hint to iput_final() to drop all those dirty datas correctly. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-23f2fs: disallow switching io_bits option during remountChao Yu
If IO alignment feature is turned on after remount, we didn't initialize mempool of it, it turns out we will encounter panic during IO submission due to access NULL mempool pointer. This feature should be set only at mount time, so simply deny configuring during remount. This fixes bug reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204135 Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-23f2fs: fix panic of IO alignment featureChao Yu
Since 07173c3ec276 ("block: enable multipage bvecs"), one bio vector can store multi pages, so that we can not calculate max IO size of bio as PAGE_SIZE * bio->bi_max_vecs. However IO alignment feature of f2fs always has that assumption, so finally, it may cause panic during IO submission as below stack. kernel BUG at fs/f2fs/data.c:317! RIP: 0010:__submit_merged_bio+0x8b0/0x8c0 Call Trace: f2fs_submit_page_write+0x3cd/0xdd0 do_write_page+0x15d/0x360 f2fs_outplace_write_data+0xd7/0x210 f2fs_do_write_data_page+0x43b/0xf30 __write_data_page+0xcf6/0x1140 f2fs_write_cache_pages+0x3ba/0xb40 f2fs_write_data_pages+0x3dd/0x8b0 do_writepages+0xbb/0x1e0 __writeback_single_inode+0xb6/0x800 writeback_sb_inodes+0x441/0x910 wb_writeback+0x261/0x650 wb_workfn+0x1f9/0x7a0 process_one_work+0x503/0x970 worker_thread+0x7d/0x820 kthread+0x1ad/0x210 ret_from_fork+0x35/0x40 This patch adds one extra condition to check left space in bio while trying merging page to bio, to avoid panic. This bug was reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204043 Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-23f2fs: introduce {page,io}_is_mergeable() for readabilityChao Yu
Wrap merge condition into function for readability, no logic change. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-08-22xfs: fix missing ILOCK unlock when xfs_setattr_nonsize fails due to EDQUOTDarrick J. Wong
Benjamin Moody reported to Debian that XFS partially wedges when a chgrp fails on account of being out of disk quota. I ran his reproducer script: # adduser dummy # adduser dummy plugdev # dd if=/dev/zero bs=1M count=100 of=test.img # mkfs.xfs test.img # mount -t xfs -o gquota test.img /mnt # mkdir -p /mnt/dummy # chown -c dummy /mnt/dummy # xfs_quota -xc 'limit -g bsoft=100k bhard=100k plugdev' /mnt (and then as user dummy) $ dd if=/dev/urandom bs=1M count=50 of=/mnt/dummy/foo $ chgrp plugdev /mnt/dummy/foo and saw: ================================================ WARNING: lock held when returning to user space! 5.3.0-rc5 #rc5 Tainted: G W ------------------------------------------------ chgrp/47006 is leaving the kernel with locks still held! 1 lock held by chgrp/47006: #0: 000000006664ea2d (&xfs_nondir_ilock_class){++++}, at: xfs_ilock+0xd2/0x290 [xfs] ...which is clearly caused by xfs_setattr_nonsize failing to unlock the ILOCK after the xfs_qm_vop_chown_reserve call fails. Add the missing unlock. Reported-by: benjamin.moody@gmail.com Fixes: 253f4911f297 ("xfs: better xfs_trans_alloc interface") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Tested-by: Salvatore Bonaccorso <carnil@debian.org>
2019-08-22ext4: rework reserved cluster accounting when invalidating pagesEric Whitney
The goal of this patch is to remove two references to the buffer delay bit in ext4_da_page_release_reservation() as part of a larger effort to remove all such references from ext4. These two references are principally used to reduce the reserved block/cluster count when pages are invalidated as a result of truncating, punching holes, or collapsing a block range in a file. The entire function is removed and replaced with code in ext4_es_remove_extent() that reduces the reserved count as a side effect of removing a block range from delayed and not unwritten extents in the extent status tree as is done when truncating, punching holes, or collapsing ranges. The code is written to minimize the number of searches descending from rb tree roots for scalability. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-08-22ext4: treat buffers with write errors as containing valid dataZhangXiaoxu
I got some errors when I repair an ext4 volume which stacked by an iscsi target: Entry 'test60' in / (2) has deleted/unused inode 73750. Clear? It can be reproduced when the network not good enough. When I debug this I found ext4 will read entry buffer from disk and the buffer is marked with write_io_error. If the buffer is marked with write_io_error, it means it already wroten to journal, and not checked out to disk. IOW, the journal is newer than the data in disk. If this journal record 'delete test60', it means the 'test60' still on the disk metadata. In this case, if we read the buffer from disk successfully and create file continue, the new journal record will overwrite the journal which record 'delete test60', then the entry corruptioned. So, use the buffer rather than read from disk if the buffer is marked with write_io_error. Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-08-22ext4: fix warning inside ext4_convert_unwritten_extents_endioRakesh Pandit
Really enable warning when CONFIG_EXT4_DEBUG is set and fix missing first argument. This was introduced in commit ff95ec22cd7f ("ext4: add warning to ext4_convert_unwritten_extents_endio") and splitting extents inside endio would trigger it. Fixes: ff95ec22cd7f ("ext4: add warning to ext4_convert_unwritten_extents_endio") Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2019-08-22io_uring: add need_resched() check in inner poll loopJens Axboe
The outer poll loop checks for whether we need to reschedule, and returns to userspace if we do. However, it's possible to get stuck in the inner loop as well, if the CPU we are running on needs to reschedule to finish the IO work. Add the need_resched() check in the inner loop as well. This fixes a potential hang if the kernel is configured with CONFIG_PREEMPT_VOLUNTARY=y. Reported-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-22Merge tag 'afs-fixes-20190822' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull AFS fixes from David Howells: - Fix a cell record leak due to the default error not being cleared. - Fix an oops in tracepoint due to a pointer that may contain an error. - Fix the ACL storage op for YFS where the wrong op definition is being used. By luck, this only actually affects the information appearing in traces. * tag 'afs-fixes-20190822' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: afs: use correct afs_call_type in yfs_fs_store_opaque_acl2 afs: Fix possible oops in afs_lookup trace event afs: Fix leak in afs_lookup_cell_rcu()
2019-08-22ubifs: Limit the number of pages in shrink_liabilityLiu Song
If the number of dirty pages to be written back is large, then writeback_inodes_sb will block waiting for a long time, causing hung task detection alarm. Therefore, we should limit the maximum number of pages written back this time, which let the budget be completed faster. The remaining dirty pages tend to rely on the writeback mechanism to complete the synchronization. Fixes: b6e51316daed ("writeback: separate starting of sync vs opportunistic writeback") Signed-off-by: Liu Song <liu.song11@zte.com.cn> Signed-off-by: Richard Weinberger <richard@nod.at>
2019-08-22ubifs: Correctly initialize c->min_log_bytesRichard Weinberger
Currently on a freshly mounted UBIFS, c->min_log_bytes is 0. This can lead to a log overrun and make commits fail. Recent kernels will report the following assert: UBIFS assert failed: c->lhead_lnum != c->ltail_lnum, in fs/ubifs/log.c:412 c->min_log_bytes can have two states, 0 and c->leb_size. It controls how much bytes of the log area are reserved for non-bud nodes such as commit nodes. After a commit it has to be set to c->leb_size such that we have always enough space for a commit. While a commit runs it can be 0 to make the remaining bytes of the log available to writers. Having it set to 0 right after mount is wrong since no space for commits is reserved. Fixes: 1e51764a3c2ac ("UBIFS: add new flash file system") Reported-and-tested-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>
2019-08-22ubifs: Fix double unlock around orphan_delete()Richard Weinberger
We unlock after orphan_delete(), so no need to unlock in the function too. Reported-by: Han Xu <han.xu@nxp.com> Fixes: 8009ce956c3d ("ubifs: Don't leak orphans on memory during commit") Signed-off-by: Richard Weinberger <richard@nod.at>
2019-08-22NFS: Have nfs4_proc_get_lease_time() call nfs4_call_sync_custom()Anna Schumaker
This removes some code duplication, since both functions were doing the same thing. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-22NFS: Have nfs41_proc_secinfo_no_name() call nfs4_call_sync_custom()Anna Schumaker
We need to use the custom rpc_task_setup here to set the RPC_TASK_NO_ROUND_ROBIN flag on the RPC call. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-22NFS: Have nfs41_proc_reclaim_complete() call nfs4_call_sync_custom()Anna Schumaker
An async call followed by an rpc_wait_for_completion() is basically the same as a synchronous call, so we can use nfs4_call_sync_custom() to keep our custom callback ops and the RPC_TASK_NO_ROUND_ROBIN flag. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-22NFS: Have _nfs4_proc_secinfo() call nfs4_call_sync_custom()Anna Schumaker
We do this to set the RPC_TASK_NO_ROUND_ROBIN flag in the task_setup structure Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-22NFS: Have nfs4_proc_setclientid() call nfs4_call_sync_custom()Anna Schumaker
Rather than running the task manually Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-22NFS: Add an nfs4_call_sync_custom() functionAnna Schumaker
There are a few cases where we need to manually configure the rpc_task_setup structure to get the behavior we want. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-22afs: use correct afs_call_type in yfs_fs_store_opaque_acl2YueHaibing
It seems that 'yfs_RXYFSStoreOpaqueACL2' should be use in yfs_fs_store_opaque_acl2(). Fixes: f5e4546347bc ("afs: Implement YFS ACL setting") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David Howells <dhowells@redhat.com>
2019-08-22afs: Fix possible oops in afs_lookup trace eventMarc Dionne
The afs_lookup trace event can cause the following: [ 216.576777] BUG: kernel NULL pointer dereference, address: 000000000000023b [ 216.576803] #PF: supervisor read access in kernel mode [ 216.576813] #PF: error_code(0x0000) - not-present page ... [ 216.576913] RIP: 0010:trace_event_raw_event_afs_lookup+0x9e/0x1c0 [kafs] If the inode from afs_do_lookup() is an error other than ENOENT, or if it is ENOENT and afs_try_auto_mntpt() returns an error, the trace event will try to dereference the error pointer as a valid pointer. Use IS_ERR_OR_NULL to only pass a valid pointer for the trace, or NULL. Ideally the trace would include the error value, but for now just avoid the oops. Fixes: 80548b03991f ("afs: Add more tracepoints") Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
2019-08-22afs: Fix leak in afs_lookup_cell_rcu()David Howells
Fix a leak on the cell refcount in afs_lookup_cell_rcu() due to non-clearance of the default error in the case a NULL cell name is passed and the workstation default cell is used. Also put a bit at the end to make sure we don't leak a cell ref if we're going to be returning an error. This leak results in an assertion like the following when the kafs module is unloaded: AFS: Assertion failed 2 == 1 is false 0x2 == 0x1 is false ------------[ cut here ]------------ kernel BUG at fs/afs/cell.c:770! ... RIP: 0010:afs_manage_cells+0x220/0x42f [kafs] ... process_one_work+0x4c2/0x82c ? pool_mayday_timeout+0x1e1/0x1e1 ? do_raw_spin_lock+0x134/0x175 worker_thread+0x336/0x4a6 ? rescuer_thread+0x4af/0x4af kthread+0x1de/0x1ee ? kthread_park+0xd4/0xd4 ret_from_fork+0x24/0x30 Fixes: 989782dcdc91 ("afs: Overhaul cell database management") Signed-off-by: David Howells <dhowells@redhat.com>
2019-08-22ceph: don't try fill file_lock on unsuccessful GETFILELOCK replyJeff Layton
When ceph_mdsc_do_request returns an error, we can't assume that the filelock_reply pointer will be set. Only try to fetch fields out of the r_reply_info when it returns success. Cc: stable@vger.kernel.org Reported-by: Hector Martin <hector@marcansoft.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-08-22ceph: clear page dirty before invalidate pageErqi Chen
clear_page_dirty_for_io(page) before mapping->a_ops->invalidatepage(). invalidatepage() clears page's private flag, if dirty flag is not cleared, the page may cause BUG_ON failure in ceph_set_page_dirty(). Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/40862 Signed-off-by: Erqi Chen <chenerqi@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-08-22ceph: fix buffer free while holding i_ceph_lock in fill_inode()Luis Henriques
Calling ceph_buffer_put() in fill_inode() may result in freeing the i_xattrs.blob buffer while holding the i_ceph_lock. This can be fixed by postponing the call until later, when the lock is released. The following backtrace was triggered by fstests generic/070. BUG: sleeping function called from invalid context at mm/vmalloc.c:2283 in_atomic(): 1, irqs_disabled(): 0, pid: 3852, name: kworker/0:4 6 locks held by kworker/0:4/3852: #0: 000000004270f6bb ((wq_completion)ceph-msgr){+.+.}, at: process_one_work+0x1b8/0x5f0 #1: 00000000eb420803 ((work_completion)(&(&con->work)->work)){+.+.}, at: process_one_work+0x1b8/0x5f0 #2: 00000000be1c53a4 (&s->s_mutex){+.+.}, at: dispatch+0x288/0x1476 #3: 00000000559cb958 (&mdsc->snap_rwsem){++++}, at: dispatch+0x2eb/0x1476 #4: 000000000d5ebbae (&req->r_fill_mutex){+.+.}, at: dispatch+0x2fc/0x1476 #5: 00000000a83d0514 (&(&ci->i_ceph_lock)->rlock){+.+.}, at: fill_inode.isra.0+0xf8/0xf70 CPU: 0 PID: 3852 Comm: kworker/0:4 Not tainted 5.2.0+ #441 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014 Workqueue: ceph-msgr ceph_con_workfn Call Trace: dump_stack+0x67/0x90 ___might_sleep.cold+0x9f/0xb1 vfree+0x4b/0x60 ceph_buffer_release+0x1b/0x60 fill_inode.isra.0+0xa9b/0xf70 ceph_fill_trace+0x13b/0xc70 ? dispatch+0x2eb/0x1476 dispatch+0x320/0x1476 ? __mutex_unlock_slowpath+0x4d/0x2a0 ceph_con_workfn+0xc97/0x2ec0 ? process_one_work+0x1b8/0x5f0 process_one_work+0x244/0x5f0 worker_thread+0x4d/0x3e0 kthread+0x105/0x140 ? process_one_work+0x5f0/0x5f0 ? kthread_park+0x90/0x90 ret_from_fork+0x3a/0x50 Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>