summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2020-03-30f2fs: don't trigger data flush in foreground operationChao Yu
Data flush can generate heavy IO and cause long latency during flush, so it's not appropriate to trigger it in foreground operation. And also, we may face below potential deadlock during data flush: - f2fs_write_multi_pages - f2fs_write_raw_pages - f2fs_write_single_data_page - f2fs_balance_fs - f2fs_balance_fs_bg - f2fs_sync_dirty_inodes - filemap_fdatawrite -- stuck on flush same cluster Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-03-30f2fs: fix NULL pointer dereference in f2fs_write_begin()Chao Yu
BUG: kernel NULL pointer dereference, address: 0000000000000000 RIP: 0010:f2fs_write_begin+0x823/0xb90 [f2fs] Call Trace: f2fs_quota_write+0x139/0x1d0 [f2fs] write_blk+0x36/0x80 [quota_tree] get_free_dqblk+0x42/0xa0 [quota_tree] do_insert_tree+0x235/0x4a0 [quota_tree] do_insert_tree+0x26e/0x4a0 [quota_tree] do_insert_tree+0x26e/0x4a0 [quota_tree] do_insert_tree+0x26e/0x4a0 [quota_tree] qtree_write_dquot+0x70/0x190 [quota_tree] v2_write_dquot+0x43/0x90 [quota_v2] dquot_acquire+0x77/0x100 f2fs_dquot_acquire+0x2f/0x60 [f2fs] dqget+0x310/0x450 dquot_transfer+0x7e/0x120 f2fs_setattr+0x11a/0x4a0 [f2fs] notify_change+0x349/0x480 chown_common+0x168/0x1c0 do_fchownat+0xbc/0xf0 __x64_sys_fchownat+0x20/0x30 do_syscall_64+0x5f/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Passing fsdata parameter to .write_{begin,end} in f2fs_quota_write(), so that if quota file is compressed one, we can avoid above NULL pointer dereference when updating quota content. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-03-30f2fs: clean up f2fs_may_encrypt()Chao Yu
Merge below two conditions into f2fs_may_encrypt() for cleanup - IS_ENCRYPTED() - DUMMY_ENCRYPTION_ENABLED() Check IS_ENCRYPTED(inode) condition in f2fs_init_inode_metadata() is enough since we have already set encrypt flag in f2fs_new_inode(). Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-03-30f2fs: fix to avoid potential deadlockChao Yu
We should always check F2FS_I(inode)->cp_task condition in prior to other conditions in __should_serialize_io() to avoid deadloop described in commit 040d2bb318d1 ("f2fs: fix to avoid deadloop if data_flush is on"), however we break this rule when we support compression, fix it. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-03-30f2fs: don't change inode status under page lockChao Yu
In order to shrink page lock coverage. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-03-30f2fs: fix potential deadlock on compressed quota fileChao Yu
generic/232 reports below deadlock: fsstress D 0 96980 96969 0x00084000 Call Trace: schedule+0x4a/0xb0 io_schedule+0x12/0x40 __lock_page+0x127/0x1d0 pagecache_get_page+0x1d8/0x250 prepare_compress_overwrite+0xe0/0x490 [f2fs] f2fs_prepare_compress_overwrite+0x5d/0x80 [f2fs] f2fs_write_begin+0x833/0xb90 [f2fs] f2fs_quota_write+0x145/0x1e0 [f2fs] write_blk+0x36/0x80 [quota_tree] do_insert_tree+0x2ac/0x4a0 [quota_tree] do_insert_tree+0x26e/0x4a0 [quota_tree] qtree_write_dquot+0x70/0x190 [quota_tree] v2_write_dquot+0x43/0x90 [quota_v2] dquot_acquire+0x77/0x100 f2fs_dquot_acquire+0x2f/0x60 [f2fs] dqget+0x310/0x450 dquot_transfer+0xb2/0x120 f2fs_setattr+0x11a/0x4a0 [f2fs] notify_change+0x349/0x480 chown_common+0x168/0x1c0 do_fchownat+0xbc/0xf0 __x64_sys_lchown+0x21/0x30 do_syscall_64+0x5f/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9 task PC stack pid father kworker/u256:0 D 0 103444 2 0x80084000 Workqueue: writeback wb_workfn (flush-251:1) Call Trace: schedule+0x4a/0xb0 schedule_timeout+0x15e/0x2f0 io_schedule_timeout+0x19/0x40 congestion_wait+0x7e/0x120 f2fs_write_multi_pages+0x12a/0x840 [f2fs] f2fs_write_cache_pages+0x48f/0x790 [f2fs] f2fs_write_data_pages+0x2db/0x330 [f2fs] do_writepages+0x1a/0x60 __writeback_single_inode+0x3d/0x340 writeback_sb_inodes+0x225/0x4a0 wb_writeback+0xf7/0x320 wb_workfn+0xba/0x470 process_one_work+0x16c/0x3f0 worker_thread+0x4c/0x440 kthread+0xf8/0x130 ret_from_fork+0x35/0x40 fsstress D 0 5277 5266 0x00084000 Call Trace: schedule+0x4a/0xb0 rwsem_down_write_slowpath+0x29d/0x540 block_operations+0x105/0x360 [f2fs] f2fs_write_checkpoint+0x101/0x1010 [f2fs] f2fs_sync_fs+0xa8/0x130 [f2fs] f2fs_do_sync_file+0x1ad/0x890 [f2fs] do_fsync+0x38/0x60 __x64_sys_fdatasync+0x13/0x20 do_syscall_64+0x5f/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The root cause is there is potential deadlock between quota data update and writeback. Kworker Thread B Thread C - f2fs_write_cache_pages - lock whole cluster --- A - f2fs_write_multi_pages - f2fs_write_raw_pages - f2fs_write_single_data_page - f2fs_do_write_data_page - f2fs_setattr - f2fs_lock_op --- B - f2fs_write_checkpoint - block_operations - f2fs_lock_all --- B - dquot_transfer - f2fs_quota_write - f2fs_prepare_compress_overwrite - pagecache_get_page --- A - f2fs_trylock_op failed --- B - congestion_wait - goto rewrite To fix this issue, during quota file writeback, just redirty all pages left in cluster rather holding pages' lock in cluster and looping retrying lock cp_rwsem. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-03-30f2fs: delete DIO read lockDongDongJu
This lock can be a contention with multi 4k random read IO with single inode. example) fio --output=test --name=test --numjobs=60 --filename=/media/samsung960pro/file_test --rw=randread --bs=4k --direct=1 --time_based --runtime=7 --ioengine=libaio --iodepth=256 --group_reporting --size=10G With this commit, it remove that possible lock contention. Signed-off-by: Dongjoo Seo <commisori28@gmail.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-03-30f2fs: don't mark compressed inode dirty during f2fs_iget()Chao Yu
- f2fs_iget - do_read_inode - set_inode_flag(, FI_COMPRESSED_FILE) - __mark_inode_dirty_flag(, true) It's unnecessary, so let's just mark compressed inode dirty while compressed inode conversion. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-03-30NFS: Ensure security label is set for root inodeScott Mayhew
When using NFSv4.2, the security label for the root inode should be set via a call to nfs_setsecurity() during the mount process, otherwise the inode will appear as unlabeled for up to acdirmin seconds. Currently the label for the root inode is allocated, retrieved, and freed entirely witin nfs4_proc_get_root(). Add a field for the label to the nfs_fattr struct, and allocate & free the label in nfs_get_root(), where we also add a call to nfs_setsecurity(). Note that for the call to nfs_setsecurity() to succeed, it's necessary to also move the logic calling security_sb_{set,clone}_security() from nfs_get_tree_common() down into nfs_get_root()... otherwise the SBLABEL_MNT flag will not be set in the super_block's security flags and nfs_setsecurity() will silently fail. Reported-by: Richard Haines <richard_c_haines@btinternet.com> Signed-off-by: Scott Mayhew <smayhew@redhat.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Tested-by: Stephen Smalley <sds@tycho.nsa.gov> [PM: fixed 80-char line width problems] Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-03-30Merge branch 'locking-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "The main changes in this cycle were: - Continued user-access cleanups in the futex code. - percpu-rwsem rewrite that uses its own waitqueue and atomic_t instead of an embedded rwsem. This addresses a couple of weaknesses, but the primary motivation was complications on the -rt kernel. - Introduce raw lock nesting detection on lockdep (CONFIG_PROVE_RAW_LOCK_NESTING=y), document the raw_lock vs. normal lock differences. This too originates from -rt. - Reuse lockdep zapped chain_hlocks entries, to conserve RAM footprint on distro-ish kernels running into the "BUG: MAX_LOCKDEP_CHAIN_HLOCKS too low!" depletion of the lockdep chain-entries pool. - Misc cleanups, smaller fixes and enhancements - see the changelog for details" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (55 commits) fs/buffer: Make BH_Uptodate_Lock bit_spin_lock a regular spinlock_t thermal/x86_pkg_temp: Make pkg_temp_lock a raw_spinlock_t Documentation/locking/locktypes: Minor copy editor fixes Documentation/locking/locktypes: Further clarifications and wordsmithing m68knommu: Remove mm.h include from uaccess_no.h x86: get rid of user_atomic_cmpxchg_inatomic() generic arch_futex_atomic_op_inuser() doesn't need access_ok() x86: don't reload after cmpxchg in unsafe_atomic_op2() loop x86: convert arch_futex_atomic_op_inuser() to user_access_begin/user_access_end() objtool: whitelist __sanitizer_cov_trace_switch() [parisc, s390, sparc64] no need for access_ok() in futex handling sh: no need of access_ok() in arch_futex_atomic_op_inuser() futex: arch_futex_atomic_op_inuser() calling conventions change completion: Use lockdep_assert_RT_in_threaded_ctx() in complete_all() lockdep: Add posixtimer context tracing bits lockdep: Annotate irq_work lockdep: Add hrtimer context tracing bits lockdep: Introduce wait-type checks completion: Use simple wait queues sched/swait: Prepare usage in completions ...
2020-03-30Merge branch 'efi-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI updates from Ingo Molnar: "The EFI changes in this cycle are much larger than usual, for two (positive) reasons: - The GRUB project is showing signs of life again, resulting in the introduction of the generic Linux/UEFI boot protocol, instead of x86 specific hacks which are increasingly difficult to maintain. There's hope that all future extensions will now go through that boot protocol. - Preparatory work for RISC-V EFI support. The main changes are: - Boot time GDT handling changes - Simplify handling of EFI properties table on arm64 - Generic EFI stub cleanups, to improve command line handling, file I/O, memory allocation, etc. - Introduce a generic initrd loading method based on calling back into the firmware, instead of relying on the x86 EFI handover protocol or device tree. - Introduce a mixed mode boot method that does not rely on the x86 EFI handover protocol either, and could potentially be adopted by other architectures (if another one ever surfaces where one execution mode is a superset of another) - Clean up the contents of 'struct efi', and move out everything that doesn't need to be stored there. - Incorporate support for UEFI spec v2.8A changes that permit firmware implementations to return EFI_UNSUPPORTED from UEFI runtime services at OS runtime, and expose a mask of which ones are supported or unsupported via a configuration table. - Partial fix for the lack of by-VA cache maintenance in the decompressor on 32-bit ARM. - Changes to load device firmware from EFI boot service memory regions - Various documentation updates and minor code cleanups and fixes" * 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits) efi/libstub/arm: Fix spurious message that an initrd was loaded efi/libstub/arm64: Avoid image_base value from efi_loaded_image partitions/efi: Fix partition name parsing in GUID partition entry efi/x86: Fix cast of image argument efi/libstub/x86: Use ULONG_MAX as upper bound for all allocations efi: Fix a mistype in comments mentioning efivar_entry_iter_begin() efi/libstub: Avoid linking libstub/lib-ksyms.o into vmlinux efi/x86: Preserve %ebx correctly in efi_set_virtual_address_map() efi/x86: Ignore the memory attributes table on i386 efi/x86: Don't relocate the kernel unless necessary efi/x86: Remove extra headroom for setup block efi/x86: Add kernel preferred address to PE header efi/x86: Decompress at start of PE image load address x86/boot/compressed/32: Save the output address instead of recalculating it efi/libstub/x86: Deal with exit() boot service returning x86/boot: Use unsigned comparison for addresses efi/x86: Avoid using code32_start efi/x86: Make efi32_pe_entry() more readable efi/x86: Respect 32-bit ABI in efi32_pe_entry() efi/x86: Annotate the LOADED_IMAGE_PROTOCOL_GUID with SYM_DATA ...
2020-03-30Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: "The main changes in this cycle were: - Make kfree_rcu() use kfree_bulk() for added performance - RCU updates - Callback-overload handling updates - Tasks-RCU KCSAN and sparse updates - Locking torture test and RCU torture test updates - Documentation updates - Miscellaneous fixes" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits) rcu: Make rcu_barrier() account for offline no-CBs CPUs rcu: Mark rcu_state.gp_seq to detect concurrent writes Documentation/memory-barriers: Fix typos doc: Add rcutorture scripting to torture.txt doc/RCU/rcu: Use https instead of http if possible doc/RCU/rcu: Use absolute paths for non-rst files doc/RCU/rcu: Use ':ref:' for links to other docs doc/RCU/listRCU: Update example function name doc/RCU/listRCU: Fix typos in a example code snippets doc/RCU/Design: Remove remaining HTML tags in ReST files doc: Add some more RCU list patterns in the kernel rcutorture: Set KCSAN Kconfig options to detect more data races rcutorture: Manually clean up after rcu_barrier() failure rcutorture: Make rcu_torture_barrier_cbs() post from corresponding CPU rcuperf: Measure memory footprint during kfree_rcu() test rcutorture: Annotation lockless accesses to rcu_torture_current rcutorture: Add READ_ONCE() to rcu_torture_count and rcu_torture_batch rcutorture: Fix stray access to rcu_fwd_cb_nodelay rcutorture: Fix rcu_torture_one_read()/rcu_torture_writer() data race rcutorture: Make kvm-find-errors.sh abort on bad directory ...
2020-03-30ubifs: Fix out-of-bounds memory access caused by abnormal value of node_lenLiu Song
In “ubifs_check_node”, when the value of "node_len" is abnormal, the code will goto label of "out_len" for execution. Then, in the following "ubifs_dump_node", if inode type is "UBIFS_DATA_NODE", in "print_hex_dump", an out-of-bounds access may occur due to the wrong "ch->len". Therefore, when the value of "node_len" is abnormal, data length should to be adjusted to a reasonable safe range. At this time, structured data is not credible, so dump the corrupted data directly for analysis. Signed-off-by: Liu Song <liu.song11@zte.com.cn> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-03-30ubifs: ubifs_add_orphan: Fix a memory leak bugZhihao Cheng
Memory leak occurs when files with extended attributes are added to orphan list. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Fixes: 988bec41318f3fa897e2f8 ("ubifs: orphan: Handle xattrs like files") Signed-off-by: Richard Weinberger <richard@nod.at>
2020-03-30ubifs: ubifs_jnl_write_inode: Fix a memory leak bugZhihao Cheng
When inodes with extended attributes are evicted, xent is not freed in one exit branch. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Fixes: 9ca2d732644484488db3112 ("ubifs: Limit number of xattrs per inode") Signed-off-by: Richard Weinberger <richard@nod.at>
2020-03-30Merge tag 'driver-core-5.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the "big" set of driver core changes for 5.7-rc1. Nothing huge in here, just lots of little firmware core changes and use of new apis, a libfs fix, a debugfs api change, and some driver core deferred probe rework. All of these have been in linux-next for a while with no reported issues" * tag 'driver-core-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (44 commits) Revert "driver core: Set fw_devlink to "permissive" behavior by default" driver core: Set fw_devlink to "permissive" behavior by default driver core: Replace open-coded list_last_entry() driver core: Read atomic counter once in driver_probe_done() libfs: fix infoleak in simple_attr_read() driver core: Add device links from fwnode only for the primary device platform/x86: touchscreen_dmi: Add info for the Chuwi Vi8 Plus tablet platform/x86: touchscreen_dmi: Add EFI embedded firmware info support Input: icn8505 - Switch to firmware_request_platform for retreiving the fw Input: silead - Switch to firmware_request_platform for retreiving the fw selftests: firmware: Add firmware_request_platform tests test_firmware: add support for firmware_request_platform firmware: Add new platform fallback mechanism and firmware_request_platform() Revert "drivers: base: power: wakeup.c: Use built-in RCU list checking" drivers: base: power: wakeup.c: Use built-in RCU list checking component: allow missing unbind callback debugfs: remove return value of debugfs_create_file_size() debugfs: Check module state before warning in {full/open}_proxy_open() firmware: fix a double abort case with fw_load_sysfs_fallback arch_topology: Fix putting invalid cpu clk ...
2020-03-30ubifs: Fix ubifs_tnc_lookup() usage in do_kill_orphans()Richard Weinberger
Orphans are allowed to point to deleted inodes. So -ENOENT is not a fatal error. Reported-by: Кочетков Максим <fido_max@inbox.ru> Reported-and-tested-by: "Christian Berger" <Christian.Berger@de.bosch.com> Tested-by: Karl Olsen <karl@micro-technic.com> Tested-by: Jef Driesen <jef.driesen@niko.eu> Fixes: ee1438ce5dc4 ("ubifs: Check link count of inodes when killing orphans.") Signed-off-by: Richard Weinberger <richard@nod.at>
2020-03-30Merge tag 'pstore-v5.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull pstore updates from Kees Cook: "These mostly some minor cleanups and a bug fix for an ftrace corner case: - Improve failure paths (chenqiwu) - Fix ftrace position index (Vasily Averin) - Use proper flexible-array member (Gustavo A. R. Silva)" * tag 'pstore-v5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: pstore/ram: Replace zero-length array with flexible-array member pstore: pstore_ftrace_seq_next should increase position index pstore/ram: remove unnecessary ramoops_unregister_dummy() pstore/platform: fix potential mem leak if pstore_init_fs failed
2020-03-30Merge tag 'erofs-for-5.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs updates from Gao Xiang: "Updates with a XArray adaptation, several fixes for shrinker and corrupted images are ready for this cycle. All commits have been stress tested with no noticeable smoke out and have been in linux-next as well. Summary: - Convert radix tree usage to XArray - Fix shrink scan count on multiple filesystem instances - Better handling for specific corrupted images - Update my email address in MAINTAINERS" * tag 'erofs-for-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: MAINTAINERS: erofs: update my email address erofs: handle corrupted images whose decompressed size less than it'd be erofs: use LZ4_decompress_safe() for full decoding erofs: correct the remaining shrink objects erofs: convert workstn to XArray
2020-03-30Merge tag 'docs-5.7' of git://git.lwn.net/linuxLinus Torvalds
Pull documentation updates from Jonathan Corbet: "This has been a busy cycle for documentation work. Highlights include: - Lots of RST conversion work by Mauro, Daniel ALmeida, and others. Maybe someday we'll get to the end of this stuff...maybe... - Some organizational work to bring some order to the core-api manual. - Various new docs and additions to the existing documentation. - Typo fixes, warning fixes, ..." * tag 'docs-5.7' of git://git.lwn.net/linux: (123 commits) Documentation: x86: exception-tables: document CONFIG_BUILDTIME_TABLE_SORT MAINTAINERS: adjust to filesystem doc ReST conversion docs: deprecated.rst: Add BUG()-family doc: zh_CN: add translation for virtiofs doc: zh_CN: index files in filesystems subdirectory docs: locking: Drop :c:func: throughout docs: locking: Add 'need' to hardirq section docs: conf.py: avoid thousands of duplicate label warning on Sphinx docs: prevent warnings due to autosectionlabel docs: fix reference to core-api/namespaces.rst docs: fix pointers to io-mapping.rst and io_ordering.rst files Documentation: Better document the softlockup_panic sysctl docs: hw-vuln: tsx_async_abort.rst: get rid of an unused ref docs: perf: imx-ddr.rst: get rid of a warning docs: filesystems: fuse.rst: supress a Sphinx warning docs: translations: it: avoid duplicate refs at programming-language.rst docs: driver.rst: supress two ReSt warnings docs: trace: events.rst: convert some new stuff to ReST format Documentation: Add io_ordering.rst to driver-api manual Documentation: Add io-mapping.rst to driver-api manual ...
2020-03-30Merge tag 'for-5.7/io_uring-2020-03-29' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring updates from Jens Axboe: "Here are the io_uring changes for this merge window. Light on new features this time around (just splice + buffer selection), lots of cleanups, fixes, and improvements to existing support. In particular, this contains: - Cleanup fixed file update handling for stack fallback (Hillf) - Re-work of how pollable async IO is handled, we no longer require thread offload to handle that. Instead we rely using poll to drive this, with task_work execution. - In conjunction with the above, allow expendable buffer selection, so that poll+recv (for example) no longer has to be a split operation. - Make sure we honor RLIMIT_FSIZE for buffered writes - Add support for splice (Pavel) - Linked work inheritance fixes and optimizations (Pavel) - Async work fixes and cleanups (Pavel) - Improve io-wq locking (Pavel) - Hashed link write improvements (Pavel) - SETUP_IOPOLL|SETUP_SQPOLL improvements (Xiaoguang)" * tag 'for-5.7/io_uring-2020-03-29' of git://git.kernel.dk/linux-block: (54 commits) io_uring: cleanup io_alloc_async_ctx() io_uring: fix missing 'return' in comment io-wq: handle hashed writes in chains io-uring: drop 'free_pfile' in struct io_file_put io-uring: drop completion when removing file io_uring: Fix ->data corruption on re-enqueue io-wq: close cancel gap for hashed linked work io_uring: make spdxcheck.py happy io_uring: honor original task RLIMIT_FSIZE io-wq: hash dependent work io-wq: split hashing and enqueueing io-wq: don't resched if there is no work io-wq: remove duplicated cancel code io_uring: fix truncated async read/readv and write/writev retry io_uring: dual license io_uring.h uapi header io_uring: io_uring_enter(2) don't poll while SETUP_IOPOLL|SETUP_SQPOLL enabled io_uring: Fix unused function warnings io_uring: add end-of-bits marker and build time verify it io_uring: provide means of removing buffers io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_RECVMSG ...
2020-03-30Merge tag 'for-5.7/block-2020-03-29' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: - Online capacity resizing (Balbir) - Number of hardware queue change fixes (Bart) - null_blk fault injection addition (Bart) - Cleanup of queue allocation, unifying the node/no-node API (Christoph) - Cleanup of genhd, moving code to where it makes sense (Christoph) - Cleanup of the partition handling code (Christoph) - disk stat fixes/improvements (Konstantin) - BFQ improvements (Paolo) - Various fixes and improvements * tag 'for-5.7/block-2020-03-29' of git://git.kernel.dk/linux-block: (72 commits) block: return NULL in blk_alloc_queue() on error block: move bio_map_* to blk-map.c Revert "blkdev: check for valid request queue before issuing flush" block: simplify queue allocation bcache: pass the make_request methods to blk_queue_make_request null_blk: use blk_mq_init_queue_data block: add a blk_mq_init_queue_data helper block: move the ->devnode callback to struct block_device_operations block: move the part_stat* helpers from genhd.h to a new header block: move block layer internals out of include/linux/genhd.h block: move guard_bio_eod to bio.c block: unexport get_gendisk block: unexport disk_map_sector_rcu block: unexport disk_get_part block: mark part_in_flight and part_in_flight_rw static block: mark block_depr static block: factor out requeue handling from dispatch code block/diskstats: replace time_in_queue with sum of request times block/diskstats: accumulate all per-cpu counters in one pass block/diskstats: more accurate approximation of io_ticks for slow disks ...
2020-03-30gfs2: Fix oversight in gfs2_ail1_flushBob Peterson
Ordinarily, function gfs2_ail1_start_one issues a write request for one item on the ail1 list, then returns -EBUSY. This makes the caller, gfs2_ail1_flush, loop around and start another. However, it was not clearing the -EBUSY return code each time through the loop. So on rare occasions, like when the wbc runs out of nr_to_write, it remained set to -EBUSY, which triggered an error and withdraw. This patch sets the return code to 0 each time through the restart loop so this won't happen anymore. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2020-03-30ceph: fix snapshot directory timestampsLuis Henriques
The .snap directory timestamps are kept at 0 (1970-01-01 00:00), which isn't consistent with what the fuse client does. This patch makes the behaviour consistent, by setting these timestamps (atime, btime, ctime, mtime) to those of the parent directory. Cc: Marc Roos <M.Roos@f1-outsourcing.eu> Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: wait for async creating inode before requesting new max sizeYan, Zheng
ceph_check_caps() can't request new max size for async creating inode. This may make ceph_get_caps() loop busily until getting reply of the async create. Also, wait for async creating reply before calling ceph_renew_caps(). Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: don't skip updating wanted caps when cap is staleYan, Zheng
1. try_get_cap_refs() fails to get caps and finds that mds_wanted does not include what it wants. It returns -ESTALE. 2. ceph_get_caps() calls ceph_renew_caps(). ceph_renew_caps() finds that inode has cap, so it calls ceph_check_caps(). 3. ceph_check_caps() finds that issued caps (without checking if it's stale) already includes caps wanted by open file, so it skips updating wanted caps. Above events can cause an infinite loop inside ceph_get_caps(). Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: request new max size only when there is auth capYan, Zheng
When there is no auth cap, check_max_size() can't do anything and may cause an infinite loop inside ceph_get_caps(). Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: cleanup return error of try_get_cap_refs()Yan, Zheng
Returns 0 if caps were not able to be acquired (yet), 1 if cap acquisition succeeded, or a negative error code. There are 3 special error codes: -EAGAIN: need to sleep but non-blocking is specified -EFBIG: ask caller to call check_max_size() and try again. -ESTALE: ask caller to call ceph_renew_caps() and try again. [ jlayton: add WARN_ON_ONCE check for -EAGAIN ] Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: return ceph_mdsc_do_request() errors from __get_parent()Qiujun Huang
Return the error returned by ceph_mdsc_do_request(). Otherwise, r_target_inode ends up being NULL this ends up returning ENOENT regardless of the error. Signed-off-by: Qiujun Huang <hqjagain@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: check all mds' caps after page writebackYan, Zheng
If an inode has caps from multiple mds's, the following can happen: - non-auth mds revokes Fsc. Fcb is used, so page writeback is queued. - when writeback finishes, ceph_check_caps() is called with auth only flag. ceph_check_caps() invalidates pagecache, but skips checking any non-auth caps. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: update i_requested_max_size only when sending cap msg to auth mdsYan, Zheng
Non-auth mds can't do anything to 'update max' cap message. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: simplify calling of ceph_get_fmode()Yan, Zheng
Originally, calling ceph_get_fmode() for open files is by thread that handles request reply. There is a small window between updating caps and and waking the request initiator. We need to prevent ceph_check_caps() from releasing wanted caps in the window. Previous patches made fill_inode() call __ceph_touch_fmode() for open file requests. This prevented ceph_check_caps() from releasing wanted caps for 'caps_wanted_delay_min' seconds, enough for request initiator to get woken up and call ceph_get_fmode(). This allows us to now call ceph_get_fmode() in ceph_open() instead. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: remove delay check logic from ceph_check_caps()Yan, Zheng
__ceph_caps_file_wanted() already checks 'caps_wanted_delay_min' and 'caps_wanted_delay_max'. There is no need to duplicate the logic in ceph_check_caps() and __send_cap() Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: consider inode's last read/write when calculating wanted capsYan, Zheng
Add i_last_rd and i_last_wr to ceph_inode_info. These fields are used to track the last time the client acquired read/write caps for the inode. If there is no read/write on an inode for 'caps_wanted_delay_max' seconds, __ceph_caps_file_wanted() does not request caps for read/write even there are open files. Call __ceph_touch_fmode() for dir operations. __ceph_caps_file_wanted() calculates dir's wanted caps according to last dir read/modification. If there is recent dir read, dir inode wants CEPH_CAP_ANY_SHARED caps. If there is recent dir modification, also wants CEPH_CAP_FILE_EXCL. Readdir is a special case. Dir inode wants CEPH_CAP_FILE_EXCL after readdir, as with that, modifications do not need to release CEPH_CAP_FILE_SHARED or invalidate all dentry leases issued by readdir. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: always renew caps if mds_wanted is insufficientYan, Zheng
Original code only renews caps for inodes with CEPH_I_CAP_DROPPED flag, which indicates that mds has closed the session and caps were dropped. Remove this flag in preparation for not requesting caps for idle open files. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: update dentry lease for async createYan, Zheng
Otherwise ceph_d_delete() may return 1 for the dentry, which makes dput() prune the dentry and clear parent dir's complete flag. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: attempt to do async create when possibleJeff Layton
With the Octopus release, the MDS will hand out directory create caps. If we have Fxc caps on the directory, and complete directory information or a known negative dentry, then we can return without waiting on the reply, allowing the open() call to return very quickly to userland. We use the normal ceph_fill_inode() routine to fill in the inode, so we have to gin up some reply inode information with what we'd expect the newly-created inode to have. The client assumes that it has a full set of caps on the new inode, and that the MDS will revoke them when there is conflicting access. This functionality is gated on the wsync/nowsync mount options. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: cache layout in parent dir on first sync createJeff Layton
If a create is done, then typically we'll end up writing to the file soon afterward. We don't want to wait for the reply before doing that when doing an async create, so that means we need the layout for the new file before we've gotten the response from the MDS. All files created in a directory will initially inherit the same layout, so copy off the requisite info from the first synchronous create in the directory, and save it in a new i_cached_layout field. Zero out the layout when we lose Dc caps in the dir. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: add new MDS req field to hold delegated inode numberJeff Layton
Add new request field to hold the delegated inode number. Encode that into the message when it's set. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: decode interval_sets for delegated inosJeff Layton
Starting in Octopus, the MDS will hand out caps that allow the client to do asynchronous file creates under certain conditions. As part of that, the MDS will delegate ranges of inode numbers to the client. Add the infrastructure to decode these ranges, and stuff them into an xarray for later consumption by the async creation code. Because the xarray code currently only handles unsigned long indexes, and those are 32-bits on 32-bit arches, we only enable the decoding when running on a 64-bit arch. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: make ceph_fill_inode non-staticJeff Layton
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: perform asynchronous unlink if we have sufficient capsJeff Layton
The MDS is getting a new lock-caching facility that will allow it to cache the necessary locks to allow asynchronous directory operations. Since the CEPH_CAP_FILE_* caps are currently unused on directories, we can repurpose those bits for this purpose. When performing an unlink, if we have Fx on the parent directory, and CEPH_CAP_DIR_UNLINK (aka Fr), and we know that the dentry being removed is the primary link, then then we can fire off an unlink request immediately and don't need to wait on reply before returning. In that situation, just fix up the dcache and link count and return immediately after issuing the call to the MDS. This does mean that we need to hold an extra reference to the inode being unlinked, and extra references to the caps to avoid races. Those references are put and error handling is done in the r_callback routine. If the operation ends up failing, then set a writeback error on the directory inode, and the inode itself that can be fetched later by an fsync on the dir. The behavior of dir caps is slightly different from caps on normal files. Because these are just considered an optimization, if the session is reconnected, we will not automatically reclaim them. They are instead considered lost until we do another synchronous op in the parent directory. Async dirops are enabled via the "nowsync" mount option, which is patterned after the xfs "wsync" mount option. For now, the default is "wsync", but eventually we may flip that. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: don't take refs to want mask unless we have all bitsYan, Zheng
If we don't have all of the cap bits for the want mask in try_get_cap_refs, then just take refs on the need bits. Signed-off-by: "Yan, Zheng" <ukernel@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: cap tracking for async directory operationsJeff Layton
Track and correctly handle directory caps for asynchronous operations. Add aliases for Frc caps that we now designate at Dcu caps (when dealing with directories). Unlike file caps, we don't reclaim these when the session goes away, and instead preemptively release them. In-flight async dirops are instead handled during reconnect phase. The client needs to re-do a synchronous operation in order to re-get directory caps. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: make __take_cap_refs non-staticJeff Layton
Rename it to ceph_take_cap_refs and make it available to other files. Also replace a comment with a lockdep assertion. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: add infrastructure for waiting for async create to completeJeff Layton
When we issue an async create, we must ensure that any later on-the-wire requests involving it wait for the create reply. Expand i_ceph_flags to be an unsigned long, and add a new bit that MDS requests can wait on. If the bit is set in the inode when sending caps, then don't send it and just return that it has been delayed. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: track primary dentry linkJeff Layton
Newer versions of the MDS will flag a dentry as "primary". In later patches, we'll need to consult this info, so track it in di->flags. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: add flag to designate that a request is asynchronousJeff Layton
...and ensure that such requests are never queued. The MDS has need to know that a request is asynchronous so add flags and proper infrastructure for that. Also, delegated inode numbers and directory caps are associated with the session, so ensure that async requests are always transmitted on the first attempt and are never queued to wait for session reestablishment. If it does end up looking like we'll need to queue the request, then have it return -EJUKEBOX so the caller can reattempt with a synchronous request. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: more caps.c lockdep assertionsJeff Layton
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: clean up kick_flushing_inode_caps()Jeff Layton
The last thing that this function does is release i_ceph_lock, so have the caller do that instead. Add a lockdep assertion to ensure that the function is always called with i_ceph_lock held. Change the prototype to take a ceph_inode_info pointer and drop the separate mdsc argument as we can get that from the session. While at it, make it non-static. We'll need this to kick any flushing caps once the create reply comes in. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>