summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2019-03-04fs: Make splice() and tee() take into account O_NONBLOCK flag on pipesSlavomir Kaslev
The current implementation of splice() and tee() ignores O_NONBLOCK set on pipe file descriptors and checks only the SPLICE_F_NONBLOCK flag for blocking on pipe arguments. This is inconsistent since splice()-ing from/to non-pipe file descriptors does take O_NONBLOCK into consideration. Fix this by promoting O_NONBLOCK, when set on a pipe, to SPLICE_F_NONBLOCK. Some context for how the current implementation of splice() leads to inconsistent behavior. In the ongoing work[1] to add VM tracing capability to trace-cmd we stream tracing data over named FIFOs or vsockets from guests back to the host. When we receive SIGINT from user to stop tracing, we set O_NONBLOCK on the input file descriptor and set SPLICE_F_NONBLOCK for the next call to splice(). If splice() was blocked waiting on data from the input FIFO, after SIGINT splice() restarts with the same arguments (no SPLICE_F_NONBLOCK) and blocks again instead of returning -EAGAIN when no data is available. This differs from the splice() behavior when reading from a vsocket or when we're doing a traditional read()/write() loop (trace-cmd's --nosplice argument). With this patch applied we get the same behavior in all situations after setting O_NONBLOCK which also matches the behavior of doing a read()/write() loop instead of splice(). This change does have potential of breaking users who don't expect EAGAIN from splice() when SPLICE_F_NONBLOCK is not set. OTOH programs that set O_NONBLOCK and don't anticipate EAGAIN are arguably buggy[2]. [1] https://github.com/skaslev/trace-cmd/tree/vsock [2] https://github.com/torvalds/linux/blob/d47e3da1759230e394096fd742aad423c291ba48/fs/read_write.c#L1425 Signed-off-by: Slavomir Kaslev <kaslevs@vmware.com> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2019-03-04Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull vfs fixes from Al Viro: "Assorted fixes that sat in -next for a while, all over the place" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: aio: Fix locking in aio_poll() exec: Fix mem leak in kernel_read_file copy_mount_string: Limit string length to PATH_MAX cgroup: saner refcounting for cgroup_root fix cgroup_do_mount() handling of failure exits
2019-03-04get rid of legacy 'get_ds()' functionLinus Torvalds
Every in-kernel use of this function defined it to KERNEL_DS (either as an actual define, or as an inline function). It's an entirely historical artifact, and long long long ago used to actually read the segment selector valueof '%ds' on x86. Which in the kernel is always KERNEL_DS. Inspired by a patch from Jann Horn that just did this for a very small subset of users (the ones in fs/), along with Al who suggested a script. I then just took it to the logical extreme and removed all the remaining gunk. Roughly scripted with git grep -l '(get_ds())' -- :^tools/ | xargs sed -i 's/(get_ds())/(KERNEL_DS)/' git grep -lw 'get_ds' -- :^tools/ | xargs sed -i '/^#define get_ds()/d' plus manual fixups to remove a few unusual usage patterns, the couple of inline function cases and to fix up a comment that had become stale. The 'get_ds()' function remains in an x86 kvm selftest, since in user space it actually does something relevant. Inspired-by: Jann Horn <jannh@google.com> Inspired-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-04aio: simplify - and fix - fget/fput for io_submit()Linus Torvalds
Al Viro root-caused a race where the IOCB_CMD_POLL handling of fget/fput() could cause us to access the file pointer after it had already been freed: "In more details - normally IOCB_CMD_POLL handling looks so: 1) io_submit(2) allocates aio_kiocb instance and passes it to aio_poll() 2) aio_poll() resolves the descriptor to struct file by req->file = fget(iocb->aio_fildes) 3) aio_poll() sets ->woken to false and raises ->ki_refcnt of that aio_kiocb to 2 (bumps by 1, that is). 4) aio_poll() calls vfs_poll(). After sanity checks (basically, "poll_wait() had been called and only once") it locks the queue. That's what the extra reference to iocb had been for - we know we can safely access it. 5) With queue locked, we check if ->woken has already been set to true (by aio_poll_wake()) and, if it had been, we unlock the queue, drop a reference to aio_kiocb and bugger off - at that point it's a responsibility to aio_poll_wake() and the stuff called/scheduled by it. That code will drop the reference to file in req->file, along with the other reference to our aio_kiocb. 6) otherwise, we see whether we need to wait. If we do, we unlock the queue, drop one reference to aio_kiocb and go away - eventual wakeup (or cancel) will deal with the reference to file and with the other reference to aio_kiocb 7) otherwise we remove ourselves from waitqueue (still under the queue lock), so that wakeup won't get us. No async activity will be happening, so we can safely drop req->file and iocb ourselves. If wakeup happens while we are in vfs_poll(), we are fine - aio_kiocb won't get freed under us, so we can do all the checks and locking safely. And we don't touch ->file if we detect that case. However, vfs_poll() most certainly *does* touch the file it had been given. So wakeup coming while we are still in ->poll() might end up doing fput() on that file. That case is not too rare, and usually we are saved by the still present reference from descriptor table - that fput() is not the final one. But if another thread closes that descriptor right after our fget() and wakeup does happen before ->poll() returns, we are in trouble - final fput() done while we are in the middle of a method: Al also wrote a patch to take an extra reference to the file descriptor to fix this, but I instead suggested we just streamline the whole file pointer handling by submit_io() so that the generic aio submission code simply keeps the file pointer around until the aio has completed. Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL") Acked-by: Al Viro <viro@zeniv.linux.org.uk> Reported-by: syzbot+503d4cc169fcec1cb18c@syzkaller.appspotmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-039p: use inode->i_lock to protect i_size_write() under 32-bitHou Tao
Use inode->i_lock to protect i_size_write(), else i_size_read() in generic_fillattr() may loop infinitely in read_seqcount_begin() when multiple processes invoke v9fs_vfs_getattr() or v9fs_vfs_getattr_dotl() simultaneously under 32-bit SMP environment, and a soft lockup will be triggered as show below: watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [stat:2217] Modules linked in: CPU: 5 PID: 2217 Comm: stat Not tainted 5.0.0-rc1-00005-g7f702faf5a9e #4 Hardware name: Generic DT based system PC is at generic_fillattr+0x104/0x108 LR is at 0xec497f00 pc : [<802b8898>] lr : [<ec497f00>] psr: 200c0013 sp : ec497e20 ip : ed608030 fp : ec497e3c r10: 00000000 r9 : ec497f00 r8 : ed608030 r7 : ec497ebc r6 : ec497f00 r5 : ee5c1550 r4 : ee005780 r3 : 0000052d r2 : 00000000 r1 : ec497f00 r0 : ed608030 Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none Control: 10c5387d Table: ac48006a DAC: 00000051 CPU: 5 PID: 2217 Comm: stat Not tainted 5.0.0-rc1-00005-g7f702faf5a9e #4 Hardware name: Generic DT based system Backtrace: [<8010d974>] (dump_backtrace) from [<8010dc88>] (show_stack+0x20/0x24) [<8010dc68>] (show_stack) from [<80a1d194>] (dump_stack+0xb0/0xdc) [<80a1d0e4>] (dump_stack) from [<80109f34>] (show_regs+0x1c/0x20) [<80109f18>] (show_regs) from [<801d0a80>] (watchdog_timer_fn+0x280/0x2f8) [<801d0800>] (watchdog_timer_fn) from [<80198658>] (__hrtimer_run_queues+0x18c/0x380) [<801984cc>] (__hrtimer_run_queues) from [<80198e60>] (hrtimer_run_queues+0xb8/0xf0) [<80198da8>] (hrtimer_run_queues) from [<801973e8>] (run_local_timers+0x28/0x64) [<801973c0>] (run_local_timers) from [<80197460>] (update_process_times+0x3c/0x6c) [<80197424>] (update_process_times) from [<801ab2b8>] (tick_nohz_handler+0xe0/0x1bc) [<801ab1d8>] (tick_nohz_handler) from [<80843050>] (arch_timer_handler_virt+0x38/0x48) [<80843018>] (arch_timer_handler_virt) from [<80180a64>] (handle_percpu_devid_irq+0x8c/0x240) [<801809d8>] (handle_percpu_devid_irq) from [<8017ac20>] (generic_handle_irq+0x34/0x44) [<8017abec>] (generic_handle_irq) from [<8017b344>] (__handle_domain_irq+0x6c/0xc4) [<8017b2d8>] (__handle_domain_irq) from [<801022e0>] (gic_handle_irq+0x4c/0x88) [<80102294>] (gic_handle_irq) from [<80101a30>] (__irq_svc+0x70/0x98) [<802b8794>] (generic_fillattr) from [<8056b284>] (v9fs_vfs_getattr_dotl+0x74/0xa4) [<8056b210>] (v9fs_vfs_getattr_dotl) from [<802b8904>] (vfs_getattr_nosec+0x68/0x7c) [<802b889c>] (vfs_getattr_nosec) from [<802b895c>] (vfs_getattr+0x44/0x48) [<802b8918>] (vfs_getattr) from [<802b8a74>] (vfs_statx+0x9c/0xec) [<802b89d8>] (vfs_statx) from [<802b9428>] (sys_lstat64+0x48/0x78) [<802b93e0>] (sys_lstat64) from [<80101000>] (ret_fast_syscall+0x0/0x28) [dominique.martinet@cea.fr: updated comment to not refer to a function in another subsystem] Link: http://lkml.kernel.org/r/20190124063514.8571-2-houtao1@huawei.com Cc: stable@vger.kernel.org Fixes: 7549ae3e81cc ("9p: Use the i_size_[read, write]() macros instead of using inode->i_size directly.") Reported-by: Xing Gaopeng <xingaopeng@huawei.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2019-03-02NFSv4.1: Bump the default callback session slot count to 16Trond Myklebust
Users can still control this value explicitly using the max_session_cb_slots module parameter, but let's bump the default up to 16 for now. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-01NFS/flexfiles: Clean up mirror DS initialisationTrond Myklebust
Get rid of the redundant parameter and rename the function ff_layout_mirror_valid() to ff_layout_init_mirror_ds() for clarity. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-01NFS/flexfiles: Remove dead code in ff_layout_mirror_valid()Trond Myklebust
nfs4_ff_alloc_deviceid_node() guarantees that if mirror->mirror_ds is a valid pointer, then so is mirror->mirror_ds->ds. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-01NFS/flexfile: Simplify nfs4_ff_layout_select_ds_stateid()Trond Myklebust
Pass in a pointer to the mirror rather than forcing another array access. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-01NFS/flexfile: Simplify nfs4_ff_layout_ds_version()Trond Myklebust
Pass in a pointer to the mirror rather than forcing another array access. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-01NFS/flexfiles: Simplify ff_layout_get_ds_cred()Trond Myklebust
Pass in a pointer to the mirror rather than forcing another array access. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-01NFS/flexfiles: Simplify nfs4_ff_find_or_create_ds_client()Trond Myklebust
Pass in a pointer to the mirror rather than forcing another array access. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-01NFS/flexfiles: Simplify nfs4_ff_layout_select_ds_fh()Trond Myklebust
Pass in a pointer to the mirror rather than having to retrieve it from the array and then verify the resulting pointer. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-01NFS/flexfiles: Speed up read failover when DSes are downTrond Myklebust
If we notice that a DS may be down, we should attempt to read from the other mirrors first before we go back to retry the dead DS. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-01NFS/flexfiles: Don't invalidate DS deviceids for being unresponsiveTrond Myklebust
If the DS is unresponsive, we want to just mark it as such, while reporting the errors. If the server later returns the same deviceid in a new layout, then we don't want to have to look it up again. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-01NFS/flexfiles: Remove bogus checks for invalid deviceidsTrond Myklebust
We already check the deviceids before we start the RPC call. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-01NFS/flexfiles: Avoid unnecessary layout invalidationsTrond Myklebust
In ff_layout_mirror_valid() we may not want to invalidate the layout segment despite the call to GETDEVICEINFO failing. The reason is that a read may still be able to make progress on another mirror. So instead we let the caller (in this case nfs4_ff_layout_prepare_ds()) decide whether or not it needs to invalidate. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-01NFS/flexfiles: refactor calls to fs4_ff_layout_prepare_ds()Trond Myklebust
While we may want to skip attempting to connect to a downed mirror when we're deciding which mirror to select for a read, we do not want to do so once we've committed to attempting the I/O in ff_layout_read/write_pagelist(), or ff_layout_initiate_commit() Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-01NFSv4: Handle early exit in layoutget by returning an errorTrond Myklebust
If the LAYOUTGET rpc call exits early without an error, convert it to EAGAIN. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-01NFS/flexfiles: Send LAYOUTERROR when failing over mirrored readsTrond Myklebust
When a read to the preferred mirror returns an error, the flexfiles driver records the error in the inode list and currently marks the layout for return before failing over the attempted read to the next mirror. What we actually want to do is fire off a LAYOUTERROR to notify the MDS that there is an issue with the preferred mirror, then we fail over. Only once we've failed to read from all mirrors should we return the layout. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-01dax: Flush partial PMDs correctlyMatthew Wilcox
The radix tree would rewind the index in an iterator to the lowest index of a multi-slot entry. The XArray iterators instead leave the index unchanged, but I overlooked that when converting DAX from the radix tree to the XArray. Adjust the index that we use for flushing to the start of the PMD range. Fixes: c1901cd33cf4 ("page cache: Convert find_get_entries_tag to XArray") Cc: <stable@vger.kernel.org> Reported-by: Piotr Balcer <piotr.balcer@intel.com> Tested-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-03-01NFSv4.2: Add client support for the generic 'layouterror' RPC callTrond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-01NFSv4/flexfiles: Abort I/O early if the layout segment was invalidatedTrond Myklebust
If a layout segment gets invalidated while a pNFS I/O operation is queued for transmission, then we ideally want to abort immediately. This is particularly the case when there is a large number of I/O related RPCs queued in the RPC layer, and the layout segment gets invalidated due to an ENOSPC error, or an EACCES (because the client was fenced). We may end up forced to spam the MDS with a lot of otherwise unnecessary LAYOUTERRORs after that I/O fails. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-01NFSv4/pnfs: Fix barriers in nfs4_mark_deviceid_unavailable()Trond Myklebust
Fix the memory barriers in nfs4_mark_deviceid_unavailable() and nfs4_test_deviceid_unavailable(). Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-01NFS/flexfiles: Fix up sparse RCU annotationsTrond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-01NFSv4/flexfiles: Fix invalid deref in FF_LAYOUT_DEVID_NODE()Trond Myklebust
If the attempt to instantiate the mirror's layout DS pointer failed, then that pointer may hold a value of type ERR_PTR(), so we need to check that before we dereference it. Fixes: 65990d1afbd2d ("pNFS/flexfiles: Fix a deadlock on LAYOUTGET") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-01NFS: Add missing encode / decode sequence_maxsz to v4.2 operationsAnna Schumaker
These really should have been there from the beginning, but we never noticed because there was enough slack in the RPC request for the extra bytes. Chuck's recent patch to use au_cslack and au_rslack to compute buffer size shrunk the buffer enough that this was now a problem for SEEK operations on my test client. Fixes: f4ac1674f5da4 ("nfs: Add ALLOCATE support") Fixes: 2e72448b07dc3 ("NFS: Add COPY nfs operation") Fixes: cb95deea0b4aa ("NFS OFFLOAD_CANCEL xdr") Fixes: 624bd5b7b683c ("nfs: Add DEALLOCATE support") Fixes: 1c6dcbe5ceff8 ("NFS: Implement SEEK") Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-01NFSv4.1: Don't process the sequence op more than once.Trond Myklebust
Ensure that if we call nfs41_sequence_process() a second time for the same rpc_task, then we only process the results once. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-01NFSv4.1: Reinitialise sequence results before retransmitting a requestTrond Myklebust
If we have to retransmit a request, we should ensure that we reinitialise the sequence results structure, since in the event of a signal we need to treat the request as if it had not been sent. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: stable@vger.kernel.org
2019-03-01Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "2 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: hugetlbfs: fix races and page leaks during migration kasan: turn off asan-stack for clang-8 and earlier
2019-03-01hugetlbfs: fix races and page leaks during migrationMike Kravetz
hugetlb pages should only be migrated if they are 'active'. The routines set/clear_page_huge_active() modify the active state of hugetlb pages. When a new hugetlb page is allocated at fault time, set_page_huge_active is called before the page is locked. Therefore, another thread could race and migrate the page while it is being added to page table by the fault code. This race is somewhat hard to trigger, but can be seen by strategically adding udelay to simulate worst case scheduling behavior. Depending on 'how' the code races, various BUG()s could be triggered. To address this issue, simply delay the set_page_huge_active call until after the page is successfully added to the page table. Hugetlb pages can also be leaked at migration time if the pages are associated with a file in an explicitly mounted hugetlbfs filesystem. For example, consider a two node system with 4GB worth of huge pages available. A program mmaps a 2G file in a hugetlbfs filesystem. It then migrates the pages associated with the file from one node to another. When the program exits, huge page counts are as follows: node0 1024 free_hugepages 1024 nr_hugepages node1 0 free_hugepages 1024 nr_hugepages Filesystem Size Used Avail Use% Mounted on nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool That is as expected. 2G of huge pages are taken from the free_hugepages counts, and 2G is the size of the file in the explicitly mounted filesystem. If the file is then removed, the counts become: node0 1024 free_hugepages 1024 nr_hugepages node1 1024 free_hugepages 1024 nr_hugepages Filesystem Size Used Avail Use% Mounted on nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool Note that the filesystem still shows 2G of pages used, while there actually are no huge pages in use. The only way to 'fix' the filesystem accounting is to unmount the filesystem If a hugetlb page is associated with an explicitly mounted filesystem, this information in contained in the page_private field. At migration time, this information is not preserved. To fix, simply transfer page_private from old to new page at migration time if necessary. There is a related race with removing a huge page from a file and migration. When a huge page is removed from the pagecache, the page_mapping() field is cleared, yet page_private remains set until the page is actually freed by free_huge_page(). A page could be migrated while in this state. However, since page_mapping() is not set the hugetlbfs specific routine to transfer page_private is not called and we leak the page count in the filesystem. To fix that, check for this condition before migrating a huge page. If the condition is detected, return EBUSY for the page. Link: http://lkml.kernel.org/r/74510272-7319-7372-9ea6-ec914734c179@oracle.com Link: http://lkml.kernel.org/r/20190212221400.3512-1-mike.kravetz@oracle.com Fixes: bcc54222309c ("mm: hugetlb: introduce page_huge_active") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: <stable@vger.kernel.org> [mike.kravetz@oracle.com: v2] Link: http://lkml.kernel.org/r/7534d322-d782-8ac6-1c8d-a8dc380eb3ab@oracle.com [mike.kravetz@oracle.com: update comment and changelog] Link: http://lkml.kernel.org/r/420bcfd6-158b-38e4-98da-26d0cd85bd01@oracle.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-01xfs: fix reporting supported extra file attributes for statx()Luis R. Rodriguez
statx(2) notes that any attribute that is not indicated as supported by stx_attributes_mask has no usable value. Commit 5f955f26f3d42d ("xfs: report crtime and attribute flags to statx") added support for informing userspace of extra file attributes but forgot to list these flags as supported making reporting them rather useless for the pedantic userspace author. $ git describe --contains 5f955f26f3d42d04aba65590a32eb70eedb7f37d v4.11-rc6~5^2^2~2 Fixes: 5f955f26f3d42d ("xfs: report crtime and attribute flags to statx") Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: add a comment reminding people to keep attributes_mask up to date] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-03-01jbd2: jbd2_get_transaction does not need to return a valueLiu Song
In jbd2_get_transaction, a new transaction is initialized, and set to the j_running_transaction. No need for a return value, so remove it. Also, adjust some comments to match the actual operation of this function. Signed-off-by: Liu Song <liu.song11@zte.com.cn> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2019-03-01jbd2: fix invalid descriptor block checksumluojiajun
In jbd2_journal_commit_transaction(), if we are in abort mode, we may flush the buffer without setting descriptor block checksum by goto start_journal_io. Then fs is mounted, jbd2_descriptor_block_csum_verify() failed. [ 271.379811] EXT4-fs (vdd): shut down requested (2) [ 271.381827] Aborting journal on device vdd-8. [ 271.597136] JBD2: Invalid checksum recovering block 22199 in log [ 271.598023] JBD2: recovery failed [ 271.598484] EXT4-fs (vdd): error loading journal Fix this problem by keep setting descriptor block checksum if the descriptor buffer is not NULL. This checksum problem can be reproduced by xfstests generic/388. Signed-off-by: luojiajun <luojiajun3@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2019-02-28ext4: fix bigalloc cluster freeing when hole punching under loadEric Whitney
Ext4 may not free clusters correctly when punching holes in bigalloc file systems under high load conditions. If it's not possible to extend and restart the journal in ext4_ext_rm_leaf() when preparing to remove blocks from a punched region, a retry of the entire punch operation is triggered in ext4_ext_remove_space(). This causes a partial cluster to be set to the first cluster in the extent found to the right of the punched region. However, if the punch operation prior to the retry had made enough progress to delete one or more extents and a partial cluster candidate for freeing had already been recorded, the retry would overwrite the partial cluster. The loss of this information makes it impossible to correctly free the original partial cluster in all cases. This bug can cause generic/476 to fail when run as part of xfstests-bld's bigalloc and bigalloc_1k test cases. The failure is reported when e2fsck detects bad iblocks counts greater than expected in units of whole clusters and also detects a number of negative block bitmap differences equal to the iblocks discrepancy in cluster units. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-02-28Merge tag 'for-linus-5.0-ofs1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux Pull orangefs fixlet from Mike Marshall: "Remove two un-needed BUG_ONs" * tag 'for-linus-5.0-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: orangefs: remove two un-needed BUG_ONs...
2019-02-28fs: fix guard_bio_eod to check for real EOD errorsCarlos Maiolino
guard_bio_eod() can truncate a segment in bio to allow it to do IO on odd last sectors of a device. It already checks if the IO starts past EOD, but it does not consider the possibility of an IO request starting within device boundaries can contain more than one segment past EOD. In such cases, truncated_bytes can be bigger than PAGE_SIZE, and will underflow bvec->bv_len. Fix this by checking if truncated_bytes is lower than PAGE_SIZE. This situation has been found on filesystems such as isofs and vfat, which doesn't check the device size before mount, if the device is smaller than the filesystem itself, a readahead on such filesystem, which spans EOD, can trigger this situation, leading a call to zero_user() with a wrong size possibly corrupting memory. I didn't see any crash, or didn't let the system run long enough to check if memory corruption will be hit somewhere, but adding instrumentation to guard_bio_end() to check truncated_bytes size, was enough to see the error. The following script can trigger the error. MNT=/mnt IMG=./DISK.img DEV=/dev/loop0 mkfs.vfat $IMG mount $IMG $MNT cp -R /etc $MNT &> /dev/null umount $MNT losetup -D losetup --find --show --sizelimit 16247280 $IMG mount $DEV $MNT find $MNT -type f -exec cat {} + >/dev/null Kudos to Eric Sandeen for coming up with the reproducer above Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28io_uring: add io_kiocb ref countJens Axboe
We'll use this for the POLL implementation. Regular requests will NOT be using references, so initialize it to 0. Any real use of the io_kiocb ref will initialize it to at least 2. Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28io_uring: add submission pollingJens Axboe
This enables an application to do IO, without ever entering the kernel. By using the SQ ring to fill in new sqes and watching for completions on the CQ ring, we can submit and reap IOs without doing a single system call. The kernel side thread will poll for new submissions, and in case of HIPRI/polled IO, it'll also poll for completions. By default, we allow 1 second of active spinning. This can by changed by passing in a different grace period at io_uring_register(2) time. If the thread exceeds this idle time without having any work to do, it will set: sq_ring->flags |= IORING_SQ_NEED_WAKEUP. The application will have to call io_uring_enter() to start things back up again. If IO is kept busy, that will never be needed. Basically an application that has this feature enabled will guard it's io_uring_enter(2) call with: read_barrier(); if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP) io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP); instead of calling it unconditionally. It's mandatory to use fixed files with this feature. Failure to do so will result in the application getting an -EBADF CQ entry when submitting IO. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28io_uring: add file set registrationJens Axboe
We normally have to fget/fput for each IO we do on a file. Even with the batching we do, the cost of the atomic inc/dec of the file usage count adds up. This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes for the io_uring_register(2) system call. The arguments passed in must be an array of __s32 holding file descriptors, and nr_args should hold the number of file descriptors the application wishes to pin for the duration of the io_uring instance (or until IORING_UNREGISTER_FILES is called). When used, the application must set IOSQE_FIXED_FILE in the sqe->flags member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd to the index in the array passed in to IORING_REGISTER_FILES. Files are automatically unregistered when the io_uring instance is torn down. An application need only unregister if it wishes to register a new set of fds. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28io_uring: add support for pre-mapped user IO buffersJens Axboe
If we have fixed user buffers, we can map them into the kernel when we setup the io_uring. That avoids the need to do get_user_pages() for each and every IO. To utilize this feature, the application must call io_uring_register() after having setup an io_uring instance, passing in IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to an iovec array, and the nr_args should contain how many iovecs the application wishes to map. If successful, these buffers are now mapped into the kernel, eligible for IO. To use these fixed buffers, the application must use the IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len must point to somewhere inside the indexed buffer. The application may register buffers throughout the lifetime of the io_uring instance. It can call io_uring_register() with IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of buffers, and then register a new set. The application need not unregister buffers explicitly before shutting down the io_uring instance. It's perfectly valid to setup a larger buffer, and then sometimes only use parts of it for an IO. As long as the range is within the originally mapped region, it will work just fine. For now, buffers must not be file backed. If file backed buffers are passed in, the registration will fail with -1/EOPNOTSUPP. This restriction may be relaxed in the future. RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat arbitrary 1G per buffer size is also imposed. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28io_uring: batch io_kiocb allocationJens Axboe
Similarly to how we use the state->ios_left to know how many references to get to a file, we can use it to allocate the io_kiocb's we need in bulk. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28io_uring: use fget/fput_many() for file referencesJens Axboe
Add a separate io_submit_state structure, to cache some of the things we need for IO submission. One such example is file reference batching. io_submit_state. We get as many references as the number of sqes we are submitting, and drop unused ones if we end up switching files. The assumption here is that we're usually only dealing with one fd, and if there are multiple, hopefuly they are at least somewhat ordered. Could trivially be extended to cover multiple fds, if needed. On the completion side we do the same thing, except this is trivially done just locally in io_iopoll_reap(). Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28fs: add fget_many() and fput_many()Jens Axboe
Some uses cases repeatedly get and put references to the same file, but the only exposed interface is doing these one at the time. As each of these entail an atomic inc or dec on a shared structure, that cost can add up. Add fget_many(), which works just like fget(), except it takes an argument for how many references to get on the file. Ditto fput_many(), which can drop an arbitrary number of references to a file. Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28io_uring: support for IO pollingJens Axboe
Add support for a polled io_uring instance. When a read or write is submitted to a polled io_uring, the application must poll for completions on the CQ ring through io_uring_enter(2). Polled IO may not generate IRQ completions, hence they need to be actively found by the application itself. To use polling, io_uring_setup() must be used with the IORING_SETUP_IOPOLL flag being set. It is illegal to mix and match polled and non-polled IO on an io_uring. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28io_uring: add fsync supportChristoph Hellwig
Add a new fsync opcode, which either syncs a range if one is passed, or the whole file if the offset and length fields are both cleared to zero. A flag is provided to use fdatasync semantics, that is only force out metadata which is required to retrieve the file data, but not others like metadata. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28Add io_uring IO interfaceJens Axboe
The submission queue (SQ) and completion queue (CQ) rings are shared between the application and the kernel. This eliminates the need to copy data back and forth to submit and complete IO. IO submissions use the io_uring_sqe data structure, and completions are generated in the form of io_uring_cqe data structures. The SQ ring is an index into the io_uring_sqe array, which makes it possible to submit a batch of IOs without them being contiguous in the ring. The CQ ring is always contiguous, as completion events are inherently unordered, and hence any io_uring_cqe entry can point back to an arbitrary submission. Two new system calls are added for this: io_uring_setup(entries, params) Sets up an io_uring instance for doing async IO. On success, returns a file descriptor that the application can mmap to gain access to the SQ ring, CQ ring, and io_uring_sqes. io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize) Initiates IO against the rings mapped to this fd, or waits for them to complete, or both. The behavior is controlled by the parameters passed in. If 'to_submit' is non-zero, then we'll try and submit new IO. If IORING_ENTER_GETEVENTS is set, the kernel will wait for 'min_complete' events, if they aren't already available. It's valid to set IORING_ENTER_GETEVENTS and 'min_complete' == 0 at the same time, this allows the kernel to return already completed events without waiting for them. This is useful only for polling, as for IRQ driven IO, the application can just check the CQ ring without entering the kernel. With this setup, it's possible to do async IO with a single system call. Future developments will enable polled IO with this interface, and polled submission as well. The latter will enable an application to do IO without doing ANY system calls at all. For IRQ driven IO, an application only needs to enter the kernel for completions if it wants to wait for them to occur. Each io_uring is backed by a workqueue, to support buffered async IO as well. We will only punt to an async context if the command would need to wait for IO on the device side. Any data that can be accessed directly in the page cache is done inline. This avoids the slowness issue of usual threadpools, since cached data is accessed as quickly as a sync interface. Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28afs: Use fs_context to pass parameters over automountDavid Howells
Alter the AFS automounting code to create and modify an fs_context struct when parameterising a new mount triggered by an AFS mountpoint rather than constructing device name and option strings. Also remove the cell=, vol= and rwpath options as they are then redundant. The reason they existed is because the 'device name' may be derived literally from a mountpoint object in the filesystem, so default cell and parent-type information needed to be passed in by some other method from the automount routines. The vol= option didn't end up being used. Signed-off-by: David Howells <dhowells@redhat.com> cc: Eric W. Biederman <ebiederm@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-28afs: Add fs_context supportDavid Howells
Add fs_context support to the AFS filesystem, converting the parameter parsing to store options there. This will form the basis for namespace propagation over mountpoints within the AFS model, thereby allowing AFS to be used in containers more easily. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>