summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2019-11-18NFS: remove duplicated include from nfs4file.cYueHaibing
Remove duplicated include. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-11-18NFSv4: Make _nfs42_proc_copy_notify() staticYueHaibing
Fix sparse warning: fs/nfs/nfs42proc.c:527:5: warning: symbol '_nfs42_proc_copy_notify' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-11-18NFS: Fallocate should use the nfs4_fattr_bitmapAnna Schumaker
Changing a sparse file could have an effect not only on the file size, but also on the number of blocks used by the file in the underlying filesystem. The server's cache_consistency_bitmap doesn't update the SPACE_USED attribute, so let's switch to the nfs4_fattr_bitmap to catch this update whenever we do an ALLOCATE or DEALLOCATE. This patch fixes xfstests generic/568, which tests that fallocating an unaligned range allocates all blocks touched by that range. Without this patch, `stat` reports 0 bytes used immediately after the fallocate. Adding a `sleep 5` to the test also catches the update, but it's better to do so when we know something has changed. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-11-18NFS: Return -ETXTBSY when attempting to write to a swapfileAnna Schumaker
My understanding is that -EBUSY refers to the underlying device, and that -ETXTBSY is used when attempting to access a file in use by the kernel (like a swapfile). Changing this return code helps us pass xfstests generic/569 Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-11-18fs: nfs: sysfs: Remove NULL check before kfreeSaurav Girepunje
Remove NULL check before kfree, NULL check is taken care on kfree. Signed-off-by: Saurav Girepunje <saurav.girepunje@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-11-18NFS: remove unneeded semicolonYueHaibing
remove unneeded semicolon. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-11-18NFSv4: add declaration of current_stateidBen Dooks
The current_stateid is exported from nfs4state.c but not declared in any of the headers. Add to nfs4_fs.h to remove the following warning: fs/nfs/nfs4state.c:80:20: warning: symbol 'current_stateid' was not declared. Should it be static? Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-11-17ubifs: ubifs_tnc_start_commit: Fix OOB in layout_in_gapsZhihao Cheng
Running stress-test test_2 in mtd-utils on ubi device, sometimes we can get following oops message: BUG: unable to handle page fault for address: ffffffff00000140 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 280a067 P4D 280a067 PUD 0 Oops: 0000 [#1] SMP CPU: 0 PID: 60 Comm: kworker/u16:1 Kdump: loaded Not tainted 5.2.0 #13 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0 -0-ga698c8995f-prebuilt.qemu.org 04/01/2014 Workqueue: writeback wb_workfn (flush-ubifs_0_0) RIP: 0010:rb_next_postorder+0x2e/0xb0 Code: 80 db 03 01 48 85 ff 0f 84 97 00 00 00 48 8b 17 48 83 05 bc 80 db 03 01 48 83 e2 fc 0f 84 82 00 00 00 48 83 05 b2 80 db 03 01 <48> 3b 7a 10 48 89 d0 74 02 f3 c3 48 8b 52 08 48 83 05 a3 80 db 03 RSP: 0018:ffffc90000887758 EFLAGS: 00010202 RAX: ffff888129ae4700 RBX: ffff888138b08400 RCX: 0000000080800001 RDX: ffffffff00000130 RSI: 0000000080800024 RDI: ffff888138b08400 RBP: ffff888138b08400 R08: ffffea0004a6b920 R09: 0000000000000000 R10: ffffc90000887740 R11: 0000000000000001 R12: ffff888128d48000 R13: 0000000000000800 R14: 000000000000011e R15: 00000000000007c8 FS: 0000000000000000(0000) GS:ffff88813ba00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffff00000140 CR3: 000000013789d000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: destroy_old_idx+0x5d/0xa0 [ubifs] ubifs_tnc_start_commit+0x4fe/0x1380 [ubifs] do_commit+0x3eb/0x830 [ubifs] ubifs_run_commit+0xdc/0x1c0 [ubifs] Above Oops are due to the slab-out-of-bounds happened in do-while of function layout_in_gaps indirectly called by ubifs_tnc_start_commit. In function layout_in_gaps, there is a do-while loop placing index nodes into the gaps created by obsolete index nodes in non-empty index LEBs until rest index nodes can totally be placed into pre-allocated empty LEBs. @c->gap_lebs points to a memory area(integer array) which records LEB numbers used by 'in-the-gaps' method. Whenever a fitable index LEB is found, corresponding lnum will be incrementally written into the memory area pointed by @c->gap_lebs. The size ((@c->lst.idx_lebs + 1) * sizeof(int)) of memory area is allocated before do-while loop and can not be changed in the loop. But @c->lst.idx_lebs could be increased by function ubifs_change_lp (called by layout_leb_in_gaps->ubifs_find_dirty_idx_leb->get_idx_gc_leb) during the loop. So, sometimes oob happens when number of cycles in do-while loop exceeds the original value of @c->lst.idx_lebs. See detail in https://bugzilla.kernel.org/show_bug.cgi?id=204229. This patch fixes oob in layout_in_gaps. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2019-11-17ubifs: do_kill_orphans: Fix a memory leak bugZhihao Cheng
If there are more than one valid snod on the sleb->nodes list, do_kill_orphans will malloc ino more than once without releasing previous ino's memory. Finally, it will trigger memory leak. Fixes: ee1438ce5dc4 ("ubifs: Check link count of inodes when...") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2019-11-17Revert "ubifs: Fix memory leak bug in alloc_ubifs_info() error path"Richard Weinberger
This reverts commit 9163e0184bd7d5f779934d34581843f699ad2ffd. At the point when ubifs_fill_super() runs, we have already a reference to the super block. So upon deactivate_locked_super() c will get free()'ed via ->kill_sb(). Cc: Wenwen Wang <wenwen@cs.uga.edu> Fixes: 9163e0184bd7 ("ubifs: Fix memory leak bug in alloc_ubifs_info() error path") Reported-by: https://twitter.com/grsecurity/status/1180609139359277056 Signed-off-by: Richard Weinberger <richard@nod.at> Tested-by: Romain Izard <romain.izard.pro@gmail.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2019-11-17ubifs: Fix type of sup->hash_algoBen Dooks (Codethink)
The sup->hash_algo is a __le16, and whilst 0xffff is the same in __le16 and u16, it would be better to use cpu_to_le16() anyway (which should deal with constants) and silence the following sparse warning: fs/ubifs/sb.c:187:32: warning: incorrect type in assignment (different base types) fs/ubifs/sb.c:187:32: expected restricted __le16 [usertype] hash_algo fs/ubifs/sb.c:187:32: got int Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk> Signed-off-by: Richard Weinberger <richard@nod.at>
2019-11-17ubifs: Fixed missed le64_to_cpu() in journalBen Dooks (Codethink)
In the ubifs_jnl_write_inode() functon, it calls ubifs_iget() with xent->inum. The xent->inum is __le64, but the ubifs_iget() takes native cpu endian. I think that this should be changed to passing le64_to_cpu(xent->inum) to fix the following sparse warning: fs/ubifs/journal.c:902:58: warning: incorrect type in argument 2 (different base types) fs/ubifs/journal.c:902:58: expected unsigned long inum fs/ubifs/journal.c:902:58: got restricted __le64 [usertype] inum Fixes: 7959cf3a7506 ("ubifs: journal: Handle xattrs like files") Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk> Signed-off-by: Richard Weinberger <richard@nod.at>
2019-11-17ubifs: Force prandom result to __le32Ben Dooks (Codethink)
In set_dent_cookie() the result of prandom_u32() is assinged to an __le32 type. Make this a forced conversion to remove the following sparse warning: fs/ubifs/journal.c:506:30: warning: incorrect type in assignment (different base types) fs/ubifs/journal.c:506:30: expected restricted __le32 [usertype] cookie fs/ubifs/journal.c:506:30: got unsigned int Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk> Signed-off-by: Richard Weinberger <richard@nod.at>
2019-11-17ubifs: Remove obsolete TODO from dfs_file_write()Richard Weinberger
AFAICT this kind of problems are no longer possible since debugfs gained file removal protection via e9117a5a4bf6 ("debugfs: implement per-file removal protection"). Cc: Christoph Hellwig <hch@lst.de> Cc: Nicolai Stange <nicstange@gmail.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2019-11-15xfs: fix attr leaf header freemap.size underflowBrian Foster
The leaf format xattr addition helper xfs_attr3_leaf_add_work() adjusts the block freemap in a couple places. The first update drops the size of the freemap that the caller had already selected to place the xattr name/value data. Before the function returns, it also checks whether the entries array has encroached on a freemap range by virtue of the new entry addition. This is necessary because the entries array grows from the start of the block (but end of the block header) towards the end of the block while the name/value data grows from the end of the block in the opposite direction. If the associated freemap is already empty, however, size is zero and the subtraction underflows the field and causes corruption. This is reproduced rarely by generic/070. The observed behavior is that a smaller sized freemap is aligned to the end of the entries list, several subsequent xattr additions land in larger freemaps and the entries list expands into the smaller freemap until it is fully consumed and then underflows. Note that it is not otherwise a corruption for the entries array to consume an empty freemap because the nameval list (i.e. the firstused pointer in the xattr header) starts beyond the end of the corrupted freemap. Update the freemap size modification to account for the fact that the freemap entry can be empty and thus stale. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-15xfs: fix some memory leaks in log recoveryDarrick J. Wong
Fix a few places where we xlog_alloc_buffer a buffer, hit an error, and then bail out without freeing the buffer. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-11-15Merge tag 'for-linus-20191115' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "A few fixes that should make it into this release. This contains: - io_uring: - The timeout command assumes sequence == 0 means that we want one completion, but this kind of overloading is unfortunate as it prevents users from doing a pure time based wait. Since this operation was introduced in this cycle, let's correct it now, while we can. (me) - One-liner to fix an issue with dependent links and fixed buffer reads. The actual IO completed fine, but the link got severed since we stored the wrong expected value. (me) - Add TIMEOUT to list of opcodes that don't need a file. (Pavel) - rsxx missing workqueue destry calls. Old bug. (Chuhong) - Fix blk-iocost active list check (Jiufei) - Fix impossible-to-hit overflow merge condition, that still hit some folks very rarely (Junichi) - Fix bfq hang issue from 5.3. This didn't get marked for stable, but will go into stable post this merge (Paolo)" * tag 'for-linus-20191115' of git://git.kernel.dk/linux-block: rsxx: add missed destroy_workqueue calls in remove iocost: check active_list of all the ancestors in iocg_activate() block, bfq: deschedule empty bfq_queues not referred by any process io_uring: ensure registered buffer import returns the IO length io_uring: Fix getting file for timeout block: check bi_size overflow before merge io_uring: make timeout sequence == 0 mean no sequence
2019-11-15fs/namei.c: fix missing barriers when checking positivityAl Viro
Pinned negative dentries can, generally, be made positive by another thread. Conditions that prevent that are * ->d_lock on dentry in question * parent directory held at least shared * nobody else could have observed the address of dentry Most of the places working with those fall into one of those categories; however, d_lookup() and friends need to be used with some care. Fortunately, there's not a lot of call sites, and with few exceptions all of those fall under one of the cases above. Exceptions are all in fs/namei.c - in lookup_fast(), lookup_dcache() and mountpoint_last(). Another one is lookup_slow() - there dcache lookup is done with parent held shared, but the result is used after we'd drop the lock. The same happens in do_last() - the lookup (in lookup_one()) is done with parent locked, but result is used after unlocking. lookup_fast(), do_last() and mountpoint_last() flat-out reject negatives. Most of lookup_dcache() calls are made with parent locked at least shared; the only exception is lookup_one_len_unlocked(). It might return pinned negative, needs serious care from callers. Fortunately, almost nobody calls it directly anymore; all but two callers have converted to lookup_positive_unlocked(), which rejects negatives. lookup_slow() is called by the same lookup_one_len_unlocked() (see above), mountpoint_last() and walk_component(). In those two negatives are rejected. In other words, there is a small set of places where we need to check carefully if a pinned potentially negative dentry is, in fact, positive. After that check we want to be sure that both ->d_inode and type bits in ->d_flags are stable and observed. The set consists of follow_managed() (where the rejection happens for lookup_fast(), walk_component() and do_last()), last_mountpoint() and lookup_positive_unlocked(). Solution: 1) transition from negative to positive (in __d_set_inode_and_type()) stores ->d_inode, then uses smp_store_release() to set ->d_flags type bits. 2) aforementioned 3 places in fs/namei.c fetch ->d_flags with smp_load_acquire() and bugger off if it type bits say "negative". That way anyone downstream of those checks has dentry know positive pinned, with ->d_inode and type bits of ->d_flags stable and observed. I considered splitting off d_lookup_positive(), so that the checks could be done right there, under ->d_lock. However, that leads to massive duplication of rather subtle code in fs/namei.c and fs/dcache.c. It's worse than it might seem, thanks to autofs ->d_manage() getting involved ;-/ No matter what, autofs_d_manage()/autofs_d_automount() must live with the possibility of pinned negative dentry passed their way, becoming positive under them - that's the intended behaviour when lookup comes in the middle of automount in progress, so we can't keep them out of the area that has to deal with those, more's the pity... Reported-by: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-15fix dget_parent() fastpath raceAl Viro
We are overoptimistic about taking the fast path there; seeing the same value in ->d_parent after having grabbed a reference to that parent does *not* mean that it has remained our parent all along. That wouldn't be a big deal (in the end it is our parent and we have grabbed the reference we are about to return), but... the situation with barriers is messed up. We might have hit the following sequence: d is a dentry of /tmp/a/b CPU1: CPU2: parent = d->d_parent (i.e. dentry of /tmp/a) rename /tmp/a/b to /tmp/b rmdir /tmp/a, making its dentry negative grab reference to parent, end up with cached parent->d_inode (NULL) mkdir /tmp/a, rename /tmp/b to /tmp/a/b recheck d->d_parent, which is back to original decide that everything's fine and return the reference we'd got. The trouble is, caller (on CPU1) will observe dget_parent() returning an apparently negative dentry. It actually is positive, but CPU1 has stale ->d_inode cached. Use d->d_seq to see if it has been moved instead of rechecking ->d_parent. NOTE: we are *NOT* going to retry on any kind of ->d_seq mismatch; we just go into the slow path in such case. We don't wait for ->d_seq to become even either - again, if we are racing with renames, we can bloody well go to slow path anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-15new helper: lookup_positive_unlocked()Al Viro
Most of the callers of lookup_one_len_unlocked() treat negatives are ERR_PTR(-ENOENT). Provide a helper that would do just that. Note that a pinned positive dentry remains positive - it's ->d_inode is stable, etc.; a pinned _negative_ dentry can become positive at any point as long as you are not holding its parent at least shared. So using lookup_one_len_unlocked() needs to be careful; lookup_positive_unlocked() is safer and that's what the callers end up open-coding anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-15fs/namei.c: pull positivity check into follow_managed()Al Viro
There are 4 callers; two proceed to check if result is positive and fail with ENOENT if it isn't; one (in handle_lookup_down()) is guaranteed to yield positive and one (in lookup_fast()) is _preceded_ by positivity check. However, follow_managed() on a negative dentry is a (fairly cheap) no-op on anything other than autofs. And negative autofs dentries are never hashed, so lookup_fast() is not going to run into one of those. Moreover, successful follow_managed() on a _positive_ dentry never yields a negative one (and we significantly rely upon that in callers of lookup_fast()). In other words, we can easily transpose the positivity check and the call of follow_managed() in lookup_fast(). And that allows to fold the positivity check *into* follow_managed(), simplifying life for the code downstream of its calls. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-15Merge tag 'ceph-for-5.4-rc8' of git://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fixes from Ilya Dryomov: "Two fixes for the buffered reads and O_DIRECT writes serialization patch that went into -rc1 and a fixup for a bogus warning on older gcc versions" * tag 'ceph-for-5.4-rc8' of git://github.com/ceph/ceph-client: rbd: silence bogus uninitialized warning in rbd_object_map_update_finish() ceph: increment/decrement dio counter on async requests ceph: take the inode lock before acquiring cap refs
2019-11-15afs: Fix race in commit bulk status fetchDavid Howells
When a lookup is done, the afs filesystem will perform a bulk status-fetch operation on the requested vnode (file) plus the next 49 other vnodes from the directory list (in AFS, directory contents are downloaded as blobs and parsed locally). When the results are received, it will speculatively populate the inode cache from the extra data. However, if the lookup races with another lookup on the same directory, but for a different file - one that's in the 49 extra fetches, then if the bulk status-fetch operation finishes first, it will try and update the inode from the other lookup. If this other inode is still in the throes of being created, however, this will cause an assertion failure in afs_apply_status(): BUG_ON(test_bit(AFS_VNODE_UNSET, &vnode->flags)); on or about fs/afs/inode.c:175 because it expects data to be there already that it can compare to. Fix this by skipping the update if the inode is being created as the creator will presumably set up the inode with the same information. Fixes: 39db9815da48 ("afs: Fix application of the results of a inline bulk status fetch") Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-15gfs2: Close timing window with GLF_INVALIDATE_IN_PROGRESSBob Peterson
This patch closes a timing window in which two processes compete and overlap in the execution of do_xmote for the same glock: Process A Process B ------------------------------------ ----------------------------- 1. Grabs gl_lockref and calls do_xmote 2. Grabs gl_lockref but is blocked 3. Sets GLF_INVALIDATE_IN_PROGRESS 4. Unlocks gl_lockref 5. Calls do_xmote 6. Call glops->go_sync 7. test_and_clear_bit GLF_DIRTY 8. Call gfs2_log_flush Call glops->go_sync 9. (slow IO, so it blocks a long time) test_and_clear_bit GLF_DIRTY It's not dirty (step 7) returns 10. Tests GLF_INVALIDATE_IN_PROGRESS 11. Calls go_inval (rgrp_go_inval) 12. gfs2_rgrp_relse does brelse 13. truncate_inode_pages_range 14. Calls lm_lock UN In step 14 we've just told dlm to give the glock to another node when, in fact, process A has not finished the IO and synced all buffer_heads to disk and make sure their revokes are done. This patch fixes the problem by changing the GLF_INVALIDATE_IN_PROGRESS to use test_and_set_bit, and if the bit is already set, process B just ignores it and trusts that process A will do the do_xmote in the proper order. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-11-15gfs2: Abort gfs2_freeze if io error is seenBob Peterson
Before this patch, an io error, such as -EIO writing to the journal would cause function gfs2_freeze to go into an infinite loop, continuously retrying the freeze operation. But nothing ever clears the -EIO except unmount after withdraw, which is impossible if the freeze operation never ends (fails). Instead you get: [ 6499.767994] gfs2: fsid=dm-32.0: error freezing FS: -5 [ 6499.773058] gfs2: fsid=dm-32.0: retrying... [ 6500.791957] gfs2: fsid=dm-32.0: error freezing FS: -5 [ 6500.797015] gfs2: fsid=dm-32.0: retrying... This patch adds a check for -EIO in gfs2_freeze, and if seen, it dequeues the freeze glock, aborts the loop and returns the error. Also, there's no need to pass the freeze holder to function gfs2_lock_fs_check_clean since it's only called in one place and it's a well-known superblock pointer, so this simplifies that. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-11-15Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull misc vfs fixes from Al Viro: "Assorted fixes all over the place; some of that is -stable fodder, some regressions from the last window" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ecryptfs_lookup_interpose(): lower_dentry->d_parent is not stable either ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable ecryptfs: fix unlink and rmdir in face of underlying fs modifications audit_get_nd(): don't unlock parent too early exportfs_decode_fh(): negative pinned may become positive without the parent locked cgroup: don't put ERR_PTR() into fc->root autofs: fix a leak in autofs_expire_indirect() aio: Fix io_pgetevents() struct __compat_aio_sigset layout fs/namespace.c: fix use-after-free of mount in mnt_warn_timestamp_expiry()
2019-11-15pipe: Remove sync on wake_upsDavid Howells
2019-11-15pipe: Increase the writer-wakeup threshold to reduce context-switch countDavid Howells
Increase the threshold at which the reader sends a wake event to the writers in the queue such that the queue must be half empty before the wake is issued rather than the wake being issued when just a single slot available. This reduces the number of context switches in the tests significantly, without altering the amount of work achieved. With my pipe-bench program, there's a 20% reduction versus an unpatched kernel. Suggested-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15pipe: Check for ring full inside of the spinlock in pipe_write()David Howells
Make pipe_write() check to see if the ring has become full between it taking the pipe mutex, checking the ring status and then taking the spinlock. This can happen if a notification is written into the pipe as that happens without the pipe mutex. Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15pipe: Remove redundant wakeup from pipe_write()David Howells
Remove a redundant wakeup from pipe_write(). Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15pipe: Rearrange sequence in pipe_write() to preallocate slotDavid Howells
Rearrange the sequence in pipe_write() so that the allocation of the new buffer, the allocation of a ring slot and the attachment to the ring is done under the pipe wait spinlock and then the lock is dropped and the buffer can be filled. The data copy needs to be done with the spinlock unheld and irqs enabled, so the lock needs to be dropped first. However, the reader can't progress as we're holding pipe->mutex. We also need to drop the lock as that would impact others looking at the pipe waitqueue, such as poll(), the consumer and a future kernel message writer. We just abandon the preallocated slot if we get a copy error. Future writes may continue it and a future read will eventually recycle it. Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15pipe: Conditionalise wakeup in pipe_read()David Howells
Only do a wakeup in pipe_read() if we made space in a completely full buffer. The producer shouldn't be waiting on pipe->wait otherwise. Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15pipe: Advance tail pointer inside of wait spinlock in pipe_read()David Howells
Advance the pipe ring tail pointer inside of wait spinlock in pipe_read() so that the pipe can be written into with kernel notifications from contexts where pipe->mutex cannot be taken. Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15pipe: Allow pipes to have kernel-reserved slotsDavid Howells
Split pipe->ring_size into two numbers: (1) pipe->ring_size - indicates the hard size of the pipe ring. (2) pipe->max_usage - indicates the maximum number of pipe ring slots that userspace orchestrated events can fill. This allows for a pipe that is both writable by the general kernel notification facility and by userspace, allowing plenty of ring space for notifications to be added whilst preventing userspace from being able to pin too much unswappable kernel space. Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-15y2038: timerfd: Use timespec64 internallyArnd Bergmann
timerfd_show() uses a 'struct itimerspec' internally, but that is deprecated because of the time_t overflow and a conflict with the glibc type of the same name that is now incompatible in user space. Use a pair of timespec64 variables instead as a simple replacement. As this removes the last use of itimerspec from the kernel, allowing the removal of the definition from the uapi headers along with timespec and timeval later. Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15y2038: elfcore: Use __kernel_old_timeval for process timesArnd Bergmann
We store elapsed time for a crashed process in struct elf_prstatus using 'timeval' structures. Once glibc starts using 64-bit time_t, this becomes incompatible with the kernel's idea of timeval since the structure layout no longer matches on 32-bit architectures. This changes the definition of the elf_prstatus structure to use __kernel_old_timeval instead, which is hardcoded to the currently used binary layout. There is no risk of overflow in y2038 though, because the time values are all relative times, and can store up to 68 years of process elapsed time. There is a risk of applications breaking at build time when they use the new kernel headers and expect the type to be exactly 'timeval' rather than a structure that has the same fields as before. Those applications have to be modified to deal with 64-bit time_t anyway. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15y2038: syscalls: change remaining timeval to __kernel_old_timevalArnd Bergmann
All of the remaining syscalls that pass a timeval (gettimeofday, utime, futimesat) can trivially be changed to pass a __kernel_old_timeval instead, which has a compatible layout, but avoids ambiguity with the timeval type in user space. Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15y2038: remove CONFIG_64BIT_TIMEArnd Bergmann
The CONFIG_64BIT_TIME option is defined on all architectures, and can be removed for simplicity now. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-14ext4: fix a bug in ext4_wait_for_tail_page_commityangerkun
No need to wait for any commit once the page is fully truncated. Besides, it may confuse e.g. concurrent ext4_writepage() with the page still be dirty (will be cleared by truncate_pagecache() in ext4_setattr()) but buffers has been freed; and then trigger a bug show as below: [ 26.057508] ------------[ cut here ]------------ [ 26.058531] kernel BUG at fs/ext4/inode.c:2134! ... [ 26.088130] Call trace: [ 26.088695] ext4_writepage+0x914/0xb28 [ 26.089541] writeout.isra.4+0x1b4/0x2b8 [ 26.090409] move_to_new_page+0x3b0/0x568 [ 26.091338] __unmap_and_move+0x648/0x988 [ 26.092241] unmap_and_move+0x48c/0xbb8 [ 26.093096] migrate_pages+0x220/0xb28 [ 26.093945] kernel_mbind+0x828/0xa18 [ 26.094791] __arm64_sys_mbind+0xc8/0x138 [ 26.095716] el0_svc_common+0x190/0x490 [ 26.096571] el0_svc_handler+0x60/0xd0 [ 26.097423] el0_svc+0x8/0xc Run the procedure (generate by syzkaller) parallel with ext3. void main() { int fd, fd1, ret; void *addr; size_t length = 4096; int flags; off_t offset = 0; char *str = "12345"; fd = open("a", O_RDWR | O_CREAT); assert(fd >= 0); /* Truncate to 4k */ ret = ftruncate(fd, length); assert(ret == 0); /* Journal data mode */ flags = 0xc00f; ret = ioctl(fd, _IOW('f', 2, long), &flags); assert(ret == 0); /* Truncate to 0 */ fd1 = open("a", O_TRUNC | O_NOATIME); assert(fd1 >= 0); addr = mmap(NULL, length, PROT_WRITE | PROT_READ, MAP_SHARED, fd, offset); assert(addr != (void *)-1); memcpy(addr, str, 5); mbind(addr, length, 0, 0, 0, MPOL_MF_MOVE); } And the bug will be triggered once we seen the below order. reproduce1 reproduce2 ... | ... truncate to 4k | change to journal data mode | | memcpy(set page dirty) truncate to 0: | ext4_setattr: | ... | ext4_wait_for_tail_page_commit | | mbind(trigger bug) truncate_pagecache(clean dirty)| ... ... | mbind will call ext4_writepage() since the page still be dirty, and then report the bug since the buffers has been free. Fix it by return directly once offset equals to 0 which means the page has been fully truncated. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: yangerkun <yangerkun@huawei.com> Link: https://lore.kernel.org/r/20190919063508.1045-1-yangerkun@huawei.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-14ext4: bio_alloc with __GFP_DIRECT_RECLAIM never failsGao Xiang
Similar to [1] [2], bio_alloc with __GFP_DIRECT_RECLAIM flags guarantees bio allocation under some given restrictions, as stated in block/bio.c and fs/direct-io.c So here it's ok to not check for NULL value from bio_alloc(). [1] https://lore.kernel.org/r/20191030035518.65477-1-gaoxiang25@huawei.com [2] https://lore.kernel.org/r/20190830162812.GA10694@infradead.org Cc: Theodore Ts'o <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Link: https://lore.kernel.org/r/20191031092315.139267-1-gaoxiang25@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-14ext4: code cleanup for get_next_idChengguang Xu
Now the checks in ext4_get_next_id() and dquot_get_next_id() are almost the same, so just call dquot_get_next_id() instead of ext4_get_next_id(). Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Link: https://lore.kernel.org/r/20191006103028.31299-1-cgxu519@mykernel.net Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-14ext4: fix leak of quota reservationsJan Kara
Commit 8fcc3a580651 ("ext4: rework reserved cluster accounting when invalidating pages") moved freeing of delayed allocation reservations from dirty page invalidation time to time when we evict corresponding status extent from extent status tree. For inodes which don't have any blocks allocated this may actually happen only in ext4_clear_blocks() which is after we've dropped references to quota structures from the inode. Thus reservation of quota leaked. Fix the problem by clearing quota information from the inode only after evicting extent status tree in ext4_clear_inode(). Link: https://lore.kernel.org/r/20191108115420.GI20863@quack2.suse.cz Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Fixes: 8fcc3a580651 ("ext4: rework reserved cluster accounting when invalidating pages") Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-14ext4: remove unused variable warning in parse_options()Olof Johansson
Commit c33fbe8f673c5 ("ext4: Enable blocksize < pagesize for dioread_nolock") removed the only user of 'sbi' outside of the ifdef, so it caused a new warning: fs/ext4/super.c:2068:23: warning: unused variable 'sbi' [-Wunused-variable] Fixes: c33fbe8f673c5 ("ext4: Enable blocksize < pagesize for dioread_nolock") Signed-off-by: Olof Johansson <olof@lixom.net> Link: https://lore.kernel.org/r/20191111022523.34256-1-olof@lixom.net Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
2019-11-14ext4: Enable encryption for subpage-sized blocksChandan Rajendra
Now that we have the code to support encryption for subpage-sized blocks, this commit removes the conditional check in filesystem mount code. The commit also changes the support statement in Documentation/filesystems/fscrypt.rst to reflect the fact that encryption on filesystems with blocksize less than page size now works. [EB: Tested with 'gce-xfstests -c ext4/encrypt_1k -g auto', using the new "encrypt_1k" config I created. All tests pass except for those that already fail or are excluded with the encrypt or 1k configs, and 2 tests that try to create 1023-byte symlinks which fails since encrypted symlinks are limited to blocksize-3 bytes. Also ran the dedicated encryption tests using 'kvm-xfstests -c ext4/1k -g encrypt'; all pass, including the on-disk ciphertext verification tests.] Signed-off-by: Chandan Rajendra <chandan@linux.ibm.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191023033312.361355-3-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-14fs/buffer.c: support fscrypt in block_read_full_page()Eric Biggers
After each filesystem block (as represented by a buffer_head) has been read from disk by block_read_full_page(), decrypt it if needed. The decryption is done on the fscrypt_read_workqueue. This is the final change needed to support ext4 encryption with blocksize != PAGE_SIZE, and it's a fairly small change now that CONFIG_FS_ENCRYPTION is a bool and fs/crypto/ exposes functions to decrypt individual blocks and to enqueue work on the fscrypt workqueue. Don't try to add fs-verity support yet, as the fs/verity/ support layer isn't ready for sub-page blocks yet. Just add fscrypt support for now. Almost all the new code is compiled away when CONFIG_FS_ENCRYPTION=n. Cc: Chandan Rajendra <chandan@linux.ibm.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191023033312.361355-2-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-14io_uring: make POLL_ADD/POLL_REMOVE scale betterJens Axboe
One of the obvious use cases for these commands is networking, where it's not uncommon to have tons of sockets open and polled for. The current implementation uses a list for insertion and lookup, which works fine for file based use cases where the count is usually low, it breaks down somewhat for higher number of files / sockets. A test case with 30k sockets being polled for and cancelled takes: real 0m6.968s user 0m0.002s sys 0m6.936s with the patch it takes: real 0m0.233s user 0m0.010s sys 0m0.176s If you go to 50k sockets, it gets even more abysmal with the current code: real 0m40.602s user 0m0.010s sys 0m40.555s with the patch it takes: real 0m0.398s user 0m0.000s sys 0m0.341s Change is pretty straight forward, just replace the cancel_list with a red/black tree instead. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14gfs2: Don't loop forever in gfs2_freeze if withdrawnBob Peterson
Before this patch, function gfs2_freeze would loop forever if the filesystem it tries to freeze is withdrawn. That's because function gfs2_lock_fs_check_clean tries to enqueue the glock of the journal and the gfs2_glock returns -EIO because you can't enqueue a journaled glock after a withdraw. Move the check for file system withdraw inside the loop so that the loop can end when withdraw occurs. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-11-14gfs2: fix infinite loop in gfs2_ail1_flush on io errorBob Peterson
Before this patch, an IO error encountered in function gfs2_ail1_flush would cause a deadlock: because of the io error (and its resulting withdrawn state), buffers stopped being written to the journal. Buffers would remain on the ail1 list, so gfs2_ail1_start_one would return 1 to indicate dirty buffers were still on the ail1 list. However, when function gfs2_ail1_flush got a non-zero return code, it would goto restart to retry the writes, which meant it would never finish, and thus the infinite loop. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-11-14gfs2: Introduce function gfs2_withdrawnBob Peterson
Add function gfs2_withdrawn and replace all checks for the SDF_WITHDRAWN bit to call it. This does not change the logic or function of gfs2, and it facilitates later improvements to the withdraw sequence. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-11-14ceph: increment/decrement dio counter on async requestsJeff Layton
Ceph can in some cases issue an async DIO request, in which case we can end up calling ceph_end_io_direct before the I/O is actually complete. That may allow buffered operations to proceed while DIO requests are still in flight. Fix this by incrementing the i_dio_count when issuing an async DIO request, and decrement it when tearing down the aio_req. Fixes: 321fe13c9398 ("ceph: add buffered/direct exclusionary locking for reads and writes") Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>