linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2019-02-21	xfs: introduce an always_cow mode	Christoph Hellwig
	Add a mode where XFS never overwrites existing blocks in place. This is to aid debugging our COW code, and also put infatructure in place for things like possible future support for zoned block devices, which can't support overwrites. This mode is enabled globally by doing a: echo 1 > /sys/fs/xfs/debug/always_cow Note that the parameter is global to allow running all tests in xfstests easily in this mode, which would not easily be possible with a per-fs sysfs file. In always_cow mode persistent preallocations are disabled, and fallocate will fail when called with a 0 mode (with our without FALLOC_FL_KEEP_SIZE), and not create unwritten extent for zeroed space when called with FALLOC_FL_ZERO_RANGE or FALLOC_FL_UNSHARE_RANGE. There are a few interesting xfstests failures when run in always_cow mode: - generic/392 fails because the bytes used in the file used to test hole punch recovery are less after the log replay. This is because the blocks written and then punched out are only freed with a delay due to the logging mechanism. - xfs/170 will fail as the already fragile file streams mechanism doesn't seem to interact well with the COW allocator - xfs/180 xfs/182 xfs/192 xfs/198 xfs/204 and xfs/208 will claim the file system is badly fragmented, but there is not much we can do to avoid that when always writing out of place - xfs/205 fails because overwriting a file in always_cow mode will require new space allocation and the assumption in the test thus don't work anymore. - xfs/326 fails to modify the file at all in always_cow mode after injecting the refcount error, leading to an unexpected md5sum after the remount, but that again is expected Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21	xfs: report IOMAP_F_SHARED from xfs_file_iomap_begin_delay	Christoph Hellwig
	No user of it in the iomap code at the moment, but we should not actively report wrong information if we can trivially get it right. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21	xfs: make COW fork unwritten extent conversions more robust	Christoph Hellwig
	If we have racing buffered and direct I/O COW fork extents under writeback can have been moved to the data fork by the time we call xfs_reflink_convert_cow from xfs_submit_ioend. This would be mostly harmless as the block numbers don't change by this move, except for the fact that xfs_bmapi_write will crash or trigger asserts when not finding existing extents, even despite trying to paper over this with the XFS_BMAPI_CONVERT_ONLY flag. Instead of special casing non-transaction conversions in the already way too complicated xfs_bmapi_write just add a new helper for the much simpler non-transactional COW fork case, which simplify ignores not found extents. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21	xfs: merge COW handling into xfs_file_iomap_begin_delay	Christoph Hellwig
	Besides simplifying the code a bit this allows to actually implement the behavior of using COW preallocation for non-COW data mentioned in the current comments. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21	xfs: also truncate holes covered by COW blocks	Christoph Hellwig
	This only matters if we want to write data through the COW fork that is not actually an overwrite of existing data. Reasons for that are speculative COW fork allocations using the cowextsize, or a mode where we always write through the COW fork. Currently both can't actually happen, but I plan to enable them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21	xfs: don't use delalloc extents for COW on files with extsize hints	Christoph Hellwig
	While using delalloc for extsize hints is generally a good idea, the current code that does so only for COW doesn't help us much and creates a lot of special cases. Switch it to use real allocations like we do for direct I/O. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21	xfs: fix SEEK_DATA for speculative COW fork preallocation	Christoph Hellwig
	We speculatively allocate extents in the COW fork to reduce fragmentation. But when we write data into such COW fork blocks that do now shadow an allocation in the data fork SEEK_DATA will not correctly report it, as it only looks at the data fork extents. The only reason why that hasn't been an issue so far is because we even use these speculative COW fork preallocations over holes in the data fork at all for buffered writes, and blocks in the COW fork that are written by direct writes are moved into the data fork immediately at I/O completion time. Add a new set of iomap_ops for SEEK_HOLE/SEEK_DATA which looks into both the COW and data fork, and reports all COW extents as unwritten to the iomap layer. While this isn't strictly true for COW fork extents that were already converted to real extents, the practical semantics that you can't read data from them until they are moved into the data fork are very similar, and this will force the iomap layer into probing the extents for actually present data. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21	xfs: make xfs_bmbt_to_iomap more useful	Christoph Hellwig
	Move checking for invalid zero blocks and setting of various iomap flags into this helper. Also make it deal with "raw" delalloc extents to avoid clutter in the callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21	ext4: annotate more implicit fall throughs	Mathieu Malaterre
	There is a plan to build the kernel with -Wimplicit-fallthrough and these places in the code produced warnings (W=1). Fix them up. This commit remove the following warnings: fs/ext4/indirect.c:1182:6: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/ext4/indirect.c:1188:6: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/ext4/indirect.c:1432:6: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/ext4/indirect.c:1440:6: warning: this statement may fall through [-Wimplicit-fallthrough=] Signed-off-by: Mathieu Malaterre <malat@debian.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2019-02-21	ext4: annotate implicit fall throughs	Mathieu Malaterre
	There is a plan to build the kernel with -Wimplicit-fallthrough and these places in the code produced warnings (W=1). Fix them up. This commit remove the following warnings: fs/ext4/hash.c:233:15: warning: this statement may fall through [-Wimplicit-fallthrough=] fs/ext4/hash.c:246:15: warning: this statement may fall through [-Wimplicit-fallthrough=] Signed-off-by: Mathieu Malaterre <malat@debian.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2019-02-21	nfsd: fix performance-limiting session calculation	J. Bruce Fields
	We're unintentionally limiting the number of slots per nfsv4.1 session to 10. Often more than 10 simultaneous RPCs are needed for the best performance. This calculation was meant to prevent any one client from using up more than a third of the limit we set for total memory use across all clients and sessions. Instead, it's limiting the client to a third of the maximum for a single session. Fix this. Reported-by: Chris Tracy <ctracy@engr.scu.edu> Cc: stable@vger.kernel.org Fixes: de766e570413 "nfsd: give out fewer session slots as limit approaches" Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-02-21	fanotify: Make waits for fanotify events only killable	Jan Kara
	Making waits for response to fanotify permission events interruptible can result in EINTR returns from open(2) or other syscalls when there's e.g. AV software that's monitoring the file. Orion reports that e.g. bash is complaining like: bash: /etc/bash_completion.d/itweb-settings.bash: Interrupted system call So for now convert the wait from interruptible to only killable one. That is mostly invisible to userspace. Sadly this breaks hibernation with fanotify permission events pending again but we have to put more thought into how to fix this without regressing userspace visible behavior. Reported-by: Orion Poplawski <orion@nwra.com> Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-20	nfs: fix xfstest generic/099 failed on nfsv3	ZhangXiaoxu
	After setxattr, the nfsv3 cached the acl which set by user. But at the backend, the shared file system (eg. ext4) will check the acl, if it can merged with mode, it won't add acl to the file. So, the nfsv3 cached acl is redundant. Don't 'set_cached_acl' when setxattr. Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20	pNFS: Avoid read/modify/write when it is not necessary	Kazuo Ito
	As the block and SCSI layouts can only read/write fixed-length blocks, we must perform read-modify-write when data to be written is not aligned to a block boundary or smaller than the block size. (612aa983a0410 pnfs: add flag to force read-modify-write in ->write_begin) The current code tries to see if we have to do read-modify-write on block-oriented pNFS layouts by just checking !PageUptodate(page), but the same condition also applies for overwriting of any uncached potions of existing files, making such operations excessively slow even it is block-aligned. The change does not affect the optimization for modify-write-read cases (38c73044f5f4d NFS: read-modify-write page updating), because partial update of !PageUptodate() pages can only happen in layouts that can do arbitrary length read/write and never in block-based ones. Testing results: We ran fio on one of the pNFS clients running 4.20 kernel (vanilla and patched) in this configuration to read/write/overwrite files on the storage array, exported as pnfs share by the server. pNFS clients ---1G Ethernet--- pNFS server (HP DL360 G8) (HP DL360 G8) \| \| \| \| +------8G Fiber Channel--------+ \| Storage Array (HP P6350) Throughput of overwrite (both buffered and O_SYNC) is noticeably improved. Ops. \|block size\| Throughput \| \| (KiB) \| (MiB/s) \| \| \| 4.20 \| patched\| ---------+----------+----------------+ buffered \| 4\| 21.3 \| 232 \| overwrite\| 32\| 22.2 \| 256 \| \| 512\| 22.4 \| 260 \| ---------+----------+----------------+ O_SYNC \| 4\| 3.84\| 4.77\| overwrite\| 32\| 12.2 \| 32.0 \| \| 512\| 18.5 \| 152 \| ---------+----------+----------------+ Read and write (buffered and O_SYNC) by the same client remain unchanged by the patch either negatively or positively, as they should do. Ops. \|block size\| Throughput \| \| (KiB) \| (MiB/s) \| \| \| 4.20 \| patched\| ---------+----------+----------------+ read \| 4\| 548 \| 550 \| \| 32\| 547 \| 551 \| \| 512\| 548 \| 551 \| ---------+----------+----------------+ buffered \| 4\| 237 \| 244 \| write \| 32\| 261 \| 268 \| \| 512\| 265 \| 272 \| ---------+----------+----------------+ O_SYNC \| 4\| 0.46\| 0.46\| write \| 32\| 3.60\| 3.57\| \| 512\| 105 \| 106 \| ---------+----------+----------------+ Signed-off-by: Kazuo Ito <ito_kazuo_g3@lab.ntt.co.jp> Tested-by: Hiroyuki Watanabe <watanabe.hiroyuki@lab.ntt.co.jp> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20	pNFS: Fix potential corruption of page being written	Kazuo Ito
	nfs_want_read_modify_write() didn't check for !PagePrivate when pNFS block or SCSI layout was in use, therefore we could lose data forever if the page being written was filled by a read before completion. Signed-off-by: Kazuo Ito <ito_kazuo_g3@lab.ntt.co.jp> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20	NFS: Fix typo in comments of nfs_readdir_alloc_pages()	zhangliguang
	This fixes the typo in comments of nfs_readdir_alloc_pages(). Because nfs_readdir_large_page and nfs_readdir_free_pagearray had been renamed. Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20	NFS: Remove redundant semicolon	zhangliguang
	This removes redundant semicolon for ending code. Fixes: c7944ebb9ce9 ("NFSv4: Fix lookup revalidate of regular files") Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20	NFS: readdirplus optimization by cache mechanism	luanshi
	When listing very large directories via NFS, clients may take a long time to complete. There are about three factors involved: First of all, ls and practically every other method of listing a directory including python os.listdir and find rely on libc readdir(). However readdir() only reads 32K of directory entries at a time, which means that if you have a lot of files in the same directory, it is going to take an insanely long time to read all the directory entries. Secondly, libc readdir() reads 32K of directory entries at a time, in kernel space 32K buffer split into 8 pages. One NFS readdirplus rpc will be called for one page, which introduces many readdirplus rpc calls. Lastly, one NFS readdirplus rpc asks for 32K data (filled by nfs_dentry) to fill one page (filled by dentry), we found that nearly one third of data was wasted. To solve above problems, pagecache mechanism was introduced. One NFS readdirplus rpc will ask for a large data (more than 32k), the data can fill more than one page, the cached pages can be used for next readdir call. This can reduce many readdirplus rpc calls and improve readdirplus performance. TESTING: When listing very large directories(include 300 thousand files) via NFS time ls -l /nfs_mount \| wc -l without the patch: 300001 real 1m53.524s user 0m2.314s sys 0m2.599s with the patch: 300001 real 0m23.487s user 0m2.305s sys 0m2.558s Improved performance: 79.6% readdirplus rpc calls decrease: 85% Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20	fs/nfs: Fix nfs_parse_devname to not modify it's argument	Eric W. Biederman
	In the rare and unsupported case of a hostname list nfs_parse_devname will modify dev_name. There is no need to modify dev_name as the all that is being computed is the length of the hostname, so the computed length can just be shorted. Fixes: dc04589827f7 ("NFS: Use common device name parsing logic for NFSv4 and NFSv2/v3") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20	NFS: drop useless LIST_HEAD	Julia Lawall
	Drop LIST_HEAD where the variable it declares has never been used. The semantic patch that fixes this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ identifier x; @@ - LIST_HEAD(x); ... when != x // </smpl> Fixes: 0e20162ed1e9 ("NFSv4.1 Use MDS auth flavor for data server connection") Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20	NFS: Fix sparse annotations for nfs_set_open_stateid_locked()	Trond Myklebust
	Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20	NFS: Fix up documentation warnings	Trond Myklebust
	Fix up some compiler warnings about function parameters, etc not being correctly described or formatted. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20	NFS: ENOMEM should also be a fatal error.	Trond Myklebust
	Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20	NFS: EINTR is also a fatal error.	Trond Myklebust
	Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20	NFS: Ensure NFS writeback allocations don't recurse back into NFS.	Trond Myklebust
	All the allocations that we can hit in the NFS layer and sunrpc layers themselves are already marked as GFP_NOFS, but we need to ensure that any calls to generic kernel functionality do the right thing as well. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20	NFS: Pass error information to the pgio error cleanup routine	Trond Myklebust
	Allow the caller to pass error information when cleaning up a failed I/O request so that we can conditionally take action to cancel the request altogether if the error turned out to be fatal. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20	NFS: Clean up list moves of struct nfs_page	Trond Myklebust
	In several places we're just moving the struct nfs_page from one list to another by first removing from the existing list, then adding to the new one. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20	NFS: Don't recoalesce on error in nfs_pageio_complete_mirror()	Trond Myklebust
	If the I/O completion failed with a fatal error, then we should just exit nfs_pageio_complete_mirror() rather than try to recoalesce. Fixes: a7d42ddb3099 ("nfs: add mirroring support to pgio layer") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: stable@vger.kernel.org # v4.0+
2019-02-20	NFS: Fix an I/O request leakage in nfs_do_recoalesce	Trond Myklebust
	Whether we need to exit early, or just reprocess the list, we must not lost track of the request which failed to get recoalesced. Fixes: 03d5eb65b538 ("NFS: Fix a memory leak in nfs_do_recoalesce") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: stable@vger.kernel.org # v4.0+
2019-02-20	NFS: Fix I/O request leakages	Trond Myklebust
	When we fail to add the request to the I/O queue, we currently leave it to the caller to free the failed request. However since some of the requests that fail are actually created by nfs_pageio_add_request() itself, and are not passed back the caller, this leads to a leakage issue, which can again cause page locks to leak. This commit addresses the leakage by freeing the created requests on error, using desc->pg_completion_ops->error_cleanup() Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Fixes: a7d42ddb30997 ("nfs: add mirroring support to pgio layer") Cc: stable@vger.kernel.org # v4.0: c18b96a1b862: nfs: clean up rest of reqs Cc: stable@vger.kernel.org # v4.0: d600ad1f2bdb: NFS41: pop some layoutget Cc: stable@vger.kernel.org # v4.0+
2019-02-20	orangefs: remove two un-needed BUG_ONs...	Mike Marshall
	Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2019-02-20	Merge branch 'fixes-v5.1-rc6' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull keys fixes from James Morris: - Handle quotas better, allowing full quota to be reached. - Fix the creation of shortcuts in the assoc_array internal representation when the index key needs to be an exact multiple of the machine word size. - Fix a dependency loop between the request_key contruction record and the request_key authentication key. The construction record isn't really necessary and can be dispensed with. - Set the timestamp on a new key rather than leaving it as 0. This would ordinarily be fine - provided the system clock is never set to a time before 1970 * 'fixes-v5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: keys: Timestamp new keys keys: Fix dependency loop between construction record and auth key assoc_array: Fix shortcut creation KEYS: allow reaching the keys quotas exactly
2019-02-20	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	David S. Miller
	Two easily resolvable overlapping change conflicts, one in TCP and one in the eBPF verifier. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-18	exec: Fix mem leak in kernel_read_file	YueHaibing
	syzkaller report this: BUG: memory leak unreferenced object 0xffffc9000488d000 (size 9195520): comm "syz-executor.0", pid 2752, jiffies 4294787496 (age 18.757s) hex dump (first 32 bytes): ff ff ff ff ff ff ff ff a8 00 00 00 01 00 00 00 ................ 02 00 00 00 00 00 00 00 80 a1 7a c1 ff ff ff ff ..........z..... backtrace: [<000000000863775c>] __vmalloc_node mm/vmalloc.c:1795 [inline] [<000000000863775c>] __vmalloc_node_flags mm/vmalloc.c:1809 [inline] [<000000000863775c>] vmalloc+0x8c/0xb0 mm/vmalloc.c:1831 [<000000003f668111>] kernel_read_file+0x58f/0x7d0 fs/exec.c:924 [<000000002385813f>] kernel_read_file_from_fd+0x49/0x80 fs/exec.c:993 [<0000000011953ff1>] __do_sys_finit_module+0x13b/0x2a0 kernel/module.c:3895 [<000000006f58491f>] do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290 [<00000000ee78baf4>] entry_SYSCALL_64_after_hwframe+0x49/0xbe [<00000000241f889b>] 0xffffffffffffffff It should goto 'out_free' lable to free allocated buf while kernel_read fails. Fixes: 39d637af5aa7 ("vfs: forbid write access when reading a file into memory") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-18	exec: load_script: Do not exec truncated interpreter path	Kees Cook
	Commit 8099b047ecc4 ("exec: load_script: don't blindly truncate shebang string") was trying to protect against a confused exec of a truncated interpreter path. However, it was overeager and also refused to truncate arguments as well, which broke userspace, and it was reverted. This attempts the protection again, but allows arguments to remain truncated. In an effort to improve readability, helper functions and comments have been added. Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Samuel Dionne-Riel <samuel@dionne-riel.com> Cc: Richard Weinberger <richard.weinberger@gmail.com> Cc: Graham Christensen <graham@grahamc.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-18	xfs: fix xfs_buf magic number endian checks	Darrick J. Wong
	Create a separate magic16 check function so that we don't run afoul of static checkers. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-18	ceph: avoid repeatedly adding inode to mdsc->snap_flush_list	Yan, Zheng
	Otherwise, mdsc->snap_flush_list may get corrupted. Cc: stable@vger.kernel.org Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-02-18	ext2: support statx syscall	yangerkun
	Since statx, every filesystem should fill the attributes/attributes_mask in routine getattr. But the generic_fillattr has not fill that, so add ext2_getattr to do this. This can fix generic/424 while testing ext2. Reviewed-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-18	fanotify: Use interruptible wait when waiting for permission events	Jan Kara
	When waiting for response to fanotify permission events, we currently use uninterruptible waits. That makes code simple however it can cause lots of processes to end up in uninterruptible sleep with hard reboot being the only alternative in case fanotify listener process stops responding (e.g. due to a bug in its implementation). Uninterruptible sleep also makes system hibernation fail if the listener gets frozen before the process generating fanotify permission event. Fix these problems by using interruptible sleep for waiting for response to fanotify event. This is slightly tricky though - we have to detect when the event got already reported to userspace as in that case we must not free the event. Instead we push the responsibility for freeing the event to the process that will write response to the event. Reported-by: Orion Poplawski <orion@nwra.com> Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-18	fanotify: Track permission event state	Jan Kara
	Track whether permission event got already reported to userspace and whether userspace already answered to the permission event. Protect stores to this field together with updates to ->response field by group->notification_lock. This will allow aborting wait for reply to permission event from userspace. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-18	fanotify: Simplify cleaning of access_list	Jan Kara
	Simplify iteration cleaning access_list in fanotify_release(). That will make following changes more obvious. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-18	fsnotify: Create function to remove event from notification list	Jan Kara
	Create function to remove event from the notification list. Later it will be used from more places. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-18	fanotify: Move locking inside get_one_event()	Jan Kara
	get_one_event() has a single caller and that just locks notification_lock around the call. Move locking inside get_one_event() as that will make using ->response field for permission event state easier. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-18	fanotify: Fold dequeue_event() into process_access_response()	Jan Kara
	Fold dequeue_event() into process_access_response(). This will make changes to use of ->response field easier. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-17	xfs: retry COW fork delalloc conversion when no extent was found	Christoph Hellwig
	While we can only truncate a block under the page lock for the current page, there is no high-level synchronization for moving extents from the COW to the data fork. This means that for example we can have another thread doing a direct I/O completion that moves extents from the COW to the data fork race with writeback. While this race is very hard to hit the always_cow seems to reproduce it reasonably well, and it also exists without that. Because of that there is a chance that a delalloc conversion for the COW fork might not find any extents to convert. In that case we should retry the whole block lookup and now find the blocks in the data fork. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17	xfs: remove the truncate short cut in xfs_map_blocks	Christoph Hellwig
	Now that we properly handle the race with truncate in the delalloc allocator there is no need to short cut this exceptional case earlier on. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17	xfs: move xfs_iomap_write_allocate to xfs_aops.c	Christoph Hellwig
	This function is a small wrapper only used by the writeback code, so move it together with the writeback code and simplify it down to the glorified do { } while loop that is now is. A few bits intentionally got lost here: no need to call xfs_qm_dqattach because quotas are always attached when we create the delalloc reservation, and no need for the imap->br_startblock == 0 check given that xfs_bmapi_convert_delalloc already has a WARN_ON_ONCE for exactly that condition. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17	xfs: move stat accounting to xfs_bmapi_convert_delalloc	Christoph Hellwig
	This way we can actually count how many bytes got converted and how many calls we need, unlike in the caller which doesn't have the detailed view. Note that this includes a slight change in behavior as the xs_xstrat_quick is now bumped for every allocation instead of just the one covering the requested writeback offset, which makes a lot more sense. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17	xfs: move transaction handling to xfs_bmapi_convert_delalloc	Christoph Hellwig
	No need to deal with the transaction and the inode locking in the caller. Note that we also switch to passing whichfork as the second paramter, matching what most related functions do. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17	xfs: split XFS_BMAPI_DELALLOC handling from xfs_bmapi_write	Christoph Hellwig
	Delalloc conversion has traditionally been part of our function to allocate blocks on disk (first xfs_bmapi, then xfs_bmapi_write), but delalloc conversion is a little special as we really do not want to allocate blocks over holes, for which we don't have reservations. Split the delalloc conversions into a separate helper to keep the code simple and structured. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>