summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2019-05-20Merge tag 'for-5.2-rc1-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Notable highlights: - fixes for some long-standing bugs in fsync that were quite hard to catch but now finaly fixed - some fixups to error handling paths that did not properly clean up (locking, memory) - fix to space reservation for inheriting properties" * tag 'for-5.2-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: tree-checker: detect file extent items with overlapping ranges Btrfs: fix race between ranged fsync and writeback of adjacent ranges Btrfs: avoid fallback to transaction commit during fsync of files with holes btrfs: extent-tree: Fix a bug that btrfs is unable to add pinned bytes btrfs: sysfs: don't leak memory when failing add fsid btrfs: sysfs: Fix error path kobject memory leak Btrfs: do not abort transaction at btrfs_update_root() after failure to COW path btrfs: use the existing reserved items for our first prop for inheritance btrfs: don't double unlock on error in btrfs_punch_hole btrfs: Check the compression level before getting a workspace
2019-05-19Merge tag 'upstream-5.2-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs Pull UBIFS fixes from Richard Weinberger: - build errors wrt xattrs - mismerge which lead to a wrong Kconfig ifdef - missing endianness conversion * tag 'upstream-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs: ubifs: Convert xattr inum to host order ubifs: Use correct config name for encryption ubifs: Fix build error without CONFIG_UBIFS_FS_XATTR
2019-05-19Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge yet more updates from Andrew Morton: "A few final bits: - large changes to vmalloc, yielding large performance benefits - tweak the console-flush-on-panic code - a few fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: panic: add an option to replay all the printk message in buffer initramfs: don't free a non-existent initrd fs/writeback.c: use rcu_barrier() to wait for inflight wb switches going into workqueue when umount mm/compaction.c: correct zone boundary handling when isolating pages from a pageblock mm/vmap: add DEBUG_AUGMENT_LOWEST_MATCH_CHECK macro mm/vmap: add DEBUG_AUGMENT_PROPAGATE_CHECK macro mm/vmalloc.c: keep track of free blocks for vmap allocation
2019-05-19Merge tag 'kbuild-v5.2-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull more Kbuild updates from Masahiro Yamada: - remove unneeded use of cc-option, cc-disable-warning, cc-ldoption - exclude tracked files from .gitignore - re-enable -Wint-in-bool-context warning - refactor samples/Makefile - stop building immediately if syncconfig fails - do not sprinkle error messages when $(CC) does not exist - move arch/alpha/defconfig to the configs subdirectory - remove crappy header search path manipulation - add comment lines to .config to clarify the end of menu blocks - check uniqueness of module names (adding new warnings intentionally) * tag 'kbuild-v5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (24 commits) kconfig: use 'else ifneq' for Makefile to improve readability kbuild: check uniqueness of module names kconfig: Terminate menu blocks with a comment in the generated config kbuild: add LICENSES to KBUILD_ALLDIRS kbuild: remove 'addtree' and 'flags' magic for header search paths treewide: prefix header search paths with $(srctree)/ media: prefix header search paths with $(srctree)/ media: remove unneeded header search paths alpha: move arch/alpha/defconfig to arch/alpha/configs/defconfig kbuild: terminate Kconfig when $(CC) or $(LD) is missing kbuild: turn auto.conf.cmd into a mandatory include file .gitignore: exclude .get_maintainer.ignore and .gitattributes kbuild: add all Clang-specific flags unconditionally kbuild: Don't try to add '-fcatch-undefined-behavior' flag kbuild: add some extra warning flags unconditionally kbuild: add -Wvla flag unconditionally arch: remove dangling asm-generic wrappers samples: guard sub-directories with CONFIG options kbuild: re-enable int-in-bool-context warning MAINTAINERS: kbuild: Add pattern for scripts/*vmlinux* ...
2019-05-19Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Some bug fixes, and an update to the URL's for the final version of Unicode 12.1.0" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: avoid panic during forced reboot due to aborted journal ext4: fix block validity checks for journal inodes using indirect blocks unicode: update to Unicode 12.1.0 final unicode: add missing check for an error return from utf8lookup() ext4: fix miscellaneous sparse warnings ext4: unsigned int compared against zero ext4: fix use-after-free in dx_release() ext4: fix data corruption caused by overlapping unaligned and aligned IO jbd2: fix potential double free ext4: zero out the unused memory region in the extent tree block
2019-05-19Merge tag '5.2-rc-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs fixes from Steve French: "Minor cleanup and fixes, one for stable, four rdma (smbdirect) related. Also adds SEEK_HOLE support" * tag '5.2-rc-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: add support for SEEK_DATA and SEEK_HOLE Fixed https://bugzilla.kernel.org/show_bug.cgi?id=202935 allow write on the same file cifs: Allocate memory for all iovs in smb2_ioctl cifs: Don't match port on SMBDirect transport cifs:smbd Use the correct DMA direction when sending data cifs:smbd When reconnecting to server, call smbd_destroy() after all MIDs have been called cifs: use the right include for signal_pending() smb3: trivial cleanup to smb2ops.c cifs: cleanup smb2ops.c and normalize strings smb3: display session id in debug data
2019-05-18fs/writeback.c: use rcu_barrier() to wait for inflight wb switches going ↵Jiufei Xue
into workqueue when umount synchronize_rcu() didn't wait for call_rcu() callbacks, so inode wb switch may not go to the workqueue after synchronize_rcu(). Thus previous scheduled switches was not finished even flushing the workqueue, which will cause a NULL pointer dereferenced followed below. VFS: Busy inodes after unmount of vdd. Self-destruct in 5 seconds. Have a nice day... BUG: unable to handle kernel NULL pointer dereference at 0000000000000278 evict+0xb3/0x180 iput+0x1b0/0x230 inode_switch_wbs_work_fn+0x3c0/0x6a0 worker_thread+0x4e/0x490 ? process_one_work+0x410/0x410 kthread+0xe6/0x100 ret_from_fork+0x39/0x50 Replace the synchronize_rcu() call with a rcu_barrier() to wait for all pending callbacks to finish. And inc isw_nr_in_flight after call_rcu() in inode_switch_wbs() to make more sense. Link: http://lkml.kernel.org/r/20190429024108.54150-1-jiufei.xue@linux.alibaba.com Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com> Acked-by: Tejun Heo <tj@kernel.org> Suggested-by: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-18treewide: prefix header search paths with $(srctree)/Masahiro Yamada
Currently, the Kbuild core manipulates header search paths in a crazy way [1]. To fix this mess, I want all Makefiles to add explicit $(srctree)/ to the search paths in the srctree. Some Makefiles are already written in that way, but not all. The goal of this work is to make the notation consistent, and finally get rid of the gross hacks. Having whitespaces after -I does not matter since commit 48f6e3cf5bc6 ("kbuild: do not drop -I without parameter"). [1]: https://patchwork.kernel.org/patch/9632347/ Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-05-17ext4: avoid panic during forced reboot due to aborted journalJan Kara
Handling of aborted journal is a special code path different from standard ext4_error() one and it can call panic() as well. Commit 1dc1097ff60e ("ext4: avoid panic during forced reboot") forgot to update this path so fix that omission. Fixes: 1dc1097ff60e ("ext4: avoid panic during forced reboot") Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org # 5.1
2019-05-17Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull more vfs mount updates from Al Viro: "Propagation of new syscalls to other architectures + cosmetic change from Christian (fscontext didn't follow the convention for anon inode names)" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: uapi: Wire up the mount API syscalls on non-x86 arches [ver #2] uapi, x86: Fix the syscall numbering of the mount API syscalls [ver #2] uapi, fsopen: use square brackets around "fscontext" [ver #2]
2019-05-16Merge tag 'for-linus-20190516' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: "A small set of fixes for io_uring. This contains: - smp_rmb() cleanup for io_cqring_events() (Jackie) - io_cqring_wait() simplification (Jackie) - removal of dead 'ev_flags' passing (me) - SQ poll CPU affinity verification fix (me) - SQ poll wait fix (Roman) - SQE command prep cleanup and fix (Stefan)" * tag 'for-linus-20190516' of git://git.kernel.dk/linux-block: io_uring: use wait_event_interruptible for cq_wait conditional wait io_uring: adjust smp_rmb inside io_cqring_events io_uring: fix infinite wait in khread_park() on io_finish_async() io_uring: remove 'ev_flags' argument io_uring: fix failure to verify SQ_AFF cpu io_uring: fix race condition reading SQE data
2019-05-16Merge tag 'afs-fixes-b-20190516' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull AFS callback promise fixes from David Howells: "This series fixes a bunch of problems in callback promise handling, where a callback promise indicates a promise on the part of the server to notify the client in the event of some sort of change to a file or volume. In the event of a break, the client has to go and refetch the client status from the server and discard any cached permission information as the ACL might have changed. The problem in the current code is that changes made by other clients aren't always noticed, primarily because the file status information and the callback information aren't updated in the same critical section, even if these are carried in the same reply from an RPC operation, and so the AFS_VNODE_CB_PROMISED flag is unreliable. Arranging for them to be done in the same critical section during reply decoding is tricky because of the FS.InlineBulkStatus op - which has all the statuses in the reply arriving and then all the callbacks, so they have to be buffered. It simplifies things a lot to move the critical section out of the decode phase and do it after the RPC function returns. Also new inodes (either newly fetched or newly created) aren't properly managed against a callback break happening before we get the local inode up and running. Fix this by: - There's now a combined file status and callback record (struct afs_status_cb) to carry both plus some flags. - Each operation wrapper function allocates sufficient afs_status_cb records for all the vnodes it is interested in and passes them into RPC operations to be filled in from the reply. - The FileStatus and CallBack record decoders no longer apply the new/revised status and callback information to the inode/vnode at the point of decoding and instead store the information into the record from (2). - afs_vnode_commit_status() then revises the file status, detects deletion and notes callback information inside of a single critical section. It also checks the callback break counters and cancels the callback promise if they changed during the operation. [*] Note that "callback break counters" are counters of server events that cancel one or more callback promises that the client thinks it has. The client counts the events and compares the counters before and after an operation to see if the callback promise it thinks it just got evaporated before it got recorded under lock. - Volume and server callback break counters are passed into afs_iget() allowing callback breaks concurrent with inode set up to be detected and the callback promise thence to be cancelled. - AFS validation checks are now done under RCU conditions using a read lock on cb_lock. This requires vnode->cb_interest to be made RCU safe. - If the checks in (6) fail, the callback breaker is then called under write lock on the cb_lock - but only if the callback break counter didn't change from the value read before the checks were made. - Results from FS.InlineBulkStatus that correspond to inodes we currently have in memory are now used to update those inodes' status and callback information rather than being discarded. This requires those inodes to be looked up before the RPC op is made and all their callback break values saved. To aid in this, the following changes have also been made: - Don't pass the vnode into the reply delivery functions or the decoders. The vnode shouldn't be altered anywhere in those paths. The only exception, for the moment, is for the call done hook for file lock ops that wants access to both the vnode and the call - this can be fixed at a later time. - Get rid of the call->reply[] void* array and replace it with named and typed members. This avoids confusion since different ops were mapping different reply[] members to different things. - Fix an order-1 kmalloc allocation in afs_do_lookup() and replace it with kvcalloc(). - Always get the reply time. Since callback, lock and fileserver record expiry times are calculated for several RPCs, make this mandatory. - Call afs_pages_written_back() from the operation wrapper rather than from the delivery function. - Don't store the version and type from a callback promise in a reply as the information in them is of very limited use" * tag 'afs-fixes-b-20190516' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: afs: Fix application of the results of a inline bulk status fetch afs: Pass pre-fetch server and volume break counts into afs_iget5_set() afs: Fix unlink to handle YFS.RemoveFile2 better afs: Clear AFS_VNODE_CB_PROMISED if we detect callback expiry afs: Make vnode->cb_interest RCU safe afs: Split afs_validate() so first part can be used under LOOKUP_RCU afs: Don't save callback version and type fields afs: Fix application of status and callback to be under same lock afs: Always get the reply time afs: Fix order-1 allocation in afs_do_lookup() afs: Get rid of afs_call::reply[] afs: Don't pass the vnode pointer through into the inline bulk status op
2019-05-16Merge tag 'afs-fixes-20190516' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull misc AFS fixes from David Howells: "This fixes a set of miscellaneous issues in the afs filesystem, including: - leak of keys on file close. - broken error handling in xattr functions. - missing locking when updating VL server list. - volume location server DNS lookup whereby preloaded cells may not ever get a lookup and regular DNS lookups to maintain server lists consume power unnecessarily. - incorrect error propagation and handling in the fileserver iteration code causes operations to sometimes apparently succeed. - interruption of server record check/update side op during fileserver iteration causes uninterruptible main operations to fail unexpectedly. - callback promise expiry time miscalculation. - over invalidation of the callback promise on directories. - double locking on callback break waking up file locking waiters. - double increment of the vnode callback break counter. Note that it makes some changes outside of the afs code, including: - an extra parameter to dns_query() to allow the dns_resolver key just accessed to be immediately invalidated. AFS is caching the results itself, so the key can be discarded. - an interruptible version of wait_var_event(). - an rxrpc function to allow the maximum lifespan to be set on a call. - a way for an rxrpc call to be marked as non-interruptible" * tag 'afs-fixes-20190516' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: afs: Fix double inc of vnode->cb_break afs: Fix lock-wait/callback-break double locking afs: Don't invalidate callback if AFS_VNODE_DIR_VALID not set afs: Fix calculation of callback expiry time afs: Make dynamic root population wait uninterruptibly for proc_cells_lock afs: Make some RPC operations non-interruptible rxrpc: Allow the kernel to mark a call as being non-interruptible afs: Fix error propagation from server record check/update afs: Fix the maximum lifespan of VL and probe calls rxrpc: Provide kernel interface to set max lifespan on a call afs: Fix "kAFS: AFS vnode with undefined type 0" afs: Fix cell DNS lookup Add wait_var_event_interruptible() dns_resolver: Allow used keys to be invalidated afs: Fix afs_cell records to always have a VL server list record afs: Fix missing lock when replacing VL server list afs: Fix afs_xattr_get_yfs() to not try freeing an error value afs: Fix incorrect error handling in afs_xattr_get_acl() afs: Fix key leak in afs_release() and afs_evict_inode()
2019-05-16Merge tag 'ceph-for-5.2-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph updates from Ilya Dryomov: "On the filesystem side we have: - a fix to enforce quotas set above the mount point (Luis Henriques) - support for exporting snapshots through NFS (Zheng Yan) - proper statx implementation (Jeff Layton). statx flags are mapped to MDS caps, with AT_STATX_{DONT,FORCE}_SYNC taken into account. - some follow-up dentry name handling fixes, in particular elimination of our hand-rolled helper and the switch to __getname() as suggested by Al (Jeff Layton) - a set of MDS client cleanups in preparation for async MDS requests in the future (Jeff Layton) - a fix to sync the filesystem before remounting (Jeff Layton) On the rbd side, work is on-going on object-map and fast-diff image features" * tag 'ceph-for-5.2-rc1' of git://github.com/ceph/ceph-client: (29 commits) ceph: flush dirty inodes before proceeding with remount ceph: fix unaligned access in ceph_send_cap_releases libceph: make ceph_pr_addr take an struct ceph_entity_addr pointer libceph: fix unaligned accesses in ceph_entity_addr handling rbd: don't assert on writes to snapshots rbd: client_mutex is never nested ceph: print inode number in __caps_issued_mask debugging messages ceph: just call get_session in __ceph_lookup_mds_session ceph: simplify arguments and return semantics of try_get_cap_refs ceph: fix comment over ceph_drop_caps_for_unlink ceph: move wait for mds request into helper function ceph: have ceph_mdsc_do_request call ceph_mdsc_submit_request ceph: after an MDS request, do callback and completions ceph: use pathlen values returned by set_request_path_attr ceph: use __getname/__putname in ceph_mdsc_build_path ceph: use ceph_mdsc_build_path instead of clone_dentry_name ceph: fix potential use-after-free in ceph_mdsc_build_path ceph: dump granular cap info in "caps" debugfs file ceph: make iterate_session_caps a public symbol ceph: fix NULL pointer deref when debugging is enabled ...
2019-05-16afs: Fix application of the results of a inline bulk status fetchDavid Howells
Fix afs_do_lookup() such that when it does an inline bulk status fetch op, it will update inodes that are already extant (something that afs_iget() doesn't do) and to cache permits for each inode created (thereby avoiding a follow up FS.FetchStatus call to determine this). Extant inodes need looking up in advance so that their cb_break counters before and after the operation can be compared. To this end, the inode pointers are cached so that they don't need looking up again after the op. Fixes: 5cf9dd55a0ec ("afs: Prospectively look up extra files when doing a single lookup") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Pass pre-fetch server and volume break counts into afs_iget5_set()David Howells
Pass the server and volume break counts from before the status fetch operation that queried the attributes of a file into afs_iget5_set() so that the new vnode's break counters can be initialised appropriately. This allows detection of a volume or server break that happened whilst we were fetching the status or setting up the vnode. Fixes: c435ee34551e ("afs: Overhaul the callback handling") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Fix unlink to handle YFS.RemoveFile2 betterDavid Howells
Make use of the status update for the target file that the YFS.RemoveFile2 RPC op returns to correctly update the vnode as to whether the file was actually deleted or just had nlink reduced. Fixes: 30062bd13e36 ("afs: Implement YFS support in the fs client") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Clear AFS_VNODE_CB_PROMISED if we detect callback expiryDavid Howells
Fix afs_validate() to clear AFS_VNODE_CB_PROMISED on a vnode if we detect any condition that causes the callback promise to be broken implicitly, including server break (cb_s_break), volume break (cb_v_break) or callback expiry. Fixes: ae3b7361dc0e ("afs: Fix validation/callback interaction") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Make vnode->cb_interest RCU safeDavid Howells
Use RCU-based freeing for afs_cb_interest struct objects and use RCU on vnode->cb_interest. Use that change to allow afs_check_validity() to use read_seqbegin_or_lock() instead of read_seqlock_excl(). This also requires the caller of afs_check_validity() to hold the RCU read lock across the call. Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Split afs_validate() so first part can be used under LOOKUP_RCUDavid Howells
Split afs_validate() so that the part that decides if the vnode is still valid can be used under LOOKUP_RCU conditions from afs_d_revalidate(). Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Don't save callback version and type fieldsDavid Howells
Don't save callback version and type fields as the version is about the format of the callback information and the type is relative to the particular RPC call. Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16Merge tag 'configfs-for-5.2' of git://git.infradead.org/users/hch/configfsLinus Torvalds
Pull configfs update from Christoph Hellwig: - a fix for an error path use after free (YueHaibing) * tag 'configfs-for-5.2' of git://git.infradead.org/users/hch/configfs: configfs: fix possible use-after-free in configfs_register_group
2019-05-16uapi, fsopen: use square brackets around "fscontext" [ver #2]Christian Brauner
Make the name of the anon inode fd "[fscontext]" instead of "fscontext". This is minor but most core-kernel anon inode fds already carry square brackets around their name: [eventfd] [eventpoll] [fanotify] [io_uring] [pidfd] [signalfd] [timerfd] [userfaultfd] For the sake of consistency lets do the same for the fscontext anon inode fd that comes with the new mount api. Signed-off-by: Christian Brauner <christian@brauner.io> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-16afs: Fix double inc of vnode->cb_breakDavid Howells
When __afs_break_callback() clears the CB_PROMISED flag, it increments vnode->cb_break to trigger a future refetch of the status and callback - however it also calls afs_clear_permits(), which also increments vnode->cb_break. Fix this by removing the increment from afs_clear_permits(). Whilst we're at it, fix the conditional call to afs_put_permits() as the function checks to see if the argument is NULL, so the check is redundant. Fixes: be080a6f43c4 ("afs: Overhaul permit caching"); Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Fix application of status and callback to be under same lockDavid Howells
When applying the status and callback in the response of an operation, apply them in the same critical section so that there's no race between checking the callback state and checking status-dependent state (such as the data version). Fix this by: (1) Allocating a joint {status,callback} record (afs_status_cb) before calling the RPC function for each vnode for which the RPC reply contains a status or a status plus a callback. A flag is set in the record to indicate if a callback was actually received. (2) These records are passed into the RPC functions to be filled in. The afs_decode_status() and yfs_decode_status() functions are removed and the cb_lock is no longer taken. (3) xdr_decode_AFSFetchStatus() and xdr_decode_YFSFetchStatus() no longer update the vnode. (4) xdr_decode_AFSCallBack() and xdr_decode_YFSCallBack() no longer update the vnode. (5) vnodes, expected data-version numbers and callback break counters (cb_break) no longer need to be passed to the reply delivery functions. Note that, for the moment, the file locking functions still need access to both the call and the vnode at the same time. (6) afs_vnode_commit_status() is now given the cb_break value and the expected data_version and the task of applying the status and the callback to the vnode are now done here. This is done under a single taking of vnode->cb_lock. (7) afs_pages_written_back() is now called by afs_store_data() rather than by the reply delivery function. afs_pages_written_back() has been moved to before the call point and is now given the first and last page numbers rather than a pointer to the call. (8) The indicator from YFS.RemoveFile2 as to whether the target file actually got removed (status.abort_code == VNOVNODE) rather than merely dropping a link is now checked in afs_unlink rather than in xdr_decode_YFSFetchStatus(). Supplementary fixes: (*) afs_cache_permit() now gets the caller_access mask from the afs_status_cb object rather than picking it out of the vnode's status record. afs_fetch_status() returns caller_access through its argument list for this purpose also. (*) afs_inode_init_from_status() now uses a write lock on cb_lock rather than a read lock and now sets the callback inside the same critical section. Fixes: c435ee34551e ("afs: Overhaul the callback handling") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Fix lock-wait/callback-break double lockingDavid Howells
__afs_break_callback() holds vnode->lock around its call of afs_lock_may_be_available() - which also takes that lock. Fix this by not taking the lock in __afs_break_callback(). Also, there's no point checking the granted_locks and pending_locks queues; it's sufficient to check lock_state, so move that check out of afs_lock_may_be_available() into __afs_break_callback() to replace the queue checks. Fixes: e8d6c554126b ("AFS: implement file locking") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Always get the reply timeDavid Howells
Always ask for the reply time from AF_RXRPC as it's used to calculate the callback expiry time and lock expiry times, so it's needed by most FS operations. Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Don't invalidate callback if AFS_VNODE_DIR_VALID not setDavid Howells
Don't invalidate the callback promise on a directory if the AFS_VNODE_DIR_VALID flag is not set (which indicates that the directory contents are invalid, due to edit failure, callback break, page reclaim). The directory will be reloaded next time the directory is accessed, so clearing the callback flag at this point may race with a reload of the directory and cancel it's recorded callback promise. Fixes: f3ddee8dc4e2 ("afs: Fix directory handling") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Fix order-1 allocation in afs_do_lookup()David Howells
afs_do_lookup() will do an order-1 allocation to allocate status records if there are more than 39 vnodes to stat. Fix this by allocating an array of {status,callback} records for each vnode we want to examine using vmalloc() if larger than a page. This not only gets rid of the order-1 allocation, but makes it easier to grow beyond 50 records for YFS servers. It also allows us to move to {status,callback} tuples for other calls too and makes it easier to lock across the application of the status and the callback to the vnode. Fixes: 5cf9dd55a0ec ("afs: Prospectively look up extra files when doing a single lookup") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Fix calculation of callback expiry timeDavid Howells
Fix the calculation of the expiry time of a callback promise, as obtained from operations like FS.FetchStatus and FS.FetchData. The time should be based on the timestamp of the first DATA packet in the reply and the calculation needs to turn the ktime_t timestamp into a time64_t. Fixes: c435ee34551e ("afs: Overhaul the callback handling") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Get rid of afs_call::reply[]David Howells
Replace the afs_call::reply[] array with a bunch of typed members so that the compiler can use type-checking on them. It's also easier for the eye to see what's going on. Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Make dynamic root population wait uninterruptibly for proc_cells_lockDavid Howells
Make dynamic root population wait uninterruptibly for proc_cells_lock. Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Don't pass the vnode pointer through into the inline bulk status opDavid Howells
Don't pass the vnode pointer through into the inline bulk status op. We want to process the status records outside of it anyway. Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Make some RPC operations non-interruptibleDavid Howells
Make certain RPC operations non-interruptible, including: (*) Set attributes (*) Store data We don't want to get interrupted during a flush on close, flush on unlock, writeback or an inode update, leaving us in a state where we still need to do the writeback or update. (*) Extend lock (*) Release lock We don't want to get lock extension interrupted as the file locks on the server are time-limited. Interruption during lock release is less of an issue since the lock is time-limited, but it's better to complete the release to avoid a several-minute wait to recover it. *Setting* the lock isn't a problem if it's interrupted since we can just return to the user and tell them they were interrupted - at which point they can elect to retry. (*) Silly unlink We want to remove silly unlink files if we can, rather than leaving them for the salvager to clear up. Note that whilst these calls are no longer interruptible, they do have timeouts on them, so if the server stops responding the call will fail with something like ETIME or ECONNRESET. Without this, the following: kAFS: Unexpected error from FS.StoreData -512 appears in dmesg when a pending store data gets interrupted and some processes may just hang. Additionally, make the code that checks/updates the server record ignore failure due to interruption if the main call is uninterruptible and if the server has an address list. The next op will check it again since the expiration time on the old list has past. Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation") Reported-by: Jonathan Billings <jsbillings@jsbillings.org> Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16rxrpc: Allow the kernel to mark a call as being non-interruptibleDavid Howells
Allow kernel services using AF_RXRPC to indicate that a call should be non-interruptible. This allows kafs to make things like lock-extension and writeback data storage calls non-interruptible. If this is set, signals will be ignored for operations on that call where possible - such as waiting to get a call channel on an rxrpc connection. It doesn't prevent UDP sendmsg from being interrupted, but that will be handled by packet retransmission. rxrpc_kernel_recv_data() isn't affected by this since that never waits, preferring instead to return -EAGAIN and leave the waiting to the caller. Userspace initiated calls can't be set to be uninterruptible at this time. Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Fix error propagation from server record check/updateDavid Howells
afs_check/update_server_record() should be setting fc->error rather than fc->ac.error as they're called from within the cursor iteration function. afs_fs_cursor::error is where the error code of the attempt to call the operation on multiple servers is integrated and is the final result, whereas afs_addr_cursor::error is used to hold the error from individual iterations of the call loop. (Note there's also an afs_vl_cursor which also wraps afs_addr_cursor for accessing VL servers rather than file servers). Fix this by setting fc->error in the afs_check/update_server_record() so that any error incurred whilst talking to the VL server correctly propagates to the final result. This results in: kAFS: Unexpected error from FS.StoreData -512 being seen, even though the store-data op is non-interruptible. The error is actually coming from the server record update getting interrupted. Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Fix the maximum lifespan of VL and probe callsDavid Howells
If an older AFS server doesn't support an operation, it may accept the call and then sit on it forever, happily responding to pings that make kafs think that the call is still alive. Fix this by setting the maximum lifespan of Volume Location service calls in particular and probe calls in general so that they don't run on endlessly if they're not supported. Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Fix "kAFS: AFS vnode with undefined type 0"David Howells
Under some circumstances afs_select_fileserver() can return without setting an error in fc->error. The problem is in the no_more_servers segment where the accumulated errors from attempts to contact various servers are integrated into an afs_error-type variable 'e'. The resultant error code is, however, then abandoned. Fix this by getting the error out of e.error and putting it in 'error' so that the next part will store it into fc->error. Not doing this causes a report like the following: kAFS: AFS vnode with undefined type 0 kAFS: A=0 m=0 s=0 v=0 kAFS: vnode 20000025:1:1 because the code following the server selection loop then sees what it thinks is a successful invocation because fc.error is 0. However, it can't apply the status record because it's all zeros. The report is followed on the first instance with a trace looking something like: dump_stack+0x67/0x8e afs_inode_init_from_status.isra.2+0x21b/0x487 afs_fetch_status+0x119/0x1df afs_iget+0x130/0x295 afs_get_tree+0x31d/0x595 vfs_get_tree+0x1f/0xe8 fc_mount+0xe/0x36 afs_d_automount+0x328/0x3c3 follow_managed+0x109/0x20a lookup_fast+0x3bf/0x3f8 do_last+0xc3/0x6a4 path_openat+0x1af/0x236 do_filp_open+0x51/0xae ? _raw_spin_unlock+0x24/0x2d ? __alloc_fd+0x1a5/0x1b7 do_sys_open+0x13b/0x1e8 do_syscall_64+0x7d/0x1b3 entry_SYSCALL_64_after_hwframe+0x49/0xbe Fixes: 4584ae96ae30 ("afs: Fix missing net error handling") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16io_uring: use wait_event_interruptible for cq_wait conditional waitJackie Liu
The previous patch has ensured that io_cqring_events contain smp_rmb memory barriers, Now we can use wait_event_interruptible to keep the code simple. Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-16io_uring: adjust smp_rmb inside io_cqring_eventsJackie Liu
Whenever smp_rmb is required to use io_cqring_events, keep smp_rmb inside the function io_cqring_events. Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-16io_uring: fix infinite wait in khread_park() on io_finish_async()Roman Penyaev
This fixes couple of races which lead to infinite wait of park completion with the following backtraces: [20801.303319] Call Trace: [20801.303321] ? __schedule+0x284/0x650 [20801.303323] schedule+0x33/0xc0 [20801.303324] schedule_timeout+0x1bc/0x210 [20801.303326] ? schedule+0x3d/0xc0 [20801.303327] ? schedule_timeout+0x1bc/0x210 [20801.303329] ? preempt_count_add+0x79/0xb0 [20801.303330] wait_for_completion+0xa5/0x120 [20801.303331] ? wake_up_q+0x70/0x70 [20801.303333] kthread_park+0x48/0x80 [20801.303335] io_finish_async+0x2c/0x70 [20801.303336] io_ring_ctx_wait_and_kill+0x95/0x180 [20801.303338] io_uring_release+0x1c/0x20 [20801.303339] __fput+0xad/0x210 [20801.303341] task_work_run+0x8f/0xb0 [20801.303342] exit_to_usermode_loop+0xa0/0xb0 [20801.303343] do_syscall_64+0xe0/0x100 [20801.303349] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [20801.303380] Call Trace: [20801.303383] ? __schedule+0x284/0x650 [20801.303384] schedule+0x33/0xc0 [20801.303386] io_sq_thread+0x38a/0x410 [20801.303388] ? __switch_to_asm+0x40/0x70 [20801.303390] ? wait_woken+0x80/0x80 [20801.303392] ? _raw_spin_lock_irqsave+0x17/0x40 [20801.303394] ? io_submit_sqes+0x120/0x120 [20801.303395] kthread+0x112/0x130 [20801.303396] ? kthread_create_on_node+0x60/0x60 [20801.303398] ret_from_fork+0x35/0x40 o kthread_park() waits for park completion, so io_sq_thread() loop should check kthread_should_park() along with khread_should_stop(), otherwise if kthread_park() is called before prepare_to_wait() the following schedule() never returns: CPU#0 CPU#1 io_sq_thread_stop(): io_sq_thread(): while(!kthread_should_stop() && !ctx->sqo_stop) { ctx->sqo_stop = 1; kthread_park() prepare_to_wait(); if (kthread_should_stop() { } schedule(); <<< nobody checks park flag, <<< so schedule and never return o if the flag ctx->sqo_stop is observed by the io_sq_thread() loop it is quite possible, that kthread_should_park() check and the following kthread_parkme() is never called, because kthread_park() has not been yet called, but few moments later is is called and waits there for park completion, which never happens, because kthread has already exited: CPU#0 CPU#1 io_sq_thread_stop(): io_sq_thread(): ctx->sqo_stop = 1; while(!kthread_should_stop() && !ctx->sqo_stop) { <<< observe sqo_stop and exit the loop } if (kthread_should_park()) kthread_parkme(); <<< never called, since was <<< never parked kthread_park() <<< waits forever for park completion In the current patch we quit the loop by only kthread_should_park() check (kthread_park() is synchronous, so kthread_should_stop() is never observed), and we abandon ->sqo_stop flag, since it is racy. At the end of the io_sq_thread() we unconditionally call parmke(), since we've exited the loop by the park flag. Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-16Btrfs: tree-checker: detect file extent items with overlapping rangesFilipe Manana
Having file extent items with ranges that overlap each other is a serious issue that leads to all sorts of corruptions and crashes (like a BUG_ON() during the course of __btrfs_drop_extents() when it traims file extent items). Therefore teach the tree checker to detect such cases. This is motivated by a recently fixed bug (race between ranged full fsync and writeback or adjacent ranges). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-16Btrfs: fix race between ranged fsync and writeback of adjacent rangesFilipe Manana
When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the inode) that happens to be ranged, which happens during a msync() or writes for files opened with O_SYNC for example, we can end up with a corrupt log, due to different file extent items representing ranges that overlap with each other, or hit some assertion failures. When doing a ranged fsync we only flush delalloc and wait for ordered exents within that range. If while we are logging items from our inode ordered extents for adjacent ranges complete, we end up in a race that can make us insert the file extent items that overlap with others we logged previously and the assertion failures. For example, if tree-log.c:copy_items() receives a leaf that has the following file extents items, all with a length of 4K and therefore there is an implicit hole in the range 68K to 72K - 1: (257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ... It copies them to the log tree. However due to the need to detect implicit holes, it may release the path, in order to look at the previous leaf to detect an implicit hole, and then later it will search again in the tree for the first file extent item key, with the goal of locking again the leaf (which might have changed due to concurrent changes to other inodes). However when it locks again the leaf containing the first key, the key corresponding to the extent at offset 72K may not be there anymore since there is an ordered extent for that range that is finishing (that is, somewhere in the middle of btrfs_finish_ordered_io()), and it just removed the file extent item but has not yet replaced it with a new file extent item, so the part of copy_items() that does hole detection will decide that there is a hole in the range starting from 68K to 76K - 1, and therefore insert a file extent item to represent that hole, having a key offset of 68K. After that we now have a log tree with 2 different extent items that have overlapping ranges: 1) The file extent item copied before copy_items() released the path, which has a key offset of 72K and a length of 4K, representing the file range 72K to 76K - 1. 2) And a file extent item representing a hole that has a key offset of 68K and a length of 8K, representing the range 68K to 76K - 1. This item was inserted after releasing the path, and overlaps with the extent item inserted before. The overlapping extent items can cause all sorts of unpredictable and incorrect behaviour, either when replayed or if a fast (non full) fsync happens later, which can trigger a BUG_ON() when calling btrfs_set_item_key_safe() through __btrfs_drop_extents(), producing a trace like the following: [61666.783269] ------------[ cut here ]------------ [61666.783943] kernel BUG at fs/btrfs/ctree.c:3182! [61666.784644] invalid opcode: 0000 [#1] PREEMPT SMP (...) [61666.786253] task: ffff880117b88c40 task.stack: ffffc90008168000 [61666.786253] RIP: 0010:btrfs_set_item_key_safe+0x7c/0xd2 [btrfs] [61666.786253] RSP: 0018:ffffc9000816b958 EFLAGS: 00010246 [61666.786253] RAX: 0000000000000000 RBX: 000000000000000f RCX: 0000000000030000 [61666.786253] RDX: 0000000000000000 RSI: ffffc9000816ba4f RDI: ffffc9000816b937 [61666.786253] RBP: ffffc9000816b998 R08: ffff88011dae2428 R09: 0000000000001000 [61666.786253] R10: 0000160000000000 R11: 6db6db6db6db6db7 R12: ffff88011dae2418 [61666.786253] R13: ffffc9000816ba4f R14: ffff8801e10c4118 R15: ffff8801e715c000 [61666.786253] FS: 00007f6060a18700(0000) GS:ffff88023f5c0000(0000) knlGS:0000000000000000 [61666.786253] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [61666.786253] CR2: 00007f6060a28000 CR3: 0000000213e69000 CR4: 00000000000006e0 [61666.786253] Call Trace: [61666.786253] __btrfs_drop_extents+0x5e3/0xaad [btrfs] [61666.786253] ? time_hardirqs_on+0x9/0x14 [61666.786253] btrfs_log_changed_extents+0x294/0x4e0 [btrfs] [61666.786253] ? release_extent_buffer+0x38/0xb4 [btrfs] [61666.786253] btrfs_log_inode+0xb6e/0xcdc [btrfs] [61666.786253] ? lock_acquire+0x131/0x1c5 [61666.786253] ? btrfs_log_inode_parent+0xee/0x659 [btrfs] [61666.786253] ? arch_local_irq_save+0x9/0xc [61666.786253] ? btrfs_log_inode_parent+0x1f5/0x659 [btrfs] [61666.786253] btrfs_log_inode_parent+0x223/0x659 [btrfs] [61666.786253] ? arch_local_irq_save+0x9/0xc [61666.786253] ? lockref_get_not_zero+0x2c/0x34 [61666.786253] ? rcu_read_unlock+0x3e/0x5d [61666.786253] btrfs_log_dentry_safe+0x60/0x7b [btrfs] [61666.786253] btrfs_sync_file+0x317/0x42c [btrfs] [61666.786253] vfs_fsync_range+0x8c/0x9e [61666.786253] SyS_msync+0x13c/0x1c9 [61666.786253] entry_SYSCALL_64_fastpath+0x18/0xad A sample of a corrupt log tree leaf with overlapping extents I got from running btrfs/072: item 14 key (295 108 200704) itemoff 2599 itemsize 53 extent data disk bytenr 0 nr 0 extent data offset 0 nr 458752 ram 458752 item 15 key (295 108 659456) itemoff 2546 itemsize 53 extent data disk bytenr 4343541760 nr 770048 extent data offset 606208 nr 163840 ram 770048 item 16 key (295 108 663552) itemoff 2493 itemsize 53 extent data disk bytenr 4343541760 nr 770048 extent data offset 610304 nr 155648 ram 770048 item 17 key (295 108 819200) itemoff 2440 itemsize 53 extent data disk bytenr 4334788608 nr 4096 extent data offset 0 nr 4096 ram 4096 The file extent item at offset 659456 (item 15) ends at offset 823296 (659456 + 163840) while the next file extent item (item 16) starts at offset 663552. Another different problem that the race can trigger is a failure in the assertions at tree-log.c:copy_items(), which expect that the first file extent item key we found before releasing the path exists after we have released path and that the last key we found before releasing the path also exists after releasing the path: $ cat -n fs/btrfs/tree-log.c 4080 if (need_find_last_extent) { 4081 /* btrfs_prev_leaf could return 1 without releasing the path */ 4082 btrfs_release_path(src_path); 4083 ret = btrfs_search_slot(NULL, inode->root, &first_key, 4084 src_path, 0, 0); 4085 if (ret < 0) 4086 return ret; 4087 ASSERT(ret == 0); (...) 4103 if (i >= btrfs_header_nritems(src_path->nodes[0])) { 4104 ret = btrfs_next_leaf(inode->root, src_path); 4105 if (ret < 0) 4106 return ret; 4107 ASSERT(ret == 0); 4108 src = src_path->nodes[0]; 4109 i = 0; 4110 need_find_last_extent = true; 4111 } (...) The second assertion implicitly expects that the last key before the path release still exists, because the surrounding while loop only stops after we have found that key. When this assertion fails it produces a stack like this: [139590.037075] assertion failed: ret == 0, file: fs/btrfs/tree-log.c, line: 4107 [139590.037406] ------------[ cut here ]------------ [139590.037707] kernel BUG at fs/btrfs/ctree.h:3546! [139590.038034] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI [139590.038340] CPU: 1 PID: 31841 Comm: fsstress Tainted: G W 5.0.0-btrfs-next-46 #1 (...) [139590.039354] RIP: 0010:assfail.constprop.24+0x18/0x1a [btrfs] (...) [139590.040397] RSP: 0018:ffffa27f48f2b9b0 EFLAGS: 00010282 [139590.040730] RAX: 0000000000000041 RBX: ffff897c635d92c8 RCX: 0000000000000000 [139590.041105] RDX: 0000000000000000 RSI: ffff897d36a96868 RDI: ffff897d36a96868 [139590.041470] RBP: ffff897d1b9a0708 R08: 0000000000000000 R09: 0000000000000000 [139590.041815] R10: 0000000000000008 R11: 0000000000000000 R12: 0000000000000013 [139590.042159] R13: 0000000000000227 R14: ffff897cffcbba88 R15: 0000000000000001 [139590.042501] FS: 00007f2efc8dee80(0000) GS:ffff897d36a80000(0000) knlGS:0000000000000000 [139590.042847] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [139590.043199] CR2: 00007f8c064935e0 CR3: 0000000232252002 CR4: 00000000003606e0 [139590.043547] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [139590.043899] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [139590.044250] Call Trace: [139590.044631] copy_items+0xa3f/0x1000 [btrfs] [139590.045009] ? generic_bin_search.constprop.32+0x61/0x200 [btrfs] [139590.045396] btrfs_log_inode+0x7b3/0xd70 [btrfs] [139590.045773] btrfs_log_inode_parent+0x2b3/0xce0 [btrfs] [139590.046143] ? do_raw_spin_unlock+0x49/0xc0 [139590.046510] btrfs_log_dentry_safe+0x4a/0x70 [btrfs] [139590.046872] btrfs_sync_file+0x3b6/0x440 [btrfs] [139590.047243] btrfs_file_write_iter+0x45b/0x5c0 [btrfs] [139590.047592] __vfs_write+0x129/0x1c0 [139590.047932] vfs_write+0xc2/0x1b0 [139590.048270] ksys_write+0x55/0xc0 [139590.048608] do_syscall_64+0x60/0x1b0 [139590.048946] entry_SYSCALL_64_after_hwframe+0x49/0xbe [139590.049287] RIP: 0033:0x7f2efc4be190 (...) [139590.050342] RSP: 002b:00007ffe743243a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [139590.050701] RAX: ffffffffffffffda RBX: 0000000000008d58 RCX: 00007f2efc4be190 [139590.051067] RDX: 0000000000008d58 RSI: 00005567eca0f370 RDI: 0000000000000003 [139590.051459] RBP: 0000000000000024 R08: 0000000000000003 R09: 0000000000008d60 [139590.051863] R10: 0000000000000078 R11: 0000000000000246 R12: 0000000000000003 [139590.052252] R13: 00000000003d3507 R14: 00005567eca0f370 R15: 0000000000000000 (...) [139590.055128] ---[ end trace 193f35d0215cdeeb ]--- So fix this race between a full ranged fsync and writeback of adjacent ranges by flushing all delalloc and waiting for all ordered extents to complete before logging the inode. This is the simplest way to solve the problem because currently the full fsync path does not deal with ranges at all (it assumes a full range from 0 to LLONG_MAX) and it always needs to look at adjacent ranges for hole detection. For use cases of ranged fsyncs this can make a few fsyncs slower but on the other hand it can make some following fsyncs to other ranges do less work or no need to do anything at all. A full fsync is rare anyway and happens only once after loading/creating an inode and once after less common operations such as a shrinking truncate. This is an issue that exists for a long time, and was often triggered by generic/127, because it does mmap'ed writes and msync (which triggers a ranged fsync). Adding support for the tree checker to detect overlapping extents (next patch in the series) and trigger a WARN() when such cases are found, and then calling btrfs_check_leaf_full() at the end of btrfs_insert_file_extent() made the issue much easier to detect. Running btrfs/072 with that change to the tree checker and making fsstress open files always with O_SYNC made it much easier to trigger the issue (as triggering it with generic/127 is very rare). CC: stable@vger.kernel.org # 3.16+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-16Btrfs: avoid fallback to transaction commit during fsync of files with holesFilipe Manana
When we are doing a full fsync (bit BTRFS_INODE_NEEDS_FULL_SYNC set) of a file that has holes and has file extent items spanning two or more leafs, we can end up falling to back to a full transaction commit due to a logic bug that leads to failure to insert a duplicate file extent item that is meant to represent a hole between the last file extent item of a leaf and the first file extent item in the next leaf. The failure (EEXIST error) leads to a transaction commit (as most errors when logging an inode do). For example, we have the two following leafs: Leaf N: ----------------------------------------------- | ..., ..., ..., (257, FILE_EXTENT_ITEM, 64K) | ----------------------------------------------- The file extent item at the end of leaf N has a length of 4Kb, representing the file range from 64K to 68K - 1. Leaf N + 1: ----------------------------------------------- | (257, FILE_EXTENT_ITEM, 72K), ..., ..., ... | ----------------------------------------------- The file extent item at the first slot of leaf N + 1 has a length of 4Kb too, representing the file range from 72K to 76K - 1. During the full fsync path, when we are at tree-log.c:copy_items() with leaf N as a parameter, after processing the last file extent item, that represents the extent at offset 64K, we take a look at the first file extent item at the next leaf (leaf N + 1), and notice there's a 4K hole between the two extents, and therefore we insert a file extent item representing that hole, starting at file offset 68K and ending at offset 72K - 1. However we don't update the value of *last_extent, which is used to represent the end offset (plus 1, non-inclusive end) of the last file extent item inserted in the log, so it stays with a value of 68K and not with a value of 72K. Then, when copy_items() is called for leaf N + 1, because the value of *last_extent is smaller then the offset of the first extent item in the leaf (68K < 72K), we look at the last file extent item in the previous leaf (leaf N) and see it there's a 4K gap between it and our first file extent item (again, 68K < 72K), so we decide to insert a file extent item representing the hole, starting at file offset 68K and ending at offset 72K - 1, this insertion will fail with -EEXIST being returned from btrfs_insert_file_extent() because we already inserted a file extent item representing a hole for this offset (68K) in the previous call to copy_items(), when processing leaf N. The -EEXIST error gets propagated to the fsync callback, btrfs_sync_file(), which falls back to a full transaction commit. Fix this by adjusting *last_extent after inserting a hole when we had to look at the next leaf. Fixes: 4ee3fad34a9c ("Btrfs: fix fsync after hole punching when using no-holes feature") Cc: stable@vger.kernel.org # 4.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-16btrfs: extent-tree: Fix a bug that btrfs is unable to add pinned bytesQu Wenruo
Commit ddf30cf03fb5 ("btrfs: extent-tree: Use btrfs_ref to refactor add_pinned_bytes()") refactored add_pinned_bytes(), but during that refactor, there are two callers which add the pinned bytes instead of subtracting. That refactor misses those two caller, causing incorrect pinned bytes calculation and resulting unexpected ENOSPC error. Fix it by adding a new parameter @sign to restore the original behavior. Reported-by: kernel test robot <rong.a.chen@intel.com> Fixes: ddf30cf03fb5 ("btrfs: extent-tree: Use btrfs_ref to refactor add_pinned_bytes()") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-16btrfs: sysfs: don't leak memory when failing add fsidTobin C. Harding
A failed call to kobject_init_and_add() must be followed by a call to kobject_put(). Currently in the error path when adding fs_devices we are missing this call. This could be fixed by calling btrfs_sysfs_remove_fsid() if btrfs_sysfs_add_fsid() returns an error or by adding a call to kobject_put() directly in btrfs_sysfs_add_fsid(). Here we choose the second option because it prevents the slightly unusual error path handling requirements of kobject from leaking out into btrfs functions. Add a call to kobject_put() in the error path of kobject_add_and_init(). This causes the release method to be called if kobject_init_and_add() fails. open_tree() is the function that calls btrfs_sysfs_add_fsid() and the error code in this function is already written with the assumption that the release method is called during the error path of open_tree() (as seen by the call to btrfs_sysfs_remove_fsid() under the fail_fsdev_sysfs label). Cc: stable@vger.kernel.org # v4.4+ Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Tobin C. Harding <tobin@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-16btrfs: sysfs: Fix error path kobject memory leakTobin C. Harding
If a call to kobject_init_and_add() fails we must call kobject_put() otherwise we leak memory. Calling kobject_put() when kobject_init_and_add() fails drops the refcount back to 0 and calls the ktype release method (which in turn calls the percpu destroy and kfree). Add call to kobject_put() in the error path of call to kobject_init_and_add(). Cc: stable@vger.kernel.org # v4.4+ Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Tobin C. Harding <tobin@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-16afs: Fix cell DNS lookupDavid Howells
Currently, once configured, AFS cells are looked up in the DNS at regular intervals - which is a waste of resources if those cells aren't being used. It also leads to a problem where cells preloaded, but not configured, before the network is brought up end up effectively statically configured with no VL servers and are unable to get any. Fix this by not doing the DNS lookup until the first time a cell is touched. It is waited for if we don't have any cached records yet, otherwise the DNS lookup to maintain the record is done in the background. This has the downside that the first time you touch a cell, you now have to wait for the upcall to do the required DNS lookups rather than them already being cached. Further, the record is not replaced if the old record has at least one server in it and the new record doesn't have any. Fixes: 0a5143f2f89c ("afs: Implement VL server rotation") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-15cifs: add support for SEEK_DATA and SEEK_HOLERonnie Sahlberg
Add llseek op for SEEK_DATA and SEEK_HOLE. Improves xfstests/285,286,436,445,448 and 490 Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-05-15Fixed https://bugzilla.kernel.org/show_bug.cgi?id=202935 allow write on the ↵Kovtunenko Oleksandr
same file Copychunk allows source and target to be on the same file. For details on restrictions see MS-SMB2 3.3.5.15.6 Signed-off-by: Kovtunenko Oleksandr <alexander198961@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>