summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2017-06-19Btrfs: new helper btrfs_bio_clone_partialLiu Bo
This adds a new helper btrfs_bio_clone_partial, it'll allocate a cloned bio that only owns a part of the original bio's data. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19Btrfs: use bio_clone_fast to clone our bioLiu Bo
For raid1 and raid10, we clone the original bio to the bios which are then sent to different disks. Right now we use bio_clone_bioset to create a clone bio with iterating bi_io_vec to initialize it. This changes it to use bio_clone_fast() which creates a clone bio but only copies the bi_io_vec pointer instead of iterating bi_io_vec. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19Btrfs: don't pass the inode through clean_io_failureJosef Bacik
Instead pass around the failure tree and the io tree. Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: remove inode argument from repair_io_failureJosef Bacik
Once we remove the btree_inode we won't have an inode to pass anymore, just pass the fs_info directly and the inum since we use that to print out the repair message. Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19Btrfs: replace tree->mapping with tree->private_dataJosef Bacik
For extent_io tree's we have carried the address_mapping of the inode around in the io tree in order to pull the inode back out for calling into various tree ops hooks. This works fine when everything that has an extent_io_tree has an inode. But we are going to remove the btree_inode, so we need to change this. Instead just have a generic void * for private data that we can initialize with, and have all the tree ops use that instead. This had a lot of cascading changes but should be relatively straightforward. Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor reordering of the callback prototypes ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: Add quota_override knob into sysfsSargun Dhillon
This patch adds the read-write attribute quota_override into sysfs. Any process which has CAP_SYS_RESOURCE can set this flag to on, and once it is set to true, processes with CAP_SYS_RESOURCE can exceed the quota. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Reviewed-by: David Sterba <dsterba@suse.com> [ minor changelog edits ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCESargun Dhillon
This patch introduces the quota override flag to btrfs_fs_info, and a change to quota limit checking code to temporarily allow for quota to be overridden for processes with CAP_SYS_RESOURCE. It's useful for administrative programs, such as log rotation, that may need to temporarily use more disk space in order to free up a greater amount of overall disk space without yielding more disk space to the rest of userland. Eventually, we may want to add the idea of an operator-specific quota, operator reserved space, or something else to allow for administrative override, but this is perhaps the simplest solution. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Reviewed-by: David Sterba <dsterba@suse.com> [ minor changelog edits ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: Convert fs_info->free_chunk_space to atomic64_tNikolay Borisov
The ->free_chunk_space variable is used to track the unallocated space and access to it is protected by a spinlock, which is not used for anything else. Make the code a bit self-explanatory by switching the variable to an atomic64_t type and kill the spinlock. Signed-off-by: Nikolay Borisov <nborisov@suse.com> [ not a performance critical code, use of atomic type is ok ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: add framework to handle device flush error as a volumeAnand Jain
This adds comments to the flush error handling part of the code, and hopes to maintain the same logic with a framework which can be used to handle the errors at the volume level. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19Btrfs: remove obsolete FIXMEs in qgroup ioctlsDaichou
These FIXMEs were already addressed in 2013. All functions check for qgroup existence: * btrfs_add_qgroup_relation * btrfs_ioctl_qgroup_create * btrfs_limit_qgroup * btrfs_del_qgroup_relation Signed-off-by: Daichou <tommy0705c@gmail.com> [ enhance and reformat changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19Btrfs: remove an unused variableDan Carpenter
"item" is never used. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: kmap() can't failFabian Frederick
Remove NULL test on kmap() as it will always return a valid pointer. Signed-off-by: Fabian Frederick <fabf@skynet.be> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19xfs: remove bli from AIL before release on transaction abortBrian Foster
When a buffer is modified, logged and committed, it ultimately ends up sitting on the AIL with a dirty bli waiting for metadata writeback. If another transaction locks and invalidates the buffer (freeing an inode chunk, for example) in the meantime, the bli is flagged as stale, the dirty state is cleared and the bli remains in the AIL. If a shutdown occurs before the transaction that has invalidated the buffer is committed, the transaction is ultimately aborted. The log items are flagged as such and ->iop_unlock() handles the aborted items. Because the bli is clean (due to the invalidation), ->iop_unlock() unconditionally releases it. The log item may still reside in the AIL, however, which means the I/O completion handler may still run and attempt to access it. This results in assert failure due to the release of the bli while still present in the AIL and a subsequent NULL dereference and panic in the buffer I/O completion handling. This can be reproduced by running generic/388 in repetition. To avoid this problem, update xfs_buf_item_unlock() to first check whether the bli is aborted and if so, remove it from the AIL before it is released. This ensures that the bli is no longer accessed during the shutdown sequence after it has been freed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: release bli from transaction properly on fs shutdownBrian Foster
If a filesystem shutdown occurs with a buffer log item in the CIL and a log force occurs, the ->iop_unpin() handler is generally expected to tear down the bli properly. This entails freeing the bli memory and releasing the associated hold on the buffer so it can be released and the filesystem unmounted. If this sequence occurs while ->bli_refcount is elevated (i.e., another transaction is open and attempting to modify the buffer), however, ->iop_unpin() may not be responsible for releasing the bli. Instead, the transaction may release the final ->bli_refcount reference and thus xfs_trans_brelse() is responsible for tearing down the bli. While xfs_trans_brelse() does drop the reference count, it only attempts to release the bli if it is clean (i.e., not in the CIL/AIL). If the filesystem is shutdown and the bli is sitting dirty in the CIL as noted above, this ends up skipping the last opportunity to release the bli. In turn, this leaves the hold on the buffer and causes an unmount hang. This can be reproduced by running generic/388 in repetition. Update xfs_trans_brelse() to handle this shutdown corner case correctly. If the final bli reference is dropped and the filesystem is shutdown, remove the bli from the AIL (if necessary) and release the bli to drop the buffer hold and ensure an unmount does not hang. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: avoid harmless gcc-7 warningsArnd Bergmann
gcc-7 flags the use of integer math inside of a condition as a potential bug: fs/xfs/xfs_bmap_util.c: In function 'xfs_swap_extents_check_format': fs/xfs/xfs_bmap_util.c:1619:8: error: '<<' in boolean context, did you mean '<' ? [-Werror=int-in-bool-context] fs/xfs/xfs_bmap_util.c:1629:8: error: '<<' in boolean context, did you mean '<' ? [-Werror=int-in-bool-context] There is already a helper function for testing the di_forkoff field for zero, so let's use that instead to shut up the warning. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: remove lsn relevant fields from xfs_trans structure and its usersShan Hai
The t_lsn is not used anymore and the t_commit_lsn is used as a tmp storage for the checkpoint sequence number only in the current code. And the start/commit lsn are tracked as a transaction group tag in the xfs_cil_ctx instead of a single transaction, so remove them from the xfs_trans structure and their users to match with the design. Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: remove XFS_HSIZEChristoph Hellwig
XFS_HSIZE is an extremly confusing way to calculate the size of handle_t. Given that handle_t always only had two sizes, and one of them isn't even covered by XFS_HSIZE to start with just remove the macro and use a constant sizeof expression. Note that XFS_HSIZE isn't used in xfsprogs, xfsdump or xfstests either. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: dump transaction usage details on log reservation overrunBrian Foster
If a transaction log reservation overrun occurs, the ticket data associated with the reservation is dumped in xfs_log_commit_cil(). This occurs long after the transaction items and details have been removed from the transaction and effectively lost. This limited set of ticket data provides very little information to support debugging transaction overruns based on the typical report. To improve transaction log reservation overrun reporting, create a helper to dump transaction details such as log items, log vector data, etc., as well as the underlying ticket data for the transaction. Move the overrun detection from xfs_log_commit_cil() to xlog_cil_insert_items() so it occurs prior to migration of the logged items to the CIL. Call the new helper such that it is able to dump this transaction data before it is lost. Also, warn on overrun to provide callstack context for the offending transaction and include a few additional messages from xlog_cil_insert_items() to display the reservation consumed locally for overhead such as log vector headers, split region headers and the context ticket. This provides a complete general breakdown of the reservation consumption of a transaction when/if it happens to overrun the reservation. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: refactor xlog_cil_insert_items() to facilitate transaction dumpBrian Foster
Transaction reservation overrun detection currently occurs too late to print useful information about the offending transaction. Ideally, the transaction data is printed before the associated log items are moved from the transaction to the CIL, which occurs in xlog_cil_insert_items(), such that details of the items logged by the transaction are available for analysis. Refactor xlog_cil_insert_items() to facilitate moving tx overrun detection to this function. Update the function to track each bit of extra log reservation stolen from the transaction (i.e., such as for the CIL context ticket) and perform the log item migration as the last operation before the CIL lock is released. This creates a context where the transaction reservation consumption has been fully calculated when the log items are moved to the CIL. This patch makes no functional changes. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: separate shutdown from ticket reservation print helperBrian Foster
xlog_print_tic_res() pre-dates delayed logging and the committed items list (CIL) and thus retains some factoring warts, such as hard coded function names in the output and the fact that it induces a shutdown. In preparation for more detailed logging of regular transaction overrun situations, refactor xlog_print_tic_res() to be slightly more generic. Reword some of the warning messages and pull the shutdown into the callers. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: define fatal assert build time tunableBrian Foster
While configurable at runtime, the DEBUG mode assert failure behavior is usually either desired or not for a particular situation. For example, developers using kernel modules may prefer for fatal asserts to remain disabled across module reloads while QE engineers doing broad regression testing may prefer to have fatal asserts enabled on boot to facilitate data collection for bug reports. To provide a compromise/convenience for developers, create a Kconfig option that sets the default value of the DEBUG mode 'bug_on_assert' sysfs tunable. The default behavior remains to trigger kernel BUGs on assert failures to preserve existing behavior across kernel configuration updates with DEBUG mode enabled. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: define bug_on_assert debug mode sysfs tunableBrian Foster
In DEBUG mode, assert failures unconditionally trigger a kernel BUG. This is useful in diagnostic situations to panic a system and collect detailed state information at the time of a failure. This can also cause problems in cases where DEBUG mode code is desired but it is preferable not trigger kernel BUGs on assert failure. For example, during development of new code or during certain xfstests tests that intentionally cause corruption and test the kernel for survival (but otherwise may expect to trigger assert failures). To provide additional flexibility, create the <sysfs>/fs/xfs/debug/bug_on_assert tunable to configure assert failure behavior at runtime. This tunable is only available in DEBUG mode and is enabled by default to preserve existing default behavior. When disabled, assert failures in DEBUG mode result in kernel warnings. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: try to avoid blowing out the transaction reservation when bunmaping a ↵Darrick J. Wong
shared extent In a pathological scenario where we are trying to bunmapi a single extent in which every other block is shared, it's possible that trying to unmap the entire large extent in a single transaction can generate so many EFIs that we overflow the transaction reservation. Therefore, use a heuristic to guess at the number of blocks we can safely unmap from a reflink file's data fork in an single transaction. This should prevent problems such as the log head slamming into the tail and ASSERTs that trigger because we've exceeded the transaction reservation. Note that since bunmapi can fail to unmap the entire range, we must also teach the deferred unmap code to roll into a new transaction whenever we get low on reservation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> [hch: random edits, all bugs are my fault] Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-19xfs: refactor dir2 leaf readahead shadow buffer clevernessDarrick J. Wong
Currently, the dir2 leaf block getdents function uses a complex state tracking mechanism to create a shadow copy of the block mappings and then uses the shadow copy to schedule readahead. Since the read and readahead functions are perfectly capable of reading the mappings themselves, we can tear all that out in favor of a simpler function that simply keeps pushing the readahead window further out. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-06-19xfs: push buffer of flush locked dquot to avoid quotacheck deadlockBrian Foster
Reclaim during quotacheck can lead to deadlocks on the dquot flush lock: - Quotacheck populates a local delwri queue with the physical dquot buffers. - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and dirties all of the dquots. - Reclaim kicks in and attempts to flush a dquot whose buffer is already queud on the quotacheck queue. The flush succeeds but queueing to the reclaim delwri queue fails as the backing buffer is already queued. The flush unlock is now deferred to I/O completion of the buffer from the quotacheck queue. - The dqadjust bulkstat continues and dirties the recently flushed dquot once again. - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires the flush lock to update the backing buffers with the in-core recalculated values. It deadlocks on the redirtied dquot as the flush lock was already acquired by reclaim, but the buffer resides on the local delwri queue which isn't submitted until the end of quotacheck. This is reproduced by running quotacheck on a filesystem with a couple million inodes in low memory (512MB-1GB) situations. This is a regression as of commit 43ff2122e6 ("xfs: on-stack delayed write buffer lists"), which removed a trylock and buffer I/O submission from the quotacheck dquot flush sequence. Quotacheck first resets and collects the physical dquot buffers in a delwri queue. Then, it traverses the filesystem inodes via bulkstat, updates the in-core dquots, flushes the corrected dquots to the backing buffers and finally submits the delwri queue for I/O. Since the backing buffers are queued across the entire quotacheck operation, dquot reclaim cannot possibly complete a dquot flush before quotacheck completes. Therefore, quotacheck must submit the buffer for I/O in order to cycle the flush lock and flush the dirty in-core dquot to the buffer. Add a delwri queue buffer push mechanism to submit an individual buffer for I/O without losing the delwri queue status and use it from quotacheck to avoid the deadlock. This restores quotacheck behavior to as before the regression was introduced. Reported-by: Martin Svec <martin.svec@zoner.cz> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19mm: larger stack guard gap, between vmasHugh Dickins
Stack guard page is a useful feature to reduce a risk of stack smashing into a different mapping. We have been using a single page gap which is sufficient to prevent having stack adjacent to a different mapping. But this seems to be insufficient in the light of the stack usage in userspace. E.g. glibc uses as large as 64kB alloca() in many commonly used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX] which is 256kB or stack strings with MAX_ARG_STRLEN. This will become especially dangerous for suid binaries and the default no limit for the stack size limit because those applications can be tricked to consume a large portion of the stack and a single glibc call could jump over the guard page. These attacks are not theoretical, unfortunatelly. Make those attacks less probable by increasing the stack guard gap to 1MB (on systems with 4k pages; but make it depend on the page size because systems with larger base pages might cap stack allocations in the PAGE_SIZE units) which should cover larger alloca() and VLA stack allocations. It is obviously not a full fix because the problem is somehow inherent, but it should reduce attack space a lot. One could argue that the gap size should be configurable from userspace, but that can be done later when somebody finds that the new 1MB is wrong for some special case applications. For now, add a kernel command line option (stack_guard_gap) to specify the stack gap size (in page units). Implementation wise, first delete all the old code for stack guard page: because although we could get away with accounting one extra page in a stack vma, accounting a larger gap can break userspace - case in point, a program run with "ulimit -S -v 20000" failed when the 1MB gap was counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK and strict non-overcommit mode. Instead of keeping gap inside the stack vma, maintain the stack guard gap as a gap between vmas: using vm_start_gap() in place of vm_start (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few places which need to respect the gap - mainly arch_get_unmapped_area(), and and the vma tree's subtree_gap support for that. Original-patch-by: Oleg Nesterov <oleg@redhat.com> Original-patch-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Tested-by: Helge Deller <deller@gmx.de> # parisc Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-18blk: replace bioset_create_nobvec() with a flags arg to bioset_create()NeilBrown
"flags" arguments are often seen as good API design as they allow easy extensibility. bioset_create_nobvec() is implemented internally as a variation in flags passed to __bioset_create(). To support future extension, make the internal structure part of the API. i.e. add a 'flags' argument to bioset_create() and discard bioset_create_nobvec(). Note that the bio_split allocations in drivers/md/raid* do not need the bvec mempool - they should have used bioset_create_nobvec(). Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18Merge tag 'ceph-for-4.12-rc6' of git://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fixes from Ilya Dryomov: "A fix for an old ceph ->fh_to_* bug from Luis and two timestamp fixups from Zheng, prompted by the ongoing y2038 work" * tag 'ceph-for-4.12-rc6' of git://github.com/ceph/ceph-client: ceph: unify inode i_ctime update ceph: use current_kernel_time() to get request time stamp ceph: check i_nlink while converting a file handle to dentry
2017-06-17ufs: fix the logics for tail relocationAl Viro
* original hysteresis loop got broken by typo back in 2002; now it never switches out of OPTTIME state. Fixed. * critical levels for switching from OPTTIME to OPTSPACE and back ought to be calculated once, at mount time. * we should use mul_u64_u32_div() for those calculations, now that ->s_dsize is 64bit. * to quote Kirk McKusick (in 1995 FreeBSD commit message): The threshold for switching from time-space and space-time is too small when minfree is 5%...so make it stay at space in this case. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-17ufs_iget(): fail with -ESTALE on deleted inodeAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-17fix signedness of timestamps on ufs1Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-17Merge tag 'xfs-4.12-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fix from Darrick Wong: "One more bugfix for you for 4.12-rc6 to fix something that came up in an earlier rc: - Fix some bogus ASSERT failures on CONFIG_SMP=n and CONFIG_XFS_DEBUG=y" * tag 'xfs-4.12-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix spurious spin_is_locked() assert failures on non-smp kernels
2017-06-17Merge branch 'ufs-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull ufs fixes from Al Viro: "Fix assorted ufs bugs: a couple of deadlocks, fs corruption in truncate(), oopsen on tail unpacking and truncate when racing with vmscan, mild fs corruption (free blocks stats summary buggered, *BSD fsck would complain and fix), several instances of broken logics around reserved blocks (starting with "check almost never triggers when it should" and then there are issues with sufficiently large UFS2)" [ Note: ufs hasn't gotten any loving in a long time, because nobody really seems to use it. These ufs fixes are triggered by people actually caring now, not some sudden influx of new bugs. - Linus ] * 'ufs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ufs_truncate_blocks(): fix the case when size is in the last direct block ufs: more deadlock prevention on tail unpacking ufs: avoid grabbing ->truncate_mutex if possible ufs_get_locked_page(): make sure we have buffer_heads ufs: fix s_size/s_dsize users ufs: fix reserved blocks check ufs: make ufs_freespace() return signed ufs: fix logics in "ufs: make fsck -f happy"
2017-06-17Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "A couple of fixes; a leak in mntns_install() caught by Andrei (this cycle regression) + d_invalidate() softlockup fix - that had been reported by a bunch of people lately, but the problem is pretty old" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: don't forget to put old mntns in mntns_install Hang/soft lockup in d_invalidate with simultaneous calls
2017-06-17userfaultfd: shmem: handle coredumping in handle_userfault()Andrea Arcangeli
Anon and hugetlbfs handle FOLL_DUMP set by get_dump_page() internally to __get_user_pages(). shmem as opposed has no special FOLL_DUMP handling there so handle_mm_fault() is invoked without mmap_sem and ends up calling handle_userfault() that isn't expecting to be invoked without mmap_sem held. This makes handle_userfault() fail immediately if invoked through shmem_vm_ops->fault during coredumping and solves the problem. The side effect is a BUG_ON with no lock held triggered by the coredumping process which exits. Only 4.11 is affected, pre-4.11 anon memory holes are skipped in __get_user_pages by checking FOLL_DUMP explicitly against empty pagetables (mm/gup.c:no_page_table()). It's zero cost as we already had a check for current->flags to prevent futex to trigger userfaults during exit (PF_EXITING). Link: http://lkml.kernel.org/r/20170615214838.27429-1-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: <stable@vger.kernel.org> [4.11+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-16Merge tag 'configfs-for-4.12' of git://git.infradead.org/users/hch/configfsLinus Torvalds
Pull configfs updates from Christoph Hellwig: "A fix from Nic for a race seen in production (including a stable tag). And while I'm sending you this I'm also sneaking in a trivial new helper from Bart so that we don't need inter-tree dependencies for the next merge window" * tag 'configfs-for-4.12' of git://git.infradead.org/users/hch/configfs: configfs: Introduce config_item_get_unless_zero() configfs: Fix race between create_link and configfs_rmdir
2017-06-16fs: pass on flags in compat_writevChristoph Hellwig
Fixes: 793b80ef14af ("vfs: pass a flags argument to vfs_readv/vfs_writev") Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-15x86, dax: replace clear_pmem() with open coded memset + dax_ops->flushDan Williams
The clear_pmem() helper simply combines a memset() plus a cache flush. Now that the flush routine is optionally provided by the dax device driver we can avoid unnecessary cache management on dax devices fronting volatile memory. With clear_pmem() gone we can follow on with a patch to make pmem cache management completely defined within the pmem driver. Cc: <x86@kernel.org> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15filesystem-dax: convert to dax_flush()Dan Williams
Filesystem-DAX flushes caches whenever it writes to the address returned through dax_direct_access() and when writing back dirty radix entries. That flushing is only required in the pmem case, so the dax_flush() helper skips cache management work when the underlying driver does not specify a flush method. We still do all the dirty tracking since the radix entry will already be there for locking purposes. However, the work to clean the entry will be a nop for some dax drivers. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15filesystem-dax: convert to dax_copy_from_iter()Dan Williams
Now that all possible providers of the dax_operations copy_from_iter method are implemented, switch filesytem-dax to call the driver rather than copy_to_iter_pmem. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
The conflicts were two cases of overlapping changes in batman-adv and the qed driver. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15fs: don't forget to put old mntns in mntns_installAndrei Vagin
Fixes: 4f757f3cbf54 ("make sure that mntns_install() doesn't end up with referral for root") Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15Hang/soft lockup in d_invalidate with simultaneous callsAl Viro
It's not hard to trigger a bunch of d_invalidate() on the same dentry in parallel. They end up fighting each other - any dentry picked for removal by one will be skipped by the rest and we'll go for the next iteration through the entire subtree, even if everything is being skipped. Morevoer, we immediately go back to scanning the subtree. The only thing we really need is to dissolve all mounts in the subtree and as soon as we've nothing left to do, we can just unhash the dentry and bugger off. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15Merge branch 'linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fix from Herbert Xu: "This fixes a bug on sparc where we may dereference freed stack memory" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: Work around deallocated stack frame reference gcc bug on sparc.
2017-06-15ufs_truncate_blocks(): fix the case when size is in the last direct blockAl Viro
The logics when deciding whether we need to do anything with direct blocks is broken when new size is within the last direct block. It's better to find the path to the last byte _not_ to be removed and use that instead of the path to the beginning of the first block to be freed... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15ufs: more deadlock prevention on tail unpackingAl Viro
->s_lock is not needed for ufs_change_blocknr() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15ufs: avoid grabbing ->truncate_mutex if possibleAl Viro
tail unpacking is done in a wrong place; the deadlocks galore is best dealt with by doing that in ->write_iter() (and switching to iomap, while we are at it), but that's rather painful to backport. The trouble comes from grabbing pages that cover the beginning of tail from inside of ufs_new_fragments(); ongoing pageout of any of those is going to deadlock on ->truncate_mutex with process that got around to extending the tail holding that and waiting for page to get unlocked, while ->writepage() on that page is waiting on ->truncate_mutex. The thing is, we don't need ->truncate_mutex when the fragment we are trying to map is within the tail - the damn thing is allocated (tail can't contain holes). Let's do a plain lookup and if the fragment is present, we can just pretend that we'd won the race in almost all cases. The only exception is a fragment between the end of tail and the end of block containing tail. Protect ->i_lastfrag with ->meta_lock - read_seqlock_excl() is sufficient. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-14ufs_get_locked_page(): make sure we have buffer_headsAl Viro
callers rely upon that, but find_lock_page() racing with attempt of page eviction by memory pressure might have left us with * try_to_free_buffers() successfully done * __remove_mapping() failed, leaving the page in our mapping * find_lock_page() returning an uptodate page with no buffer_heads attached. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-14ufs: fix s_size/s_dsize usersAl Viro
For UFS2 we need 64bit variants; we even store them in uspi, but use 32bit ones instead. One wrinkle is in handling of reserved space - recalculating it every time had been stupid all along, but now it would become really ugly. Just calculate it once... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-14ufs: fix reserved blocks checkAl Viro
a) honour ->s_minfree; don't just go with default (5) b) don't bother with capability checks until we know we'll need them Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>