summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2018-05-10xfs: add bmapi nodiscard flagBrian Foster
Freed extents are unconditionally discarded when online discard is enabled. Define XFS_BMAPI_NODISCARD to allow callers to bypass discards when unnecessary. For example, this will be useful for eofblocks trimming. This patch does not change behavior. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10xfs: get rid of the log item descriptorDave Chinner
It's just a connector between a transaction and a log item. There's a 1:1 relationship between a log item descriptor and a log item, and a 1:1 relationship between a log item descriptor and a transaction. Both relationships are created and terminated at the same time, so why do we even have the descriptor? Replace it with a specific list_head in the log item and a new log item dirtied flag to replace the XFS_LID_DIRTY flag. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> [darrick: fix up deferred agfl intent finish_item use of LID_DIRTY] Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10xfs: add some more debug checks to buffer log item reuseDave Chinner
Just to make sure the item isn't associated with another transaction when we try to reuse it. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10xfs: fix double ijoin in xfs_reflink_clear_inode_flag()Dave Chinner
xfs_reflink_clear_inode_flag double-joins an inode to a transaction, which is not allowed. Fix that and document that the caller must have already joined it. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> [darrick: edit out trace for nonexistent ASSERT] Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10xfs: fix double ijoin in xfs_reflink_cancel_cow_rangeDave Chinner
xfs_reflink_cancel_cow_range joins an inode twice to the same transaction. This is not allowed, so fix it and document that the callers of xfs_reflink_cancel_cow_blocks() must have already joined the inode to the permanent transaction passed in. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> [darrick: edited the commit log to remove trace for nonexistent ASSERT] Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10xfs: fix double ijoin in xfs_inactive_symlink_rmt()Dave Chinner
xfs_inactive_symlink_rmt() does something nasty - it joins an inode into a transaction it is already joined to. This means the inode can have multiple log item descriptors attached to the transaction for it. This breaks teh 1:1 mapping that is supposed to exist between the log item and log item descriptor. This results in the log item being processed twice during transaction commit and CIL formatting, and there are lots of other potential issues tha arise from double processing of log items in the transaction commit state machine. In this case, the inode is already held by the rolling transaction returned from xfs_defer_finish(), so there's no need to join it again. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10xfs: don't assert fail with AIL lock heldDave Chinner
Been hitting AIL ordering assert failures recently, but been unable to trace them down because the system immediately hangs up onteh spinlock that was held when this assert fires: XFS: Assertion failed: XFS_LSN_CMP(prev_lip->li_lsn, lip->li_lsn) <= 0, file: fs/xfs/xfs_trans_ail.c, line: 52 Move the assertions outside of the spinlock so the corpse can be dissected. Thanks to Brian Foster for supplying a clean way of doing this. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10xfs: adder caller IP to xfs_defer* tracepointsDave Chinner
So it's clear in the trace where they are being called from. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10xfs: add tracing to high level transaction operationsDave Chinner
Because currently we have no idea what the transaction context we are operating in is, and I need to know that information to track down bugs in multiple log item joins to transactions. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10xfs: log item flags are racyDave Chinner
The log item flags contain a field that is protected by the AIL lock - the XFS_LI_IN_AIL flag. We use non-atomic RMW operations to set and clear these flags, but most of the updates and checks are not done with the AIL lock held and so are susceptible to update races. Fix this by changing the log item flags to use atomic bitops rather than be reliant on the AIL lock for update serialisation. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-10xfs: add missing rmap error returnDarrick J. Wong
xfs_rmap_lookup_le_range can return errors, so we need to check for them and bail out. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-10ext4: use raw i_version value for ea_inodeEryu Guan
Currently, creating large xattr (e.g. 2k) in ea_inode would cause ea_inode refcount corruption, e.g. Pass 4: Checking reference counts Extended attribute inode 13 ref count is 0, should be 1. Fix? no This is because that we save the lower 32bit of refcount in inode->i_version and store it in raw_inode->i_disk_version on disk. But since commit ee73f9a52a34 ("ext4: convert to new i_version API"), we load/store modified i_disk_version from/to disk instead of raw value, which causes on-disk ea_inode refcount corruption. Fix it by loading/storing raw i_version/i_disk_version, because it's a self-managed value in this case. Fixes: ee73f9a52a34 ("ext4: convert to new i_version API") Cc: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-05-10ext4: use XATTR_CREATE in ext4_initxattrs()Eryu Guan
I hit ENOSPC error when creating new file in a newly created ext4 with ea_inode feature enabled, if selinux is enabled and ext4 is mounted without any selinux context. e.g. mkfs -t ext4 -O ea_inode -F /dev/sda5 mount /dev/sda5 /mnt/ext4 touch /mnt/ext4/testfile # got ENOSPC here It turns out that we run out of journal credits in ext4_xattr_set_handle() when creating new selinux label for the newly created inode. This is because that in __ext4_new_inode() we use __ext4_xattr_set_credits() to calculate the reserved credits for new xattr, with the 'is_create' argument being true, which implies less credits in the ea_inode case. But we calculate the required credits in ext4_xattr_set_handle() with 'is_create' being false, which means we need more credits if ea_inode feature is enabled. So we don't have enough credits and error out with ENOSPC. Fix it by simply calling ext4_xattr_set_handle() with XATTR_CREATE flag in ext4_initxattrs(), so we end up with requiring less credits than reserved. The semantic of XATTR_CREATE is "Perform a pure create, which fails if the named attribute exists already." (from setxattr(2)), which is fine in this case, because we only call ext4_initxattrs() on newly created inode. Fixes: af65207c76ce ("ext4: fix __ext4_new_inode() journal credits calculation") Cc: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-05-10ext4: make function ‘ext4_getfsmap_find_fixed_metadata’ staticMathieu Malaterre
Since function ‘ext4_getfsmap_find_fixed_metadata’ can be made static, make it so. Remove the following gcc warning (W=1): fs/ext4/fsmap.c:405:5: warning: no previous prototype for ‘ext4_getfsmap_find_fixed_metadata’ [-Wmissing-prototypes] Signed-off-by: Mathieu Malaterre <malat@debian.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-05-10ceph: fix iov_iter issues in ceph_direct_read_write()Ilya Dryomov
dio_get_pagev_size() and dio_get_pages_alloc() introduced in commit b5b98989dc7e ("ceph: combine as many iovec as possile into one OSD request") assume that the passed iov_iter is ITER_IOVEC. This isn't the case with splice where it ends up poking into the guts of ITER_BVEC or ITER_PIPE iterators, causing lockups and crashes easily reproduced with generic/095. Rather than trying to figure out gap alignment and stuff pages into a page vector, add a helper for going from iov_iter to a bio_vec array and make use of the new CEPH_OSD_DATA_TYPE_BVECS code. Fixes: b5b98989dc7e ("ceph: combine as many iovec as possile into one OSD request") Link: http://tracker.ceph.com/issues/18130 Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Tested-by: Luis Henriques <lhenriques@suse.com>
2018-05-10ceph: fix rsize/wsize capping in ceph_direct_read_write()Ilya Dryomov
rsize/wsize cap should be applied before ceph_osdc_new_request() is called. Otherwise, if the size is limited by the cap instead of the stripe unit, ceph_osdc_new_request() would setup an extent op that is bigger than what dio_get_pages_alloc() would pin and add to the page vector, triggering asserts in the messenger. Cc: stable@vger.kernel.org Fixes: 95cca2b44e54 ("ceph: limit osd write size") Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
2018-05-09proc: Use underscores for SSBD in 'status'Konrad Rzeszutek Wilk
The style for the 'status' file is CamelCase or this. _. Fixes: fae1fa0fc ("proc: Provide details on speculation flaw mitigations") Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-05-09xfs: bmap debugging should never panic the systemDarrick J. Wong
Don't panic() the system if the bmap records are garbage, just call ASSERT which gives us the same backtrace but enables developers to control if the system goes down or not. This makes debugging with generic/388 much easier because it won't reboot the machine midway through a run just because btree_read_bufl returns EIO when the fs has already shut down. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-09xfs: defer agfl frees from directory op transactionsBrian Foster
Directory operations can perform block allocations as entries are added/removed from directories. Defer AGFL block frees from the remaining directory operation transactions. This covers the hard link, remove and rename operations. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09xfs: defer frees from common inode allocation pathsBrian Foster
Inode allocation can require block allocation for physical inode chunk allocation, inode btree record insertion, and/or directory block allocation for entry insertion. Any of these block allocation requests can require AGFL fixups prior to the actual allocation. Update the common file creation transacions to defer AGFL frees from these contexts to avoid too much log reservation consumption per-transaction. Since these transactions are already passed down through the btree cursors and da_args structure, this simply requires to attach dfops to the transaction. Note that this covers tr_create, tr_mkdir and tr_symlink. Other transactions such as tr_create_tmpfile do not already make use of deferred operations and so are left alone for the time being. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09xfs: defer agfl frees from inode inactivationBrian Foster
XFS inode chunks are already freed via deferred operations (which now also defer AGFL block frees), but inode btree blocks are freed directly in the associated context. This has been known to lead to log reservation overruns in particular workloads where an inobt block free may require several AGFL block frees (and thus several allocation btree modifications) before the inobt block itself is actually freed. To avoid this problem, defer the frees of any AGFL blocks before the inobt block free takes place. This requires passing the dfops from xfs_inactive_ifree() down through the inobt ->[alloc|free]_block() callouts, which essentially only requires to attach the dfops to the transaction since it is already carried all the way through to the inobt update and allocation. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09xfs: defer agfl block frees from deferred ops processing contextBrian Foster
Now that AGFL block frees are deferred when dfops is set in the transaction, start deferring AGFL block frees from contexts that are known to push the limits of existing log reservations. The first such context is deferred operation processing itself. This primarily targets deferred extent frees (such as file extents and inode chunks), but in doing so covers all allocation operations that occur in deferred operation processing context. Update xfs_defer_finish() to set and reset ->t_agfl_dfops across the processing sequence. This means that any AGFL block frees due to allocation events result in the addition of new EFIs to the dfops rather than being processed immediately. xfs_defer_finish() rolls the transaction at least once more to process the frees of the AGFL blocks back to the allocation btrees and returns once the AGFL is rectified. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09xfs: defer agfl block frees when dfops is availableBrian Foster
The AGFL fixup code executes before every block allocation/free and rectifies the AGFL based on the current, dynamic allocation requirements of the fs. The AGFL must hold a minimum number of blocks to satisfy a worst case split of the free space btrees caused by the impending allocation operation. The AGFL is also updated to maintain the implicit requirement for a minimum number of free slots to satisfy a worst case join of the free space btrees. Since the AGFL caches individual blocks, AGFL reduction typically involves multiple, single block frees. We've had reports of transaction overrun problems during certain workloads that boil down to AGFL reduction freeing multiple blocks and consuming more space in the log than was reserved for the transaction. Since the objective of freeing AGFL blocks is to ensure free AGFL free slots are available for the upcoming allocation, one way to address this problem is to release surplus blocks from the AGFL immediately but defer the free of those blocks (similar to how file-mapped blocks are unmapped from the file in one transaction and freed via a deferred operation) until the transaction is rolled. This turns AGFL reduction into an operation with predictable log reservation consumption. Add the capability to defer AGFL block frees when a deferred ops list is available to the AGFL fixup code. Add a dfops pointer to the transaction to carry dfops through various contexts to the allocator context. Deferring AGFL frees is conditional behavior based on whether the transaction pointer is populated. The long term objective is to reuse the transaction pointer to clean up all unrelated callchains that pass dfops on the stack along with a transaction and in doing so, consistently defer AGFL blocks from the allocator. A bit of customization is required to handle deferred completion processing because AGFL blocks are accounted against a per-ag reservation pool and AGFL blocks are not inserted into the extent busy list when freed (they are inserted when used and released back to the AGFL). Reuse the majority of the existing deferred extent free infrastructure and customize it appropriately to handle AGFL blocks. Note that this patch only adds infrastructure. It does not change behavior because no callers have been updated to pass ->t_agfl_dfops into the allocation code. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09xfs: create agfl block free helper functionBrian Foster
Refactor the AGFL block free code into a new helper such that it can be invoked from deferred context. No functional changes. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09xfs: print specific dqblk that failed verifiersEric Sandeen
Rather than printing the top of the buffer that held a corrupted dqblk, restructure things to print out the specific one that failed by pushing the calls to the verifier_error function down into the verifier which iterates over the buffer and detects the error. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09xfs: add full xfs_dqblk verifierEric Sandeen
Add an xfs_dqblk verifier so that it can check the uuid on V5 filesystems; it calls the existing xfs_dquot_verify verifier to validate the xfs_disk_dquot_t contained inside it. This lets us move the uuid verification out of the crc verifier, which makes little sense. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09xfs: pass full xfs_dqblk to repair during quotacheckEric Sandeen
It's a bit dicey to pass in the smaller xfs_disk_dquot and then cast it to something larger; pass in the full xfs_dqblk so we know the caller has sent us the right thing. Rename the function to xfs_dqblk_repair for clarity. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09xfs: check type in quota verifier during quotacheckEric Sandeen
During quotacheck we send in the quota type, so verify that as well. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09xfs: remove unused flags arg from xfs_dquot_verifyEric Sandeen
Long ago the flags argument was used to determine whether to issue warnings about corruptions, but that's done elsewhere now and the flag is unused here, so remove it. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09xfs: clean up locking in xfs_file_iomap_beginDave Chinner
Rather than checking what kind of locking is needed in a helper function and then jumping through hoops to do the locking in line, move the locking to the helper function that does all the checks and rename it to xfs_ilock_for_iomap(). This also allows us to hoist all the nonblocking checks up into the locking helper, further simplifier the code flow in xfs_file_iomap_begin() and making it easier to understand. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09xfs: simplify xfs_file_iomap_begin() logicDave Chinner
The current logic that determines whether allocation should be done has grown somewhat spaghetti like with the addition of IOMAP_NOWAIT functionality. Separate out each of the different cases into single, obvious checks to get rid most of the nested IOMAP_NOWAIT checks in the allocation logic. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09iomap: Use FUA for pure data O_DSYNC DIO writesDave Chinner
If we are doing direct IO writes with datasync semantics, we often have to flush metadata changes along with the data write. However, if we are overwriting existing data, there are no metadata changes that we need to flush. In this case, optimising the IO by using FUA write makes sense. We know from the IOMAP_F_DIRTY flag as to whether a specific inode requires a metadata flush - this is currently used by DAX to ensure extent modification as stable in page fault operations. For direct IO writes, we can use it to determine if we need to flush metadata or not once the data is on disk. Hence if we have been returned a mapped extent that is not new and the IO mapping is not dirty, then we can use a FUA write to provide datasync semantics. This allows us to short-cut the generic_write_sync() call in IO completion and hence avoid unnecessary operations. This makes pure direct IO data write behaviour identical to the way block devices use REQ_FUA to provide datasync semantics. On a FUA enabled device, a synchronous direct IO write workload (sequential 4k overwrites in 32MB file) had the following results: # xfs_io -fd -c "pwrite -V 1 -D 0 32m" /mnt/scratch/boo kernel time write()s write iops Write b/w ------ ---- -------- ---------- --------- (no dsync) 4s 2173/s 2173 8.5MB/s vanilla 22s 370/s 750 1.4MB/s patched 19s 420/s 420 1.6MB/s The patched code clearly doesn't send cache flushes anymore, but instead uses FUA (confirmed via blktrace), and performance improves a bit as a result. However, the benefits will be higher on workloads that mix O_DSYNC overwrites with other write IO as we won't be flushing the entire device cache on every DSYNC overwrite IO anymore. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09iomap: iomap_dio_rw() handles all sync writesDave Chinner
Currently iomap_dio_rw() only handles (data)sync write completions for AIO. This means we can't optimised non-AIO IO to minimise device flushes as we can't tell the caller whether a flush is required or not. To solve this problem and enable further optimisations, make iomap_dio_rw responsible for data sync behaviour for all IO, not just AIO. In doing so, the sync operation is now accounted as part of the DIO IO by inode_dio_end(), hence post-IO data stability updates will no long race against operations that serialise via inode_dio_wait() such as truncate or hole punch. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09xfs: move generic_write_sync calls inwardsDave Chinner
To prepare for iomap iinfrastructure based DSYNC optimisations. While moving the code araound, move the XFS write bytes metric update for direct IO into xfs_dio_write_end_io callback so that we always capture the amount of data written via AIO+DIO. This fixes the problem where queued AIO+DIO writes are not accounted to this metric. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09xfs: don't retry xfs_buf_find on XBF_TRYLOCK failureDave Chinner
When looking at an event trace recently, I noticed that non-blocking buffer lookup attempts would fail on cached locked buffers and then run the slow cache-miss path. This means we are doing an xfs_buf allocation, lookup and free unnecessarily every time we avoid blocking on a locked buffer. Fix this by changing _xfs_buf_find() to return an error status to the caller to indicate that we failed the lock attempt rather than just returning a NULL. This allows the higher level code to discriminate between a cache miss and an cache hit that we failed to lock. This also allows us to return a -EFSCORRUPTED state if we are asked to look up a block number outside the range of the filesystem in _xfs_buf_find(), which moves us one step closer to being able to handle such errors in a more graceful manner at the higher levels. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09xfs: make xfs_buf_incore out of lineDave Chinner
Move xfs_buf_incore out of line and make it the only way to look up a buffer in the buffer cache from outside the buffer cache. Convert the external users of _xfs_buf_find() to xfs_buf_incore() and make _xfs_buf_find() static. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: actually rename xfs_incore -> xfs_buf_incore] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09xfs: trace ATTR flags in xattr tracepointsEric Sandeen
This will trace i.e. the ATTR_SECURE/ATTR_CREATE/ATTR_REPLACE flags as well as the OP_FLAGS. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09xfs: validate allocated inode numberDave Chinner
When we have corrupted free inode btrees, we can attempt to allocate inodes that we know are already allocated. Catch allocation of these inodes and report corruption as early as possible to prevent corruption propagation or deadlocks. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09xfs: validate cached inodes are free when allocatedDave Chinner
A recent fuzzed filesystem image cached random dcache corruption when the reproducer was run. This often showed up as panics in lookup_slow() on a null inode->i_ops pointer when doing pathwalks. BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 .... Call Trace: lookup_slow+0x44/0x60 walk_component+0x3dd/0x9f0 link_path_walk+0x4a7/0x830 path_lookupat+0xc1/0x470 filename_lookup+0x129/0x270 user_path_at_empty+0x36/0x40 path_listxattr+0x98/0x110 SyS_listxattr+0x13/0x20 do_syscall_64+0xf5/0x280 entry_SYSCALL_64_after_hwframe+0x42/0xb7 but had many different failure modes including deadlocks trying to lock the inode that was just allocated or KASAN reports of use-after-free violations. The cause of the problem was a corrupt INOBT on a v4 fs where the root inode was marked as free in the inobt record. Hence when we allocated an inode, it chose the root inode to allocate, found it in the cache and re-initialised it. We recently fixed a similar inode allocation issue caused by inobt record corruption problem in xfs_iget_cache_miss() in commit ee457001ed6c ("xfs: catch inode allocation state mismatch corruption"). This change adds similar checks to the cache-hit path to catch it, and turns the reproducer into a corruption shutdown situation. Reported-by: Wen Xu <wen.xu@gatech.edu> Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: fix typos in comment] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-09cifs: smb2ops: Fix listxattr() when there are no EAsPaulo Alcantara
As per listxattr(2): On success, a nonnegative number is returned indicating the size of the extended attribute name list. On failure, -1 is returned and errno is set appropriately. In SMB1, when the server returns an empty EA list through a listxattr(), it will correctly return 0 as there are no EAs for the given file. However, in SMB2+, it returns -ENODATA in listxattr() which is wrong since the request and response were sent successfully, although there's no actual EA for the given file. This patch fixes listxattr() for SMB2+ by returning 0 in cifs_listxattr() when the server returns an empty list of EAs. Signed-off-by: Paulo Alcantara <palcantara@suse.de> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <smfrench@gmail.com>
2018-05-09cifs: smbd: Enable signing with smbdirectLong Li
Now signing is supported with RDMA transport. Remove the code that disabled it. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-05-09cifs: Allocate validate negotiation request through kmallocLong Li
The data buffer allocated on the stack can't be DMA'ed, ib_dma_map_page will return an invalid DMA address for a buffer on stack. Even worse, this incorrect address can't be detected by ib_dma_mapping_error. Sending data from this address to hardware will not fail, but the remote peer will get junk data. Fix this by allocating the request on the heap in smb3_validate_negotiate. Changes in v2: Removed duplicated code on freeing buffers on function exit. (Thanks to Parav Pandit <parav@mellanox.com>) Fixed typo in the patch title. Changes in v3: Added "Fixes" to the patch. Changed several sizeof() to use *pointer in place of struct. Changes in v4: Added detailed comments on the failure through RDMA. Allocate request buffer using GPF_NOFS. Fixed possible memory leak. Changes in v5: Removed variable ret for checking return value. Changed to use pneg_inbuf->Dialects[0] to calculate unused space in pneg_inbuf. Fixes: ff1c038addc4 ("Check SMB3 dialects against downgrade attacks") Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Tom Talpey <ttalpey@microsoft.com>
2018-05-09IB: remove redundant INFINIBAND kconfig dependenciesGreg Thelen
INFINIBAND_ADDR_TRANS depends on INFINIBAND. So there's no need for options which depend INFINIBAND_ADDR_TRANS to also depend on INFINIBAND. Remove the unnecessary INFINIBAND depends. Signed-off-by: Greg Thelen <gthelen@google.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-05-09mm/pkeys, x86, powerpc: Display pkey in smaps if arch supports pkeysRam Pai
Currently the architecture specific code is expected to display the protection keys in smap for a given vma. This can lead to redundant code and possibly to divergent formats in which the key gets displayed. This patch changes the implementation. It displays the pkey only if the architecture support pkeys, i.e arch_pkeys_enabled() returns true. x86 arch_show_smap() function is not needed anymore, delete it. Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Signed-off-by: Ram Pai <linuxram@us.ibm.com> [mpe: Split out from larger patch, rebased on header changes] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Dave Hansen <dave.hansen@intel.com>
2018-05-09mm, powerpc, x86: introduce an additional vma bit for powerpc pkeyRam Pai
Currently only 4bits are allocated in the vma flags to hold 16 keys. This is sufficient for x86. PowerPC supports 32 keys, which needs 5bits. This patch allocates an additional bit. Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Ram Pai <linuxram@us.ibm.com> [mpe: Fold in #if VM_PKEY_BIT4 as noticed by Dave Hansen] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-05-09mm, powerpc, x86: define VM_PKEY_BITx bits if CONFIG_ARCH_HAS_PKEYS is enabledRam Pai
VM_PKEY_BITx are defined only if CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS is enabled. Powerpc also needs these bits. Hence lets define the VM_PKEY_BITx bits for any architecture that enables CONFIG_ARCH_HAS_PKEYS. Reviewed-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Ram Pai <linuxram@us.ibm.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-05-07nfsd: restrict rd_maxcount to svc_max_payload in nfsd_encode_readdirScott Mayhew
nfsd4_readdir_rsize restricts rd_maxcount to svc_max_payload when estimating the size of the readdir reply, but nfsd_encode_readdir restricts it to INT_MAX when encoding the reply. This can result in log messages like "kernel: RPC request reserved 32896 but used 1049444". Restrict rd_dircount similarly (no reason it should be larger than svc_max_payload). Signed-off-by: Scott Mayhew <smayhew@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-04Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma fixes from Doug Ledford: "This is our first pull request of the rc cycle. It's not that it's been overly quiet, we were just waiting on a few things before sending this off. For instance, the 6 patch series from Intel for the hfi1 driver had actually been pulled in on Tuesday for a Wednesday pull request, only to have Jason notice something I missed, so we held off for some testing, and then on Thursday had to respin the series because the very first patch needed a minor fix (unnecessary cast is all). There is a sizable hns patch series in here, as well as a reasonably largish hfi1 patch series, then all of the lines of uapi updates are just the change to the new official Linux-OpenIB SPDX tag (a bunch of our files had what amounts to a BSD-2-Clause + MIT Warranty statement as their license as a result of the initial code submission years ago, and the SPDX folks decided it was unique enough to warrant a unique tag), then the typical mlx4 and mlx5 updates, and finally some cxgb4 and core/cache/cma updates to round out the bunch. None of it was overly large by itself, but in the 2 1/2 weeks we've been collecting patches, it has added up :-/. As best I can tell, it's been through 0day (I got a notice about my last for-next push, but not for my for-rc push, but Jason seems to think that failure messages are prioritized and success messages not so much). It's also been through linux-next. And yes, we did notice in the context portion of the CMA query gid fix patch that there is a dubious BUG_ON() in the code, and have plans to audit our BUG_ON usage and remove it anywhere we can. Summary: - Various build fixes (USER_ACCESS=m and ADDR_TRANS turned off) - SPDX license tag cleanups (new tag Linux-OpenIB) - RoCE GID fixes related to default GIDs - Various fixes to: cxgb4, uverbs, cma, iwpm, rxe, hns (big batch), mlx4, mlx5, and hfi1 (medium batch)" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (52 commits) RDMA/cma: Do not query GID during QP state transition to RTR IB/mlx4: Fix integer overflow when calculating optimal MTT size IB/hfi1: Fix memory leak in exception path in get_irq_affinity() IB/{hfi1, rdmavt}: Fix memory leak in hfi1_alloc_devdata() upon failure IB/hfi1: Fix NULL pointer dereference when invalid num_vls is used IB/hfi1: Fix loss of BECN with AHG IB/hfi1 Use correct type for num_user_context IB/hfi1: Fix handling of FECN marked multicast packet IB/core: Make ib_mad_client_id atomic iw_cxgb4: Atomically flush per QP HW CQEs IB/uverbs: Fix kernel crash during MR deregistration flow IB/uverbs: Prevent reregistration of DM_MR to regular MR RDMA/mlx4: Add missed RSS hash inner header flag RDMA/hns: Fix a couple misspellings RDMA/hns: Submit bad wr RDMA/hns: Update assignment method for owner field of send wqe RDMA/hns: Adjust the order of cleanup hem table RDMA/hns: Only assign dqpn if IB_QP_PATH_DEST_QPN bit is set RDMA/hns: Remove some unnecessary attr_mask judgement RDMA/hns: Only assign mtu if IB_QP_PATH_MTU bit is set ...
2018-05-04Merge tag 'for-linus-20180504' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "A collection of fixes that should to into this release. This contains: - Set of bcache fixes from Coly, fixing regression in patches that went into this series. - Set of NVMe fixes by way of Keith. - Set of bdi related fixes, one from Jan and two from Tetsuo Handa, fixing various issues around device addition/removal. - Two block inflight fixes from Omar, fixing issues around the transition to using tags for blk-mq inflight accounting that we did a few releases ago" * tag 'for-linus-20180504' of git://git.kernel.dk/linux-block: bdi: Fix oops in wb_workfn() nvmet: switch loopback target state to connecting when resetting nvme/multipath: Fix multipath disabled naming collisions nvme/multipath: Disable runtime writable enabling parameter nvme: Set integrity flag for user passthrough commands nvme: fix potential memory leak in option parsing bdi: Fix use after free bug in debugfs_remove() bdi: wake up concurrent wb_shutdown() callers. bcache: use pr_info() to inform duplicated CACHE_SET_IO_DISABLE set bcache: set dc->io_disable to true in conditional_stop_bcache_device() bcache: add wait_for_kthread_stop() in bch_allocator_thread() bcache: count backing device I/O error for writeback I/O bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error() bcache: store disk name in struct cache and struct cached_dev blk-mq: fix sysfs inflight counter blk-mq: count allocated but not started requests in iostats inflight
2018-05-04Merge tag 'xfs-4.17-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fixes from Darrick Wong: "I've got one more bug fix for xfs for 4.17-rc4, which caps the amount of data we try to handle in one dedupe request so that userspace can't livelock the kernel. This series has been run through a full xfstests run during the week and through a quick xfstests run against this morning's master, with no ajor failures reported. Summary: - Cap the maximum length of a deduplication request at MAX_RW_COUNT/2 to avoid kernel livelock due to excessively large IO requests" * tag 'xfs-4.17-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: cap the length of deduplication requests