summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-05-27xfs: remove flags argument from xfs_inode_ag_walkDarrick J. Wong
The incore inode walk code passes a flags argument and a pointer from the xfs_inode_ag_iterator caller all the way to the iteration function. We can reduce the function complexity by passing flags through the private pointer. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-05-27xfs: remove xfs_inode_ag_iterator_flagsDarrick J. Wong
Combine xfs_inode_ag_iterator_flags and xfs_inode_ag_iterator_tag into a single wrapper function since there's only one caller of the _flags variant. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-05-27xfs: remove unused xfs_inode_ag_iterator functionDarrick J. Wong
Not used by anyone, so get rid of it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-05-27xfs: replace open-coded XFS_ICI_NO_TAGDarrick J. Wong
Use XFS_ICI_NO_TAG instead of -1 when appropriate. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-05-27xfs: move eofblocks conversion function to xfs_ioctl.cDarrick J. Wong
Move xfs_fs_eofblocks_from_user into the only file that actually uses it, so that we don't have this function cluttering up the header file. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-05-27xfs: allow individual quota grace period extensionEric Sandeen
The only grace period which can be set in the kernel today is for id 0, i.e. the default grace period for all users. However, setting an individual grace period is useful; for example: Alice has a soft quota of 100 inodes, and a hard quota of 200 inodes Alice uses 150 inodes, and enters a short grace period Alice really needs to use those 150 inodes past the grace period The administrator extends Alice's grace period until next Monday vfs quota users such as ext4 can do this today, with setquota -T To enable this for XFS, we simply move the timelimit assignment out from under the (id == 0) test. Default setting remains under (id == 0). Note that this now is consistent with how we set warnings. (Userspace requires updates to enable this as well; xfs_quota needs to parse new options, and setquota needs to set appropriate field flags.) Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-27xfs: per-type quota timers and warn limitsEric Sandeen
Move timers and warnings out of xfs_quotainfo and into xfs_def_quota so that we can utilize them on a per-type basis, rather than enforcing them based on the values found in the first enabled quota type. Signed-off-by: Eric Sandeen <sandeen@redhat.com> [zlang: new way to get defquota in xfs_qm_init_timelimits] [zlang: remove redundant defq assign] Signed-off-by: Zorro Lang <zlang@redhat.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-27xfs: switch xfs_get_defquota to take explicit typeEric Sandeen
xfs_get_defquota() currently takes an xfs_dquot, and from that obtains the type of default quota we should get (user/group/project). But early in init, we don't have access to a fully set up quota, so that's not possible. The next patch needs go set up default quota timers early, so switch xfs_get_defquota to take an explicit type and add a helper function to obtain that type from an xfs_dquot for the existing callers. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-27xfs: pass xfs_dquot to xfs_qm_adjust_dqtimersEric Sandeen
Pass xfs_dquot rather than xfs_disk_dquot to xfs_qm_adjust_dqtimers; this makes it symmetric with xfs_qm_adjust_dqlimits and will help the next patch. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-27xfs: fix up some whitespace in quota codeEric Sandeen
There is a fair bit of whitespace damage in the quota code, so fix up enough of it that subsequent patches are restricted to functional change to aid review. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-27xfs: always return -ENOSPC on project quota reservation failureEric Sandeen
XFS project quota treats project hierarchies as "mini filesysems" and so rather than -EDQUOT, the intent is to return -ENOSPC when a quota reservation fails, but this behavior is not consistent. The only place we make a decision between -EDQUOT and -ENOSPC returns based on quota type is in xfs_trans_dqresv(). This behavior is currently controlled by whether or not the XFS_QMOPT_ENOSPC flag gets passed into the quota reservation. However, its use is not consistent; paths such as xfs_create() and xfs_symlink() don't set the flag, so a reservation failure will return -EDQUOT for project quota reservation failures rather than -ENOSPC for these sorts of operations, even for project quota: # mkdir mnt/project # xfs_quota -x -c "project -s -p mnt/project 42" mnt # xfs_quota -x -c 'limit -p isoft=2 ihard=3 42' mnt # touch mnt/project/file{1,2,3} touch: cannot touch ‘mnt/project/file3’: Disk quota exceeded We can make this consistent by not requiring the flag to be set at the top of the callchain; instead we can simply test whether we are reserving a project quota with XFS_QM_ISPDQ in xfs_trans_dqresv and if so, return -ENOSPC for that failure. This removes the need for the XFS_QMOPT_ENOSPC altogether and simplifies the code a fair bit. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-27xfs: group quota should return EDQUOT when prj quota enabledEric Sandeen
Long ago, group & project quota were mutually exclusive, and so when we turned on XFS_QMOPT_ENOSPC ("return ENOSPC if project quota is exceeded") when project quota was enabled, we only needed to disable it again for user quota. When group & project quota got separated, this got missed, and as a result if project quota is enabled and group quota is exceeded, the error code returned is incorrectly returned as ENOSPC not EDQUOT. Fix this by stripping XFS_QMOPT_ENOSPC out of flags for group quota when we try to reserve the space. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-27xfs: remove the m_active_trans counterDave Chinner
It's a global atomic counter, and we are hitting it at a rate of half a million transactions a second, so it's bouncing the counter cacheline all over the place on large machines. We don't actually need it anymore - it used to be required because the VFS freeze code could not track/prevent filesystem transactions that were running, but that problem no longer exists. Hence to remove the counter, we simply have to ensure that nothing calls xfs_sync_sb() while we are trying to quiesce the filesytem. That only happens if the log worker is still running when we call xfs_quiesce_attr(). The log worker is cancelled at the end of xfs_quiesce_attr() by calling xfs_log_quiesce(), so just call it early here and then we can remove the counter altogether. Concurrent create, 50 million inodes, identical 16p/16GB virtual machines on different physical hosts. Machine A has twice the CPU cores per socket of machine B: unpatched patched machine A: 3m16s 2m00s machine B: 4m04s 4m05s Create rates: unpatched patched machine A: 282k+/-31k 468k+/-21k machine B: 231k+/-8k 233k+/-11k Concurrent rm of same 50 million inodes: unpatched patched machine A: 6m42s 2m33s machine B: 4m47s 4m47s The transaction rate on the fast machine went from just under 300k/sec to 700k/sec, which indicates just how much of a bottleneck this atomic counter was. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-27xfs: separate read-only variables in struct xfs_mountDave Chinner
Seeing massive cpu usage from xfs_agino_range() on one machine; instruction level profiles look similar to another machine running the same workload, only one machine is consuming 10x as much CPU as the other and going much slower. The only real difference between the two machines is core count per socket. Both are running identical 16p/16GB virtual machine configurations Machine A: 25.83% [k] xfs_agino_range 12.68% [k] __xfs_dir3_data_check 6.95% [k] xfs_verify_ino 6.78% [k] xfs_dir2_data_entry_tag_p 3.56% [k] xfs_buf_find 2.31% [k] xfs_verify_dir_ino 2.02% [k] xfs_dabuf_map.constprop.0 1.65% [k] xfs_ag_block_count And takes around 13 minutes to remove 50 million inodes. Machine B: 13.90% [k] __pv_queued_spin_lock_slowpath 3.76% [k] do_raw_spin_lock 2.83% [k] xfs_dir3_leaf_check_int 2.75% [k] xfs_agino_range 2.51% [k] __raw_callee_save___pv_queued_spin_unlock 2.18% [k] __xfs_dir3_data_check 2.02% [k] xfs_log_commit_cil And takes around 5m30s to remove 50 million inodes. Suspect is cacheline contention on m_sectbb_log which is used in one of the macros in xfs_agino_range. This is a read-only variable but shares a cacheline with m_active_trans which is a global atomic that gets bounced all around the machine. The workload is trying to run hundreds of thousands of transactions per second and hence cacheline contention will be occurring on this atomic counter. Hence xfs_agino_range() is likely just be an innocent bystander as the cache coherency protocol fights over the cacheline between CPU cores and sockets. On machine A, this rearrangement of the struct xfs_mount results in the profile changing to: 9.77% [kernel] [k] xfs_agino_range 6.27% [kernel] [k] __xfs_dir3_data_check 5.31% [kernel] [k] __pv_queued_spin_lock_slowpath 4.54% [kernel] [k] xfs_buf_find 3.79% [kernel] [k] do_raw_spin_lock 3.39% [kernel] [k] xfs_verify_ino 2.73% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock Vastly less CPU usage in xfs_agino_range(), but still 3x the amount of machine B and still runs substantially slower than it should. Current rm -rf of 50 million files: vanilla patched machine A 13m20s 6m42s machine B 5m30s 5m02s It's an improvement, hence indicating that separation and further optimisation of read-only global filesystem data is worthwhile, but it clearly isn't the underlying issue causing this specific performance degradation. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-27xfs: reduce free inode accounting overheadDave Chinner
Shaokun Zhang reported that XFS was using substantial CPU time in percpu_count_sum() when running a single threaded benchmark on a high CPU count (128p) machine from xfs_mod_ifree(). The issue is that the filesystem is empty when the benchmark runs, so inode allocation is running with a very low inode free count. With the percpu counter batching, this means comparisons when the counter is less that 128 * 256 = 32768 use the slow path of adding up all the counters across the CPUs, and this is expensive on high CPU count machines. The summing in xfs_mod_ifree() is only used to fire an assert if an underrun occurs. The error is ignored by the higher level code. Hence this is really just debug code and we don't need to run it on production kernels, nor do we need such debug checks to return error values just to trigger an assert. Finally, xfs_mod_icount/xfs_mod_ifree are only called from xfs_trans_unreserve_and_mod_sb(), so get rid of them and just directly call the percpu_counter_add/percpu_counter_compare functions. The compare functions are now run only on debug builds as they are internal to ASSERT() checks and so only compiled in when ASSERTs are active (CONFIG_XFS_DEBUG=y or CONFIG_XFS_WARN=y). Reported-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-27xfs: gut error handling in xfs_trans_unreserve_and_mod_sb()Dave Chinner
xfs: gut error handling in xfs_trans_unreserve_and_mod_sb() From: Dave Chinner <dchinner@redhat.com> The error handling in xfs_trans_unreserve_and_mod_sb() is largely incorrect - rolling back the changes in the transaction if only one counter underruns makes all the other counters incorrect. We still allow the change to proceed and committing the transaction, except now we have multiple incorrect counters instead of a single underflow. Further, we don't actually report the error to the caller, so this is completely silent except on debug kernels that will assert on failure before we even get to the rollback code. Hence this error handling is broken, untested, and largely unnecessary complexity. Just remove it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-27x86: Hide the archdata.iommu field behind generic IOMMU_APIKrzysztof Kozlowski
There is a generic, kernel wide configuration symbol for enabling the IOMMU specific bits: CONFIG_IOMMU_API. Implementations (including INTEL_IOMMU and AMD_IOMMU driver) select it so use it here as well. This makes the conditional archdata.iommu field consistent with other platforms and also fixes any compile test builds of other IOMMU drivers, when INTEL_IOMMU or AMD_IOMMU are not selected). For the case when INTEL_IOMMU/AMD_IOMMU and COMPILE_TEST are not selected, this should create functionally equivalent code/choice. With COMPILE_TEST this field could appear if other IOMMU drivers are chosen but neither INTEL_IOMMU nor AMD_IOMMU are not. Reported-by: kbuild test robot <lkp@intel.com> Fixes: e93a1695d7fb ("iommu: Enable compile testing for some of drivers") Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Acked-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20200518120855.27822-2-krzk@kernel.org Signed-off-by: Joerg Roedel <jroedel@suse.de>
2020-05-27ia64: Hide the archdata.iommu field behind generic IOMMU_APIKrzysztof Kozlowski
There is a generic, kernel wide configuration symbol for enabling the IOMMU specific bits: CONFIG_IOMMU_API. Implementations (including INTEL_IOMMU driver) select it so use it here as well. This makes the conditional archdata.iommu field consistent with other platforms and also fixes any compile test builds of other IOMMU drivers, when INTEL_IOMMU is not selected). For the case when INTEL_IOMMU and COMPILE_TEST are not selected, this should create functionally equivalent code/choice. With COMPILE_TEST this field could appear if other IOMMU drivers are chosen but INTEL_IOMMU not. Reported-by: kbuild test robot <lkp@intel.com> Fixes: e93a1695d7fb ("iommu: Enable compile testing for some of drivers") Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Link: https://lore.kernel.org/r/20200518120855.27822-1-krzk@kernel.org Signed-off-by: Joerg Roedel <jroedel@suse.de>
2020-05-27Merge tag 'gpio-fixes-for-v5.7' of ↵Linus Walleij
git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux into fixes gpio fixes for v5.7 - fix mutex and spinlock ordering in gpio-mlxbf2 - fix the return value checks on devm_platform_ioremap_resource in gpio-pxa and gpio-bcm-kona
2020-05-27MIPS: Loongson64: select NO_EXCEPT_FILLJiaxun Yang
Loongson64 load kernel at 0x82000000 and allocate exception vectors by ebase. So we don't need to reserve space for exception vectors at head of kernel. Signed-off-by: Jiaxun Yang <jiaxun.yang@flygoat.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2020-05-27arm64/cpufeature: Add get_arm64_ftr_reg_nowarn()Anshuman Khandual
There is no way to proceed when requested register could not be searched in arm64_ftr_reg[]. Requesting for a non present register would be an error as well. Hence lets just WARN_ON() when search fails in get_arm64_ftr_reg() rather than checking for return value and doing a BUG_ON() instead in some individual callers. But there are also caller instances that dont error out when register search fails. Add a new helper get_arm64_ftr_reg_nowarn() for such cases. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: Mark Brown <broonie@kernel.org> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Link: https://lore.kernel.org/r/1590573876-19120-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2020-05-27block: blk-crypto-fallback: remove redundant initialization of variable errColin Ian King
The variable err is being initialized with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Satya Tangirala <satyat@google.com> Addresses-Coverity: ("Unused value") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27netfilter: nf_conntrack_pptp: fix compilation warning with W=1 buildPablo Neira Ayuso
>> include/linux/netfilter/nf_conntrack_pptp.h:13:20: warning: 'const' type qualifier on return type has no effect [-Wignored-qualifiers] extern const char *const pptp_msg_name(u_int16_t msg); ^~~~~~ Reported-by: kbuild test robot <lkp@intel.com> Fixes: 4c559f15efcc ("netfilter: nf_conntrack_pptp: prevent buffer overflows in debug code") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27netfilter: conntrack: comparison of unsigned in cthelper confirmationPablo Neira Ayuso
net/netfilter/nf_conntrack_core.c: In function nf_confirm_cthelper: net/netfilter/nf_conntrack_core.c:2117:15: warning: comparison of unsigned expression in < 0 is always false [-Wtype-limits] 2117 | if (protoff < 0 || (frag_off & htons(~0x7)) != 0) | ^ ipv6_skip_exthdr() returns a signed integer. Reported-by: Colin Ian King <colin.king@canonical.com> Fixes: 703acd70f249 ("netfilter: nfnetlink_cthelper: unbreak userspace helper support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27netfilter: conntrack: Pass value of ctinfo to __nf_conntrack_updateNathan Chancellor
Clang warns: net/netfilter/nf_conntrack_core.c:2068:21: warning: variable 'ctinfo' is uninitialized when used here [-Wuninitialized] nf_ct_set(skb, ct, ctinfo); ^~~~~~ net/netfilter/nf_conntrack_core.c:2024:2: note: variable 'ctinfo' is declared here enum ip_conntrack_info ctinfo; ^ 1 warning generated. nf_conntrack_update was split up into nf_conntrack_update and __nf_conntrack_update, where the assignment of ctinfo is in nf_conntrack_update but it is used in __nf_conntrack_update. Pass the value of ctinfo from nf_conntrack_update to __nf_conntrack_update so that uninitialized memory is not used and everything works properly. Fixes: ee04805ff54a ("netfilter: conntrack: make conntrack userspace helpers work again") Link: https://github.com/ClangBuiltLinux/linux/issues/1039 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27block: reduce part_stat_lock() scopeChristoph Hellwig
We only need the stats lock (aka preempt_disable()) for updating the states, not for looking up or dropping the hd_struct reference. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27block: use __this_cpu_add() instead of access by smp_processor_id()Konstantin Khlebnikov
Most architectures have fast path to access percpu for current cpu. The required preempt_disable() is provided by part_stat_lock(). [hch: rebased] Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27block: remove rcu_read_lock() from part_stat_lock()Konstantin Khlebnikov
The RCU lock is required only in disk_map_sector_rcu() to lookup the partition. After that request holds reference to related hd_struct. Replace get_cpu() with preempt_disable() - returned cpu index is unused. [hch: rebased] Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27block: add a blk_account_io_merge_bio helperKonstantin Khlebnikov
Move the non-"new_io" branch of blk_account_io_start() into separate function. Fix merge accounting for discards (they were counted as write merges). The new blk_account_io_merge_bio() doesn't call update_io_ticks() unlike blk_account_io_start(), as there is no reason for that. [hch: rebased] Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27block: account merge of two requestsKonstantin Khlebnikov
Also rename blk_account_io_merge() into blk_account_io_merge_request() to distinguish it from merging request and bio. [hch: rebased] Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27block: always use a percpu variable for disk statsChristoph Hellwig
percpu variables have a perfectly fine working stub implementation for UP kernels, so use that. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27block: move update_io_ticks to blk-core.cChristoph Hellwig
All callers are in blk-core.c, so move update_io_ticks over. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27block: remove generic_{start,end}_io_acctChristoph Hellwig
Remove these now unused functions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27zram: nvdimm: use bio_{start,end}_io_acct and disk_{start,end}_io_acctChristoph Hellwig
Switch zram to use the nicer bio accounting helpers, and as part of that ensure each bio is counted as a single I/O request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27nvdimm: use bio_{start,end}_io_acctChristoph Hellwig
Switch dm to use the nicer bio accounting helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27dm: use bio_{start,end}_io_acctChristoph Hellwig
Switch dm to use the nicer bio accounting helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27bcache: use bio_{start,end}_io_acctChristoph Hellwig
Switch bcache to use the nicer bio accounting helpers, and call the routines where we also sample the start time to give coherent accounting results. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27lightnvm/pblk: use bio_{start,end}_io_acctChristoph Hellwig
Switch rsxx to use the nicer bio accounting helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27rsxx: use bio_{start,end}_io_acctChristoph Hellwig
Switch rsxx to use the nicer bio accounting helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27drbd: use bio_{start,end}_io_acctChristoph Hellwig
Switch drbd to use the nicer bio accounting helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27block: add disk/bio-based accounting helpersChristoph Hellwig
Add two new helpers to simplify I/O accounting for bio based drivers. Currently these drivers use the generic_start_io_acct and generic_end_io_acct helpers which have very cumbersome calling conventions, don't actually return the time they started accounting, and try to deal with accounting for partitions, which can't happen for bio based drivers. The new helpers will be used to subsequently replace uses of the old helpers. The main API is the bio based wrappes in blkdev.h, but for zram which wants to account rw_page based I/O lower level routines are provided as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27bcache: configure the asynchronous registertion to be experimentalColy Li
In order to avoid the experimental async registration interface to be treated as new kernel ABI for common users, this patch makes it as an experimental kernel configure BCACHE_ASYNC_REGISTRAION. This interface is for extreme large cached data situation, to make sure the bcache device can always created without the udev timeout issue. For normal users the async or sync registration does not make difference. In future when we decide to use the asynchronous registration as default behavior, this experimental interface may be removed. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27bcache: asynchronous devices registrationColy Li
When there is a lot of data cached on cache device, the bcach internal btree can take a very long to validate during the backing device and cache device registration. In my test, it may takes 55+ minutes to check all the internal btree nodes. The problem is that the registration is invoked by udev rules and the udevd has 180 seconds timeout by default. If the btree node checking time is longer than udevd timeout, the registering process will be killed by udevd with SIGKILL. If the registering process has pending sigal, creating kthread for bcache will fail and the device registration will fail. The result is, for bcache device which cached a lot of data on cache device, the bcache device node like /dev/bcache<N> won't create always due to the very long btree checking time. A solution to avoid the udevd 180 seconds timeout is to register devices in an asynchronous way. Which is, after writing cache or backing device path into /sys/fs/bcache/register_async, the kernel code will create a kworker and move all the btree node checking (for cache device) or dirty data counting (for cached device) in the kwork context. Then the kworder is scheduled on system_wq and the registration code just returned to user space udev rule task. By this asynchronous way, the udev task for bcache rule will complete in seconds, no matter how long time spent in the kworker context, it won't be killed by udevd for a timeout. After all the checking and counting are done asynchronously in the kworker, the bcache device will eventually be created successfully. This patch does the above chagne and add a register sysfs file /sys/fs/bcache/register_async. Writing the registering device path into this sysfs file will do the asynchronous registration. The register_async interface is for very rare condition and won't be used for common users. In future I plan to make the asynchronous registration as default behavior, which depends on feedback for this patch. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27bcache: fix refcount underflow in bcache_device_free()Coly Li
The problematic code piece in bcache_device_free() is, 785 static void bcache_device_free(struct bcache_device *d) 786 { 787 struct gendisk *disk = d->disk; [snipped] 799 if (disk) { 800 if (disk->flags & GENHD_FL_UP) 801 del_gendisk(disk); 802 803 if (disk->queue) 804 blk_cleanup_queue(disk->queue); 805 806 ida_simple_remove(&bcache_device_idx, 807 first_minor_to_idx(disk->first_minor)); 808 put_disk(disk); 809 } [snipped] 816 } At line 808, put_disk(disk) may encounter kobject refcount of 'disk' being underflow. Here is how to reproduce the issue, - Attche the backing device to a cache device and do random write to make the cache being dirty. - Stop the bcache device while the cache device has dirty data of the backing device. - Only register the backing device back, NOT register cache device. - The bcache device node /dev/bcache0 won't show up, because backing device waits for the cache device shows up for the missing dirty data. - Now echo 1 into /sys/fs/bcache/pendings_cleanup, to stop the pending backing device. - After the pending backing device stopped, use 'dmesg' to check kernel message, a use-after-free warning from KASA reported the refcount of kobject linked to the 'disk' is underflow. The dropping refcount at line 808 in the above code piece is added by add_disk(d->disk) in bch_cached_dev_run(). But in the above condition the cache device is not registered, bch_cached_dev_run() has no chance to be called and the refcount is not added. The put_disk() for a non- added refcount of gendisk kobject triggers a underflow warning. This patch checks whether GENHD_FL_UP is set in disk->flags, if it is not set then the bcache device was not added, don't call put_disk() and the the underflow issue can be avoided. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27bcache: Convert pr_<level> uses to a more typical styleJoe Perches
Remove the trailing newline from the define of pr_fmt and add newlines to the uses. Miscellanea: o Convert bch_bkey_dump from multiple uses of pr_err to pr_cont as the earlier conversion was inappropriate done causing multiple lines to be emitted where only a single output line was desired o Use vsprintf extension %pV in bch_cache_set_error to avoid multiple line output where only a single line output was desired o Coalesce formats Fixes: 6ae63e3501c4 ("bcache: replace printk() by pr_*() routines") Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27bcache: remove redundant variables i and nColin Ian King
Variables i and n are being assigned but are never used. They are redundant and can be removed. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Coly Li <colyli@suse.de> Addresses-Coverity: ("Unused value") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27Merge branch 'nvme-5.8' of git://git.infradead.org/nvme into for-5.8/driversJens Axboe
Pull NVMe updates from Christoph: "The second large batch of nvme updates: - t10 protection information support for nvme-rdma and nvmet-rdma (Israel Rukshin and Max Gurtovoy) - target side AEN improvements (Chaitanya Kulkarni) - various fixes and minor improvements all over, icluding the nvme part of the lpfc driver" * 'nvme-5.8' of git://git.infradead.org/nvme: (38 commits) lpfc: Fix return value in __lpfc_nvme_ls_abort lpfc: fix axchg pointer reference after free and double frees lpfc: Fix pointer checks and comments in LS receive refactoring nvme: set dma alignment to qword nvmet: cleanups the loop in nvmet_async_events_process nvmet: fix memory leak when removing namespaces and controllers concurrently nvmet-rdma: add metadata/T10-PI support nvmet: add metadata support for block devices nvmet: add metadata/T10-PI support nvme: add Metadata Capabilities enumerations nvmet: rename nvmet_check_data_len to nvmet_check_transfer_len nvmet: rename nvmet_rw_len to nvmet_rw_data_len nvmet: add metadata characteristics for a namespace nvme-rdma: add metadata/T10-PI support nvme-rdma: introduce nvme_rdma_sgl structure nvme: introduce NVME_INLINE_METADATA_SG_CNT nvme: enforce extended LBA format for fabrics metadata nvme: introduce max_integrity_segments ctrl attribute nvme: make nvme_ns_has_pi accessible to transports nvme: introduce NVME_NS_METADATA_SUPPORTED flag ...
2020-05-27x86/apb_timer: Drop unused declaration and macroJohan Hovold
Drop an extern declaration that has never been used and a no longer needed macro. Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200513100944.9171-2-johan@kernel.org
2020-05-27MIPS: Fix IRQ tracing when call handle_fpe() and handle_msa_fpe()YuanJunQing
Register "a1" is unsaved in this function, when CONFIG_TRACE_IRQFLAGS is enabled, the TRACE_IRQS_OFF macro will call trace_hardirqs_off(), and this may change register "a1". The changed register "a1" as argument will be send to do_fpe() and do_msa_fpe(). Signed-off-by: YuanJunQing <yuanjunqing66@163.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2020-05-27net/mlx5: Add ability to read and write ECE optionsLeon Romanovsky
The end result of RDMA-CM ECE handshake is ECE options, which is needed to be used while configuring data QPs. Such options can come in any QP state, so add in/out fields to set and query ECE options. OUT fields: * create_qp() - default ECE options for that type of QP. * modify_qp() - enabled ECE options after QP state transition. IN fields: * create_qp() - create QP with this ECE option. * modify_qp() - requested options. For unconnected QPs, the FW will return an error if ECE is already configured with any options that not equal to previously set. Reviewed-by: Mark Zhang <markz@mellanox.com> Reviewed-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>