summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2020-09-21dm: fix comment in dm_process_bio()Mike Snitzer
Refer to the correct function (->submit_bio instead of ->queue_bio). Also, add details about why using blk_queue_split() isn't needed for dm_wq_work()'s call to dm_process_bio(). Fixes: c62b37d96b6eb ("block: move ->make_request_fn to struct block_device_operations") Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-21dm: fix bio splitting and its bio completion order for regular IOMike Snitzer
dm_queue_split() is removed because __split_and_process_bio() _must_ handle splitting bios to ensure proper bio submission and completion ordering as a bio is split. Otherwise, multiple recursive calls to ->submit_bio will cause multiple split bios to be allocated from the same ->bio_split mempool at the same time. This would result in deadlock in low memory conditions because no progress could be made (only one bio is available in ->bio_split mempool). This fix has been verified to still fix the loss of performance, due to excess splitting, that commit 120c9257f5f1 provided. Fixes: 120c9257f5f1 ("Revert "dm: always call blk_queue_split() in dm_process_bio()"") Cc: stable@vger.kernel.org # 5.0+, requires custom backport due to 5.9 changes Reported-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-20dm: Call proper helper to determine dax supportJan Kara
DM was calling generic_fsdax_supported() to determine whether a device referenced in the DM table supports DAX. However this is a helper for "leaf" device drivers so that they don't have to duplicate common generic checks. High level code should call dax_supported() helper which that calls into appropriate helper for the particular device. This problem manifested itself as kernel messages: dm-3: error: dax access failed (-95) when lvm2-testsuite run in cases where a DM device was stacked on top of another DM device. Fixes: 7bf7eac8d648 ("dax: Arrange for dax_supported check to span multiple devices") Cc: <stable@vger.kernel.org> Tested-by: Adrian Huang <ahuang12@lenovo.com> Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mike Snitzer <snitzer@redhat.com> Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/r/160061715195.13131.5503173247632041975.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-09-20dm/dax: Fix table reference countsDan Williams
A recent fix to the dm_dax_supported() flow uncovered a latent bug. When dm_get_live_table() fails it is still required to drop the srcu_read_lock(). Without this change the lvm2 test-suite triggers this warning: # lvm2-testsuite --only pvmove-abort-all.sh WARNING: lock held when returning to user space! 5.9.0-rc5+ #251 Tainted: G OE ------------------------------------------------ lvm/1318 is leaving the kernel with locks still held! 1 lock held by lvm/1318: #0: ffff9372abb5a340 (&md->io_barrier){....}-{0:0}, at: dm_get_live_table+0x5/0xb0 [dm_mod] ...and later on this hang signature: INFO: task lvm:1344 blocked for more than 122 seconds. Tainted: G OE 5.9.0-rc5+ #251 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:lvm state:D stack: 0 pid: 1344 ppid: 1 flags:0x00004000 Call Trace: __schedule+0x45f/0xa80 ? finish_task_switch+0x249/0x2c0 ? wait_for_completion+0x86/0x110 schedule+0x5f/0xd0 schedule_timeout+0x212/0x2a0 ? __schedule+0x467/0xa80 ? wait_for_completion+0x86/0x110 wait_for_completion+0xb0/0x110 __synchronize_srcu+0xd1/0x160 ? __bpf_trace_rcu_utilization+0x10/0x10 __dm_suspend+0x6d/0x210 [dm_mod] dm_suspend+0xf6/0x140 [dm_mod] Fixes: 7bf7eac8d648 ("dax: Arrange for dax_supported check to span multiple devices") Cc: <stable@vger.kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Reported-by: Adrian Huang <ahuang12@lenovo.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Tested-by: Adrian Huang <ahuang12@lenovo.com> Link: https://lore.kernel.org/r/160045867590.25663.7548541079217827340.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-09-02dm thin metadata: Fix use-after-free in dm_bm_set_read_onlyYe Bin
The following error ocurred when testing disk online/offline: [ 301.798344] device-mapper: thin: 253:5: aborting current metadata transaction [ 301.848441] device-mapper: thin: 253:5: failed to abort metadata transaction [ 301.849206] Aborting journal on device dm-26-8. [ 301.850489] EXT4-fs error (device dm-26) in __ext4_new_inode:943: Journal has aborted [ 301.851095] EXT4-fs (dm-26): Delayed block allocation failed for inode 398742 at logical offset 181 with max blocks 19 with error 30 [ 301.854476] BUG: KASAN: use-after-free in dm_bm_set_read_only+0x3a/0x40 [dm_persistent_data] Reason is: metadata_operation_failed abort_transaction dm_pool_abort_metadata __create_persistent_data_objects r = __open_or_format_metadata if (r) --> If failed will free pmd->bm but pmd->bm not set NULL dm_block_manager_destroy(pmd->bm); set_pool_mode dm_pool_metadata_read_only(pool->pmd); dm_bm_set_read_only(pmd->bm); --> use-after-free Add checks to see if pmd->bm is NULL in dm_bm_set_read_only and dm_bm_set_read_write functions. If bm is NULL it means creating the bm failed and so dm_bm_is_read_only must return true. Signed-off-by: Ye Bin <yebin10@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-02dm thin metadata: Avoid returning cmd->bm wild pointer on errorYe Bin
Maybe __create_persistent_data_objects() caller will use PTR_ERR as a pointer, it will lead to some strange things. Signed-off-by: Ye Bin <yebin10@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-02dm cache metadata: Avoid returning cmd->bm wild pointer on errorYe Bin
Maybe __create_persistent_data_objects() caller will use PTR_ERR as a pointer, it will lead to some strange things. Signed-off-by: Ye Bin <yebin10@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-01dm integrity: fix error reporting in bitmap mode after creationMikulas Patocka
The dm-integrity target did not report errors in bitmap mode just after creation. The reason is that the function integrity_recalc didn't clean up ic->recalc_bitmap as it proceeded with recalculation. Fix this by updating the bitmap accordingly -- the double shift serves to rounddown. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: 468dfca38b1a ("dm integrity: add a bitmap mode") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-01dm crypt: Initialize crypto wait structuresDamien Le Moal
Use the DECLARE_CRYPTO_WAIT() macro to properly initialize the crypto wait structures declared on stack before their use with crypto_wait_req(). Fixes: 39d13a1ac41d ("dm crypt: reuse eboiv skcipher for IV generation") Fixes: bbb1658461ac ("dm crypt: Implement Elephant diffuser for Bitlocker compatibility") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-01dm mpath: fix racey management of PG initializationMike Snitzer
Commit 935fcc56abc3 ("dm mpath: only flush workqueue when needed") changed flush_multipath_work() to avoid needless workqueue flushing (of a multipath global workqueue). But that change didn't realize the surrounding flush_multipath_work() code should also only run if 'pg_init_in_progress' is set. Fix this by only doing all of flush_multipath_work()'s PG init related work if 'pg_init_in_progress' is set. Otherwise multipath_wait_for_pg_init_completion() will run unconditionally but the preceeding flush_workqueue(kmpath_handlerd) may not. This could lead to deadlock (though only if kmpath_handlerd never runs a corresponding work to decrement 'pg_init_in_progress'). It could also be, though highly unlikely, that the kmpath_handlerd work that does PG init completes before 'pg_init_in_progress' is set, and then an intervening DM table reload's multipath_postsuspend() triggers flush_multipath_work(). Fixes: 935fcc56abc3 ("dm mpath: only flush workqueue when needed") Cc: stable@vger.kernel.org Reported-by: Ben Marzinski <bmarzins@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-01dm writecache: handle DAX to partitions on persistent memory correctlyMikulas Patocka
The function dax_direct_access doesn't take partitions into account, it always maps pages from the beginning of the device. Therefore, persistent_memory_claim() must get the partition offset using get_start_sect() and add it to the page offsets passed to dax_direct_access(). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: 48debafe4f2f ("dm: add writecache target") Cc: stable@vger.kernel.org # 4.18+ Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-08-28Merge tag 'block-5.9-2020-08-28' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - nbd timeout fix (Hou) - device size fix for loop LOOP_CONFIGURE (Martijn) - MD pull from Song with raid5 stripe size fix (Yufen) * tag 'block-5.9-2020-08-28' of git://git.kernel.dk/linux-block: md/raid5: make sure stripe_size as power of two loop: Set correct device size when using LOOP_CONFIGURE nbd: restore default timeout when setting it to zero
2020-08-27md/raid5: make sure stripe_size as power of twoYufen Yu
Commit 3b5408b98e4d ("md/raid5: support config stripe_size by sysfs entry") make stripe_size as a configurable value. It just requires stripe_size as multiple of 4KB. In fact, we should make sure stripe_size as power of two. Otherwise, stripe_shift which is the result of ilog2 can not represent the real stripe_size. Then, stripe_hash() and stripe_hash_locks_hash() may get unexpected value. Fixes: 3b5408b98e4d ("md/raid5: support config stripe_size by sysfs entry") Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-08-23treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-15Merge tag 'block-5.9-2020-08-14' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "A few fixes on the block side of things: - Discard granularity fix (Coly) - rnbd cleanups (Guoqing) - md error handling fix (Dan) - md sysfs fix (Junxiao) - Fix flush request accounting, which caused an IO slowdown for some configurations (Ming) - Properly propagate loop flag for partition scanning (Lennart)" * tag 'block-5.9-2020-08-14' of git://git.kernel.dk/linux-block: block: fix double account of flush request's driver tag loop: unset GENHD_FL_NO_PART_SCAN on LOOP_CONFIGURE rnbd: no need to set bi_end_io in rnbd_bio_map_kern rnbd: remove rnbd_dev_submit_io md-cluster: Fix potential error pointer dereference in resize_bitmaps() block: check queue's limits.discard_granularity in __blkdev_issue_discard() md: get sysfs entry after redundancy attr group create
2020-08-10Merge tag 'locking-urgent-2020-08-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Thomas Gleixner: "A set of locking fixes and updates: - Untangle the header spaghetti which causes build failures in various situations caused by the lockdep additions to seqcount to validate that the write side critical sections are non-preemptible. - The seqcount associated lock debug addons which were blocked by the above fallout. seqcount writers contrary to seqlock writers must be externally serialized, which usually happens via locking - except for strict per CPU seqcounts. As the lock is not part of the seqcount, lockdep cannot validate that the lock is held. This new debug mechanism adds the concept of associated locks. sequence count has now lock type variants and corresponding initializers which take a pointer to the associated lock used for writer serialization. If lockdep is enabled the pointer is stored and write_seqcount_begin() has a lockdep assertion to validate that the lock is held. Aside of the type and the initializer no other code changes are required at the seqcount usage sites. The rest of the seqcount API is unchanged and determines the type at compile time with the help of _Generic which is possible now that the minimal GCC version has been moved up. Adding this lockdep coverage unearthed a handful of seqcount bugs which have been addressed already independent of this. While generally useful this comes with a Trojan Horse twist: On RT kernels the write side critical section can become preemtible if the writers are serialized by an associated lock, which leads to the well known reader preempts writer livelock. RT prevents this by storing the associated lock pointer independent of lockdep in the seqcount and changing the reader side to block on the lock when a reader detects that a writer is in the write side critical section. - Conversion of seqcount usage sites to associated types and initializers" * tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits) locking/seqlock, headers: Untangle the spaghetti monster locking, arch/ia64: Reduce <asm/smp.h> header dependencies by moving XTP bits into the new <asm/xtp.h> header x86/headers: Remove APIC headers from <asm/smp.h> seqcount: More consistent seqprop names seqcount: Compress SEQCNT_LOCKNAME_ZERO() seqlock: Fold seqcount_LOCKNAME_init() definition seqlock: Fold seqcount_LOCKNAME_t definition seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g hrtimer: Use sequence counter with associated raw spinlock kvm/eventfd: Use sequence counter with associated spinlock userfaultfd: Use sequence counter with associated spinlock NFSv4: Use sequence counter with associated spinlock iocost: Use sequence counter with associated spinlock raid5: Use sequence counter with associated spinlock vfs: Use sequence counter with associated spinlock timekeeping: Use sequence counter with associated raw spinlock xfrm: policy: Use sequence counters with associated lock netfilter: nft_set_rbtree: Use sequence counter with associated rwlock netfilter: conntrack: Use sequence counter with associated spinlock sched: tasks: Use sequence counter with associated spinlock ...
2020-08-07Merge tag 'for-5.9/dm-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - DM multipath locking fixes around m->flags tests and improvements to bio-based code so that it follows patterns established by request-based code. - Request-based DM core improvement to eliminate unnecessary call to blk_mq_queue_stopped(). - Add "panic_on_corruption" error handling mode to DM verity target. - DM bufio fix to to perform buffer cleanup from a workqueue rather than wait for IO in reclaim context from shrinker. - DM crypt improvement to optionally avoid async processing via workqueues for reads and/or writes -- via "no_read_workqueue" and "no_write_workqueue" features. This more direct IO processing improves latency and throughput with faster storage. Avoiding workqueue IO submission for writes (DM_CRYPT_NO_WRITE_WORKQUEUE) is a requirement for adding zoned block device support to DM crypt. - Add zoned block device support to DM crypt. Makes use of DM_CRYPT_NO_WRITE_WORKQUEUE and a new optional feature (DM_CRYPT_WRITE_INLINE) that allows write completion to wait for encryption to complete. This allows write ordering to be preserved, which is needed for zoned block devices. - Fix DM ebs target's check for REQ_OP_FLUSH. - Fix DM core's report zones support to not report more zones than were requested. - A few small compiler warning fixes. - DM dust improvements to return output directly to the user rather than require they scrape the system log for output. * tag 'for-5.9/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: don't call report zones for more than the user requested dm ebs: Fix incorrect checking for REQ_OP_FLUSH dm init: Set file local variable static dm ioctl: Fix compilation warning dm raid: Remove empty if statement dm verity: Fix compilation warning dm crypt: Enable zoned block device support dm crypt: add flags to optionally bypass kcryptd workqueues dm bufio: do buffer cleanup from a workqueue dm rq: don't call blk_mq_queue_stopped() in dm_stop_queue() dm dust: add interface to list all badblocks dm dust: report some message results directly back to user dm verity: add "panic_on_corruption" error handling mode dm mpath: use double checked locking in fast path dm mpath: rename current_pgpath to pgpath in multipath_prepare_ioctl dm mpath: rework __map_bio() dm mpath: factor out multipath_queue_bio dm mpath: push locking down to must_push_back_rq() dm mpath: take m->lock spinlock when testing QUEUE_IF_NO_PATH dm mpath: changes from initial m->flags locking audit
2020-08-07Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc updates from Andrew Morton: - a few MM hotfixes - kthread, tools, scripts, ntfs and ocfs2 - some of MM Subsystems affected by this patch series: kthread, tools, scripts, ntfs, ocfs2 and mm (hofixes, pagealloc, slab-generic, slab, slub, kcsan, debug, pagecache, gup, swap, shmem, memcg, pagemap, mremap, mincore, sparsemem, vmalloc, kasan, pagealloc, hugetlb and vmscan). * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (162 commits) mm: vmscan: consistent update to pgrefill mm/vmscan.c: fix typo khugepaged: khugepaged_test_exit() check mmget_still_valid() khugepaged: retract_page_tables() remember to test exit khugepaged: collapse_pte_mapped_thp() protect the pmd lock khugepaged: collapse_pte_mapped_thp() flush the right range mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possible mm: thp: replace HTTP links with HTTPS ones mm/page_alloc: fix memalloc_nocma_{save/restore} APIs mm/page_alloc.c: skip setting nodemask when we are in interrupt mm/page_alloc: fallbacks at most has 3 elements mm/page_alloc: silence a KASAN false positive mm/page_alloc.c: remove unnecessary end_bitidx for [set|get]_pfnblock_flags_mask() mm/page_alloc.c: simplify pageblock bitmap access mm/page_alloc.c: extract the common part in pfn_to_bitidx() mm/page_alloc.c: replace the definition of NR_MIGRATETYPE_BITS with PB_migratetype_bits mm/shuffle: remove dynamic reconfiguration mm/memory_hotplug: document why shuffle_zone() is relevant mm/page_alloc: remove nr_free_pagecache_pages() mm: remove vm_total_pages ...
2020-08-07mm, treewide: rename kzfree() to kfree_sensitive()Waiman Long
As said by Linus: A symmetric naming is only helpful if it implies symmetries in use. Otherwise it's actively misleading. In "kzalloc()", the z is meaningful and an important part of what the caller wants. In "kzfree()", the z is actively detrimental, because maybe in the future we really _might_ want to use that "memfill(0xdeadbeef)" or something. The "zero" part of the interface isn't even _relevant_. The main reason that kzfree() exists is to clear sensitive information that should not be leaked to other future users of the same memory objects. Rename kzfree() to kfree_sensitive() to follow the example of the recently added kvfree_sensitive() and make the intention of the API more explicit. In addition, memzero_explicit() is used to clear the memory to make sure that it won't get optimized away by the compiler. The renaming is done by using the command sequence: git grep -w --name-only kzfree |\ xargs sed -i 's/kzfree/kfree_sensitive/' followed by some editing of the kfree_sensitive() kerneldoc and adding a kzfree backward compatibility macro in slab.h. [akpm@linux-foundation.org: fs/crypto/inline_crypt.c needs linux/slab.h] [akpm@linux-foundation.org: fix fs/crypto/inline_crypt.c some more] Suggested-by: Joe Perches <joe@perches.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Howells <dhowells@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Cc: James Morris <jmorris@namei.org> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Joe Perches <joe@perches.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Rientjes <rientjes@google.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: "Jason A . Donenfeld" <Jason@zx2c4.com> Link: http://lkml.kernel.org/r/20200616154311.12314-3-longman@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07Merge tag 'powerpc-5.9-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Add support for (optionally) using queued spinlocks & rwlocks. - Support for a new faster system call ABI using the scv instruction on Power9 or later. - Drop support for the PROT_SAO mmap/mprotect flag as it will be unsupported on Power10 and future processors, leaving us with no way to implement the functionality it requests. This risks breaking userspace, though we believe it is unused in practice. - A bug fix for, and then the removal of, our custom stack expansion checking. We now allow stack expansion up to the rlimit, like other architectures. - Remove the remnants of our (previously disabled) topology update code, which tried to react to NUMA layout changes on virtualised systems, but was prone to crashes and other problems. - Add PMU support for Power10 CPUs. - A change to our signal trampoline so that we don't unbalance the link stack (branch return predictor) in the signal delivery path. - Lots of other cleanups, refactorings, smaller features and so on as usual. Thanks to: Abhishek Goel, Alastair D'Silva, Alexander A. Klimov, Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Anton Blanchard, Arnd Bergmann, Athira Rajeev, Balamuruhan S, Bharata B Rao, Bill Wendling, Bin Meng, Cédric Le Goater, Chris Packham, Christophe Leroy, Christoph Hellwig, Daniel Axtens, Dan Williams, David Lamparter, Desnes A. Nunes do Rosario, Erhard F., Finn Thain, Frederic Barrat, Ganesh Goudar, Gautham R. Shenoy, Geoff Levand, Greg Kurz, Gustavo A. R. Silva, Hari Bathini, Harish, Imre Kaloz, Joel Stanley, Joe Perches, John Crispin, Jordan Niethe, Kajol Jain, Kamalesh Babulal, Kees Cook, Laurent Dufour, Leonardo Bras, Li RongQing, Madhavan Srinivasan, Mahesh Salgaonkar, Mark Cave-Ayland, Michal Suchanek, Milton Miller, Mimi Zohar, Murilo Opsfelder Araujo, Nathan Chancellor, Nathan Lynch, Naveen N. Rao, Nayna Jain, Nicholas Piggin, Oliver O'Halloran, Palmer Dabbelt, Pedro Miraglia Franco de Carvalho, Philippe Bergheaud, Pingfan Liu, Pratik Rajesh Sampat, Qian Cai, Qinglang Miao, Randy Dunlap, Ravi Bangoria, Sachin Sant, Sam Bobroff, Sandipan Das, Santosh Sivaraj, Satheesh Rajendran, Shirisha Ganta, Sourabh Jain, Srikar Dronamraju, Stan Johnson, Stephen Rothwell, Thadeu Lima de Souza Cascardo, Thiago Jung Bauermann, Tom Lane, Vaibhav Jain, Vladis Dronov, Wei Yongjun, Wen Xiong, YueHaibing. * tag 'powerpc-5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (337 commits) selftests/powerpc: Fix pkey syscall redefinitions powerpc: Fix circular dependency between percpu.h and mmu.h powerpc/powernv/sriov: Fix use of uninitialised variable selftests/powerpc: Skip vmx/vsx/tar/etc tests on older CPUs powerpc/40x: Fix assembler warning about r0 powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric powerpc/papr_scm: Fetch nvdimm performance stats from PHYP cpuidle: pseries: Fixup exit latency for CEDE(0) cpuidle: pseries: Add function to parse extended CEDE records cpuidle: pseries: Set the latency-hint before entering CEDE selftests/powerpc: Fix online CPU selection powerpc/perf: Consolidate perf_callchain_user_[64|32]() powerpc/pseries/hotplug-cpu: Remove double free in error path powerpc/pseries/mobility: Add pr_debug() for device tree changes powerpc/pseries/mobility: Set pr_fmt() powerpc/cacheinfo: Warn if cache object chain becomes unordered powerpc/cacheinfo: Improve diagnostics about malformed cache lists powerpc/cacheinfo: Use name@unit instead of full DT path in debug messages powerpc/cacheinfo: Set pr_fmt() powerpc: fix function annotations to avoid section mismatch warnings with gcc-10 ...
2020-08-07Merge branch 'hch.init_path' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull init and set_fs() cleanups from Al Viro: "Christoph's 'getting rid of ksys_...() uses under KERNEL_DS' series" * 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (50 commits) init: add an init_dup helper init: add an init_utimes helper init: add an init_stat helper init: add an init_mknod helper init: add an init_mkdir helper init: add an init_symlink helper init: add an init_link helper init: add an init_eaccess helper init: add an init_chmod helper init: add an init_chown helper init: add an init_chroot helper init: add an init_chdir helper init: add an init_rmdir helper init: add an init_unlink helper init: add an init_umount helper init: add an init_mount helper init: mark create_dev as __init init: mark console_on_rootfs as __init init: initialize ramdisk_execute_command at compile time devtmpfs: refactor devtmpfsd() ...
2020-08-06Merge branch 'md-next' of ↵Jens Axboe
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-5.9 Pull MD fixes from Song. * 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md-cluster: Fix potential error pointer dereference in resize_bitmaps() md: get sysfs entry after redundancy attr group create
2020-08-06Merge branch 'WIP.locking/seqlocks' into locking/urgentIngo Molnar
Pick up the full seqlock series PeterZ is working on. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-08-05md-cluster: Fix potential error pointer dereference in resize_bitmaps()Dan Carpenter
The error handling calls md_bitmap_free(bitmap) which checks for NULL but will Oops if we pass an error pointer. Let's set "bitmap" to NULL on this error path. Fixes: afd756286083 ("md-cluster/raid10: resize all the bitmaps before start reshape") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-08-05md: get sysfs entry after redundancy attr group createJunxiao Bi
"sync_completed" and "degraded" belongs to redundancy attr group, it was not exist yet when md device was created. Reported-by: kernel test robot <rong.a.chen@intel.com> Fixes: e1a86dbbbd6a ("md: fix deadlock causing by sysfs_notify") Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-08-05Merge tag 'for-5.9/block-merge-20200804' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block stacking updates from Jens Axboe: "The stacking related fixes depended on both the core block and drivers branches, so here's a topic branch with that change. Outside of that, a late fix from Johannes for zone revalidation" * tag 'for-5.9/block-merge-20200804' of git://git.kernel.dk/linux-block: block: don't do revalidate zones on invalid devices block: remove blk_queue_stack_limits block: remove bdev_stack_limits block: inherit the zoned characteristics in blk_stack_limits
2020-08-05Merge tag 'for-5.9/drivers-20200803' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block driver updates from Jens Axboe: - NVMe: - ZNS support (Aravind, Keith, Matias, Niklas) - Misc cleanups, optimizations, fixes (Baolin, Chaitanya, David, Dongli, Max, Sagi) - null_blk zone capacity support (Aravind) - MD: - raid5/6 fixes (ChangSyun) - Warning fixes (Damien) - raid5 stripe fixes (Guoqing, Song, Yufen) - sysfs deadlock fix (Junxiao) - raid10 deadlock fix (Vitaly) - struct_size conversions (Gustavo) - Set of bcache updates/fixes (Coly) * tag 'for-5.9/drivers-20200803' of git://git.kernel.dk/linux-block: (117 commits) md/raid5: Allow degraded raid6 to do rmw md/raid5: Fix Force reconstruct-write io stuck in degraded raid5 raid5: don't duplicate code for different paths in handle_stripe raid5-cache: hold spinlock instead of mutex in r5c_journal_mode_show md: print errno in super_written md/raid5: remove the redundant setting of STRIPE_HANDLE md: register new md sysfs file 'uuid' read-only md: fix max sectors calculation for super 1.0 nvme-loop: remove extra variable in create ctrl nvme-loop: set ctrl state connecting after init nvme-multipath: do not fall back to __nvme_find_path() for non-optimized paths nvme-multipath: fix logic for non-optimized paths nvme-rdma: fix controller reset hang during traffic nvme-tcp: fix controller reset hang during traffic nvmet: introduce the passthru Kconfig option nvmet: introduce the passthru configfs interface nvmet: Add passthru enable/disable helpers nvmet: add passthru code to process commands nvme: export nvme_find_get_ns() and nvme_put_ns() nvme: introduce nvme_ctrl_get_by_path() ...
2020-08-04Merge tag 'docs-5.9' of git://git.lwn.net/linuxLinus Torvalds
Pull documentation updates from Jonathan Corbet: "It's been a busy cycle for documentation - hopefully the busiest for a while to come. Changes include: - Some new Chinese translations - Progress on the battle against double words words and non-HTTPS URLs - Some block-mq documentation - More RST conversions from Mauro. At this point, that task is essentially complete, so we shouldn't see this kind of churn again for a while. Unless we decide to switch to asciidoc or something...:) - Lots of typo fixes, warning fixes, and more" * tag 'docs-5.9' of git://git.lwn.net/linux: (195 commits) scripts/kernel-doc: optionally treat warnings as errors docs: ia64: correct typo mailmap: add entry for <alobakin@marvell.com> doc/zh_CN: add cpu-load Chinese version Documentation/admin-guide: tainted-kernels: fix spelling mistake MAINTAINERS: adjust kprobes.rst entry to new location devices.txt: document rfkill allocation PCI: correct flag name docs: filesystems: vfs: correct flag name docs: filesystems: vfs: correct sync_mode flag names docs: path-lookup: markup fixes for emphasis docs: path-lookup: more markup fixes docs: path-lookup: fix HTML entity mojibake CREDITS: Replace HTTP links with HTTPS ones docs: process: Add an example for creating a fixes tag doc/zh_CN: add Chinese translation prefer section doc/zh_CN: add clearing-warn-once Chinese version doc/zh_CN: add admin-guide index doc:it_IT: process: coding-style.rst: Correct __maybe_unused compiler label futex: MAINTAINERS: Re-add selftests directory ...
2020-08-04Merge tag 'uninit-macro-v5.9-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull uninitialized_var() macro removal from Kees Cook: "This is long overdue, and has hidden too many bugs over the years. The series has several "by hand" fixes, and then a trivial treewide replacement. - Clean up non-trivial uses of uninitialized_var() - Update documentation and checkpatch for uninitialized_var() removal - Treewide removal of uninitialized_var()" * tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: compiler: Remove uninitialized_var() macro treewide: Remove uninitialized_var() usage checkpatch: Remove awareness of uninitialized_var() macro mm/debug_vm_pgtable: Remove uninitialized_var() usage f2fs: Eliminate usage of uninitialized_var() macro media: sur40: Remove uninitialized_var() usage KVM: PPC: Book3S PR: Remove uninitialized_var() usage clk: spear: Remove uninitialized_var() usage clk: st: Remove uninitialized_var() usage spi: davinci: Remove uninitialized_var() usage ide: Remove uninitialized_var() usage rtlwifi: rtl8192cu: Remove uninitialized_var() usage b43: Remove uninitialized_var() usage drbd: Remove uninitialized_var() usage x86/mm/numa: Remove uninitialized_var() usage docs: deprecated.rst: Add uninitialized_var()
2020-08-04dm: don't call report zones for more than the user requestedJohannes Thumshirn
Don't call report zones for more zones than the user actually requested, otherwise this can lead to out-of-bounds accesses in the callback functions. Such a situation can happen if the target's ->report_zones() callback function returns 0 because we've reached the end of the target and then restart the report zones on the second target. We're again calling into ->report_zones() and ultimately into the user supplied callback function but when we're not subtracting the number of zones already processed this may lead to out-of-bounds accesses in the user callbacks. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Fixes: d41003513e61 ("block: rework zone reporting") Cc: stable@vger.kernel.org # v5.5+ Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-08-04dm ebs: Fix incorrect checking for REQ_OP_FLUSHJohn Dorminy
REQ_OP_FLUSH was being treated as a flag, but the operation part of bio->bi_opf must be treated as a whole. Change to accessing the operation part via bio_op(bio) and checking for equality. Signed-off-by: John Dorminy <jdorminy@redhat.com> Acked-by: Heinz Mauelshagen <heinzm@redhat.com> Fixes: d3c7b35c20d60 ("dm: add emulated block size target") Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-08-04dm init: Set file local variable staticDamien Le Moal
Declare dm_allowed_targets as static to avoid the warning: drivers/md/dm-init.c:39:12: warning: symbol 'dm_allowed_targets' was not declared. Should it be static? when compiling with C=1. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-08-04dm ioctl: Fix compilation warningDamien Le Moal
In retrieve_status(), when copying the target type name in the target_type string field of struct dm_target_spec, copy at most DM_MAX_TYPE_NAME - 1 character to avoid the compilation warning: warning: ‘__builtin_strncpy’ specified bound 16 equals destination size [-Wstringop-truncation] when compiling with W-1. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-08-04dm raid: Remove empty if statementDamien Le Moal
In super_init_validation(), remove a body-less if statement testing only variables to avoid a compilation warning when compiling with W=1. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-08-04dm verity: Fix compilation warningDamien Le Moal
For the case !CONFIG_DM_VERITY_VERIFY_ROOTHASH_SIG, declare the functions verity_verify_root_hash(), verity_verify_is_sig_opt_arg(), verity_verify_sig_parse_opt_args() and verity_verify_sig_opts_cleanup() as inline to avoid a "no previous prototype for xxx" compilation warning when compiling with W=1. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-08-03Merge tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull core block updates from Jens Axboe: "Good amount of cleanups and tech debt removals in here, and as a result, the diffstat shows a nice net reduction in code. - Softirq completion cleanups (Christoph) - Stop using ->queuedata (Christoph) - Cleanup bd claiming (Christoph) - Use check_events, moving away from the legacy media change (Christoph) - Use inode i_blkbits consistently (Christoph) - Remove old unused writeback congestion bits (Christoph) - Cleanup/unify submission path (Christoph) - Use bio_uninit consistently, instead of bio_disassociate_blkg (Christoph) - sbitmap cleared bits handling (John) - Request merging blktrace event addition (Jan) - sysfs add/remove race fixes (Luis) - blk-mq tag fixes/optimizations (Ming) - Duplicate words in comments (Randy) - Flush deferral cleanup (Yufen) - IO context locking/retry fixes (John) - struct_size() usage (Gustavo) - blk-iocost fixes (Chengming) - blk-cgroup IO stats fixes (Boris) - Various little fixes" * tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block: (135 commits) block: blk-timeout: delete duplicated word block: blk-mq-sched: delete duplicated word block: blk-mq: delete duplicated word block: genhd: delete duplicated words block: elevator: delete duplicated word and fix typos block: bio: delete duplicated words block: bfq-iosched: fix duplicated word iocost_monitor: start from the oldest usage index iocost: Fix check condition of iocg abs_vdebt block: Remove callback typedefs for blk_mq_ops block: Use non _rcu version of list functions for tag_set_list blk-cgroup: show global disk stats in root cgroup io.stat blk-cgroup: make iostat functions visible to stat printing block: improve discard bio alignment in __blkdev_issue_discard() block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers block: defer flush request no matter whether we have elevator block: make blk_timeout_init() static block: remove retry loop in ioc_release_fn() block: remove unnecessary ioc nested locking block: integrate bd_start_claiming into __blkdev_get ...
2020-08-02md/raid5: Allow degraded raid6 to do rmwChangSyun Peng
Degraded raid6 always do reconstruct-write now. With raid6 xor supported, we can do rmw in degraded raid6. This patch can reduce many read IOs to improve performance. If the failed disk is P, Q or the disk we want to write to, we may need to do reconstruct-write in max degraded raid6. In this situation we can not read enough data from handle_stripe_dirtying() so we have to set force_rcw in handle_stripe_fill() to read all data. Reviewed-by: Alex Wu <alexwu@synology.com> Reviewed-by: BingJing Chang <bingjingc@synology.com> Reviewed-by: Danny Shih <dannyshih@synology.com> Signed-off-by: ChangSyun Peng <allenpeng@synology.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-08-02md/raid5: Fix Force reconstruct-write io stuck in degraded raid5ChangSyun Peng
In degraded raid5, we need to read parity to do reconstruct-write when data disks fail. However, we can not read parity from handle_stripe_dirtying() in force reconstruct-write mode. Reproducible Steps: 1. Create degraded raid5 mdadm -C /dev/md2 --assume-clean -l5 -n3 /dev/sda2 /dev/sdb2 missing 2. Set rmw_level to 0 echo 0 > /sys/block/md2/md/rmw_level 3. IO to raid5 Now some io may be stuck in raid5. We can use handle_stripe_fill() to read the parity in this situation. Cc: <stable@vger.kernel.org> # v4.4+ Reviewed-by: Alex Wu <alexwu@synology.com> Reviewed-by: BingJing Chang <bingjingc@synology.com> Reviewed-by: Danny Shih <dannyshih@synology.com> Signed-off-by: ChangSyun Peng <allenpeng@synology.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-08-02raid5: don't duplicate code for different paths in handle_stripeGuoqing Jiang
As we can see, R5_LOCKED is set and s.locked is increased whether R5_ReWrite is set or not, so move it to common path. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-08-02raid5-cache: hold spinlock instead of mutex in r5c_journal_mode_showGuoqing Jiang
Replace mddev_lock with spin_lock to align with other show methods in raid5_attrs. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-08-02md: print errno in super_writtenGuoqing Jiang
It is better to print errno instead of bi_status. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-08-02md/raid5: remove the redundant setting of STRIPE_HANDLEGuoqing Jiang
The flag is already set before compare rcw with rmw, so it is not necessary to do it again. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-08-02md: register new md sysfs file 'uuid' read-onlySebastian Parschauer
Report the UUID of the MD array in the following format: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx This is useful if you don't want to wait for udev to identify array. And it is also easy for script to monitor it with the format. Signed-off-by: Sebastian Parschauer <s.parschauer@gmx.de> [Guoqing: mention the change in md.rst] Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-08-02md: fix max sectors calculation for super 1.0Xiao Ni
To grow size of super 1.0 raid array, it is necessary to check the device max usable size. Now it uses rdev->sectors for max usable size. If one disk is 500G and the raid device only uses the 100GB of this disk. rdev->sectors can't tell the real max usable size. The max usable size should be dev_size-(superblock_size+bitmap_size+badblock_size). Also, remove unnecessary sb_start update in super_1_rdev_size_change(). Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-31init: add an init_stat helperChristoph Hellwig
Add a simple helper to stat with a kernel space file name and switch the early init code over to it. Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29raid5: Use sequence counter with associated spinlockAhmed S. Darwish
A sequence counter write side critical section must be protected by some form of locking to serialize writers. A plain seqcount_t does not contain the information of which lock must be held when entering a write side critical section. Use the new seqcount_spinlock_t data type, which allows to associate a spinlock with the sequence counter. This enables lockdep to verify that the spinlock used for writer serialization is held when the write side critical section is entered. If lockdep is disabled this lock association is compiled out and has neither storage size nor runtime overhead. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Song Liu <song@kernel.org> Link: https://lkml.kernel.org/r/20200720155530.1173732-20-a.darwish@linutronix.de
2020-07-28bcache: use disk_{start,end}_io_acct() to count I/O for bcache deviceColy Li
This patch is a fix to patch "bcache: fix bio_{start,end}_io_acct with proper device". The previous patch uses a hack to temporarily set bi_disk to bcache device, which is mistaken too. As Christoph suggests, this patch uses disk_{start,end}_io_acct() to count I/O for bcache device in the correct way. Fixes: 85750aeb748f ("bcache: use bio_{start,end}_io_acct") Signed-off-by: Coly Li <colyli@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25bcache: fix bio_{start,end}_io_acct with proper deviceColy Li
Commit 85750aeb748f ("bcache: use bio_{start,end}_io_acct") moves the io account code to the location after bio_set_dev(bio, dc->bdev) in cached_dev_make_request(). Then the account is performed incorrectly on backing device, indeed the I/O should be counted to bcache device like /dev/bcache0. With the mistaken I/O account, iostat does not display I/O counts for bcache device and all the numbers go to backing device. In writeback mode, the hard drive may have 340K+ IOPS which is impossible and wrong for spinning disk. This patch introduces bch_bio_start_io_acct() and bch_bio_end_io_acct(), which switches bio->bi_disk to bcache device before calling bio_start_io_acct() or bio_end_io_acct(). Now the I/Os are counted to bcache device, and bcache device, cache device and backing device have their correct I/O count information back. Fixes: 85750aeb748f ("bcache: use bio_{start,end}_io_acct") Signed-off-by: Coly Li <colyli@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25bcache: avoid extra memory consumption in struct bbio for large bucket sizeColy Li
Bcache uses struct bbio to do I/Os for meta data pages like uuids, disk_buckets, prio_buckets, and btree nodes. Example writing a btree node onto cache device, the process is, - Allocate a struct bbio from mempool c->bio_meta. - Inside struct bbio embedded a struct bio, initialize bi_inline_vecs for this embedded bio. - Call bch_bio_map() to map each meta data page to each bv from the inlined bi_io_vec table. - Call bch_submit_bbio() to submit the bio into underlying block layer. - When the I/O completed, only release the struct bbio, don't touch the reference counter of the meta data pages. The struct bbio is defined as, 738 struct bbio { 739 unsigned int submit_time_us; [snipped] 748 struct bio bio; 749 }; Because struct bio is embedded at the end of struct bbio, therefore the actual size of struct bbio is sizeof(struct bio) + size of the embedded bio->bi_inline_vecs. Now all the meta data bucket size are limited to meta_bucket_pages(), if the bucket size is large than meta_bucket_pages()*PAGE_SECTORS, rested space in the bucket is unused. Therefore the most used space in meta bucket is (1<<MAX_ORDER) pages, or (1<<CONFIG_FORCE_MAX_ZONEORDER) if it is configured. Therefore for large bucket size, it is unnecessary to calculate the allocation size of mempool c->bio_meta as, mempool_init_kmalloc_pool(&c->bio_meta, 2, sizeof(struct bbio) + sizeof(struct bio_vec) * bucket_pages(c)) It is too large, neither the Linux buddy allocator cannot allocate so much continuous pages, nor the extra allocated pages are wasted. This patch replace bucket_pages() to meta_bucket_pages() in two places, - In bch_cache_set_alloc(), when initialize mempool c->bio_meta, uses sizeof(struct bbio) + sizeof(struct bio_vec) * bucket_pages(c) to set the allocating object size. - In bch_bbio_alloc(), when calling bio_init() to set inline bvec talbe bi_inline_bvecs, uses meta_bucket_pages() to indicate number of the inline bio vencs number. Now the maximum size of embedded bio inside struct bbio exactly matches the limit of meta_bucket_pages(), no extra page wasted. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25bcache: avoid extra memory allocation from mempool c->fill_iterColy Li
Mempool c->fill_iter is used to allocate memory for struct btree_iter in bch_btree_node_read_done() to iterate all keys of a read-in btree node. The allocation size is defined in bch_cache_set_alloc() by, mempool_init_kmalloc_pool(&c->fill_iter, 1, iter_size)) where iter_size is defined by a calculation, (sb->bucket_size / sb->block_size + 1) * sizeof(struct btree_iter_set) For 16bit width bucket_size the calculation is OK, but now the bucket size is extended to 32bit, the bucket size can be 2GB. By the above calculation, iter_size can be 2048 pages (order 11 is still accepted by buddy allocator). But the actual size holds the bkeys in meta data bucket is limited to meta_bucket_pages() already, which is 16MB. By the above calculation, if replace sb->bucket_size by meta_bucket_pages() * PAGE_SECTORS, the result is 16 pages. This is the size large enough for the mempool allocation to struct btree_iter. Therefore in worst case every time mempool c->fill_iter allocates, at most 4080 pages are wasted and won't be used. Therefore this patch uses meta_bucket_pages() * PAGE_SECTORS to calculate the iter size in bch_cache_set_alloc(), to avoid extra memory allocation from mempool c->fill_iter. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>