summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2018-08-11bcache: add the missing comments for smp_mb()/smp_wmb()Coly Li
Checkpatch.pl warns there are 2 locations of smp_mb() and smp_wmb() without code comment. This patch adds the missing code comments for these memory barrier calls. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: remove unnecessary space before ioctl function pointer argumentsColy Li
This is warned by checkpatch.pl, this patch removes the extra space. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: add missing SPDX headerColy Li
The SPDX header is missing fro closure.c, super.c and util.c, this patch adds SPDX header for GPL-2.0 into these files. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: move open brace at end of function definitions to next lineColy Li
This is not a preferred style to place open brace '{' at the end of function definition, checkpatch.pl reports error for such coding style. This patch moves them into the start of the next new line. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: add static const prefix to char * array declarationsColy Li
This patch declares char * array with const prefix in sysfs.c, which is suggested by checkpatch.pl. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: fix code comments styleColy Li
This patch fixes 3 style issues warned by checkpatch.pl, - Comment lines are not aligned - Comments use "/*" on subsequent lines - Comment lines use a trailing "*/" Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: do not check NULL pointer before calling kmem_cache_destroyColy Li
kmem_cache_destroy() is safe for NULL pointer as input, the NULL pointer checking is unncessary. This patch just removes the NULL pointer checking to make code simpler. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: prefer 'help' in KconfigColy Li
Current bcache Kconfig uses '---help---' as header of help information, for now 'help' is prefered. This patch fixes this style by replacing '---help---' by 'help' in bcache Kconfig file. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: fix typo 'succesfully' to 'successfully'Coly Li
This patch fixes typo 'succesfully' to correct 'successfully', which is suggested by checkpatch.pl. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: replace '%pF' by '%pS' in seq_printf()Coly Li
'%pF' and '%pf' are deprecated vsprintf pointer extensions, this patch replace them by '%pS', which is suggested by checkpatch.pl. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: fix indent by replacing blank by tabsColy Li
bch_btree_insert_check_key() has unaligned indent, or indent by blank characters. This patch makes the indent aligned and replace blank by tabs. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: replace printk() by pr_*() routinesColy Li
There are still many places in bcache use printk to display kernel message, which are suggested to be preplaced by pr_*() routines like pr_err(), pr_info(), or pr_notice(). This patch replaces all printk() with a proper pr_*() routine for bcache code. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: replace Symbolic permissions by octal permission numbersColy Li
Symbolic permission names are used in bcache, for now octal permission numbers are encouraged to use for readability. This patch replaces all symbolic permissions by octal permission numbers. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: style fixes for lines over 80 charactersColy Li
This patch fixes the lines over 80 characters into more lines, to minimize warnings by checkpatch.pl. There are still some lines exceed 80 characters, but it is better to be a single line and I don't change them. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: add identifier names to arguments of function definitionsColy Li
There are many function definitions do not have identifier argument names, scripts/checkpatch.pl complains warnings like this, WARNING: function definition argument 'struct bcache_device *' should also have an identifier name #16735: FILE: writeback.h:120: +void bch_sectors_dirty_init(struct bcache_device *); This patch adds identifier argument names to all bcache function definitions to fix such warnings. Signed-off-by: Coly Li <colyli@suse.de> Reviewed: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: style fix to add a blank line after declarationsColy Li
Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-11bcache: style fix to replace 'unsigned' by 'unsigned int'Coly Li
This patch fixes warning reported by checkpatch.pl by replacing 'unsigned' with 'unsigned int'. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-10bcache: fix error setting writeback_rate through sysfs interfaceColy Li
Commit ea8c5356d390 ("bcache: set max writeback rate when I/O request is idle") changes struct bch_ratelimit member rate from uint32_t to atomic_long_t and uses atomic_long_set() in drivers/md/bcache/sysfs.c to set new writeback rate, after the input is converted from memory buf to long int by sysfs_strtoul_clamp(). The above change has a problem because there is an implicit return inside sysfs_strtoul_clamp() so the following atomic_long_set() won't be called. This error is detected by 0day system with following snipped smatch warnings: drivers/md/bcache/sysfs.c:271 __cached_dev_store() error: uninitialized symbol 'v'. 270 sysfs_strtoul_clamp(writeback_rate, v, 1, INT_MAX); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @271 atomic_long_set(&dc->writeback_rate.rate, v); This patch fixes the above error by using strtoul_safe_clamp() to convert the input buffer into a long int type result. Fixes: ea8c5356d390 ("bcache: set max writeback rate when I/O request is idle") Cc: Kai Krakow <kai@kaishome.de> Cc: Stefan Priebe <s.priebe@profihost.ag> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09bcache: trivial - remove tailing backslash in macro BTREE_FLAGShenghui Wang
Remove the tailing backslash in macro BTREE_FLAG in btree.h Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09bcache: make the pr_err statement used for ENOENT only in sysfs_attatch sectionShenghui Wang
The pr_err statement in the code for sysfs_attatch section would run for various error codes, which maybe confusing. E.g, Run the command twice: echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \ /sys/block/bcache0/bcache/attach [the backing dev got attached on the first run] echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \ /sys/block/bcache0/bcache/attach In dmesg, after the command run twice, we can get: bcache: bch_cached_dev_attach() Can't attach sda6: already attached bcache: __cached_dev_store() Can't attach 796b5c05-b03c-4bc7-9cbd-\ a8df5e8be891 : cache set not found The first statement in the message was right, but the second was confusing. bch_cached_dev_attach has various pr_ statements for various error codes, except ENOENT. After the change, rerun above command twice: echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \ /sys/block/bcache0/bcache/attach echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \ /sys/block/bcache0/bcache/attach In dmesg we only got: bcache: bch_cached_dev_attach() Can't attach sda6: already attached No confusing "cache set not found" message anymore. And for some not exist SET-UUID: echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be898 > \ /sys/block/bcache0/bcache/attach In dmesg we can get: bcache: __cached_dev_store() Can't attach 796b5c05-b03c-4bc7-9cbd-\ a8df5e8be898 : cache set not found Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09bcache: set max writeback rate when I/O request is idleColy Li
Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") allows the writeback rate to be faster if there is no I/O request on a bcache device. It works well if there is only one bcache device attached to the cache set. If there are many bcache devices attached to a cache set, it may introduce performance regression because multiple faster writeback threads of the idle bcache devices will compete the btree level locks with the bcache device who have I/O requests coming. This patch fixes the above issue by only permitting fast writebac when all bcache devices attached on the cache set are idle. And if one of the bcache devices has new I/O request coming, minimized all writeback throughput immediately and let PI controller __update_writeback_rate() to decide the upcoming writeback rate for each bcache device. Also when all bcache devices are idle, limited wrieback rate to a small number is wast of thoughput, especially when backing devices are slower non-rotation devices (e.g. SATA SSD). This patch sets a max writeback rate for each backing device if the whole cache set is idle. A faster writeback rate in idle time means new I/Os may have more available space for dirty data, and people may observe a better write performance then. Please note bcache may change its cache mode in run time, and this patch still works if the cache mode is switched from writeback mode and there is still dirty data on cache. Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") Cc: stable@vger.kernel.org #4.16+ Signed-off-by: Coly Li <colyli@suse.de> Tested-by: Kai Krakow <kai@kaishome.de> Tested-by: Stefan Priebe <s.priebe@profihost.ag> Cc: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09bcache: add code comments for bset.cColy Li
This patch tries to add code comments in bset.c, to make some tricky code and designment to be more comprehensible. Most information of this patch comes from the discussion between Kent and I, he offers very informative details. If there is any mistake of the idea behind the code, no doubt that's from me misrepresentation. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09bcache: fix mistaken comments in request.cColy Li
This patch updates code comment in bch_keylist_realloc() by fixing incorrected function names, to make the code to be more comprehennsible. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09bcache: fix mistaken code comments in bcache.hColy Li
This patch updates the code comment in struct cache with correct array names, to make the code to be more comprehensible. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09bcache: add a comment in super.cColy Li
This patch adds a line of code comment in super.c:register_bdev(), to make code to be more comprehensible. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09bcache: avoid unncessary cache prefetch bch_btree_node_get()Coly Li
In bch_btree_node_get() the read-in btree node will be partially prefetched into L1 cache for following bset iteration (if there is). But if the btree node read is failed, the perfetch operations will waste L1 cache space. This patch checkes whether read operation and only does cache prefetch when read I/O succeeded. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09bcache: display rate debug parameters to 0 when writeback is not runningColy Li
When writeback is not running, writeback rate should be 0, other value is misleading. And the following dyanmic writeback rate debug parameters should be 0 too, rate, proportional, integral, change otherwise they are misleading when writeback is not running. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-09bcache: do not check return value of debugfs_create_dir()Coly Li
Greg KH suggests that normal code should not care about debugfs. Therefore no matter successful or failed of debugfs_create_dir() execution, it is unncessary to check its return value. There are two functions called debugfs_create_dir() and check the return value, which are bch_debug_init() and closure_debug_init(). This patch changes these two functions from int to void type, and ignore return values of debugfs_create_dir(). This patch does not fix exact bug, just makes things work as they should. Signed-off-by: Coly Li <colyli@suse.de> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: stable@vger.kernel.org Cc: Kai Krakow <kai@kaishome.de> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-05Merge tag 'v4.18-rc6' into for-4.19/block2Jens Axboe
Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a merge conflict down the line. Signed-of-by: Jens Axboe <axboe@kernel.dk>
2018-07-27bcache: stop using the deprecated get_seconds()Arnd Bergmann
The get_seconds function is deprecated now since it returns a 32-bit value that will eventually overflow, and we are replacing it throughout the kernel with ktime_get_seconds() or ktime_get_real_seconds() that return a time64_t. bcache uses get_seconds() to read the current system time and store it in the superblock as well as in uuid_entry structures that are user visible. Unfortunately, the two structures in are still limited to 32 bits, so this won't fix any real problems but will still overflow in year 2106. Let's at least document that properly, in case we get an updated format in the future it can be fixed. We still have a long time before the overflow and checking the tools at https://github.com/koverstreet/bcache-tools reveals no access to any of them. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27bcache: do not assign in if condition in bcache_device_init()Florian Schmaus
Fixes an error condition reported by checkpatch.pl which is caused by assigning a variable in an if condition. Signed-off-by: Florian Schmaus <flo@geekplace.eu> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27bcache: do not assign in if condition in bcache_init()Florian Schmaus
Fixes an error condition reported by checkpatch.pl which is caused by assigning a variable in an if condition. Signed-off-by: Florian Schmaus <flo@geekplace.eu> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27bcache: free heap cache_set->flush_btree in bch_journal_freeShenghui Wang
Free the cache_set->flush_bree heap memory on journal free. Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27bcache: do not assign in if condition register_bcache()Florian Schmaus
Fixes an error condition reported by checkpatch.pl which is caused by assigning a variable in an if condition. Signed-off-by: Florian Schmaus <flo@geekplace.eu> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27bcache: fix I/O significant decline while backend devices registeringTang Junhui
I attached several backend devices in the same cache set, and produced lots of dirty data by running small rand I/O writes in a long time, then I continue run I/O in the others cached devices, and stopped a cached device, after a mean while, I register the stopped device again, I see the running I/O in the others cached devices dropped significantly, sometimes even jumps to zero. In currently code, bcache would traverse each keys and btree node to count the dirty data under read locker, and the writes threads can not get the btree write locker, and when there is a lot of keys and btree node in the registering device, it would last several seconds, so the write I/Os in others cached device are blocked and declined significantly. In this patch, when a device registering to a ache set, which exist others cached devices with running I/Os, we get the amount of dirty data of the device in an incremental way, and do not block other cached devices all the time. Patch v2: Rename some variables and macros name as Coly suggested. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27bcache: calculate the number of incremental GC nodes according to the total ↵Tang Junhui
of btree nodes This patch base on "[PATCH] bcache: finish incremental GC". Since incremental GC would stop 100ms when front side I/O comes, so when there are many btree nodes, if GC only processes constant (100) nodes each time, GC would last a long time, and the front I/Os would run out of the buckets (since no new bucket can be allocated during GC), and I/Os be blocked again. So GC should not process constant nodes, but varied nodes according to the number of btree nodes. In this patch, GC is divided into constant (100) times, so when there are many btree nodes, GC can process more nodes each time, otherwise GC will process less nodes each time (but no less than MIN_GC_NODES). Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27bcache: finish incremental GCTang Junhui
In GC thread, we record the latest GC key in gc_done, which is expected to be used for incremental GC, but in currently code, we didn't realize it. When GC runs, front side IO would be blocked until the GC over, it would be a long time if there is a lot of btree nodes. This patch realizes incremental GC, the main ideal is that, when there are front side I/Os, after GC some nodes (100), we stop GC, release locker of the btree node, and go to process the front side I/Os for some times (100 ms), then go back to GC again. By this patch, when we doing GC, I/Os are not blocked all the time, and there is no obvious I/Os zero jump problem any more. Patch v2: Rename some variables and macros name as Coly suggested. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27bcache: simplify the calculation of the total amount of flash dirty dataTang Junhui
Currently we calculate the total amount of flash only devices dirty data by adding the dirty data of each flash only device under registering locker. It is very inefficient. In this patch, we add a member flash_dev_dirty_sectors in struct cache_set to record the total amount of flash only devices dirty data in real time, so we didn't need to calculate the total amount of dirty data any more. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-24md: remove a bogus commentChristoph Hellwig
The function name mentioned doesn't exist, and the code next to it doesn't match the description either. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-24bcache: don't clone bio in bch_data_verifyChristoph Hellwig
We immediately overwrite the biovec array, so instead just allocate a new bio and copy over the disk, setor and size. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-20Merge tag 'for-4.18/dm-fixes-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fix from Mike Snitzer: "Fix DM writecache target to allow an optional offset to the start of the data and metadata area. This allows userspace tools (e.g. LVM2) to place a header and metadata at the front of the writecache device for its use" * tag 'for-4.18/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm writecache: support optional offset for start of device
2018-07-18block: Add and use op_stat_group() for indexing disk_stat fields.Michael Callahan
Add and use a new op_stat_group() function for indexing partition stat fields rather than indexing them by rq_data_dir() or bio_data_dir(). This function works similarly to op_is_sync() in that it takes the request::cmd_flags or bio::bi_opf flags and determines which stats should et updated. In addition, the second parameter to generic_start_io_acct() and generic_end_io_acct() is now a REQ_OP rather than simply a read or write bit and it uses op_stat_group() on the parameter to determine the stat group. Note that the partition in_flight counts are not part of the per-cpu statistics and as such are not indexed via this function. It's now indexed by op_is_write(). tj: Refreshed on top of v4.17. Updated to pass around REQ_OP. Signed-off-by: Michael Callahan <michaelcallahan@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Matias Bjorling <mb@lightnvm.io> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Alasdair Kergon <agk@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-18block: Add part_stat_read_accum to read across field entries.Michael Callahan
Add a part_stat_read_accum macro to genhd.h to read and sum across field entries. For example to sum up the number read and write sectors completed. In addition to being ar reasonable cleanup by itself this will make it easier to add new stat fields in the future. tj: Refreshed on top of v4.17. Signed-off-by: Michael Callahan <michaelcallahan@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-02dm writecache: support optional offset for start of deviceMikulas Patocka
Add an optional parameter "start_sector" to allow the start of the device to be offset by the specified number of 512-byte sectors. The sectors below this offset are not used by the writecache device and are left to be used for disk labels and/or userspace metadata (e.g. lvm). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-07-02Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds
Pull MD fixes from Shaohua Li: "Two small fixes for MD: - an error handling fix from me - a recover bug fix for raid10 from BingJing" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: md/raid10: fix that replacement cannot complete recovery after reassemble MD: cleanup resources in failure
2018-06-28dm: prevent DAX mounts if not supportedRoss Zwisler
Currently device_supports_dax() just checks to see if the QUEUE_FLAG_DAX flag is set on the device's request queue to decide whether or not the device supports filesystem DAX. Really we should be using bdev_dax_supported() like filesystems do at mount time. This performs other tests like checking to make sure the dax_direct_access() path works. We also explicitly clear QUEUE_FLAG_DAX on the DM device's request queue if any of the underlying devices do not support DAX. This makes the handling of QUEUE_FLAG_DAX consistent with the setting/clearing of most other flags in dm_table_set_restrictions(). Now that bdev_dax_supported() explicitly checks for QUEUE_FLAG_DAX, this will ensure that filesystems built upon DM devices will only be able to mount with DAX if all underlying devices also support DAX. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Fixes: commit 545ed20e6df6 ("dm: add infrastructure for DAX support") Cc: stable@vger.kernel.org Acked-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-06-28md/raid10: fix that replacement cannot complete recovery after reassembleBingJing Chang
During assemble, the spare marked for replacement is not checked. conf->fullsync cannot be updated to be 1. As a result, recovery will treat it as a clean array. All recovering sectors are skipped. Original device is replaced with the not-recovered spare. mdadm -C /dev/md0 -l10 -n4 -pn2 /dev/loop[0123] mdadm /dev/md0 -a /dev/loop4 mdadm /dev/md0 --replace /dev/loop0 mdadm -S /dev/md0 # stop array during recovery mdadm -A /dev/md0 /dev/loop[01234] After reassemble, you can see recovery go on, but it completes immediately. In fact, recovery is not actually processed. To solve this problem, we just add the missing logics for replacment spares. (In raid1.c or raid5.c, they have already been checked.) Reported-by: Alex Chen <alexchen@synology.com> Reviewed-by: Alex Wu <alexwu@synology.com> Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com> Signed-off-by: BingJing Chang <bingjingc@synology.com> Signed-off-by: Shaohua Li <shli@fb.com>
2018-06-27dm thin: handle running out of data space vs concurrent discardMike Snitzer
Discards issued to a DM thin device can complete to userspace (via fstrim) _before_ the metadata changes associated with the discards is reflected in the thinp superblock (e.g. free blocks). As such, if a user constructs a test that loops repeatedly over these steps, block allocation can fail due to discards not having completed yet: 1) fill thin device via filesystem file 2) remove file 3) fstrim From initial report, here: https://www.redhat.com/archives/dm-devel/2018-April/msg00022.html "The root cause of this issue is that dm-thin will first remove mapping and increase corresponding blocks' reference count to prevent them from being reused before DISCARD bios get processed by the underlying layers. However. increasing blocks' reference count could also increase the nr_allocated_this_transaction in struct sm_disk which makes smd->old_ll.nr_allocated + smd->nr_allocated_this_transaction bigger than smd->old_ll.nr_blocks. In this case, alloc_data_block() will never commit metadata to reset the begin pointer of struct sm_disk, because sm_disk_get_nr_free() always return an underflow value." While there is room for improvement to the space-map accounting that thinp is making use of: the reality is this test is inherently racey and will result in the previous iteration's fstrim's discard(s) completing vs concurrent block allocation, via dd, in the next iteration of the loop. No amount of space map accounting improvements will be able to allow user's to use a block before a discard of that block has completed. So the best we can really do is allow DM thinp to gracefully handle such aggressive use of all the pool's data by degrading the pool into out-of-data-space (OODS) mode. We _should_ get that behaviour already (if space map accounting didn't falsely cause alloc_data_block() to believe free space was available).. but short of that we handle the current reality that dm_pool_alloc_data_block() can return -ENOSPC. Reported-by: Dennis Yang <dennisyang@qnap.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-06-22dm raid: don't use 'const' in function returnArnd Bergmann
A newly introduced function has 'const int' as the return type, but as "make W=1" reports, that has no meaning: drivers/md/dm-raid.c:510:18: error: type qualifiers ignored on function return type [-Werror=ignored-qualifiers] This changes the return type to plain 'int'. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: 33e53f06850f ("dm raid: introduce extended superblock and new raid types to support takeover/reshaping") Signed-off-by: Mike Snitzer <snitzer@redhat.com> Fixes: 552aa679f2657431 ("dm raid: use rs_is_raid*()") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-06-22dm zoned: avoid triggering reclaim from inside dmz_map()Bart Van Assche
This patch avoids that lockdep reports the following: ====================================================== WARNING: possible circular locking dependency detected 4.18.0-rc1 #62 Not tainted ------------------------------------------------------ kswapd0/84 is trying to acquire lock: 00000000c313516d (&xfs_nondir_ilock_class){++++}, at: xfs_free_eofblocks+0xa2/0x1e0 but task is already holding lock: 00000000591c83ae (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (fs_reclaim){+.+.}: kmem_cache_alloc+0x2c/0x2b0 radix_tree_node_alloc.constprop.19+0x3d/0xc0 __radix_tree_create+0x161/0x1c0 __radix_tree_insert+0x45/0x210 dmz_map+0x245/0x2d0 [dm_zoned] __map_bio+0x40/0x260 __split_and_process_non_flush+0x116/0x220 __split_and_process_bio+0x81/0x180 __dm_make_request.isra.32+0x5a/0x100 generic_make_request+0x36e/0x690 submit_bio+0x6c/0x140 mpage_readpages+0x19e/0x1f0 read_pages+0x6d/0x1b0 __do_page_cache_readahead+0x21b/0x2d0 force_page_cache_readahead+0xc4/0x100 generic_file_read_iter+0x7c6/0xd20 __vfs_read+0x102/0x180 vfs_read+0x9b/0x140 ksys_read+0x55/0xc0 do_syscall_64+0x5a/0x1f0 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #1 (&dmz->chunk_lock){+.+.}: dmz_map+0x133/0x2d0 [dm_zoned] __map_bio+0x40/0x260 __split_and_process_non_flush+0x116/0x220 __split_and_process_bio+0x81/0x180 __dm_make_request.isra.32+0x5a/0x100 generic_make_request+0x36e/0x690 submit_bio+0x6c/0x140 _xfs_buf_ioapply+0x31c/0x590 xfs_buf_submit_wait+0x73/0x520 xfs_buf_read_map+0x134/0x2f0 xfs_trans_read_buf_map+0xc3/0x580 xfs_read_agf+0xa5/0x1e0 xfs_alloc_read_agf+0x59/0x2b0 xfs_alloc_pagf_init+0x27/0x60 xfs_bmap_longest_free_extent+0x43/0xb0 xfs_bmap_btalloc_nullfb+0x7f/0xf0 xfs_bmap_btalloc+0x428/0x7c0 xfs_bmapi_write+0x598/0xcc0 xfs_iomap_write_allocate+0x15a/0x330 xfs_map_blocks+0x1cf/0x3f0 xfs_do_writepage+0x15f/0x7b0 write_cache_pages+0x1ca/0x540 xfs_vm_writepages+0x65/0xa0 do_writepages+0x48/0xf0 __writeback_single_inode+0x58/0x730 writeback_sb_inodes+0x249/0x5c0 wb_writeback+0x11e/0x550 wb_workfn+0xa3/0x670 process_one_work+0x228/0x670 worker_thread+0x3c/0x390 kthread+0x11c/0x140 ret_from_fork+0x3a/0x50 -> #0 (&xfs_nondir_ilock_class){++++}: down_read_nested+0x43/0x70 xfs_free_eofblocks+0xa2/0x1e0 xfs_fs_destroy_inode+0xac/0x270 dispose_list+0x51/0x80 prune_icache_sb+0x52/0x70 super_cache_scan+0x127/0x1a0 shrink_slab.part.47+0x1bd/0x590 shrink_node+0x3b5/0x470 balance_pgdat+0x158/0x3b0 kswapd+0x1ba/0x600 kthread+0x11c/0x140 ret_from_fork+0x3a/0x50 other info that might help us debug this: Chain exists of: &xfs_nondir_ilock_class --> &dmz->chunk_lock --> fs_reclaim Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(&dmz->chunk_lock); lock(fs_reclaim); lock(&xfs_nondir_ilock_class); *** DEADLOCK *** 3 locks held by kswapd0/84: #0: 00000000591c83ae (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30 #1: 000000000f8208f5 (shrinker_rwsem){++++}, at: shrink_slab.part.47+0x3f/0x590 #2: 00000000cacefa54 (&type->s_umount_key#43){.+.+}, at: trylock_super+0x16/0x50 stack backtrace: CPU: 7 PID: 84 Comm: kswapd0 Not tainted 4.18.0-rc1 #62 Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0 12/17/2015 Call Trace: dump_stack+0x85/0xcb print_circular_bug.isra.36+0x1ce/0x1db __lock_acquire+0x124e/0x1310 lock_acquire+0x9f/0x1f0 down_read_nested+0x43/0x70 xfs_free_eofblocks+0xa2/0x1e0 xfs_fs_destroy_inode+0xac/0x270 dispose_list+0x51/0x80 prune_icache_sb+0x52/0x70 super_cache_scan+0x127/0x1a0 shrink_slab.part.47+0x1bd/0x590 shrink_node+0x3b5/0x470 balance_pgdat+0x158/0x3b0 kswapd+0x1ba/0x600 kthread+0x11c/0x140 ret_from_fork+0x3a/0x50 Reported-by: Masato Suzuki <masato.suzuki@wdc.com> Fixes: 4218a9554653 ("dm zoned: use GFP_NOIO in I/O path") Cc: <stable@vger.kernel.org> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>