summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-08-02md/raid5: Check all disks in a stripe_head for reshape progressLogan Gunthorpe
When testing if a previous stripe has had reshape expand past it, use the earliest or latest logical sector in all the disks for that stripe head. This will allow adding multiple disks at a time in a subesquent patch. To do this cleaner, refactor the check into a helper function called stripe_ahead_of_reshape(). Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5: Refactor add_stripe_bio()Logan Gunthorpe
Factor out two helper functions from add_stripe_bio(): one to check for overlap (stripe_bio_overlaps()), and one to actually add the bio to the stripe (__add_stripe_bio()). The latter function will always succeed. This will be useful in the next patch so that overlap can be checked for multiple disks before adding any Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5: Keep a reference to last stripe_head for batchLogan Gunthorpe
When batching, every stripe head has to find the previous stripe head to add to the batch list. This involves taking the hash lock which is highly contended during IO. Instead of finding the previous stripe_head each time, store a reference to the previous stripe_head in a pointer so that it doesn't require taking the contended lock another time. The reference to the previous stripe must be released before scheduling and waiting for work to get done. Otherwise, it can hold up raid5_activate_delayed() and deadlock. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5: Refactor for loop in raid5_make_request() into while loopLogan Gunthorpe
The for loop with retry label can be more cleanly expressed as a while loop by moving the logical_sector increment into the success path. No functional changes intended. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5: Move read_seqcount_begin() into make_stripe_request()Logan Gunthorpe
Now that prepare_to_wait() isn't in the way, move read_sequcount_begin() into make_stripe_request(). No functional changes intended. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5: Drop the do_prepare flag in raid5_make_request()Logan Gunthorpe
prepare_to_wait() can be reasonably called after schedule instead of setting a flag and preparing in the next loop iteration. This means that prepare_to_wait() will be called before read_seqcount_begin(), but there shouldn't be any reason that the order matters here. On the first iteration of the loop prepare_to_wait() is already called first. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5: Factor out helper from raid5_make_request() loopLogan Gunthorpe
Factor out the inner loop of raid5_make_request() into it's own helper called make_stripe_request(). The helper returns a number of statuses: SUCCESS, RETRY, SCHEDULE_AND_RETRY and FAIL. This makes the code a bit easier to understand and allows the SCHEDULE_AND_RETRY path to be made common. A context structure is added to contain do_flush. It will be used more in subsequent patches for state that needs to be kept outside the loop. No functional changes intended. This will be cleaned up further in subsequent patches to untangle the gen_lock and do_prepare logic further. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5: Move common stripe get code into new find_get_stripe() helperLogan Gunthorpe
Both uses of find_stripe() require a fairly complicated dance to increment the reference count. Move this into a common find_get_stripe() helper. No functional changes intended. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio()Logan Gunthorpe
stripe_add_to_batch_list() is better done in the loop in make_request instead of inside add_stripe_bio(). This is clearer and allows for storing the batch_head state outside the loop in a subsequent patch. The call to add_stripe_bio() in retry_aligned_read() is for read and batching only applies to write. So it's impossible for batching to happen at that call site. No functional changes intended. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5: Refactor raid5_make_request loopLogan Gunthorpe
Break immediately if raid5_get_active_stripe() returns NULL and deindent the rest of the loop. Annotate this check with an unlikely(). This makes the code easier to read and reduces the indentation level. No functional changes intended. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5: Factor out ahead_of_reshape() functionLogan Gunthorpe
There are a few uses of an ugly ternary operator in raid5_make_request() to check if a sector is a head of a reshape sector. Factor this out into a simple helper called ahead_of_reshape(). No functional changes intended. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5: Make logic blocking check consistent with logic that blocksLogan Gunthorpe
The check in raid5_make_request differs very slightly from the logic that causes it to block lower down. This likely does not cause a bug as the check is fuzzy anyway (as reshape may move on between the first check and the subsequent check). However, make it consistent so it can be cleaned up in a subsequent patch. The condition which causes the schedule is: !(mddev->reshape_backwards ? logical_sector < conf->reshape_progress : logical_sector >= conf->reshape_progress) && (mddev->reshape_backwards ? logical_sector < conf->reshape_safe : logical_sector >= conf->reshape_safe) The condition that causes the early bailout is made to match this. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md: unlock mddev before reap sync_thread in action_storeGuoqing Jiang
Since the bug which commit 8b48ec23cc51a ("md: don't unregister sync_thread with reconfig_mutex held") fixed is related with action_store path, other callers which reap sync_thread didn't need to be changed. Let's pull md_unregister_thread from md_reap_sync_thread, then fix previous bug with belows. 1. unlock mddev before md_reap_sync_thread in action_store. 2. save reshape_position before unlock, then restore it to ensure position not changed accidentally by others. Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md: Explicitly create command-line configured devicesChris Webb
Boot-time assembly of arrays with md= command-line arguments breaks when CONFIG_BLOCK_LEGACY_AUTOLOAD is unset. md_setup_drive() in md-autodetect.c calls blkdev_get_by_dev(), assuming this implicitly creates the block device. Fix this by attempting to md_alloc() the array first. As in the probe path, ignore any error as failure is caught by blkdev_get_by_dev() anyway. Signed-off-by: Chris Webb <chris@arachsys.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md: Notify sysfs sync_completed in md_reap_sync_thread()Logan Gunthorpe
The mdadm test 07layouts randomly produces a kernel hung task deadlock. The deadlock is caused by the suspend_lo/suspend_hi files being set by the mdadm background process during reshape and not being cleared because the process hangs. (Leaving aside the issue of the fragility of freezing kernel tasks by buggy userspace processes...) When the background mdadm process hangs it, is waiting (without a timeout) on a change to the sync_completed file signalling that the reshape has completed. The process is woken up a couple times when the reshape finishes but it is woken up before MD_RECOVERY_RUNNING is cleared so sync_completed_show() reports 0 instead of "none". To fix this, notify the sysfs file in md_reap_sync_thread() after MD_RECOVERY_RUNNING has been cleared. This wakes up mdadm and causes it to continue and write to suspend_lo/suspend_hi to allow IO to continue. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md: Ensure resync is reported after it startsLogan Gunthorpe
The 07layouts test in mdadm fails on some systems. The failure presents itself as the backup file not being removed before the next layout is grown into: mdadm: /dev/md0: cannot create backup file /tmp/md-test-backup: File exists This is because the background mdadm process, which is responsible for cleaning up this backup file gets into an infinite loop waiting for the reshape to start. mdadm checks the mdstat file if a reshape is going and, if it is not, it waits for an event on the file or times out in 5 seconds. On faster machines, the reshape may complete before the 5 seconds times out, and thus the background mdadm process loops waiting for a reshape to start that has already occurred. mdadm reads the mdstat file to start, but mdstat does not report that the reshape has begun, even though it has indeed begun. So the mdstat_wait() call (in mdadm) which polls on the mdstat file won't ever return until timing out. The reason mdstat reports the reshape has started is due to an issue in status_resync(). recovery_active is subtracted from curr_resync which will result in a value of zero for the first chunk of reshaped data, and the resulting read will report no reshape in progress. To fix this, if "resync - recovery_active" is an overloaded value, force the value to be MD_RESYNC_ACTIVE so the code reports a resync in progress. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md: Use enum for overloaded magic numbers used by mddev->curr_resyncLogan Gunthorpe
Comments in the code document special values used for mddev->curr_resync. Make this clearer by using an enum to label these values. The only functional change is a couple places use the wrong comparison operator that implied 3 is another special value. They are all fixed to imply that 3 or greater is an active resync. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5-cache: Annotate pslot with __rcu notationLogan Gunthorpe
radix_tree_lookup_slot() and radix_tree_replace_slot() API expect the slot returned and looked up to be marked with __rcu. Otherwise sparse warnings are generated: drivers/md/raid5-cache.c:2939:23: warning: incorrect type in assignment (different address spaces) drivers/md/raid5-cache.c:2939:23: expected void **pslot drivers/md/raid5-cache.c:2939:23: got void [noderef] __rcu ** Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5-cache: Clear conf->log after finishing workLogan Gunthorpe
A NULL pointer dereferlence on conf->log is seen randomly with the mdadm test 21raid5cache. Kasan reporst: BUG: KASAN: null-ptr-deref in r5l_reclaimable_space+0xf5/0x140 Read of size 8 at addr 0000000000000860 by task md0_reclaim/3086 Call Trace: dump_stack_lvl+0x5a/0x74 kasan_report.cold+0x5f/0x1a9 __asan_load8+0x69/0x90 r5l_reclaimable_space+0xf5/0x140 r5l_do_reclaim+0xf4/0x5e0 r5l_reclaim_thread+0x69/0x3b0 md_thread+0x1a2/0x2c0 kthread+0x177/0x1b0 ret_from_fork+0x22/0x30 This is caused by conf->log being cleared in r5l_exit_log() before stopping the reclaim thread. To fix this, clear conf->log after the reclaim_thread is unregistered and after flushing disable_writeback_work. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5-cache: Drop RCU usage of conf->logLogan Gunthorpe
The only place that uses RCU to access conf->log is in r5l_log_disk_error(). This function is mostly used in the IO path and once with mddev_lock() held in raid5_change_consistency_policy(). It is known that the IO will be suspended before the log is freed and r5l_log_exit() is called with the mddev_lock() held. This should mean that conf->log can not be freed while the function is being called, so the RCU protection is not necessary. Drop the rcu_read_lock() as well as the synchronize_rcu() and rcu_assign_pointer() usage. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5-cache: Take mddev_lock in r5c_journal_mode_show()Logan Gunthorpe
The mddev->lock spinlock doesn't protect against the removal of conf->log in r5l_exit_log() so conf->log may be freed before it is used. To fix this, take the mddev_lock() insteaad of the mddev->lock spinlock. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5: suspend the array for calls to log_exit()Logan Gunthorpe
The raid5-cache code relies on there being no IO in flight when log_exit() is called. There are two places where this is not guaranteed so add mddev_suspend() and mddev_resume() calls to these sites. The site in raid5_change_consistency_policy() is in the error path, and another similar call site already has suspend/resume calls just below it; so it should be equally safe to make that change here. There is one remaining site in raid5_remove_disk() that we call log_exit() without suspending the array. Unfortunately, as the comment stated, we cannot call mddev_suspend from raid5d. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5-ppl: Drop unused argument from ppl_handle_flush_request()Logan Gunthorpe
ppl_handle_flush_request() takes an struct r5log argument but doesn't use it. It has no buisiness taking this argument as it is only used by raid5-cache and has no way to derference it anyway. Remove the argument. No functional changes intended. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02md/raid5-log: Drop extern decorators for function prototypesLogan Gunthorpe
extern is not necessary and recommended against when defining prototype functions in headers. checkpatch.pl complains about these. So remove them. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02MAINTAINERS: add patchwork link to linux-raid projectSong Liu
Add link to patchwork: https://patchwork.kernel.org/project/linux-raid/list/ Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-02drbd: bm_page_async_io: fix spurious bitmap "IO error" on large volumesLars Ellenberg
We usually do all our bitmap IO in units of PAGE_SIZE. With very small or oddly sized external meta data, or with PAGE_SIZE != 4k, it can happen that our last on-disk bitmap page is not fully PAGE_SIZE aligned, so we may need to adjust the size of the IO. We used to do that with min_t(unsigned int, PAGE_SIZE, last_allowed_sector - current_offset); And for just the right diff, (unsigned int)(diff) will result in 0. A bio of length 0 will correctly be rejected with an IO error (and some scary WARN_ON_ONCE()) by the scsi layer. Do the calculation properly. Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Link: https://lore.kernel.org/r/20220622204932.196830-1-christoph.boehmwalder@linbit.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-03ata: sata_mv: Fixes expected number of resources now IRQs are goneAndrew Lunn
The commit a1a2b7125e10 ("of/platform: Drop static setup of IRQ resource from DT core") stopped IRQ resources being available as platform resources. This broke the sanity check for the expected number of resources in the Marvell SATA driver which expected two resources, the IO memory and the interrupt. Change the sanity check to only expect the IO memory. Cc: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> Fixes: a1a2b7125e10 ("of/platform: Drop static setup of IRQ resource from DT core") Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
2022-08-03libceph: fix ceph_pagelist_reserve() comment typoJason Wang
The double `without' is duplicated in the comment, remove one. Signed-off-by: Jason Wang <wangborong@cdjrlc.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03ceph: remove useless check for the folioXiubo Li
The netfs_write_begin() won't set the folio if the return value is non-zero. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03ceph: don't truncate file in atomic_openHu Weiwen
Clear O_TRUNC from the flags sent in the MDS create request. `atomic_open' is called before permission check. We should not do any modification to the file here. The caller will do the truncation afterward. Fixes: 124e68e74099 ("ceph: file operations") Signed-off-by: Hu Weiwen <sehuww@mail.scut.edu.cn> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03ceph: make f_bsize always equal to f_frsizeXiubo Li
The f_frsize maybe changed in the quota size is less than the defualt 4MB. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03ceph: flush the dirty caps immediatelly when quota is approachingXiubo Li
When the quota is approaching we need to notify it to the MDS as soon as possible, or the client could write to the directory more than expected. This will flush the dirty caps without delaying after each write, though this couldn't prevent the real size of a directory exceed the quota but could prevent it as soon as possible. Link: https://tracker.ceph.com/issues/56180 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Luís Henriques <lhenriques@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03libceph: print fsid and epoch with osd idDaichi Mukai
Print fsid and epoch in libceph log messages to distinct from which each message come. [ idryomov: don't bother with gid for now, print epoch instead ] Signed-off-by: Satoru Takeuchi <satoru.takeuchi@gmail.com> Signed-off-by: Daichi Mukai <daichi-mukai@cybozu.co.jp> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03libceph: check pointer before assigned to "c->rules[]"Li Qiong
It should be better to check pointer firstly, then assign it to c->rules[]. Refine code a little bit. Signed-off-by: Li Qiong <liqiong@nfschina.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03ceph: don't get the inline data for new creating filesXiubo Li
If the 'i_inline_version' is 1, that means the file is just new created and there shouldn't have any inline data in it, we should skip retrieving the inline data from MDS. This also could help reduce possiblity of dead lock issue introduce by the inline data and Fcr caps. Gradually we will remove the inline feature from kclient after ceph's scrub too have support to unline the inline data, currently this could help reduce the teuthology test failures. This is possiblly could also fix a bug that for some old clients if they couldn't explictly uninline the inline data when writing, the inline version will keep as 1 always. We may always reading non-exist data from inline data. Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03ceph: update the auth cap when the async create req is forwardedXiubo Li
For async create we will always try to choose the auth MDS of frag the dentry belonged to of the parent directory to send the request and ususally this works fine, but if the MDS migrated the directory to another MDS before it could be handled the request will be forwarded. And then the auth cap will be changed. We need to update the auth cap in this case before the request is forwarded. Link: https://tracker.ceph.com/issues/55857 Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03ceph: make change_auth_cap_ses a global symbolXiubo Li
Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03ceph: fix incorrect old_size length in ceph_mds_request_argsXiubo Li
The 'old_size' is a __le64 type since birth, not sure why the kclient incorrectly switched it to __le32. This change is okay won't break anything because union will always allocate more memory than the 'open' member needed. Rename 'file_replication' to 'pool' as ceph did. Though this 'open' struct may never be used in kclient in future, it's confusing when going through the ceph code. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03ceph: switch back to testing for NULL folio->private in ceph_dirty_folioJeff Layton
Willy requested that we change this back to warning on folio->private being non-NULl. He's trying to kill off the PG_private flag, and so we'd like to catch where it's non-NULL. Add a VM_WARN_ON_FOLIO (since it doesn't exist yet) and change over to using that instead of VM_BUG_ON_FOLIO along with testing the ->private pointer. [ xiubli: define VM_WARN_ON_FOLIO macro in case DEBUG_VM is disabled reported by kernel test robot <lkp@intel.com> ] Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03ceph: call netfs_subreq_terminated with was_async == falseJeff Layton
"was_async" is a bit misleadingly named. It's supposed to indicate whether it's safe to call blocking operations from the context you're calling it from, but it sounds like it's asking whether this was done via async operation. For ceph, this it's always called from kernel thread context so it should be safe to set this to false. Cc: David Howells <dhowells@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03ceph: convert to generic_file_llseekJeff Layton
There's no reason we need to lock the inode for write in order to handle an llseek. I suspect this should have been dropped in 2013 when we stopped doing vmtruncate in llseek. With that gone, ceph_llseek is functionally equivalent to generic_file_llseek, so just call that after getting the size. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Luís Henriques <lhenriques@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03ceph: fix the incorrect comment for the ceph_mds_caps structXiubo Li
The incorrect comment is misleading. Acutally the last members in ceph_mds_caps strcut is a union for none export and export bodies. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03ceph: don't leak snap_rwsem in handle_cap_grantJeff Layton
When handle_cap_grant is called on an IMPORT op, then the snap_rwsem is held and the function is expected to release it before returning. It currently fails to do that in all cases which could lead to a deadlock. Fixes: 6f05b30ea063 ("ceph: reset i_requested_max_size if file write is not wanted") Link: https://tracker.ceph.com/issues/55857 Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Luís Henriques <lhenriques@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03ceph: prevent a client from exceeding the MDS maximum xattr sizeLuís Henriques
The MDS tries to enforce a limit on the total key/values in extended attributes. However, this limit is enforced only if doing a synchronous operation (MDS_OP_SETXATTR) -- if we're buffering the xattrs, the MDS doesn't have a chance to enforce these limits. This patch adds support for decoding the xattrs maximum size setting that is distributed in the mdsmap. Then, when setting an xattr, the kernel client will revert to do a synchronous operation if that maximum size is exceeded. While there, fix a dout() that would trigger a printk warning: [ 98.718078] ------------[ cut here ]------------ [ 98.719012] precision 65536 too large [ 98.719039] WARNING: CPU: 1 PID: 3755 at lib/vsprintf.c:2703 vsnprintf+0x5e3/0x600 ... Link: https://tracker.ceph.com/issues/55725 Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03ceph: choose auth MDS for getxattr with the Xs capsXiubo Li
And for the 'Xs' caps for getxattr we will also choose the auth MDS, because the MDS side code is buggy due to setxattr won't notify the replica MDSes when the values changed and the replica MDS will return the old values. Though we will fix it in MDS code, but this still makes sense for old ceph. Link: https://tracker.ceph.com/issues/55331 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03ceph: add session already open notify supportXiubo Li
If the connection was accidently closed due to the socket issue or something else the clients will try to open the opened sessions, the MDSes will send the session open reply one more time if the clients support the notify feature. When the clients retry to open the sessions the s_seq will be 0 as default, we need to update it anyway. Link: https://tracker.ceph.com/issues/53911 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03ceph: wait for the first reply of inflight async unlinkXiubo Li
In async unlink case the kclient won't wait for the first reply from MDS and just drop all the links and unhash the dentry and then succeeds immediately. For any new create/link/rename,etc requests followed by using the same file names we must wait for the first reply of the inflight unlink request, or the MDS possibly will fail these following requests with -EEXIST if the inflight async unlink request was delayed for some reasons. And the worst case is that for the none async openc request it will successfully open the file if the CDentry hasn't been unlinked yet, but later the previous delayed async unlink request will remove the CDenty. That means the just created file is possiblly deleted later by accident. We need to wait for the inflight async unlink requests to finish when creating new files/directories by using the same file names. Link: https://tracker.ceph.com/issues/55332 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03fs/dcache: export d_same_name() helperXiubo Li
Compare dentry name with case-exact name, return true if names are same, or false. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03ceph: remove useless CEPHFS_FEATURES_CLIENT_REQUIREDXiubo Li
This macro was added but never be used. And check the ceph code there has another CEPHFS_FEATURES_MDS_REQUIRED but always be empty. We should clean up all this related code, which make no sense but introducing confusion. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Luís Henriques <lhenriques@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03ceph: use correct index when encoding client supported featuresLuís Henriques
Feature bits have to be encoded into the correct locations. This hasn't been an issue so far because the only hole in the feature bits was in bit 10 (CEPHFS_FEATURE_RECLAIM_CLIENT), which is located in the 2nd byte. When adding more bits that go beyond the this 2nd byte, the bug will show up. [xiubli: remove incorrect comment for CEPHFS_FEATURES_CLIENT_SUPPORTED] Fixes: 9ba1e224538a ("ceph: allocate the correct amount of extra bytes for the session features") Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>