summaryrefslogtreecommitdiff
path: root/fs/ceph/mds_client.c
AgeCommit message (Collapse)Author
2020-03-30ceph: consider inode's last read/write when calculating wanted capsYan, Zheng
Add i_last_rd and i_last_wr to ceph_inode_info. These fields are used to track the last time the client acquired read/write caps for the inode. If there is no read/write on an inode for 'caps_wanted_delay_max' seconds, __ceph_caps_file_wanted() does not request caps for read/write even there are open files. Call __ceph_touch_fmode() for dir operations. __ceph_caps_file_wanted() calculates dir's wanted caps according to last dir read/modification. If there is recent dir read, dir inode wants CEPH_CAP_ANY_SHARED caps. If there is recent dir modification, also wants CEPH_CAP_FILE_EXCL. Readdir is a special case. Dir inode wants CEPH_CAP_FILE_EXCL after readdir, as with that, modifications do not need to release CEPH_CAP_FILE_SHARED or invalidate all dentry leases issued by readdir. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: always renew caps if mds_wanted is insufficientYan, Zheng
Original code only renews caps for inodes with CEPH_I_CAP_DROPPED flag, which indicates that mds has closed the session and caps were dropped. Remove this flag in preparation for not requesting caps for idle open files. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: cache layout in parent dir on first sync createJeff Layton
If a create is done, then typically we'll end up writing to the file soon afterward. We don't want to wait for the reply before doing that when doing an async create, so that means we need the layout for the new file before we've gotten the response from the MDS. All files created in a directory will initially inherit the same layout, so copy off the requisite info from the first synchronous create in the directory, and save it in a new i_cached_layout field. Zero out the layout when we lose Dc caps in the dir. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: add new MDS req field to hold delegated inode numberJeff Layton
Add new request field to hold the delegated inode number. Encode that into the message when it's set. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: decode interval_sets for delegated inosJeff Layton
Starting in Octopus, the MDS will hand out caps that allow the client to do asynchronous file creates under certain conditions. As part of that, the MDS will delegate ranges of inode numbers to the client. Add the infrastructure to decode these ranges, and stuff them into an xarray for later consumption by the async creation code. Because the xarray code currently only handles unsigned long indexes, and those are 32-bits on 32-bit arches, we only enable the decoding when running on a 64-bit arch. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: cap tracking for async directory operationsJeff Layton
Track and correctly handle directory caps for asynchronous operations. Add aliases for Frc caps that we now designate at Dcu caps (when dealing with directories). Unlike file caps, we don't reclaim these when the session goes away, and instead preemptively release them. In-flight async dirops are instead handled during reconnect phase. The client needs to re-do a synchronous operation in order to re-get directory caps. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: add infrastructure for waiting for async create to completeJeff Layton
When we issue an async create, we must ensure that any later on-the-wire requests involving it wait for the create reply. Expand i_ceph_flags to be an unsigned long, and add a new bit that MDS requests can wait on. If the bit is set in the inode when sending caps, then don't send it and just return that it has been delayed. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: add flag to designate that a request is asynchronousJeff Layton
...and ensure that such requests are never queued. The MDS has need to know that a request is asynchronous so add flags and proper infrastructure for that. Also, delegated inode numbers and directory caps are associated with the session, so ensure that async requests are always transmitted on the first attempt and are never queued to wait for session reestablishment. If it does end up looking like we'll need to queue the request, then have it return -EJUKEBOX so the caller can reattempt with a synchronous request. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: return ETIMEDOUT errno to userland when request timed outXiubo Li
req->r_timeout is only used during mounting, so this error will be more accurate. URL: https://tracker.ceph.com/issues/44215 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: move to a dedicated slabcache for mds requestsJeff Layton
On my machine (x86_64) this struct is 952 bytes, which gets rounded up to 1024 by kmalloc. Move this to a dedicated slabcache, so we can allocate them without the extra 72 bytes of overhead per. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: check inode type for CEPH_CAP_FILE_{CACHE,RD,REXTEND,LAZYIO}Yan, Zheng
These bits will have new meaning for directory inodes. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30ceph: register MDS request with dir inode from the startJeff Layton
When the unsafe reply to a request comes in, the request is put on the r_unsafe_dir inode's list. In future patches, we're going to need to wait on requests that may not have gotten an unsafe reply yet. Change __register_request to put the entry on the dir inode's list when the pointer is set in the request, and don't check the CEPH_MDS_R_GOT_UNSAFE flag when unregistering it. The only place that uses this list today is fsync codepath, and with the coming changes, we'll want to wait on all operations whether it has gotten an unsafe reply or not. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-02-06Merge tag 'ceph-for-5.6-rc1' of https://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fixes from Ilya Dryomov: - a set of patches that fixes various corner cases in mount and umount code (Xiubo Li). This has to do with choosing an MDS, distinguishing between laggy and down MDSes and parsing the server path. - inode initialization fixes (Jeff Layton). The one included here mostly concerns things like open_by_handle() and there is another one that will come through Al. - copy_file_range() now uses the new copy-from2 op (Luis Henriques). The existing copy-from op turned out to be infeasible for generic filesystem use; we disable the copy offload if OSDs don't support copy-from2. - a patch to link "rbd" and "block" devices together in sysfs (Hannes Reinecke) ... and a smattering of cleanups from Xiubo, Jeff and Chengguang. * tag 'ceph-for-5.6-rc1' of https://github.com/ceph/ceph-client: (25 commits) rbd: set the 'device' link in sysfs ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c ceph: print name of xattr in __ceph_{get,set}xattr() douts ceph: print r_direct_hash in hex in __choose_mds() dout ceph: use copy-from2 op in copy_file_range ceph: close holes in structs ceph_mds_session and ceph_mds_request rbd: work around -Wuninitialized warning ceph: allocate the correct amount of extra bytes for the session features ceph: rename get_session and switch to use ceph_get_mds_session ceph: remove the extra slashes in the server path ceph: add possible_max_rank and make the code more readable ceph: print dentry offset in hex and fix xattr_version type ceph: only touch the caps which have the subset mask requested ceph: don't clear I_NEW until inode metadata is fully populated ceph: retry the same mds later after the new session is opened ceph: check availability of mds cluster on mount after wait timeout ceph: keep the session state until it is released ceph: add __send_request helper ceph: ensure we have a new cap before continuing in fill_inode ceph: drop unused ttl_from parameter from fill_inode ...
2020-02-05Merge branch 'imm.timestamp' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs timestamp updates from Al Viro: "More 64bit timestamp work" * 'imm.timestamp' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: kernfs: don't bother with timestamp truncation fs: Do not overload update_time fs: Delete timespec64_trunc() fs: ubifs: Eliminate timespec64_trunc() usage fs: ceph: Delete timespec64_trunc() usage fs: cifs: Delete usage of timespec64_trunc fs: fat: Eliminate timespec64_trunc() usage utimes: Clamp the timestamps in notify_change()
2020-01-27ceph: print r_direct_hash in hex in __choose_mds() doutXiubo Li
It's hard to read, especially when it is: ceph: __choose_mds 00000000b7bc9c15 is_hash=1 (-271041095) mode 0 At the same time, switch to __func__ to get rid of the checkpatch warning. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27ceph: allocate the correct amount of extra bytes for the session featuresXiubo Li
The total bytes may potentially be larger than 8. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27ceph: rename get_session and switch to use ceph_get_mds_sessionXiubo Li
Just in case the session's refcount reach 0 and is releasing, and if we get the session without checking it, we may encounter kernel crash. Rename get_session to ceph_get_mds_session and make it global. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27ceph: add possible_max_rank and make the code more readableXiubo Li
The m_num_mds here is actually the number for MDSs which are in up:active status, and it will be duplicated to m_num_active_mds, so remove it. Add possible_max_rank to the mdsmap struct and this will be the correctly possible largest rank boundary. Remove the special case for one mds in __mdsmap_get_random_mds(), because the validate mds rank may not always be 0. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27ceph: retry the same mds later after the new session is openedXiubo Li
If max_mds > 1 and a request is submitted that chooses a random mds rank, and the relating session is not opened yet, the request will wait until the session has been opened and resend again. Every time the request goes through __do_request, it will release the req->session first and choose a random one again, which may be a completely different rank than the one it just waited on. In the worst case, it will open all the mds sessions one by one just before the request can be successfully sent out. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27ceph: check availability of mds cluster on mount after wait timeoutXiubo Li
If all the MDS daemons are down for some reason, then the first mount attempt will fail with EIO after the mount request times out. A mount attempt will also fail with EIO if all of the MDS's are laggy. This patch changes the code to return -EHOSTUNREACH in these situations and adds a pr_info error message to help the admin determine the cause. URL: https://tracker.ceph.com/issues/4386 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27ceph: keep the session state until it is releasedXiubo Li
When reconnecting the session but if it is denied by the MDS due to client was in blacklist or something else, kclient will receive a session close reply, and we will never see the important log: "ceph: mds%d reconnect denied" And with the confusing log: "ceph: handle_session mds0 close 0000000085804730 state ??? seq 0" Let's keep the session state until its memories is released. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27ceph: add __send_request helperXiubo Li
Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27ceph: fix possible long time wait during umountXiubo Li
During umount, if there has no any unsafe request in the mdsc and some requests still in-flight and not got reply yet, and if the rest requets are all safe ones, after that even all of them in mdsc are unregistered, the umount must wait until after mount_timeout seconds anyway. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27ceph: only choose one MDS who is in up:active state without laggyXiubo Li
Even the MDS is in up:active state, but it also maybe laggy. Here will skip the laggy MDSs. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27ceph: delete redundant douts in con_get/put()Chengguang Xu
We print session's refcount in debug message inside ceph_put_mds_session() and get_session(), so we don't have to print it in con_get()/__ceph_lookup_mds_session()/con_put(). Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-21ceph: hold extra reference to r_parent over life of requestJeff Layton
Currently, we just assume that it will stick around by virtue of the submitter's reference, but later patches will allow the syscall to return early and we can't rely on that reference at that point. While I'm not aware of any reports of it, Xiubo pointed out that this may fix a use-after-free. If the wait for a reply times out or is canceled via signal, and then the reply comes in after the syscall returns, the client can end up trying to access r_parent without a reference. Take an extra reference to the inode when setting r_parent and release it when releasing the request. Cc: stable@vger.kernel.org Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-12-09ceph: trigger the reclaim work once there has enough pending capsXiubo Li
The nr in ceph_reclaim_caps_nr() is very possibly larger than 1, so we may miss it and the reclaim work couldn't triggered as expected. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-12-09ceph: show tasks waiting on caps in debugfs caps fileJeff Layton
Add some visibility of tasks that are waiting for caps to the "caps" debugfs file. Display the tgid of the waiting task, inode number, and the caps the task needs and wants. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-12-09ceph: convert int fields in ceph_mount_options to unsigned intJeff Layton
Most of these values should never be negative, so convert them to unsigned values. Add some sanity checking to the parsed values, and clean up some unneeded casts. Note that while caps_max should never be negative, this patch leaves it signed, since this value ends up later being compared to a signed counter. Just ensure that userland never passes in a negative value for caps_max. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-12-08fs: ceph: Delete timespec64_trunc() usageDeepa Dinamani
Since ceph always uses ns granularity, skip the truncation which is a no-op. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Cc: jlayton@kernel.org Cc: ceph-devel@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-25ceph: don't leave ino field in ceph_mds_request_head uninitializedJeff Layton
We currently just pass junk in this field unless we're retransmitting a create, but in later patches, we'll need a mechanism to pass a delegated inode number on an initial create request. Prepare for this by ensuring this field is zeroed out. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-11-25ceph: tone down loglevel on ceph_mdsc_build_path warningJeff Layton
When this occurs, it usually means that we raced with a rename, and there is no need to warn in that case. Only printk if we pass the rename sequence check but still ended up with pos < 0. Either way, this doesn't warrant a KERN_ERR message. Change it to KERN_WARNING. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-10-15ceph: just skip unrecognized info in ceph_reply_info_extraJeff Layton
In the future, we're going to want to extend the ceph_reply_info_extra for create replies. Currently though, the kernel code doesn't accept an extra blob that is larger than the expected data. Change the code to skip over any unrecognized fields at the end of the extra blob, rather than returning -EIO. Cc: stable@vger.kernel.org Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-09-16ceph: reconnect connection if session hang in opening stateErqi Chen
If client mds session is evicted in CEPH_MDS_SESSION_OPENING state, mds won't send session msg to client, and delayed_work skip CEPH_MDS_SESSION_OPENING state session, the session hang forever. Allow ceph_con_keepalive to reconnect a session in OPENING to avoid session hang. Also, ensure that we skip sessions in RESTARTING and REJECTED states since those states can't be resurrected by issuing a keepalive. Link: https://tracker.ceph.com/issues/41551 Signed-off-by: Erqi Chen chenerqi@gmail.com Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-09-16ceph: eliminate session->s_trim_capsJeff Layton
It's only used to keep count of caps being trimmed, but that requires that we hold the session->s_mutex to prevent multiple trimming operations from running concurrently. We can achieve the same effect using an integer on the stack, which allows us to (eventually) not need the s_mutex. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-09-16ceph: auto reconnect after blacklistedYan, Zheng
Make client use osd reply and session message to infer if itself is blacklisted. Client reconnect to cluster using new entity addr if it is blacklisted. Auto reconnect is limited to once every 30 minutes. Auto reconnect is disabled by default. It can be enabled/disabled by recover_session=<no|clean> mount option. In 'clean' mode, client drops any dirty data/metadata, invalidates page caches and invalidates all writable file handles. After reconnect, file locks become stale because MDS loses track of them. If an inode contains any stale file locks, read/write on the indoe are not allowed until applications release all stale file locks. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-09-16ceph: add helper function that forcibly reconnects to ceph cluster.Yan, Zheng
It closes mds sessions, drop all caps and invalidates page caches, then use new entity address to reconnect to the cluster. After reconnect, all dirty data/metadata are dropped, file locks get lost sliently. Open files continue to work because client will try renewing caps on later read/write. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-09-16ceph: track and report error of async metadata operationYan, Zheng
Use errseq_t to track and report errors of async metadata operations, similar to how kernel handles errors during writeback. If any dirty caps or any unsafe request gets dropped during session eviction, record -EIO in corresponding inode's i_meta_err. The error will be reported by subsequent fsync, Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: add change_attr field to ceph_inode_infoJeff Layton
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: add btime field to ceph_inode_infoJeff Layton
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: remove request from waiting list before unregisterYan, Zheng
Link: https://tracker.ceph.com/issues/40339 Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: don't blindly unregister session that is in opening stateYan, Zheng
handle_cap_export() may add placeholder caps to session that is in opening state. These caps' session pointer become wild after session get unregistered. The fix is not to unregister session in opening state during mds failovers, just let client to reconnect later when mds is recovered. Link: https://tracker.ceph.com/issues/40190 Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: ensure d_name/d_parent stability in ceph_mdsc_lease_send_msg()Yan, Zheng
Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: use READ_ONCE to access d_parent in RCU critical sectionYan, Zheng
Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: carry snapshot creation time with inodesDavid Disseldorp
MDS InodeStat v3 wire structures include a trailing snapshot creation time member. Unmarshall this and retain it for a future vxattr. Signed-off-by: David Disseldorp <ddiss@suse.de> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-06-27ceph: fix ceph_mdsc_build_path to not stop on first componentJeff Layton
When ceph_mdsc_build_path is handed a positive dentry, it will return a zero-length path string with the base set to that dentry. This is not what we want. Always include at least one path component in the string. ceph_mdsc_build_path has behaved this way for a long time but it didn't matter until recent d_name handling rework. Fixes: 964fff7491e4 ("ceph: use ceph_mdsc_build_path instead of clone_dentry_name") Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-06-05ceph: avoid iput_final() while holding mutex or in dispatch threadYan, Zheng
iput_final() may wait for reahahead pages. The wait can cause deadlock. For example: Workqueue: ceph-msgr ceph_con_workfn [libceph] Call Trace: schedule+0x36/0x80 io_schedule+0x16/0x40 __lock_page+0x101/0x140 truncate_inode_pages_range+0x556/0x9f0 truncate_inode_pages_final+0x4d/0x60 evict+0x182/0x1a0 iput+0x1d2/0x220 iterate_session_caps+0x82/0x230 [ceph] dispatch+0x678/0xa80 [ceph] ceph_con_workfn+0x95b/0x1560 [libceph] process_one_work+0x14d/0x410 worker_thread+0x4b/0x460 kthread+0x105/0x140 ret_from_fork+0x22/0x40 Workqueue: ceph-msgr ceph_con_workfn [libceph] Call Trace: __schedule+0x3d6/0x8b0 schedule+0x36/0x80 schedule_preempt_disabled+0xe/0x10 mutex_lock+0x2f/0x40 ceph_check_caps+0x505/0xa80 [ceph] ceph_put_wrbuffer_cap_refs+0x1e5/0x2c0 [ceph] writepages_finish+0x2d3/0x410 [ceph] __complete_request+0x26/0x60 [libceph] handle_reply+0x6c8/0xa10 [libceph] dispatch+0x29a/0xbb0 [libceph] ceph_con_workfn+0x95b/0x1560 [libceph] process_one_work+0x14d/0x410 worker_thread+0x4b/0x460 kthread+0x105/0x140 ret_from_fork+0x22/0x40 In above example, truncate_inode_pages_range() waits for readahead pages while holding s_mutex. ceph_check_caps() waits for s_mutex and blocks OSD dispatch thread. Later OSD replies (for readahead) can't be handled. ceph_check_caps() also may lock snap_rwsem for read. So similar deadlock can happen if iput_final() is called while holding snap_rwsem. In general, it's not good to call iput_final() inside MDS/OSD dispatch threads or while holding any mutex. The fix is introducing ceph_async_iput(), which calls iput_final() in workqueue. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-05-07ceph: fix unaligned access in ceph_send_cap_releasesJeff Layton
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-05-07ceph: just call get_session in __ceph_lookup_mds_sessionJeff Layton
I originally thought there was a potential race here, but the fact that this is called with the mdsc->mutex held, ensures that the last reference to the session can't be put here. Still, it's clearer to just return the value from get_session here, and may prevent a bug later if we ever rework this code to be less reliant on mutexes. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-05-07ceph: move wait for mds request into helper functionJeff Layton
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>