summaryrefslogtreecommitdiff
path: root/drivers/block/rbd.c
AgeCommit message (Collapse)Author
2017-05-29rbd: implement REQ_OP_WRITE_ZEROESIlya Dryomov
Commit 93c1defedcae ("rbd: remove the discard_zeroes_data flag") explicitly didn't implement REQ_OP_WRITE_ZEROES for rbd, while the following commit 48920ff2a5a9 ("block: remove the discard_zeroes_data flag") dropped ->discard_zeroes_data in favor of REQ_OP_WRITE_ZEROES. rbd does support efficient zeroing via CEPH_OSD_OP_ZERO opcode and will release either some or all blocks depending on whether the zeroing request is rbd_obj_bytes() aligned. This is how we currently implement discards, so REQ_OP_WRITE_ZEROES can be identical to REQ_OP_DISCARD for now. Caveats: - REQ_NOUNMAP is ignored, but AFAICT that's true of at least two other current implementations - nvme and loop - there is no ->write_zeroes_alignment and blk_bio_write_zeroes_split() is hence less helpful than blk_bio_discard_split(), but this can (and should) be fixed on the rbd side In the future we will split these into two code paths to respect REQ_NOUNMAP on zeroout and save on zeroing blocks that couldn't be released on discard. Fixes: 93c1defedcae ("rbd: remove the discard_zeroes_data flag") Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-05-10Merge tag 'ceph-for-4.12-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph updates from Ilya Dryomov: "The two main items are support for disabling automatic rbd exclusive lock transfers from myself and the long awaited -ENOSPC handling series from Jeff. The former will allow rbd users to take advantage of exclusive lock's built-in blacklist/break-lock functionality while staying in control of who owns the lock. With the latter in place, we will abort filesystem writes on -ENOSPC instead of having them block indefinitely. Beyond that we've got the usual pile of filesystem fixes from Zheng, some refcount_t conversion patches from Elena and a patch for an ancient open() flags handling bug from Alexander" * tag 'ceph-for-4.12-rc1' of git://github.com/ceph/ceph-client: (31 commits) ceph: fix memory leak in __ceph_setxattr() ceph: fix file open flags on ppc64 ceph: choose readdir frag based on previous readdir reply rbd: exclusive map option rbd: return ResponseMessage result from rbd_handle_request_lock() rbd: kill rbd_is_lock_supported() rbd: support updating the lock cookie without releasing the lock rbd: store lock cookie rbd: ignore unlock errors rbd: fix error handling around rbd_init_disk() rbd: move rbd_unregister_watch() call into rbd_dev_image_release() rbd: move rbd_dev_destroy() call out of rbd_dev_image_release() ceph: when seeing write errors on an inode, switch to sync writes Revert "ceph: SetPageError() for writeback pages if writepages fails" ceph: handle epoch barriers in cap messages libceph: add an epoch_barrier field to struct ceph_osd_client libceph: abort already submitted but abortable requests when map or pool goes full libceph: allow requests to return immediately on full conditions if caller wishes libceph: remove req->r_replay_version ceph: make seeky readdir more efficient ...
2017-05-08fs: ceph: CURRENT_TIME with ktime_get_real_ts()Deepa Dinamani
CURRENT_TIME is not y2038 safe. The macro will be deleted and all the references to it will be replaced by ktime_get_* apis. struct timespec is also not y2038 safe. Retain timespec for timestamp representation here as ceph uses it internally everywhere. These references will be changed to use struct timespec64 in a separate patch. The current_fs_time() api is being changed to use vfs struct inode* as an argument instead of struct super_block*. Set the new mds client request r_stamp field using ktime_get_real_ts() instead of using current_fs_time(). Also, since r_stamp is used as mtime on the server, use timespec_trunc() to truncate the timestamp, using the right granularity from the superblock. This api will be transitioned to be y2038 safe along with vfs. Link: http://lkml.kernel.org/r/1491613030-11599-5-git-send-email-deepa.kernel@gmail.com Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> M: Ilya Dryomov <idryomov@gmail.com> M: "Yan, Zheng" <zyan@redhat.com> M: Sage Weil <sage@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-04rbd: exclusive map optionIlya Dryomov
Support disabling automatic exclusive lock transfers to allow users to be in charge of which node should own the lock while being able to reuse exclusive lock's built-in blacklist/break-lock functionality. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-05-04rbd: return ResponseMessage result from rbd_handle_request_lock()Ilya Dryomov
Right now it's just 0, but "no automatic exclusive lock transfers" mode code will need -EROFS. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-05-04rbd: kill rbd_is_lock_supported()Ilya Dryomov
Currently the exclusive lock is acquired only if the mapping is writable, i.e. an image HEAD mapped in rw mode. This means that we don't acquire the lock for executing a read from a snapshot or an image HEAD mapped in ro mode, even if lock_on_read is set. This is somewhat weird and inconsistent with "no automatic exclusive lock transfers" mode, where the lock is acquired unconditionally. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-05-04rbd: support updating the lock cookie without releasing the lockIlya Dryomov
As we no longer release the lock before potentially raising BLACKLISTED in rbd_reregister_watch(), the "either locked or blacklisted" assert in rbd_queue_workfn() needs to go: we can be both locked and blacklisted at that point now. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-05-04rbd: store lock cookieIlya Dryomov
In preparation for supporting set_cookie method (or rather set_cookie fallback for older OSDs), store the lock cookie on lock and use it on unlock instead of recalculating from rbd_dev->watch_cookie. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-05-04rbd: ignore unlock errorsIlya Dryomov
Currently the lock_state is set to UNLOCKED (preventing further I/O), but RELEASED_LOCK notification isn't sent. Be consistent with userspace and treat ceph_cls_unlock() errors as the image is unlocked. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-05-04rbd: fix error handling around rbd_init_disk()Ilya Dryomov
add_disk() takes an extra reference on disk->queue, which is put in put_disk() -> disk_release(). Avoiding blk_cleanup_queue() (which also puts the queue) until add_disk() sets GENHD_FL_UP works for the queue itself, but leaks various queue internals. Conditioning tag_set freeing on GENHD_FL_UP is wrong too: all error paths after rbd_init_disk() leak the tag_set. Move the final "announce" steps out of rbd_dev_device_setup() so that it can be unwound like any other function. Leave "announce" steps to do_rbd_add/remove(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-05-04rbd: move rbd_unregister_watch() call into rbd_dev_image_release()Ilya Dryomov
rbd_dev->disk tear down vs rbd_watch_cb() race shouldn't be a problem anymore thanks to EXISTS and REMOVING checks in rbd_dev_update_size(). A similar race could occur on "rbd map", see commit 811c66887746 ("rbd: fix rbd map vs notify races"). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-05-04rbd: move rbd_dev_destroy() call out of rbd_dev_image_release()Ilya Dryomov
... to simplify error handling in do_rbd_add(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-05-04libceph, ceph: always advertise all supported featuresIlya Dryomov
No reason to hide CephFS-specific features in the rbd case. Recent feature bits mix RADOS and CephFS-specific stuff together anyway. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-02blk-mq: update ->init_request and ->exit_request prototypesChristoph Hellwig
Remove the request_idx parameter, which can't be used safely now that we support I/O schedulers with blk-mq. Except for a superflous check in mtip32xx it was unused anyway. Also pass the tag_set instead of just the driver data - this allows drivers to avoid some code duplication in a follow on cleanup. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08rbd: remove the discard_zeroes_data flagChristoph Hellwig
rbd only supports discarding on large alignments, so the zeroing code would always fall back to explicit writings of zeroes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-31blk-mq: constify struct blk_mq_opsEric Biggers
Constify all instances of blk_mq_ops, as they are never modified. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-07rbd: supported_features bus attributeIlya Dryomov
... so that userspace can generate meaningful error messages and spell out unsupported features that need to be disabled. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>
2017-02-28Merge tag 'ceph-for-4.11-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph updates from Ilya Dryomov: "This time around we have: - support for rbd data-pool feature, which enables rbd images on erasure-coded pools (myself). CEPH_PG_MAX_SIZE has been bumped to allow erasure-coded profiles with k+m up to 32. - a patch for ceph_d_revalidate() performance regression introduced in 4.9, along with some cleanups in the area (Jeff Layton) - a set of fixes for unsafe ->d_parent accesses in CephFS (Jeff Layton) - buffered reads are now processed in rsize windows instead of rasize windows (Andreas Gerstmayr). The new default for rsize mount option is 64M. - ack vs commit distinction is gone, greatly simplifying ->fsync() and MOSDOpReply handling code (myself) ... also a few filesystem bug fixes from Zheng, a CRUSH sync up (CRUSH computations are still serialized though) and several minor fixes and cleanups all over" * tag 'ceph-for-4.11-rc1' of git://github.com/ceph/ceph-client: (52 commits) libceph, rbd, ceph: WRITE | ONDISK -> WRITE libceph: get rid of ack vs commit ceph: remove special ack vs commit behavior ceph: tidy some white space in get_nonsnap_parent() crush: fix dprintk compilation crush: do is_out test only if we do not collide ceph: remove req from unsafe list when unregistering it rbd: constify device_type structure rbd: kill obj_request->object_name and rbd_segment_name_cache rbd: store and use obj_request->object_no rbd: RBD_V{1,2}_DATA_FORMAT macros rbd: factor out __rbd_osd_req_create() rbd: set offset and length outside of rbd_obj_request_create() rbd: support for data-pool feature rbd: introduce rbd_init_layout() rbd: use rbd_obj_bytes() more rbd: remove now unused rbd_obj_request_wait() and helpers rbd: switch rbd_obj_method_sync() to ceph_osdc_call() libceph: pass reply buffer length through ceph_osdc_call() rbd: do away with obj_request in rbd_obj_read_sync() ...
2017-02-24libceph, rbd, ceph: WRITE | ONDISK -> WRITEIlya Dryomov
CEPH_OSD_FLAG_ONDISK is set in account_request(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
2017-02-21Merge tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block layer updates from Jens Axboe: - blk-mq scheduling framework from me and Omar, with a port of the deadline scheduler for this framework. A port of BFQ from Paolo is in the works, and should be ready for 4.12. - Various fixups and improvements to the above scheduling framework from Omar, Paolo, Bart, me, others. - Cleanup of the exported sysfs blk-mq data into debugfs, from Omar. This allows us to export more information that helps debug hangs or performance issues, without cluttering or abusing the sysfs API. - Fixes for the sbitmap code, the scalable bitmap code that was migrated from blk-mq, from Omar. - Removal of the BLOCK_PC support in struct request, and refactoring of carrying SCSI payloads in the block layer. This cleans up the code nicely, and enables us to kill the SCSI specific parts of struct request, shrinking it down nicely. From Christoph mainly, with help from Hannes. - Support for ranged discard requests and discard merging, also from Christoph. - Support for OPAL in the block layer, and for NVMe as well. Mainly from Scott Bauer, with fixes/updates from various others folks. - Error code fixup for gdrom from Christophe. - cciss pci irq allocation cleanup from Christoph. - Making the cdrom device operations read only, from Kees Cook. - Fixes for duplicate bdi registrations and bdi/queue life time problems from Jan and Dan. - Set of fixes and updates for lightnvm, from Matias and Javier. - A few fixes for nbd from Josef, using idr to name devices and a workqueue deadlock fix on receive. Also marks Josef as the current maintainer of nbd. - Fix from Josef, overwriting queue settings when the number of hardware queues is updated for a blk-mq device. - NVMe fix from Keith, ensuring that we don't repeatedly mark and IO aborted, if we didn't end up aborting it. - SG gap merging fix from Ming Lei for block. - Loop fix also from Ming, fixing a race and crash between setting loop status and IO. - Two block race fixes from Tahsin, fixing request list iteration and fixing a race between device registration and udev device add notifiations. - Double free fix from cgroup writeback, from Tejun. - Another double free fix in blkcg, from Hou Tao. - Partition overflow fix for EFI from Alden Tondettar. * tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-block: (156 commits) nvme: Check for Security send/recv support before issuing commands. block/sed-opal: allocate struct opal_dev dynamically block/sed-opal: tone down not supported warnings block: don't defer flushes on blk-mq + scheduling blk-mq-sched: ask scheduler for work, if we failed dispatching leftovers blk-mq: don't special case flush inserts for blk-mq-sched blk-mq-sched: don't add flushes to the head of requeue queue blk-mq: have blk_mq_dispatch_rq_list() return if we queued IO or not block: do not allow updates through sysfs until registration completes lightnvm: set default lun range when no luns are specified lightnvm: fix off-by-one error on target initialization Maintainers: Modify SED list from nvme to block Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN uapi: sed-opal fix IOW for activate lsp to use correct struct cdrom: Make device operations read-only elevator: fix loading wrong elevator type for blk-mq devices cciss: switch to pci_irq_alloc_vectors block/loop: fix race between I/O and set_status blk-mq-sched: don't hold queue_lock when calling exit_icq block: set make_request_fn manually in blk_mq_update_nr_hw_queues ...
2017-02-20rbd: constify device_type structureBhumika Goyal
Declare device_type structure as const as it is only stored in the type field of a device structure. This field is of type const, so add const to the declaration of device_type structure. File size before: text data bss dec hex filename 61546 11610 208 73364 11e94 drivers/block/rbd.o File size after: text data bss dec hex filename 61610 11578 208 73396 11eb4 drivers/block/rbd.o Signed-off-by: Bhumika Goyal <bhumirks@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-02-20rbd: kill obj_request->object_name and rbd_segment_name_cacheIlya Dryomov
Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20rbd: store and use obj_request->object_noIlya Dryomov
object_no can be trivially formatted into an object name. We already store object names in OSD requests with special care to avoid dynamic allocations for short names. Storing a name in obj_request, obtained as below (!), is a waste and will be removed in the next commit. name = kmem_cache_alloc(rbd_segment_name_cache, ...); snprintf(name, ...); obj_request->object_name = kstrdup(name); kmem_cache_free(rbd_segment_name_cache, name); ... ceph_oid_aprintf(..., "%s", obj_request->object_name); Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20rbd: RBD_V{1,2}_DATA_FORMAT macrosIlya Dryomov
... and also fix up the comment -- format 1 data objects have always been 12 hex digits long. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20rbd: factor out __rbd_osd_req_create()Ilya Dryomov
Factor OSD request allocation and initialization code out into __rbd_osd_req_create(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20rbd: set offset and length outside of rbd_obj_request_create()Ilya Dryomov
The allocation doesn't depend on offset and length. Both offset and length can be changed after obj_request is allocated, too. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20rbd: support for data-pool featureIlya Dryomov
Add support for RBD_FEATURE_DATA_POOL feature. rbd_dev->layout.pool_id now stores the data pool id. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20rbd: introduce rbd_init_layout()Ilya Dryomov
Rather than initializing layout fields with some made up values in __rbd_dev_create(), move the initialization into rbd_init_layout() and call it after the header is actually populated. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20rbd: use rbd_obj_bytes() moreIlya Dryomov
Returning u64 doesn't make sense: max header->obj_order is 25 and ceph_file_layout::object_size is u32. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20rbd: remove now unused rbd_obj_request_wait() and helpersIlya Dryomov
Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20rbd: switch rbd_obj_method_sync() to ceph_osdc_call()Ilya Dryomov
As explained in the previous commit, rbd_obj_request machinery (and rbd_osd_req_create() in particular) shouldn't be used for working with metadata objects. Switch to the recently added ceph_osdc_call(). It assumes single pages for outbound and inbound buffers, but that's OK - none of the callers need more than that. These pages need to be allocated (messenger is in dire need of proper iterator interface!), but we are swapping for pages[] and pagelist allocations in the existing code. Kill class_name argument - all rbd methods are under "rbd". Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20rbd: do away with obj_request in rbd_obj_read_sync()Ilya Dryomov
rbd_obj_request machinery is completely unnecessary here; all that's being done is fetching a metadata object - no striping, cloning, etc. More importantly, rbd_osd_req_create() grabs pool id from layout and that is becoming a data pool id. Kill offset argument - all metadata objects are small and read in full. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20rbd: initialize rbd_dev->header_oloc earlyIlya Dryomov
No reason to delay it until image_id is known. This will be required by some rbd_obj_method_sync() callers, after rbd_obj_method_sync() is changed to take oloc. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20rbd: kill rbd_image_header::{crypt_type,comp_type}Ilya Dryomov
Image format 1 is deprecated and format 2 doesn't have these. Also, __rbd_dev_create() takes care of zeroing (or otherwise initializing) format 2 specific fields. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-20rbd: use kstrndup() in rbd_header_from_disk()Ilya Dryomov
Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-02-02block: Use pointer to backing_dev_info from request_queueJan Kara
We will want to have struct backing_dev_info allocated separately from struct request_queue. As the first step add pointer to backing_dev_info to request_queue and convert all users touching it. No functional changes in this patch. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-31block: fold cmd_type into the REQ_OP_ spaceChristoph Hellwig
Instead of keeping two levels of indirection for requests types, fold it all into the operations. The little caveat here is that previously cmd_type only applied to struct request, while the request and bio op fields were set to plain REQ_OP_READ/WRITE even for passthrough operations. Instead this patch adds new REQ_OP_* for SCSI passthrough and driver private requests, althought it has to add two for each so that we can communicate the data in/out nature of the request. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-14locking/atomic, kref: Add kref_read()Peter Zijlstra
Since we need to change the implementation, stop exposing internals. Provide kref_read() to read the current reference count; typically used for debug messages. Kills two anti-patterns: atomic_read(&kref->refcount) kref->refcount.counter Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-12rbd: silence bogus -Wmaybe-uninitialized warningIlya Dryomov
drivers/block/rbd.c: In function ‘rbd_watch_cb’: drivers/block/rbd.c:3690:5: error: ‘struct_v’ may be used uninitialized in this function [-Werror=maybe-uninitialized] drivers/block/rbd.c:3759:5: note: ‘struct_v’ was declared here Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-10-15rbd: don't retry watch reregistration if header object is goneIlya Dryomov
If the header object gets deleted (perhaps along with the entire pool), there is no point in attempting to reregister the watch. Treat this the same as blacklisting: fail all pending and new I/Os requiring the lock. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-10-15rbd: don't wait for the lock forever if blacklistedIlya Dryomov
-EBLACKLISTED from __rbd_register_watch() means that our ceph_client got blacklisted - we won't be able to restore the watch and reacquire the lock. Wake up and fail all outstanding requests waiting for the lock and arrange for all new requests that require the lock to fail immediately. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Tested-by: Mike Christie <mchristi@redhat.com>
2016-10-10Merge tag 'ceph-for-4.9-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds
Pull Ceph updates from Ilya Dryomov: "The big ticket item here is support for rbd exclusive-lock feature, with maintenance operations offloaded to userspace (Douglas Fuller, Mike Christie and myself). Another block device bullet is a series fixing up layering error paths (myself). On the filesystem side, we've got patches that improve our handling of buffered vs dio write races (Neil Brown) and a few assorted fixes from Zheng. Also included a couple of random cleanups and a minor CRUSH update" * tag 'ceph-for-4.9-rc1' of git://github.com/ceph/ceph-client: (39 commits) crush: remove redundant local variable crush: don't normalize input of crush_ln iteratively libceph: ceph_build_auth() doesn't need ceph_auth_build_hello() libceph: use CEPH_AUTH_UNKNOWN in ceph_auth_build_hello() ceph: fix description for rsize and rasize mount options rbd: use kmalloc_array() in rbd_header_from_disk() ceph: use list_move instead of list_del/list_add ceph: handle CEPH_SESSION_REJECT message ceph: avoid accessing / when mounting a subpath ceph: fix mandatory flock check ceph: remove warning when ceph_releasepage() is called on dirty page ceph: ignore error from invalidate_inode_pages2_range() in direct write ceph: fix error handling of start_read() rbd: add rbd_obj_request_error() helper rbd: img_data requests don't own their page array rbd: don't call rbd_osd_req_format_read() for !img_data requests rbd: rework rbd_img_obj_exists_submit() error paths rbd: don't crash or leak on errors in rbd_img_obj_parent_read_full_callback() rbd: move bumping img_request refcount into rbd_obj_request_submit() rbd: mark the original request as done if stat request fails ...
2016-10-03rbd: use kmalloc_array() in rbd_header_from_disk()Markus Elfring
* A multiplication for the size determination of a memory allocation indicated that an array data structure should be processed. Thus use the corresponding function "kmalloc_array". This issue was detected by using the Coccinelle software. * Delete the local variable "size" which became unnecessary with this refactoring. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-10-03rbd: add rbd_obj_request_error() helperIlya Dryomov
Pull setting an error and marking a request done code into a new helper. obj_request_img_data_test() check isn't strictly needed right now, but makes it applicable to !img_data requests and a bit safer. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-10-03rbd: img_data requests don't own their page arrayIlya Dryomov
Move the check into rbd_obj_request_destroy() to avoid use-after-free on errors in rbd_img_request_fill(..., OBJ_REQUEST_PAGES, ...), where pages, owned by the caller, gets freed in rbd_img_request_fill(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: David Disseldorp <ddiss@suse.de>
2016-10-03rbd: don't call rbd_osd_req_format_read() for !img_data requestsIlya Dryomov
Accessing obj_request->img_request union field is only valid for object requests associated with an image (i.e. if obj_request_img_data_test() returns true). rbd_osd_req_format_read() used to do more, but now it just sets osd_req->snap_id. Standalone and stat object requests always go to the HEAD revision and are fine with CEPH_NOSNAP set by libceph, so get around the invalid union field use by simply not calling rbd_osd_req_format_read() in those places. Reported-by: David Disseldorp <ddiss@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: David Disseldorp <ddiss@suse.de>
2016-10-03rbd: rework rbd_img_obj_exists_submit() error pathsIlya Dryomov
- don't put obj_request before rbd_obj_request_get() if rbd_obj_request_create() fails - don't leak pages if rbd_obj_request_create() fails - don't leak stat_request if rbd_osd_req_create() fails Reported-by: David Disseldorp <ddiss@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: David Disseldorp <ddiss@suse.de>
2016-10-03rbd: don't crash or leak on errors in rbd_img_obj_parent_read_full_callback()Ilya Dryomov
- fix parent_length == img_request->xferred assert to not fire on copyup read failures - don't leak pages if copyup read fails or we can't allocate a new osd request Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: David Disseldorp <ddiss@suse.de>
2016-10-03rbd: move bumping img_request refcount into rbd_obj_request_submit()Ilya Dryomov
Commit 0f2d5be792b0 ("rbd: use reference counts for image requests") added rbd_img_request_get(), which rbd_img_request_fill() calls for each obj_request added to img_request. It was an urgent band-aid for the uglyness that is rbd_img_obj_callback() and none of the error paths were updated. Given that this img_request reference is meant to represent an obj_request that hasn't passed through rbd_img_obj_callback() yet, proper cleanup in appropriate destructors is a challenge. However, noting that if we don't get a chance to call rbd_obj_request_complete(), there is not going to be a call to rbd_img_obj_callback(), we can move rbd_img_request_get() into rbd_obj_request_submit() and fixup the two places that call rbd_obj_request_complete() directly and not through rbd_obj_request_submit() to temporarily bump img_request, so that rbd_img_obj_callback() can put as usual. This takes care of img_request leaks on errors on the submit side. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
2016-10-03rbd: mark the original request as done if stat request failsIlya Dryomov
If stat request fails with something other than -ENOENT (which just means that we need to copyup), the original object request is never marked as done and therefore never completed. Fix this by moving the mark done + complete snippet from rbd_img_obj_parent_read_full() into rbd_img_obj_exists_callback(). The former remains covered, as the latter is its only caller (through rbd_img_obj_request_submit()). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: David Disseldorp <ddiss@suse.de>