summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2017-05-04dm cache policy smq: allow demotions to happen even during continuous IOJoe Thornber
dm-cache's smq policy tries hard to do it's work during the idle periods when there is no IO. But if there are no idle periods (eg, a long fio run) we still need to allow some demotions and promotions to occur. To achieve this, pass @idle=true to queue_promotion()'s free_target_met() call so that free_target_met() doesn't short-circuit the possibility of demotion simply because it isn't an idle period. Fixes: b29d4986d0 ("dm cache: significant rework to leverage dm-bio-prison-v2") Reported-by: John Harrigan <jharriga@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-05-04mq-deadline: add debugfs attributesOmar Sandoval
Expose the fifo lists, cached next requests, batching state, and dispatch list. It'd also be possible to add the sorted lists, but there aren't already seq_file helpers for rbtrees. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-04kyber: add debugfs attributesOmar Sandoval
Expose the domain token pools, asynchronous sbitmap depth, domain request lists, and batching state. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-04blk-mq-debugfs: allow schedulers to register debugfs attributesOmar Sandoval
This provides the infrastructure for schedulers to expose their internal state through debugfs. We add a list of queue attributes and a list of hctx attributes to struct elevator_type and wire them up when switching schedulers. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Add missing seq_file.h header in blk-mq-debugfs.h Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-04blk-mq: untangle debugfs and sysfsOmar Sandoval
Originally, I tied debugfs registration/unregistration together with sysfs. There's no reason to do this, and it's getting in the way of letting schedulers define their own debugfs attributes. Instead, tie the debugfs registration to the lifetime of the structures themselves. The saner lifetimes mean we can also get rid of the extra mq directory and move everything one level up. I.e., nvme0n1/mq/hctx0/tags is now just nvme0n1/hctx0/tags. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-04blk-mq: move debugfs declarations to a separate header fileOmar Sandoval
Preparation for adding more declarations. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-04blk-mq: Do not invoke queue operations on a dead queueBart Van Assche
In commit e869b5462f83 ("blk-mq: Unregister debugfs attributes earlier"), we shuffled the debugfs cleanup around so that the "state" attribute was removed before we freed the blk-mq data structures. However, later changes are going to undo that, so we need to explicitly disallow running a dead queue. [Omar: rebased and updated commit message] Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-04blk-mq-debugfs: get rid of a bunch of boilerplateOmar Sandoval
A large part of blk-mq-debugfs.c is file_operations and seq_file boilerplate. This sucks as is but will suck even more when schedulers can define their own debugfs entries. Factor it all out into a single blk_mq_debugfs_fops which multiplexes as needed. We store the request_queue, blk_mq_hw_ctx, or blk_mq_ctx in the parent directory dentry, which is kind of hacky, but it works. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-04blk-mq-debugfs: rename hw queue directories from <n> to hctx<n>Omar Sandoval
It's not clear what these numbered directories represent unless you consult the code. We're about to get rid of the intermediate "mq" directory, so these would be even more confusing without that context. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-04blk-mq-debugfs: don't open code strstrip()Omar Sandoval
Slightly more readable, plus we also strip leading spaces. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-04blk-mq-debugfs: error on long write to queue "state" fileOmar Sandoval
blk_queue_flags_store() currently truncates and returns a short write if the operation being written is too long. This can give us weird results, like here: $ echo "run bar" echo: write error: invalid argument $ dmesg [ 1103.075435] blk_queue_flags_store: unsupported operation bar. Use either 'run' or 'start' Instead, return an error if the user does this. While we're here, make the argument names consistent with everywhere else in this file. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-04blk-mq-debugfs: clean up flag definitionsOmar Sandoval
Make sure the spelled out flag names match the definition. This also adds a missing hctx state, BLK_MQ_S_START_ON_RUN, and a missing cmd_flag, __REQ_NOUNMAP. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-04blk-mq-debugfs: separate flags with |Omar Sandoval
This reads more naturally than spaces. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-04nfs: Fix bdi handling for cloned superblocksJan Kara
In commit 0d3b12584972 "nfs: Convert to separately allocated bdi" I have wrongly cloned bdi reference in nfs_clone_super(). Further inspection has shown that originally the code was actually allocating a new bdi (in ->clone_server callback) which was later registered in nfs_fs_mount_common() and used for sb->s_bdi in nfs_initialise_sb(). This could later result in bdi for the original superblock not getting unregistered when that superblock got shutdown (as the cloned sb still held bdi reference) and later when a new superblock was created under the same anonymous device number, a clash in sysfs has happened on bdi registration: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 10284 at /linux-next/fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x74 sysfs: cannot create duplicate filename '/devices/virtual/bdi/0:32' Modules linked in: axp20x_usb_power gpio_axp209 nvmem_sunxi_sid sun4i_dma sun4i_ss virt_dma CPU: 1 PID: 10284 Comm: mount.nfs Not tainted 4.11.0-rc4+ #14 Hardware name: Allwinner sun7i (A20) Family [<c010f19c>] (unwind_backtrace) from [<c010bc74>] (show_stack+0x10/0x14) [<c010bc74>] (show_stack) from [<c03c6e24>] (dump_stack+0x78/0x8c) [<c03c6e24>] (dump_stack) from [<c0122200>] (__warn+0xe8/0x100) [<c0122200>] (__warn) from [<c0122250>] (warn_slowpath_fmt+0x38/0x48) [<c0122250>] (warn_slowpath_fmt) from [<c02ac178>] (sysfs_warn_dup+0x64/0x74) [<c02ac178>] (sysfs_warn_dup) from [<c02ac254>] (sysfs_create_dir_ns+0x84/0x94) [<c02ac254>] (sysfs_create_dir_ns) from [<c03c8b8c>] (kobject_add_internal+0x9c/0x2ec) [<c03c8b8c>] (kobject_add_internal) from [<c03c8e24>] (kobject_add+0x48/0x98) [<c03c8e24>] (kobject_add) from [<c048d75c>] (device_add+0xe4/0x5a0) [<c048d75c>] (device_add) from [<c048ddb4>] (device_create_groups_vargs+0xac/0xbc) [<c048ddb4>] (device_create_groups_vargs) from [<c048dde4>] (device_create_vargs+0x20/0x28) [<c048dde4>] (device_create_vargs) from [<c02075c8>] (bdi_register_va+0x44/0xfc) [<c02075c8>] (bdi_register_va) from [<c023d378>] (super_setup_bdi_name+0x48/0xa4) [<c023d378>] (super_setup_bdi_name) from [<c0312ef4>] (nfs_fill_super+0x1a4/0x204) [<c0312ef4>] (nfs_fill_super) from [<c03133f0>] (nfs_fs_mount_common+0x140/0x1e8) [<c03133f0>] (nfs_fs_mount_common) from [<c03335cc>] (nfs4_remote_mount+0x50/0x58) [<c03335cc>] (nfs4_remote_mount) from [<c023ef98>] (mount_fs+0x14/0xa4) [<c023ef98>] (mount_fs) from [<c025cba0>] (vfs_kern_mount+0x54/0x128) [<c025cba0>] (vfs_kern_mount) from [<c033352c>] (nfs_do_root_mount+0x80/0xa0) [<c033352c>] (nfs_do_root_mount) from [<c0333818>] (nfs4_try_mount+0x28/0x3c) [<c0333818>] (nfs4_try_mount) from [<c0313874>] (nfs_fs_mount+0x2cc/0x8c4) [<c0313874>] (nfs_fs_mount) from [<c023ef98>] (mount_fs+0x14/0xa4) [<c023ef98>] (mount_fs) from [<c025cba0>] (vfs_kern_mount+0x54/0x128) [<c025cba0>] (vfs_kern_mount) from [<c02600f0>] (do_mount+0x158/0xc7c) [<c02600f0>] (do_mount) from [<c0260f98>] (SyS_mount+0x8c/0xb4) [<c0260f98>] (SyS_mount) from [<c0107840>] (ret_fast_syscall+0x0/0x3c) Fix the problem by always creating new bdi for a superblock as we used to do. Reported-and-tested-by: Corentin Labbe <clabbe.montjoie@gmail.com> Fixes: 0d3b12584972ce5781179ad3f15cca3cdb5cae05 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-04block/mq: Cure cpu hotplug lock inversionPeter Zijlstra
By poking at /debug/sched_features I triggered the following splat: [] ====================================================== [] WARNING: possible circular locking dependency detected [] 4.11.0-00873-g964c8b7-dirty #694 Not tainted [] ------------------------------------------------------ [] bash/2109 is trying to acquire lock: [] (cpu_hotplug_lock.rw_sem){++++++}, at: [<ffffffff8120cb8b>] static_key_slow_dec+0x1b/0x50 [] [] but task is already holding lock: [] (&sb->s_type->i_mutex_key#4){+++++.}, at: [<ffffffff81140216>] sched_feat_write+0x86/0x170 [] [] which lock already depends on the new lock. [] [] [] the existing dependency chain (in reverse order) is: [] [] -> #2 (&sb->s_type->i_mutex_key#4){+++++.}: [] lock_acquire+0x100/0x210 [] down_write+0x28/0x60 [] start_creating+0x5e/0xf0 [] debugfs_create_dir+0x13/0x110 [] blk_mq_debugfs_register+0x21/0x70 [] blk_mq_register_dev+0x64/0xd0 [] blk_register_queue+0x6a/0x170 [] device_add_disk+0x22d/0x440 [] loop_add+0x1f3/0x280 [] loop_init+0x104/0x142 [] do_one_initcall+0x43/0x180 [] kernel_init_freeable+0x1de/0x266 [] kernel_init+0xe/0x100 [] ret_from_fork+0x31/0x40 [] [] -> #1 (all_q_mutex){+.+.+.}: [] lock_acquire+0x100/0x210 [] __mutex_lock+0x6c/0x960 [] mutex_lock_nested+0x1b/0x20 [] blk_mq_init_allocated_queue+0x37c/0x4e0 [] blk_mq_init_queue+0x3a/0x60 [] loop_add+0xe5/0x280 [] loop_init+0x104/0x142 [] do_one_initcall+0x43/0x180 [] kernel_init_freeable+0x1de/0x266 [] kernel_init+0xe/0x100 [] ret_from_fork+0x31/0x40 [] *** DEADLOCK *** [] [] 3 locks held by bash/2109: [] #0: (sb_writers#11){.+.+.+}, at: [<ffffffff81292bcd>] vfs_write+0x17d/0x1a0 [] #1: (debugfs_srcu){......}, at: [<ffffffff8155a90d>] full_proxy_write+0x5d/0xd0 [] #2: (&sb->s_type->i_mutex_key#4){+++++.}, at: [<ffffffff81140216>] sched_feat_write+0x86/0x170 [] [] stack backtrace: [] CPU: 9 PID: 2109 Comm: bash Not tainted 4.11.0-00873-g964c8b7-dirty #694 [] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013 [] Call Trace: [] lock_acquire+0x100/0x210 [] get_online_cpus+0x2a/0x90 [] static_key_slow_dec+0x1b/0x50 [] static_key_disable+0x20/0x30 [] sched_feat_write+0x131/0x170 [] full_proxy_write+0x97/0xd0 [] __vfs_write+0x28/0x120 [] vfs_write+0xb5/0x1a0 [] SyS_write+0x49/0xa0 [] entry_SYSCALL_64_fastpath+0x23/0xc2 This is because of the cpu hotplug lock rework. Break the chain at #1 by reversing the lock acquisition order. This way i_mutex_key#4 no longer depends on cpu_hotplug_lock and things are good. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-04lightnvm: fix bad back free on error pathJavier González
Free memory correctly when an allocation fails on a loop and we free backwards previously successful allocations. Signed-off-by: Javier González <javier@cnexlabs.com> Reviewed-by: Matias Bjørling <matias@cnexlabs.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-04lightnvm: create cmd before allocating requestJavier González
Create nvme command before allocating a request using nvme_alloc_request, which uses the command direction. Up until now, the command has been zeroized, so all commands have been allocated as a read operation. Signed-off-by: Javier González <javier@cnexlabs.com> Reviewed-by: Matias Bjørling <matias@cnexlabs.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-04tools build: Fixup sched_getcpu feature testArnaldo Carvalho de Melo
We have tools/build/feature/test-all.c to speed up feature testing, doing all tests at once, but then all tests in this file should normally pass. That is not the case with the sched-getcpu one, that wasn't passing when included from test-all.c because it needs to have _GNU_SOURCE defined before including sched.h, but _GNU_SOURCE is defined by a header included from another feature test included earlier in test-all.d, test-libpython.c, resulting in: $ cat /tmp/build/perf/feature/test-all.make.output In file included from test-all.c:121:0: test-sched_getcpu.c:1:0: error: "_GNU_SOURCE" redefined [-Werror] #define _GNU_SOURCE In file included from /usr/include/python2.7/pyconfig.h:6:0, from /usr/include/python2.7/Python.h:8, from test-libpython.c:1, from test-all.c:13: /usr/include/python2.7/pyconfig-64.h:1177:0: note: this is the location of the previous definition #define _GNU_SOURCE 1 cc1: all warnings being treated as errors Which would trigger testing the tests individually, when that _GNU_SOURCE redefinition would not take place, and the whole process would continue, just slower... Fix it. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Fixes: 120010cb1eea ("tools build: Add test for sched_getcpu()") Link: http://lkml.kernel.org/n/tip-3qp1it69xsc4w8gnuu1e9ayh@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-05-04KVM: put back #ifndef CONFIG_S390 around kvm_vcpu_kickPaolo Bonzini
The #ifndef was removed in 75aaafb79f73516b69d5639ad30a72d72e75c8b4, but it was also protecting smp_send_reschedule() in kvm_vcpu_kick(). Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-05-04perf tests kmod-path: Don't fail if compressed modules aren't supportedKim Phillips
__kmod_path__parse() uses is_supported_compression() to determine and parse out compressed module file extensions. On systems without zlib, this test fails and __kmod_path__parse() continues to strcmp "ko" with "gz". Don't do this on those systems. Signed-off-by: Kim Phillips <kim.phillips@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Fixes: 3c8a67f50a1e ("perf tools: Add kmod_path__parse function") Link: http://lkml.kernel.org/r/20170503131402.c66e314460026c80cd787b34@arm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-05-04perf annotate: Fix AArch64 comment charKim Phillips
The commit 0fcb1da4aba "perf annotate: AArch64 support" blindly copied the comment character from the original: https://lkml.org/lkml/2016/5/19/461 whereas that same commit shows objdump output utilizing the C++ style "//" as the comment delimeter. Since '/' doesn't occur elsewhere in objdump output, we retain the single character check, but fix it to be '/'. Signed-off-by: Kim Phillips <kim.phillips@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Chris Riyder <chris.ryder@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Fixes: 0fcb1da4aba6 ("perf annotate: AArch64 support") Link: http://lkml.kernel.org/r/20170503131356.be88f977094fb3fa0f49b99d@arm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-05-04perf tools: Fix spelling mistakesKim Phillips
Mostly in the documentation. Signed-off-by: Kim Phillips <kim.phillips@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20170503131350.cebeecd8bd0f2968417626ab@arm.com [ Fix spelling of "parameter" in one of the spell-checked lines ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-05-04rtc: ds1374: wdt: Fix stop/start ioctl always returning -EINVALMoritz Fischer
The WDIOC_SETOPTIONS case in the watchdog ioctl would alwayss falls through to the -EINVAL case. This is wrong since thew watchdog does actually get stopped or started correctly. Fixes: 920f91e50c5b ("drivers/rtc/rtc-ds1374.c: add watchdog support") Signed-off-by: Moritz Fischer <mdf@kernel.org> Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
2017-05-04rtc: ds1374: wdt: Fix issue with timeout scaling from secs to wdt ticksMoritz Fischer
The issue is that the internal counter that triggers the watchdog reset is actually running at 4096 Hz instead of 1Hz, therefore the value given by userland (in sec) needs to be multiplied by 4096 to get the correct behavior. Fixes: 920f91e50c5b ("drivers/rtc/rtc-ds1374.c: add watchdog support") Signed-off-by: Moritz Fischer <mdf@kernel.org> Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
2017-05-04KVM: arm/arm64: Move shared files to virt/kvm/armChristoffer Dall
For some time now we have been having a lot of shared functionality between the arm and arm64 KVM support in arch/arm, which not only required a horrible inter-arch reference from the Makefile in arch/arm64/kvm, but also created confusion for newcomers to the code base, as was recently seen on the mailing list. Further, it causes confusion for things like cscope, which needs special attention to index specific shared files for arm64 from the arm tree. Move the shared files into virt/kvm/arm and move the trace points along with it. When moving the tracepoints we have to modify the way the vgic creates definitions of the trace points, so we take the chance to include the VGIC tracepoints in its very own special vgic trace.h file. Signed-off-by: Christoffer Dall <cdall@linaro.org>
2017-05-04Merge branch 'topic/pl330' into for-linusVinod Koul
2017-05-04Merge branch 'topic/xilinx' into for-linusVinod Koul
2017-05-04Merge branch 'topic/qcom' into for-linusVinod Koul
2017-05-04Merge branch 'topic/pl08x' into for-linusVinod Koul
2017-05-04dmaengine: pl08x: remove lock documentationVinod Koul
lock variable in pl08x_dma_chan_state no longer exists so remove it Signed-off-by: Vinod Koul <vinod.koul@intel.com>
2017-05-04dmaengine: pl08x: fix pl08x_dma_chan_state documentationVinod Koul
Documentation for pl08x_dma_chan_state mentions it as struct whereas it is a enum, so fix that Signed-off-by: Vinod Koul <vinod.koul@intel.com>
2017-05-04dmaengine: pl08x: Use the BIT() macro consistentlyLinus Walleij
This makes the driver shift bits with BIT() which is used on other places in the driver. Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Vinod Koul <vinod.koul@intel.com>
2017-05-04dmaengine: pl080: Fix some missing kerneldocLinus Walleij
Two elements of the physical channel description was missing. Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Vinod Koul <vinod.koul@intel.com>
2017-05-04dmaengine: pl080: Cut some unused definesLinus Walleij
There is no in-kernel code using these indexed register defines, and their offsets are clearly defined right below. Cut them. Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Vinod Koul <vinod.koul@intel.com>
2017-05-04Merge branch 'topic/cppi' into for-linusVinod Koul
2017-05-04ceph: fix memory leak in __ceph_setxattr()Luis Henriques
The ceph_inode_xattr needs to be released when removing an xattr. Easily reproducible running the 'generic/020' test from xfstests or simply by doing: attr -s attr0 -V 0 /mnt/test && attr -r attr0 /mnt/test While there, also fix the error path. Here's the kmemleak splat: unreferenced object 0xffff88001f86fbc0 (size 64): comm "attr", pid 244, jiffies 4294904246 (age 98.464s) hex dump (first 32 bytes): 40 fa 86 1f 00 88 ff ff 80 32 38 1f 00 88 ff ff @........28..... 00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de ................ backtrace: [<ffffffff81560199>] kmemleak_alloc+0x49/0xa0 [<ffffffff810f3e5b>] kmem_cache_alloc+0x9b/0xf0 [<ffffffff812b157e>] __ceph_setxattr+0x17e/0x820 [<ffffffff812b1c57>] ceph_set_xattr_handler+0x37/0x40 [<ffffffff8111fb4b>] __vfs_removexattr+0x4b/0x60 [<ffffffff8111fd37>] vfs_removexattr+0x77/0xd0 [<ffffffff8111fdd1>] removexattr+0x41/0x60 [<ffffffff8111fe65>] path_removexattr+0x75/0xa0 [<ffffffff81120aeb>] SyS_lremovexattr+0xb/0x10 [<ffffffff81564b20>] entry_SYSCALL_64_fastpath+0x13/0x94 [<ffffffffffffffff>] 0xffffffffffffffff Cc: stable@vger.kernel.org Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04ceph: fix file open flags on ppc64Alexander Graf
The file open flags (O_foo) are platform specific and should never go out to an interface that is not local to the system. Unfortunately these flags have leaked out onto the wire in the cephfs implementation. That lead to bogus flags getting transmitted on ppc64. This patch converts the kernel view of flags to the ceph view of file open flags. Fixes: 124e68e74 ("ceph: file operations") Signed-off-by: Alexander Graf <agraf@suse.de> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04ceph: choose readdir frag based on previous readdir replyYan, Zheng
The dirfragtree is lazily updated, it's not always accurate. Infinite loops happens in following circumstance. - client send request to read frag A - frag A has been fragmented into frag B and C. So mds fills the reply with contents of frag B - client wants to read next frag C. ceph_choose_frag(frag value of C) return frag A. The fix is using previous readdir reply to calculate next readdir frag when possible. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04rbd: exclusive map optionIlya Dryomov
Support disabling automatic exclusive lock transfers to allow users to be in charge of which node should own the lock while being able to reuse exclusive lock's built-in blacklist/break-lock functionality. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-05-04rbd: return ResponseMessage result from rbd_handle_request_lock()Ilya Dryomov
Right now it's just 0, but "no automatic exclusive lock transfers" mode code will need -EROFS. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-05-04rbd: kill rbd_is_lock_supported()Ilya Dryomov
Currently the exclusive lock is acquired only if the mapping is writable, i.e. an image HEAD mapped in rw mode. This means that we don't acquire the lock for executing a read from a snapshot or an image HEAD mapped in ro mode, even if lock_on_read is set. This is somewhat weird and inconsistent with "no automatic exclusive lock transfers" mode, where the lock is acquired unconditionally. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-05-04rbd: support updating the lock cookie without releasing the lockIlya Dryomov
As we no longer release the lock before potentially raising BLACKLISTED in rbd_reregister_watch(), the "either locked or blacklisted" assert in rbd_queue_workfn() needs to go: we can be both locked and blacklisted at that point now. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-05-04rbd: store lock cookieIlya Dryomov
In preparation for supporting set_cookie method (or rather set_cookie fallback for older OSDs), store the lock cookie on lock and use it on unlock instead of recalculating from rbd_dev->watch_cookie. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-05-04rbd: ignore unlock errorsIlya Dryomov
Currently the lock_state is set to UNLOCKED (preventing further I/O), but RELEASED_LOCK notification isn't sent. Be consistent with userspace and treat ceph_cls_unlock() errors as the image is unlocked. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-05-04rbd: fix error handling around rbd_init_disk()Ilya Dryomov
add_disk() takes an extra reference on disk->queue, which is put in put_disk() -> disk_release(). Avoiding blk_cleanup_queue() (which also puts the queue) until add_disk() sets GENHD_FL_UP works for the queue itself, but leaks various queue internals. Conditioning tag_set freeing on GENHD_FL_UP is wrong too: all error paths after rbd_init_disk() leak the tag_set. Move the final "announce" steps out of rbd_dev_device_setup() so that it can be unwound like any other function. Leave "announce" steps to do_rbd_add/remove(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-05-04rbd: move rbd_unregister_watch() call into rbd_dev_image_release()Ilya Dryomov
rbd_dev->disk tear down vs rbd_watch_cb() race shouldn't be a problem anymore thanks to EXISTS and REMOVING checks in rbd_dev_update_size(). A similar race could occur on "rbd map", see commit 811c66887746 ("rbd: fix rbd map vs notify races"). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-05-04rbd: move rbd_dev_destroy() call out of rbd_dev_image_release()Ilya Dryomov
... to simplify error handling in do_rbd_add(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2017-05-04ceph: when seeing write errors on an inode, switch to sync writesJeff Layton
Currently, we don't have a real feedback mechanism in place for when we start seeing buffered writeback errors. If writeback is failing, there is nothing that prevents an application from continuing to dirty pages that aren't being cleaned. In the event that we're seeing write errors of any sort occur on an inode, have the callback set a flag to force further writes to be synchronous. When the next write succeeds, clear the flag to allow buffered writeback to continue. Since this is just a hint to the write submission mechanism, we only take the i_ceph_lock when a lockless check shows that the flag needs to be changed. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: "Yan, Zheng” <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04Revert "ceph: SetPageError() for writeback pages if writepages fails"Jeff Layton
This reverts commit b109eec6f4332bd517e2f41e207037c4b9065094. If I'm filling up a filesystem with this sort of command: $ dd if=/dev/urandom of=/mnt/cephfs/fillfile bs=2M oflag=sync ...then I'll eventually get back EIO on a write. Further calls will give us ENOSPC. I'm not sure what prompted this change, but I don't think it's what we want to do. If writepages failed, we will have already set the mapping error appropriately, and that's what gets reported by fsync() or close(). __filemap_fdatawait_range however, does this: wait_on_page_writeback(page); if (TestClearPageError(page)) ret = -EIO; ...and that -EIO ends up trumping the mapping's error if one exists. When writepages fails, we only want to set the error in the mapping, and not flag the individual pages. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: "Yan, Zheng” <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04ceph: handle epoch barriers in cap messagesJeff Layton
Have the client store and update the osdc epoch_barrier when a cap message comes in with one. When sending cap messages, send the epoch barrier as well. This allows clients to inform servers that their released caps may not be used until a particular OSD map epoch. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: "Yan, Zheng” <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>