summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2015-11-05Merge tag 'dlm-4.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm update from David Teigland: "This includes one simple fix to make posix locks interruptible by signals in cases where a signal handler is used" * tag 'dlm-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: dlm: make posix locks interruptible
2015-11-05Merge tag 'locks-v4.4-1' of git://git.samba.org/jlayton/linuxLinus Torvalds
Pull file locking updates from Jeff Layton: "The largest series of changes is from Ben who offered up a set to add a new helper function for setting locks based on the type set in fl_flags. Dmitry also send in a fix for a potential race that he found with KTSAN" * tag 'locks-v4.4-1' of git://git.samba.org/jlayton/linux: locks: cleanup posix_lock_inode_wait and flock_lock_inode_wait Move locks API users to locks_lock_inode_wait() locks: introduce locks_lock_inode_wait() locks: Use more file_inode and fix a comment fs: fix data races on inode->i_flctx locks: change tracepoint for generic_add_lease
2015-11-05Btrfs: fix sleeping inside atomic context in qgroup rescan workerFilipe Manana
We are holding a btree path with spinning locks and then we attempt to clone an extent buffer, which calls kmem_cache_alloc() and this function can sleep, causing the following trace to be reported on a debug kernel: [107118.218536] BUG: sleeping function called from invalid context at mm/slab.c:2871 [107118.224110] in_atomic(): 1, irqs_disabled(): 0, pid: 19148, name: kworker/u32:3 [107118.226120] INFO: lockdep is turned off. [107118.226843] Preemption disabled at:[<ffffffffa05ffa22>] btrfs_clear_lock_blocking_rw+0x96/0xea [btrfs] [107118.229175] CPU: 3 PID: 19148 Comm: kworker/u32:3 Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1 [107118.231326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [107118.233687] Workqueue: btrfs-qgroup-rescan btrfs_qgroup_rescan_helper [btrfs] [107118.236835] 0000000000000000 ffff880424bf3b78 ffffffff812566f4 0000000000000000 [107118.238369] ffff880424bf3ba0 ffffffff81070664 ffffffff817f1cd5 0000000000000b37 [107118.239769] 0000000000000000 ffff880424bf3bc8 ffffffff8107070a 0000000000008850 [107118.241244] Call Trace: [107118.241729] [<ffffffff812566f4>] dump_stack+0x4e/0x79 [107118.242602] [<ffffffff81070664>] ___might_sleep+0x23a/0x241 [107118.243586] [<ffffffff8107070a>] __might_sleep+0x9f/0xa6 [107118.244532] [<ffffffff8115af70>] cache_alloc_debugcheck_before+0x25/0x36 [107118.245939] [<ffffffff8115d52b>] kmem_cache_alloc+0x50/0x215 [107118.246930] [<ffffffffa05e627e>] __alloc_extent_buffer+0x2a/0x11f [btrfs] [107118.248121] [<ffffffffa05ecb1a>] btrfs_clone_extent_buffer+0x3d/0xdd [btrfs] [107118.249451] [<ffffffffa06239ea>] btrfs_qgroup_rescan_worker+0x16d/0x434 [btrfs] [107118.250755] [<ffffffff81087481>] ? arch_local_irq_save+0x9/0xc [107118.251754] [<ffffffffa05f7952>] normal_work_helper+0x14c/0x32a [btrfs] [107118.252899] [<ffffffffa05f7952>] ? normal_work_helper+0x14c/0x32a [btrfs] [107118.254195] [<ffffffffa05f7c82>] btrfs_qgroup_rescan_helper+0x12/0x14 [btrfs] [107118.255436] [<ffffffff81063b23>] process_one_work+0x24a/0x4ac [107118.263690] [<ffffffff81064285>] worker_thread+0x206/0x2c2 [107118.264888] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb [107118.267413] [<ffffffff8106904d>] kthread+0xef/0xf7 [107118.268417] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24 [107118.269505] [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70 [107118.270491] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24 So just use blocking locks for our path to solve this. This fixes the patch titled: "btrfs: qgroup: Don't copy extent buffer to do qgroup rescan" Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-11-05Btrfs: fix race waiting for qgroup rescan workerFilipe Manana
We were initializing the completion (fs_info->qgroup_rescan_completion) object after releasing the qgroup rescan lock, which gives a small time window for a rescan waiter to not actually wait for the rescan worker to finish. Example: CPU 1 CPU 2 fs_info->qgroup_rescan_completion->done is 0 btrfs_qgroup_rescan_worker() complete_all(&fs_info->qgroup_rescan_completion) sets fs_info->qgroup_rescan_completion->done to UINT_MAX / 2 ... do some other stuff .... qgroup_rescan_init() mutex_lock(&fs_info->qgroup_rescan_lock) set flag BTRFS_QGROUP_STATUS_FLAG_RESCAN in fs_info->qgroup_flags mutex_unlock(&fs_info->qgroup_rescan_lock) btrfs_qgroup_wait_for_completion() mutex_lock(&fs_info->qgroup_rescan_lock) sees flag BTRFS_QGROUP_STATUS_FLAG_RESCAN in fs_info->qgroup_flags mutex_unlock(&fs_info->qgroup_rescan_lock) wait_for_completion_interruptible( &fs_info->qgroup_rescan_completion) fs_info->qgroup_rescan_completion->done is > 0 so it returns immediately init_completion(&fs_info->qgroup_rescan_completion) sets fs_info->qgroup_rescan_completion->done to 0 So fix this by initializing the completion object while holding the mutex fs_info->qgroup_rescan_lock. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-11-05btrfs: qgroup: exit the rescan worker during umountJustin Maggard
I was hitting a consistent NULL pointer dereference during shutdown that showed the trace running through end_workqueue_bio(). I traced it back to the endio_meta_workers workqueue being poked after it had already been destroyed. Eventually I found that the root cause was a qgroup rescan that was still in progress while we were stopping all the btrfs workers. Currently we explicitly pause balance and scrub operations in close_ctree(), but we do nothing to stop the qgroup rescan. We should probably be doing the same for qgroup rescan, but that's a much larger change. This small change is good enough to allow me to unmount without crashing. Signed-off-by: Justin Maggard <jmaggard@netgear.com> Reviewed-by: Filipe Manana <fdmanana@suse.com>
2015-11-05Btrfs: fix extent accounting for partial direct IO writesFilipe Manana
When doing a write using direct IO we can end up not doing the whole write operation using the direct IO path, in that case we fallback to a buffered write to do the remaining IO. This happens for example if the range we are writing to contains a compressed extent. When we do a partial write and fallback to buffered IO, due to the existence of a compressed extent for example, we end up not adjusting the outstanding extents counter of our inode which ends up getting decremented twice, once by the DIO ordered extent for the partial write and once again by btrfs_direct_IO(), resulting in an arithmetic underflow at extent-tree.c:drop_outstanding_extent(). For example if we have: extents [ prealloc extent ] [ compressed extent ] offsets A B C D E and at the moment our inode's outstanding extents counter is 0, if we do a direct IO write against the range [B, D[ (which has a length smaller than 128Mb), we end up bumping our inode's outstanding extents counter to 1, we create a DIO ordered extent for the range [B, C[ and then fallback to a buffered write for the range [C, D[. The direct IO handler (inode.c:btrfs_direct_IO()) decrements the outstanding extents counter by 1, leaving it with a value of 0, through a call to btrfs_delalloc_release_space() and then shortly after the DIO ordered extent finishes and calls btrfs_delalloc_release_metadata() which ends up to attempt to decrement the inode's outstanding extents counter by 1, resulting in an assertion failure at drop_outstanding_extent() because the operation would result in an arithmetic underflow (0 - 1). This produces the following trace: [125471.336838] BTRFS: assertion failed: BTRFS_I(inode)->outstanding_extents >= num_extents, file: fs/btrfs/extent-tree.c, line: 5526 [125471.338844] ------------[ cut here ]------------ [125471.340745] kernel BUG at fs/btrfs/ctree.h:4173! [125471.340745] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [125471.340745] Modules linked in: btrfs f2fs xfs libcrc32c dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc acpi_cpufreq psmouse i2c_piix4 parport pcspkr serio_raw microcode processor evdev i2c_core button ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci virtio_ring floppy libata virtio e1000 scsi_mod [last unloaded: btrfs] [125471.340745] CPU: 10 PID: 23649 Comm: kworker/u32:1 Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1 [125471.340745] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [125471.340745] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] [125471.340745] task: ffff8804244fcf80 ti: ffff88040a118000 task.ti: ffff88040a118000 [125471.340745] RIP: 0010:[<ffffffffa0550da1>] [<ffffffffa0550da1>] assfail.constprop.46+0x1e/0x20 [btrfs] [125471.340745] RSP: 0018:ffff88040a11bc78 EFLAGS: 00010296 [125471.340745] RAX: 0000000000000075 RBX: 0000000000005000 RCX: 0000000000000000 [125471.340745] RDX: ffffffff81098f93 RSI: ffffffff8147c619 RDI: 00000000ffffffff [125471.340745] RBP: ffff88040a11bc78 R08: 0000000000000001 R09: 0000000000000000 [125471.340745] R10: ffff88040a11bc08 R11: ffffffff81651000 R12: ffff8803efb4a000 [125471.340745] R13: ffff8803efb4a000 R14: 0000000000000000 R15: ffff8802f8e33c88 [125471.340745] FS: 0000000000000000(0000) GS:ffff88043dd40000(0000) knlGS:0000000000000000 [125471.340745] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [125471.340745] CR2: 00007fae7ca86095 CR3: 0000000001a0b000 CR4: 00000000000006e0 [125471.340745] Stack: [125471.340745] ffff88040a11bc88 ffffffffa04ca0cd ffff88040a11bcc8 ffffffffa04ceeb1 [125471.340745] ffff8802f8e33940 ffff8802c93eadb0 ffff8802f8e0bf50 ffff8803efb4a000 [125471.340745] 0000000000000000 ffff8802f8e33c88 ffff88040a11bd38 ffffffffa04eccfa [125471.340745] Call Trace: [125471.340745] [<ffffffffa04ca0cd>] drop_outstanding_extent+0x3d/0x6d [btrfs] [125471.340745] [<ffffffffa04ceeb1>] btrfs_delalloc_release_metadata+0x51/0xdd [btrfs] [125471.340745] [<ffffffffa04eccfa>] btrfs_finish_ordered_io+0x420/0x4eb [btrfs] [125471.340745] [<ffffffffa04ecdda>] finish_ordered_fn+0x15/0x17 [btrfs] [125471.340745] [<ffffffffa050e6e8>] normal_work_helper+0x14c/0x32a [btrfs] [125471.340745] [<ffffffffa050e9c8>] btrfs_endio_write_helper+0x12/0x14 [btrfs] [125471.340745] [<ffffffff81063b23>] process_one_work+0x24a/0x4ac [125471.340745] [<ffffffff81064285>] worker_thread+0x206/0x2c2 [125471.340745] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb [125471.340745] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb [125471.340745] [<ffffffff8106904d>] kthread+0xef/0xf7 [125471.340745] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24 [125471.340745] [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70 [125471.340745] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24 [125471.340745] Code: a5 55 a0 48 89 e5 e8 42 50 bc e0 0f 0b 55 89 f1 48 c7 c2 f0 a8 55 a0 48 89 fe 31 c0 48 c7 c7 14 aa 55 a0 48 89 e5 e8 22 50 bc e0 <0f> 0b 0f 1f 44 00 00 55 31 c9 ba 18 00 00 00 48 89 e5 41 56 41 [125471.340745] RIP [<ffffffffa0550da1>] assfail.constprop.46+0x1e/0x20 [btrfs] [125471.340745] RSP <ffff88040a11bc78> [125471.539620] ---[ end trace 144259f7838b4aa4 ]--- So fix this by ensuring we adjust the outstanding extents counter when we do the fallback just like we do for the case where the whole write can be done through the direct IO path. We were also adjusting the outstanding extents counter by a constant value of 1, which is incorrect because we were ignorning that we account extents in BTRFS_MAX_EXTENT_SIZE units, o fix that as well. The following test case for fstests reproduces this issue: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_xfs_io_command "falloc" rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount "-o compress" # Create a compressed extent covering the range [700K, 800K[. $XFS_IO_PROG -f -s -c "pwrite -S 0xaa -b 100K 700K 100K" \ $SCRATCH_MNT/foo | _filter_xfs_io # Create prealloc extent covering the range [600K, 700K[. $XFS_IO_PROG -c "falloc 600K 100K" $SCRATCH_MNT/foo # Write 80K of data to the range [640K, 720K[ using direct IO. This # range covers both the prealloc extent and the compressed extent. # Because there's a compressed extent in the range we are writing to, # the DIO write code path ends up only writing the first 60k of data, # which goes to the prealloc extent, and then falls back to buffered IO # for writing the remaining 20K of data - because that remaining data # maps to a file range containing a compressed extent. # When falling back to buffered IO, we used to trigger an assertion when # releasing reserved space due to bad accounting of the inode's # outstanding extents counter, which was set to 1 but we ended up # decrementing it by 1 twice, once through the ordered extent for the # 60K of data we wrote using direct IO, and once through the main direct # IO handler (inode.cbtrfs_direct_IO()) because the direct IO write # wrote less than 80K of data (60K). $XFS_IO_PROG -d -c "pwrite -S 0xbb -b 80K 640K 80K" \ $SCRATCH_MNT/foo | _filter_xfs_io # Now similar test as above but for very large write operations. This # triggers special cases for an inode's outstanding extents accounting, # as internally btrfs logically splits extents into 128Mb units. $XFS_IO_PROG -f -s \ -c "pwrite -S 0xaa -b 128M 258M 128M" \ -c "falloc 0 258M" \ $SCRATCH_MNT/bar | _filter_xfs_io $XFS_IO_PROG -d -c "pwrite -S 0xbb -b 256M 3M 256M" $SCRATCH_MNT/bar \ | _filter_xfs_io # Now verify the file contents are correct and that they are the same # even after unmounting and mounting the fs again (or evicting the page # cache). # # For file foo, all bytes in the range [0, 640K[ must have a value of # 0x00, all bytes in the range [640K, 720K[ must have a value of 0xbb # and all bytes in the range [720K, 800K[ must have a value of 0xaa. # # For file bar, all bytes in the range [0, 3M[ must havea value of 0x00, # all bytes in the range [3M, 259M[ must have a value of 0xbb and all # bytes in the range [259M, 386M[ must have a value of 0xaa. # echo "File digests before remounting the file system:" md5sum $SCRATCH_MNT/foo | _filter_scratch md5sum $SCRATCH_MNT/bar | _filter_scratch _scratch_remount echo "File digests after remounting the file system:" md5sum $SCRATCH_MNT/foo | _filter_scratch md5sum $SCRATCH_MNT/bar | _filter_scratch status=0 exit Fixes: e1cbbfa5f5aa ("Btrfs: fix outstanding_extents accounting in DIO") Fixes: 3e05bde8c3c2 ("Btrfs: only adjust outstanding_extents when we do a short write") Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-11-04Merge tag 'driver-core-4.4-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here's the "big" driver core updates for 4.4-rc1. Primarily a bunch of debugfs updates, with a smattering of minor driver core fixes and updates as well. All have been in linux-next for a long time" * tag 'driver-core-4.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: debugfs: Add debugfs_create_ulong() of: to support binding numa node to specified device in devicetree debugfs: Add read-only/write-only bool file ops debugfs: Add read-only/write-only size_t file ops debugfs: Add read-only/write-only x64 file ops debugfs: Consolidate file mode checks in debugfs_create_*() Revert "mm: Check if section present during memory block (un)registering" driver-core: platform: Provide helpers for multi-driver modules mm: Check if section present during memory block (un)registering devres: fix a for loop bounds check CMA: fix CONFIG_CMA_SIZE_MBYTES overflow in 64bit base/platform: assert that dev_pm_domain callbacks are called unconditionally sysfs: correctly handle short reads on PREALLOC attrs. base: soc: siplify ida usage kobject: move EXPORT_SYMBOL() macros next to corresponding definitions kobject: explain what kobject's sd field is debugfs: document that debugfs_remove*() accepts NULL and error values debugfs: Pass bool pointer to debugfs_create_bool() ACPI / EC: Fix broken 64bit big-endian users of 'global_lock'
2015-11-04Merge branch 'for-4.4/integrity' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block integrity updates from Jens Axboe: ""This is the joint work of Dan and Martin, cleaning up and improving the support for block data integrity" * 'for-4.4/integrity' of git://git.kernel.dk/linux-block: block, libnvdimm, nvme: provide a built-in blk_integrity nop profile block: blk_flush_integrity() for bio-based drivers block: move blk_integrity to request_queue block: generic request_queue reference counting nvme: suspend i/o during runtime blk_integrity_unregister md: suspend i/o during runtime blk_integrity_unregister md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown block: Inline blk_integrity in struct gendisk block: Export integrity data interval size in sysfs block: Reduce the size of struct blk_integrity block: Consolidate static integrity profile properties block: Move integrity kobject to struct gendisk
2015-11-04Merge branch 'for-4.4/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull core block updates from Jens Axboe: "This is the core block pull request for 4.4. I've got a few more topic branches this time around, some of them will layer on top of the core+drivers changes and will come in a separate round. So not a huge chunk of changes in this round. This pull request contains: - Enable blk-mq page allocation tracking with kmemleak, from Catalin. - Unused prototype removal in blk-mq from Christoph. - Cleanup of the q->blk_trace exchange, using cmpxchg instead of two xchg()'s, from Davidlohr. - A plug flush fix from Jeff. - Also from Jeff, a fix that means we don't have to update shared tag sets at init time unless we do a state change. This cuts down boot times on thousands of devices a lot with scsi/blk-mq. - blk-mq waitqueue barrier fix from Kosuke. - Various fixes from Ming: - Fixes for segment merging and splitting, and checks, for the old core and blk-mq. - Potential blk-mq speedup by marking ctx pending at the end of a plug insertion batch in blk-mq. - direct-io no page dirty on kernel direct reads. - A WRITE_SYNC fix for mpage from Roman" * 'for-4.4/core' of git://git.kernel.dk/linux-block: blk-mq: avoid excessive boot delays with large lun counts blktrace: re-write setting q->blk_trace blk-mq: mark ctx as pending at batch in flush plug path blk-mq: fix for trace_block_plug() block: check bio_mergeable() early before merging blk-mq: check bio_mergeable() early before merging block: avoid to merge splitted bio block: setup bi_phys_segments after splitting block: fix plug list flushing for nomerge queues blk-mq: remove unused blk_mq_clone_flush_request prototype blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c fs: direct-io: don't dirtying pages for ITER_BVEC/ITER_KVEC direct read fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write block: kmemleak: Track the page allocations for struct request
2015-11-04tracefs: Fix refcount imbalance in start_creating()Daniel Borkmann
In tracefs' start_creating(), we pin the file system to safely access its root. When we failed to create a file, we unpin the file system via failed_creating() to release the mount count and eventually the reference of the singleton vfsmount. However, when we run into an error during lookup_one_len() when still in start_creating(), we only release the parent's mutex but not so the reference on the mount. F.e., in securityfs_create_file(), after doing simple_pin_fs() when lookup_one_len() fails there, we infact do simple_release_fs(). This seems necessary here as well. Same issue seen in debugfs due to 190afd81e4a5 ("debugfs: split the beginning and the end of __create_file() off"), which seemed to got carried over into tracefs, too. Noticed during code review. Link: http://lkml.kernel.org/r/68efa86101b778cf7517ed7c6ad573bd69f60ec6.1446672850.git.daniel@iogearbox.net Fixes: 4282d60689d4 ("tracefs: Add new tracefs file system") Cc: stable@vger.kernel.org # 4.1+ Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-04Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Martin Schwidefsky: "There is only one new feature in this pull for the 4.4 merge window, most of it is small enhancements, cleanup and bug fixes: - Add the s390 backend for the software dirty bit tracking. This adds two new pgtable functions pte_clear_soft_dirty and pmd_clear_soft_dirty which is why there is a hit to arch/x86/include/asm/pgtable.h in this pull request. - A series of cleanup patches for the AP bus, this includes the removal of the support for two outdated crypto cards (PCICC and PCICA). - The irq handling / signaling on buffer full in the runtime instrumentation code is dropped. - Some micro optimizations: remove unnecessary memory barriers for a couple of functions: [smb_]rmb, [smb_]wmb, atomics, bitops, and for spin_unlock. Use the builtin bswap if available and make test_and_set_bit_lock more cache friendly. - Statistics and a tracepoint for the diagnose calls to the hypervisor. - The CPU measurement facility support to sample KVM guests is improved. - The vector instructions are now always enabled for user space processes if the hardware has the vector facility. This simplifies the FPU handling code. The fpu-internal.h header is split into fpu internals, api and types just like x86. - Cleanup and improvements for the common I/O layer. - Rework udelay to solve a problem with kprobe. udelay has busy loop semantics but still uses an idle processor state for the wait" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (66 commits) s390: remove runtime instrumentation interrupts s390/cio: de-duplicate subchannel validation s390/css: unneeded initialization in for_each_subchannel s390/Kconfig: use builtin bswap s390/dasd: fix disconnected device with valid path mask s390/dasd: fix invalid PAV assignment after suspend/resume s390/dasd: fix double free in dasd_eckd_read_conf s390/kernel: fix ptrace peek/poke for floating point registers s390/cio: move ccw_device_stlck functions s390/cio: move ccw_device_call_handler s390/topology: reduce per_cpu() invocations s390/nmi: reduce size of percpu variable s390/nmi: fix terminology s390/nmi: remove casts s390/nmi: remove pointless error strings s390: don't store registers on disabled wait anymore s390: get rid of __set_psw_mask() s390/fpu: split fpu-internal.h into fpu internals, api, and type headers s390/dasd: fix list_del corruption after lcu changes s390/spinlock: remove unneeded serializations at unlock ...
2015-11-04GFS2: Protect freeing directory hash table with i_lock spin_lockBob Peterson
This patch changes function gfs2_dir_hash_inval so it uses the i_lock spin_lock to protect the in-core hash table, i_hash_cache. This will prevent double-frees due to a race between gfs2_evict_inode and inode invalidation. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-11-03Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU changes from Ingo Molnar: "The main changes in this cycle were: - Improvements to expedited grace periods (Paul E McKenney) - Performance improvements to and locktorture tests for percpu-rwsem (Oleg Nesterov, Paul E McKenney) - Torture-test changes (Paul E McKenney, Davidlohr Bueso) - Documentation updates (Paul E McKenney) - Miscellaneous fixes (Paul E McKenney, Boqun Feng, Oleg Nesterov, Patrick Marlier)" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits) fs/writeback, rcu: Don't use list_entry_rcu() for pointer offsetting in bdi_split_work_to_wbs() rcu: Better hotplug handling for synchronize_sched_expedited() rcu: Enable stall warnings for synchronize_rcu_expedited() rcu: Add tasks to expedited stall-warning messages rcu: Add online/offline info to expedited stall warning message rcu: Consolidate expedited CPU selection rcu: Prepare for consolidating expedited CPU selection cpu: Remove try_get_online_cpus() rcu: Stop excluding CPU hotplug in synchronize_sched_expedited() rcu: Stop silencing lockdep false positive for expedited grace periods rcu: Switch synchronize_sched_expedited() to IPI locktorture: Fix module unwind when bad torture_type specified torture: Forgive non-plural arguments rcutorture: Fix unused-function warning for torturing_tasks() rcutorture: Fix module unwind when bad torture_type specified rcu_sync: Cleanup the CONFIG_PROVE_RCU checks locking/percpu-rwsem: Clean up the lockdep annotations in percpu_down_read() locking/percpu-rwsem: Fix the comments outdated by rcu_sync locking/percpu-rwsem: Make use of the rcu_sync infrastructure locking/percpu-rwsem: Make percpu_free_rwsem() after kzalloc() safe ...
2015-11-03Merge branch 'core-debug-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull wchan kernel address hiding from Ingo Molnar: "This fixes a wchan related information leak in /proc/PID/stat. There's a bit of an ABI twist to it: instead of setting the wchan field to 0 (which is our usual technique) we set it conditionally to a 0/1 flag to keep ABI compatibility with older procps versions that only fetches /proc/PID/wchan (symbolic names) if the absolute wchan address is nonzero" * 'core-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: fs/proc, core/debug: Don't expose absolute kernel addresses via wchan
2015-11-03nfs: Fix GETATTR bitmap verificationAndreas Gruenbacher
When decoding GETATTR replies, the client checks the attribute bitmap for which attributes the server has sent. It misses bits at the word boundaries, though; fix that. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-11-03nfs: Remove unused xdr page offsets in getacl/setacl argumentsAndreas Gruenbacher
The arguments passed around for getacl and setacl xdr encoding, struct nfs_setaclargs and struct nfs_getaclargs, both contain an array of pages, an offset into the first page, and the length of the page data. The offset is unused as it is always zero; remove it. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-11-03fs/nfs: remove unnecessary new_valid_dev checkYaowei Bai
As new_valid_dev always returns 1, so !new_valid_dev check is not needed, remove it. Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-11-03dlm: make posix locks interruptibleEric Ren
Replace wait_event_killable with wait_event_interruptible so that a program waiting for a posix lock can be interrupted by a signal. With the killable version, a program was not interruptible by a signal if it had a signal handler set for it, overriding the default action of terminating the process. Signed-off-by: Eric Ren <zren@suse.com> Signed-off-by: David Teigland <teigland@redhat.com>
2015-11-03Add resilienthandles mount parmSteve French
Since many servers (Windows clients, and non-clustered servers) do not support persistent handles but do support resilient handles, allow the user to specify a mount option "resilienthandles" in order to get more reliable connections and less chance of data loss (at least when SMB2.1 or later). Default resilient handle timeout (120 seconds to recent Windows server) is used. Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <steve.french@primarydata.com>
2015-11-03Btrfs: fix hole punching when using the no-holes featureFilipe Manana
When we are using the no-holes feature, if we punch a hole into a file range that already contains a hole which overlaps the range we are passing to fallocate(), we end up removing the extent map that represents the existing hole without adding a new one. This happens because with the no-holes feature we do not have explicit extent items to represent holes and therefore the call to __btrfs_drop_extents(), made from btrfs_punch_hole(), returns an end offset to the variable drop_end that is smaller than the end of the range passed to fallocate(), while it drops all existing extent maps in that range. Normally having a missing extent map is not a problem, for example for a readpages() operation we just end up building the extent map by looking at the fs/subvol tree for a matching extent item (or a lack of one for implicit holes). However for an fsync that uses the fast path, which needs to look at the list of modified extent maps, this means the fsync will not record information about the complete hole we had before the fallocate() call into the log tree, resulting in a file with content/layout that does not match what we had neither before nor after the hole punch operation. The following test case for fstests reproduces the issue. It fails without this change because we get a file with a different digest after the fsync log replay and also with a different extent/hole layout. seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { _cleanup_flakey rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter . ./common/punch . ./common/dmflakey # real QA test starts here _need_to_be_root _supported_fs generic _supported_os Linux _require_scratch _require_xfs_io_command "fpunch" _require_xfs_io_command "fiemap" _require_dm_target flakey _require_metadata_journaling $SCRATCH_DEV # This test was motivated by an issue found in btrfs when the btrfs # no-holes feature is enabled (introduced in kernel 3.14). So enable # the feature if the fs being tested is btrfs. if [ $FSTYP == "btrfs" ]; then _require_btrfs_fs_feature "no_holes" _require_btrfs_mkfs_feature "no-holes" MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes" fi rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _init_flakey _mount_flakey # Create out test file with some data and then fsync it. # We do the fsync only to make sure the last fsync we do in this test # triggers the fast code path of btrfs' fsync implementation, a # condition necessary to trigger the bug btrfs had. $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 128K" \ -c "fsync" \ $SCRATCH_MNT/foobar | _filter_xfs_io # Now punch a hole against the range [96K, 128K[. $XFS_IO_PROG -c "fpunch 96K 32K" $SCRATCH_MNT/foobar # Punch another hole against a range that overlaps the previous range # and ends beyond eof. $XFS_IO_PROG -c "fpunch 64K 128K" $SCRATCH_MNT/foobar # Punch another hole against a range that overlaps the first range # ([96K, 128K[) and ends at eof. $XFS_IO_PROG -c "fpunch 32K 96K" $SCRATCH_MNT/foobar # Fsync our file. We want to verify that, after a power failure and # mounting the filesystem again, the file content reflects all the hole # punch operations. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar echo "File digest before power failure:" md5sum $SCRATCH_MNT/foobar | _filter_scratch echo "Fiemap before power failure:" $XFS_IO_PROG -c "fiemap -v" $SCRATCH_MNT/foobar | _filter_fiemap # Silently drop all writes and umount to simulate a crash/power failure. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey # Allow writes again, mount to trigger log replay and validate file # contents. _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey echo "File digest after log replay:" # Must match the same digest we got before the power failure. md5sum $SCRATCH_MNT/foobar | _filter_scratch echo "Fiemap after log replay:" # Must match the same extent listing we got before the power failure. $XFS_IO_PROG -c "fiemap -v" $SCRATCH_MNT/foobar | _filter_fiemap _unmount_flakey status=0 exit Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-03Btrfs: find_free_extent: Do not erroneously skip LOOP_CACHING_WAIT statechandan
When executing generic/001 in a loop on a ppc64 machine (with both sectorsize and nodesize set to 64k), the following call trace is observed, WARNING: at /root/repos/linux/fs/btrfs/locking.c:253 Modules linked in: CPU: 2 PID: 8353 Comm: umount Not tainted 4.3.0-rc5-13676-ga5e681d #54 task: c0000000f2b1f560 ti: c0000000f6008000 task.ti: c0000000f6008000 NIP: c000000000520c88 LR: c0000000004a3b34 CTR: 0000000000000000 REGS: c0000000f600a820 TRAP: 0700 Not tainted (4.3.0-rc5-13676-ga5e681d) MSR: 8000000102029032 <SF,VEC,EE,ME,IR,DR,RI> CR: 24444884 XER: 00000000 CFAR: c0000000004a3b30 SOFTE: 1 GPR00: c0000000004a3b34 c0000000f600aaa0 c00000000108ac00 c0000000f5a808c0 GPR04: 0000000000000000 c0000000f600ae60 0000000000000000 0000000000000005 GPR08: 00000000000020a1 0000000000000001 c0000000f2b1f560 0000000000000030 GPR12: 0000000084842882 c00000000fdc0900 c0000000f600ae60 c0000000f070b800 GPR16: 0000000000000000 c0000000f3c8a000 0000000000000000 0000000000000049 GPR20: 0000000000000001 0000000000000001 c0000000f5aa01f8 0000000000000000 GPR24: 0f83e0f83e0f83e1 c0000000f5a808c0 c0000000f3c8d000 c000000000000000 GPR28: c0000000f600ae74 0000000000000001 c0000000f3c8d000 c0000000f5a808c0 NIP [c000000000520c88] .btrfs_tree_lock+0x48/0x2a0 LR [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80 Call Trace: [c0000000f600aaa0] [c0000000f600ab80] 0xc0000000f600ab80 (unreliable) [c0000000f600ab80] [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80 [c0000000f600ac00] [c0000000004a99dc] .btrfs_search_slot+0xa8c/0xc00 [c0000000f600ad40] [c0000000004ab878] .btrfs_insert_empty_items+0x98/0x120 [c0000000f600adf0] [c00000000050da44] .btrfs_finish_chunk_alloc+0x1d4/0x620 [c0000000f600af20] [c0000000004be854] .btrfs_create_pending_block_groups+0x1d4/0x2c0 [c0000000f600b020] [c0000000004bf188] .do_chunk_alloc+0x3c8/0x420 [c0000000f600b100] [c0000000004c27cc] .find_free_extent+0xbfc/0x1030 [c0000000f600b260] [c0000000004c2ce8] .btrfs_reserve_extent+0xe8/0x250 [c0000000f600b330] [c0000000004c2f90] .btrfs_alloc_tree_block+0x140/0x590 [c0000000f600b440] [c0000000004a47b4] .__btrfs_cow_block+0x124/0x780 [c0000000f600b530] [c0000000004a4fc0] .btrfs_cow_block+0xf0/0x250 [c0000000f600b5e0] [c0000000004a917c] .btrfs_search_slot+0x22c/0xc00 [c0000000f600b720] [c00000000050aa40] .btrfs_remove_chunk+0x1b0/0x9f0 [c0000000f600b850] [c0000000004c4e04] .btrfs_delete_unused_bgs+0x434/0x570 [c0000000f600b950] [c0000000004d3cb8] .close_ctree+0x2e8/0x3b0 [c0000000f600ba20] [c00000000049d178] .btrfs_put_super+0x18/0x30 [c0000000f600ba90] [c000000000243cd4] .generic_shutdown_super+0xa4/0x1a0 [c0000000f600bb10] [c0000000002441d8] .kill_anon_super+0x18/0x30 [c0000000f600bb90] [c00000000049c898] .btrfs_kill_super+0x18/0xc0 [c0000000f600bc10] [c0000000002444f8] .deactivate_locked_super+0x98/0xe0 [c0000000f600bc90] [c000000000269f94] .cleanup_mnt+0x54/0xa0 [c0000000f600bd10] [c0000000000bd744] .task_work_run+0xc4/0x100 [c0000000f600bdb0] [c000000000016334] .do_notify_resume+0x74/0x80 [c0000000f600be30] [c0000000000098b8] .ret_from_except_lite+0x64/0x68 Instruction dump: fba1ffe8 fbc1fff0 fbe1fff8 7c791b78 f8010010 f821ff21 e94d0290 81030040 812a04e8 7d094a78 7d290034 5529d97e <0b090000> 3b400000 3be30050 3bc3004c The above call trace is seen even on x86_64; albeit very rarely and that too with nodesize set to 64k and with nospace_cache mount option being used. The reason for the above call trace is, btrfs_remove_chunk check_system_chunk Allocate chunk if required For each physical stripe on underlying device, btrfs_free_dev_extent ... Take lock on Device tree's root node btrfs_cow_block("dev tree's root node"); btrfs_reserve_extent find_free_extent index = BTRFS_RAID_DUP; have_caching_bg = false; When in LOOP_CACHING_NOWAIT state, Assume we find a block group which is being cached; Hence have_caching_bg is set to true When repeating the search for the next RAID index, we set have_caching_bg to false. Hence right after completing the LOOP_CACHING_NOWAIT state, we incorrectly skip LOOP_CACHING_WAIT state and move to LOOP_ALLOC_CHUNK state where we allocate a chunk and try to add entries corresponding to the chunk's physical stripe into the device tree. When doing so the task deadlocks itself waiting for the blocking lock on the root node of the device tree. This commit fixes the issue by introducing a new local variable to help indicate as to whether a block group of any RAID type is being cached. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-03btrfs: Fix a data space underflow warningQu Wenruo
Even with quota disabled, generic/127 will trigger a kernel warning by underflow data space info. The bug is caused by buffered write, which in case of short copy, the start parameter for btrfs_delalloc_release_space() is wrong, and round_up/down() in btrfs_delalloc_release() extents the range to page aligned, decreasing one more page than expected. This patch will fix it by passing correct start. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-03[SMB3] Send durable handle v2 contexts when use of persistent handles requiredSteve French
Version 2 of the patch. Thanks to Dan Carpenter and the smatch tool for finding a problem in the first version of this patch. CC: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <steve.french@primarydata.com>
2015-11-03[SMB3] Display persistenthandles in /proc/mounts for SMB3 shares if enabledSteve French
Signed-off-by: Steve French <steve.french@primarydata.com> Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org>
2015-11-03[SMB3] Enable checking for continuous availability and persistent handle supportSteve French
Validate "persistenthandles" and "nopersistenthandles" mount options against the support the server claims in negotiate and tree connect SMB3 responses. Signed-off-by: Steve French <steve.french@primarydata.com> Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org>
2015-11-03[SMB3] Add parsing for new mount option controlling persistent handlesSteve French
"nopersistenthandles" and "persistenthandles" mount options added. The former will not request persistent handles on open even when SMB3 negotiated and Continuous Availability share. The latter will request persistent handles (as long as server notes the capability in protocol negotiation) even if share is not Continuous Availability share. Signed-off-by: Steve French <steve.french@primarydata.com> Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org>
2015-11-03Merge branch 'xfs-dax-updates' into for-nextDave Chinner
2015-11-03Merge branch 'xfs-misc-fixes-for-4.4-2' into for-nextDave Chinner
2015-11-03xfs: optimise away log forces on timestamp updates for fdatasyncDave Chinner
xfs: timestamp updates cause excessive fdatasync log traffic Sage Weil reported that a ceph test workload was writing to the log on every fdatasync during an overwrite workload. Event tracing showed that the only metadata modification being made was the timestamp updates during the write(2) syscall, but fdatasync(2) is supposed to ignore them. The key observation was that the transactions in the log all looked like this: INODE: #regs: 4 ino: 0x8b flags: 0x45 dsize: 32 And contained a flags field of 0x45 or 0x85, and had data and attribute forks following the inode core. This means that the timestamp updates were triggering dirty relogging of previously logged parts of the inode that hadn't yet been flushed back to disk. There are two parts to this problem. The first is that XFS relogs dirty regions in subsequent transactions, so it carries around the fields that have been dirtied since the last time the inode was written back to disk, not since the last time the inode was forced into the log. The second part is that on v5 filesystems, the inode change count update during inode dirtying also sets the XFS_ILOG_CORE flag, so on v5 filesystems this makes a timestamp update dirty the entire inode. As a result when fdatasync is run, it looks at the dirty fields in the inode, and sees more than just the timestamp flag, even though the only metadata change since the last fdatasync was just the timestamps. Hence we force the log on every subsequent fdatasync even though it is not needed. To fix this, add a new field to the inode log item that tracks changes since the last time fsync/fdatasync forced the log to flush the changes to the journal. This flag is updated when we dirty the inode, but we do it before updating the change count so it does not carry the "core dirty" flag from timestamp updates. The fields are zeroed when the inode is marked clean (due to writeback/freeing) or when an fsync/datasync forces the log. Hence if we only dirty the timestamps on the inode between fsync/fdatasync calls, the fdatasync will not trigger another log force. Over 100 runs of the test program: Ext4 baseline: runtime: 1.63s +/- 0.24s avg lat: 1.59ms +/- 0.24ms iops: ~2000 XFS, vanilla kernel: runtime: 2.45s +/- 0.18s avg lat: 2.39ms +/- 0.18ms log forces: ~400/s iops: ~1000 XFS, patched kernel: runtime: 1.49s +/- 0.26s avg lat: 1.46ms +/- 0.25ms log forces: ~30/s iops: ~1500 Reported-by: Sage Weil <sage@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03xfs: don't leak uuid table on rmmodDarrick J. Wong
Don't leak the UUID table when the module is unloaded. (Found with kmemleak.) Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03xfs: invalidate cached acl if set via ioctlAndreas Gruenbacher
Setting or removing the "SGI_ACL_[FILE|DEFAULT]" attributes via the XFS_IOC_ATTRMULTI_BY_HANDLE ioctl completely bypasses the POSIX ACL infrastructure, like setting the "trusted.SGI_ACL_[FILE|DEFAULT]" xattrs did until commit 6caa1056. Similar to that commit, invalidate cached acls when setting/removing them via the ioctl as well. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03xfs: Plug memory leak in xfs_attrmulti_attr_setAndreas Gruenbacher
When setting attributes via XFS_IOC_ATTRMULTI_BY_HANDLE, the user-space buffer is copied into a new kernel-space buffer via memdup_user; that buffer then isn't freed. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03xfs: Validate the length of on-disk ACLsAndreas Gruenbacher
In xfs_acl_from_disk, instead of trusting that xfs_acl.acl_cnt is correct, make sure that the length of the attributes is correct as well. Also, turn the aclp parameter into a const pointer. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03xfs: invalidate cached acl if set directly via xattrBrian Foster
ACLs are stored as extended attributes of the inode to which they apply. XFS converts the standard "system.posix_acl_[access|default]" attribute names used to control ACLs to "trusted.SGI_ACL_[FILE|DEFAULT]" as stored on-disk. These xattrs are directly exposed in on-disk format via getxattr/setxattr, without any ACL aware code in the path to perform validation, etc. This is partly historical and supports backup/restore applications such as xfsdump to back up and restore the binary blob that represents ACLs as-is. Andreas reports that the ACLs observed via the getfacl interface is not consistent when ACLs are set directly via the setxattr path. This occurs because the ACLs are cached in-core against the inode and the xattr path has no knowledge that the operation relates to ACLs. Update the xattr set codepath to trap writes of the special XFS ACL attributes and invalidate the associated cached ACL when this occurs. This ensures that the correct ACLs are used on a subsequent operation through the actual ACL interface. Note that this does not update or add support for setting the ACL xattrs directly beyond the restore use case that requires a correctly formatted binary blob and to restore a consistent i_mode at the same time. It is still possible for a root user to set an invalid or inconsistent (with i_mode) ACL blob on-disk and potentially cause corruption. [ With fixes from Andreas Gruenbacher. ] Reported-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03xfs: xfs_filemap_pmd_fault treats read faults as write faultsDave Chinner
The code initially committed didn't have the same checks for write faults as the dax_pmd_fault code and hence treats all faults as write faults. We can get read faults through this path because they is no pmd_mkwrite path for write faults similar to the normal page fault path. Hence we need to ensure that we only do c/mtime updates on write faults, and freeze protection is unnecessary for read faults. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03xfs: add ->pfn_mkwrite support for DAXDave Chinner
->pfn_mkwrite support is needed so that when a page with allocated backing store takes a write fault we can check that the fault has not raced with a truncate and is pointing to a region beyond the current end of file. This also allows us to update the timestamp on the inode, too, which fixes a generic/080 failure. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03xfs: DAX does not use IO completion callbacksDave Chinner
For DAX, we are now doing block zeroing during allocation. This means we no longer need a special DAX fault IO completion callback to do unwritten extent conversion. Because mmap never extends the file size (it SEGVs the process) we don't need a callback to update the file size, either. Hence we can remove the completion callbacks from the __dax_fault and __dax_mkwrite calls. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03xfs: Don't use unwritten extents for DAXDave Chinner
DAX has a page fault serialisation problem with block allocation. Because it allows concurrent page faults and does not have a page lock to serialise faults to the same page, it can get two concurrent faults to the page that race. When two read faults race, this isn't a huge problem as the data underlying the page is not changing and so "detect and drop" works just fine. The issues are to do with write faults. When two write faults occur, we serialise block allocation in get_blocks() so only one faul will allocate the extent. It will, however, be marked as an unwritten extent, and that is where the problem lies - the DAX fault code cannot differentiate between a block that was just allocated and a block that was preallocated and needs zeroing. The result is that both write faults end up zeroing the block and attempting to convert it back to written. The problem is that the first fault can zero and convert before the second fault starts zeroing, resulting in the zeroing for the second fault overwriting the data that the first fault wrote with zeros. The second fault then attempts to convert the unwritten extent, which is then a no-op because it's already written. Data loss occurs as a result of this race. Because there is no sane locking construct in the page fault code that we can use for serialisation across the page faults, we need to ensure block allocation and zeroing occurs atomically in the filesystem. This means we can still take concurrent page faults and the only time they will serialise is in the filesystem mapping/allocation callback. The page fault code will always see written, initialised extents, so we will be able to remove the unwritten extent handling from the DAX code when all filesystems are converted. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03xfs: introduce BMAPI_ZERO for allocating zeroed extentsDave Chinner
To enable DAX to do atomic allocation of zeroed extents, we need to drive the block zeroing deep into the allocator. Because xfs_bmapi_write() can return merged extents on allocation that were only partially allocated (i.e. requested range spans allocated and hole regions, allocation into the hole was contiguous), we cannot zero the extent returned from xfs_bmapi_write() as that can overwrite existing data with zeros. Hence we have to drive the extent zeroing into the allocation code, prior to where we merge the extents into the BMBT and return the resultant map. This means we need to propagate this need down to the xfs_alloc_vextent() and issue the block zeroing at this point. While this functionality is being introduced for DAX, there is no reason why it is specific to DAX - we can per-zero blocks during the allocation transaction on any type of device. It's just slow (and usually slower than unwritten allocation and conversion) on traditional block devices so doesn't tend to get used. We can, however, hook hardware zeroing optimisations via sb_issue_zeroout() to this operation, so it may be useful in future and hence the "allocate zeroed blocks" API needs to be implementation neutral. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03xfs: fix inode size update overflow in xfs_map_direct()Dave Chinner
Both direct IO and DAX pass an offset and count into get_blocks that will overflow a s64 variable when an IO goes into the last supported block in a file (i.e. at offset 2^63 - 1FSB bytes). This can be seen from the tracing: xfs_get_blocks_alloc: [...] offset 0x7ffffffffffff000 count 4096 xfs_gbmap_direct: [...] offset 0x7ffffffffffff000 count 4096 xfs_gbmap_direct_none:[...] offset 0x7ffffffffffff000 count 4096 0x7ffffffffffff000 + 4096 = 0x8000000000000000, and hence that overflows the s64 offset and we fail to detect the need for a filesize update and an ioend is not allocated. This is *mostly* avoided for direct IO because such extending IOs occur with full block allocation, and so the "IS_UNWRITTEN()" check still evaluates as true and we get an ioend that way. However, doing single sector extending IOs to this last block will expose the fact that file size updates will not occur after the first allocating direct IO as the overflow will then be exposed. There is one further complexity: the DAX page fault path also exposes the same issue in block allocation. However, page faults cannot extend the file size, so in this case we want to allocate the block but do not want to allocate an ioend to enable file size update at IO completion. Hence we now need to distinguish between the direct IO patch allocation and dax fault path allocation to avoid leaking ioend structures. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-02libceph: msg signing callouts don't need con argumentIlya Dryomov
We can use msg->con instead - at the point we sign an outgoing message or check the signature on the incoming one, msg->con is always set. We wouldn't know how to sign a message without an associated session (i.e. msg->con == NULL) and being able to sign a message using an explicitly provided authorizer is of no use. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-02ceph: make fsync() wait unsafe requests that created/modified inodeYan, Zheng
If we get a unsafe reply for request that created/modified inode, add the unsafe request to a list in the newly created/modified inode. So we can make fsync() wait these unsafe requests. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-11-02ceph: add request to i_unsafe_dirops when getting unsafe replyYan, Zheng
Previously we add request to i_unsafe_dirops when registering request. So ceph_fsync() also waits for imcomplete requests. This is unnecessary, ceph_fsync() only needs to wait unsafe requests. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-11-02ceph: don't invalidate page cache when inode is no longer usedYan, Zheng
ceph_check_caps() invalidate page cache when inode is not used by any open file. This behaviour is not friendly for workload that repeatly read files. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-11-02ceph: combine as many iovec as possile into one OSD requestZhu, Caifeng
Both ceph_sync_direct_write and ceph_sync_read iterate iovec elements one by one, send one OSD request for each iovec. This is sub-optimal, We can combine serveral iovec into one page vector, and send an OSD request for the whole page vector. Signed-off-by: Zhu, Caifeng <zhucaifeng@unissoft-nj.com> Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-11-02ceph: fix message length computationArnd Bergmann
create_request_message() computes the maximum length of a message, but uses the wrong type for the time stamp: sizeof(struct timespec) may be 8 or 16 depending on the architecture, while sizeof(struct ceph_timespec) is always 8, and that is what gets put into the message. Found while auditing the uses of timespec for y2038 problems. Fixes: b8e69066d8af ("ceph: include time stamp in every MDS request") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-11-02ceph: fix a comment typoGeliang Tang
Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-11-02Merge tag 'nfs-rdma-4.4-2' of git://git.linux-nfs.org/projects/anna/nfs-rdmaTrond Myklebust
NFS: NFSoRDMA Client Side Changes In addition to a variety of bugfixes, these patches are mostly geared at enabling both swap and backchannel support to the NFS over RDMA client. Signed-off-by: Anna Schumake <Anna.Schumaker@Netapp.com>
2015-11-02pstore: fix code comment to match codeGeliang Tang
Fix code comment about kmsg_dump register so it matches the code. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2015-11-02NFS: Enable client side NFSv4.1 backchannel to use other transportsChuck Lever
Forechannel transports get their own "bc_up" method to create an endpoint for the backchannel service. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> [Anna Schumaker: Add forward declaration of struct net to xprt.h] Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>