summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2016-12-01NFSv4: Fix missing operation accounting in NFS4_dec_delegreturn_szTrond Myklebust
We need to account for the reply to the PUTFH operation in the DELEGRETURN compound. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01pNFS: Don't mark layout segments invalid on layoutreturn in pnfs_rocTrond Myklebust
The layoutreturn call will take care of invalidating the layout segments once the call is successful. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01pNFS: Get rid of unnecessary layout parameter in encode_layoutreturn callbackTrond Myklebust
The parameter is already present in the "args" structure. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01pNFS: Skip checking for return-on-close if the layout is invalidTrond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01pNFS: Remove spurious wake up in pnfs_layout_remove_lseg()Trond Myklebust
There is no change to the value of NFS_LAYOUT_RETURN, so we should not be waking up the RPC call. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01NFSv4: Ignore LAYOUTRETURN result if the layout doesn't match or is invalidTrond Myklebust
Fix a potential race with CB_LAYOUTRECALL in which the server recalls the remaining layout segments while our LAYOUTRETURN is still in transit. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01pNFS: Do not free layout segments that are marked for returnTrond Myklebust
We may want to process and transmit layout stat information for the layout segments that are being returned, so we should defer freeing them until after the layoutreturn has completed. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01pNFS: Delay getting the layout header in CB_LAYOUTRECALL handlersTrond Myklebust
Instead of grabbing the layout, we want to get the inode so that we can reduce races between layoutget and layoutrecall when the server does not support call referring. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01pNFS: consolidate the different range intersection testsTrond Myklebust
Both pnfs.c and the flexfiles code have their own versions of the range intersection testing, and the "end_offset" helper. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01pNFS: Fix race in pnfs_wait_on_layoutreturnTrond Myklebust
We must put the task to sleep while holding the inode->i_lock in order to ensure atomicity with the test for NFS_LAYOUT_RETURN. Fixes: 500d701f336b ("NFS41: make close wait for layoutreturn") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01pNFS: On error, do not send LAYOUTGET until the LAYOUTRETURN has completedTrond Myklebust
If there is an I/O error, we should not call LAYOUTGET until the LAYOUTRETURN that reports the error is complete. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org # v4.8+
2016-12-01pNFS: Force a retry of LAYOUTGET if the stateid doesn't match our cacheTrond Myklebust
If the server sends us a completely new stateid, and the client thinks it already holds a layout, then force a retry of the LAYOUTGET after invalidating the existing layout in order to avoid corruption due to races. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01pNFS: Clear NFS_LAYOUT_RETURN_REQUESTED when invalidating the layout stateidTrond Myklebust
We must ensure that we don't schedule a layoutreturn if the layout stateid has been marked as invalid. Fixes: 2a59a0411671e ("pNFS: Fix pnfs_set_layout_stateid() to clear...") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org # v4.8+
2016-12-01pNFS: Don't clear the layout stateid if a layout return is outstandingTrond Myklebust
If we no longer hold any layout segments, we're normally expected to consider the layout stateid to be invalid. However we cannot assume this if we're about to, or in the process of sending a layoutreturn. Fixes: 334a8f37115b ("pNFS: Don't forget the layout stateid if...") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org # v4.8+
2016-12-01pNFS: Fix a deadlock between read resends and layoutreturnTrond Myklebust
We must not call nfs_pageio_init_read() on a new nfs_pageio_descriptor while holding a reference to a layout segment, as that can deadlock pnfs_update_layout(). Fixes: d67ae825a59d6 ("pnfs/flexfiles: Add the FlexFile Layout Driver") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org # v4.0+
2016-12-01NFSv4.1: Fix regression in callback retry handlingFred Isaman
When initializing a freshly created slot for the calllback channel, the seq_nr needs to be 0, not 1. Otherwise validate_seqid and nfs4_slot_wait_on_seqid get confused and believe that the mpty slot corresponds to a previously sent reply. Signed-off-by: Fred Isaman <fred.isaman@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01NFSv4: Optimise away forced revalidation when we know the attributes are OKTrond Myklebust
The NFS_INO_REVAL_FORCED flag needs to be set if we just got a delegation, and we see that there might still be some ambiguity as to whether or not our attribute or data cache are valid. In practice, this means that a call to nfs_check_inode_attributes() will have noticed a discrepancy between cached attributes and measured ones, so let's move the setting of NFS_INO_REVAL_FORCED to there. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01NFSv4: Don't request close-to-open attribute when holding a delegationTrond Myklebust
If holding a delegation, we do not need to ask the server to return close-to-open cache consistency attributes as part of the CLOSE compound. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01NFSv4: Don't request a GETATTR on open_downgrade.Trond Myklebust
If we're not closing the file completely, there is no need to request close-to-open attributes. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01NFSv4: Don't ask for the change attribute when reclaiming stateTrond Myklebust
We don't need to ask for the change attribute when returning a delegation or recovering from a server reboot, and it could actually cause us to obtain an incorrect value if we're using a pNFS flavour that requires LAYOUTCOMMIT. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01NFSv4: Don't check file access when reclaiming stateTrond Myklebust
If we're reclaiming state after a reboot, or as part of returning a delegation, we don't need to check access modes again. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01ext4: validate s_first_meta_bg at mount timeEryu Guan
Ralf Spenneberg reported that he hit a kernel crash when mounting a modified ext4 image. And it turns out that kernel crashed when calculating fs overhead (ext4_calculate_overhead()), this is because the image has very large s_first_meta_bg (debug code shows it's 842150400), and ext4 overruns the memory in count_overhead() when setting bitmap buffer, which is PAGE_SIZE. ext4_calculate_overhead(): buf = get_zeroed_page(GFP_NOFS); <=== PAGE_SIZE buffer blks = count_overhead(sb, i, buf); count_overhead(): for (j = ext4_bg_num_gdb(sb, grp); j > 0; j--) { <=== j = 842150400 ext4_set_bit(EXT4_B2C(sbi, s++), buf); <=== buffer overrun count++; } This can be reproduced easily for me by this script: #!/bin/bash rm -f fs.img mkdir -p /mnt/ext4 fallocate -l 16M fs.img mke2fs -t ext4 -O bigalloc,meta_bg,^resize_inode -F fs.img debugfs -w -R "ssv first_meta_bg 842150400" fs.img mount -o loop fs.img /mnt/ext4 Fix it by validating s_first_meta_bg first at mount time, and refusing to mount if its value exceeds the largest possible meta_bg number. Reported-by: Ralf Spenneberg <ralf@os-t.de> Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2016-12-01ext4: correctly detect when an xattr value has an invalid sizeEric Biggers
It was possible for an xattr value to have a very large size, which would then pass validation on 32-bit architectures due to a pointer wraparound. Fix this by validating the size in a way which avoids pointer wraparound. It was also possible that a value's size would fit in the available space but its padded size would not. This would cause an out-of-bounds memory write in ext4_xattr_set_entry when replacing the xattr value. For example, if an xattr value of unpadded size 253 bytes went until the very end of the inode or block, then using setxattr(2) to replace this xattr's value with 256 bytes would cause a write to the 3 bytes past the end of the inode or buffer, and the new xattr value would be incorrectly truncated. Fix this by requiring that the padded size fit in the available space rather than the unpadded size. This patch shouldn't have any noticeable effect on non-corrupted/non-malicious filesystems. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-01ext4: don't read out of bounds when checking for in-inode xattrsEric Biggers
With i_extra_isize equal to or close to the available space, it was possible for us to read past the end of the inode when trying to detect or validate in-inode xattrs. Fix this by checking for the needed extra space first. This patch shouldn't have any noticeable effect on non-corrupted/non-malicious filesystems. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2016-12-01ext4: forbid i_extra_isize not divisible by 4Eric Biggers
i_extra_isize not divisible by 4 is problematic for several reasons: - It causes the in-inode xattr space to be misaligned, but the xattr header and entries are not declared __packed to express this possibility. This may cause poor performance or incorrect code generation on some platforms. - When validating the xattr entries we can read past the end of the inode if the size available for xattrs is not a multiple of 4. - It allows the nonsensical i_extra_isize=1, which doesn't even leave enough room for i_extra_isize itself. Therefore, update ext4_iget() to consider i_extra_isize not divisible by 4 to be an error, like the case where i_extra_isize is too large. This also matches the rule recently added to e2fsck for determining whether an inode has valid i_extra_isize. This patch shouldn't have any noticeable effect on non-corrupted/non-malicious filesystems, since the size of ext4_inode has always been a multiple of 4. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2016-12-01Merge branch 'overlayfs-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs fix from Miklos Szeredi: "This fixes a regression introduced in 4.8" * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: fix d_real() for stacked fs
2016-12-01ext4: disable pwsalt ioctl when encryption disabled by configEric Biggers
On a CONFIG_EXT4_FS_ENCRYPTION=n kernel, the ioctls to get and set encryption policies were disabled but EXT4_IOC_GET_ENCRYPTION_PWSALT was not. But there's no good reason to expose the pwsalt ioctl if the kernel doesn't support encryption. The pwsalt ioctl was also disabled pre-4.8 (via ext4_sb_has_crypto() previously returning 0 when encryption was disabled by config) and seems to have been enabled by mistake when ext4 encryption was refactored to use fs/crypto/. So let's disable it again. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-01ext4: get rid of ext4_sb_has_crypto()Eric Biggers
ext4_sb_has_crypto() just called through to ext4_has_feature_encrypt(), and all callers except one were already using the latter. So remove it and switch its one caller to ext4_has_feature_encrypt(). Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-01ext4: fix inode checksum calculation problem if i_extra_size is smallDaeho Jeong
We've fixed the race condition problem in calculating ext4 checksum value in commit b47820edd163 ("ext4: avoid modifying checksum fields directly during checksum veficationon"). However, by this change, when calculating the checksum value of inode whose i_extra_size is less than 4, we couldn't calculate the checksum value in a proper way. This problem was found and reported by Nix, Thank you. Reported-by: Nix <nix@esperi.org.uk> Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com> Signed-off-by: Youngjin Gil <youngjin.gil@samsung.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-01ext4: warn when page is dirtied without buffersJan Kara
Warn when a page is dirtied without buffers (as that will likely lead to a crash in ext4_writepages()) or when it gets newly dirtied without the page being locked (as there is nothing that prevents buffers to get stripped just before calling set_page_dirty() under memory pressure). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-01block: protect iterate_bdevs() against concurrent closeRabin Vincent
If a block device is closed while iterate_bdevs() is handling it, the following NULL pointer dereference occurs because bdev->b_disk is NULL in bdev_get_queue(), which is called from blk_get_backing_dev_info() (in turn called by the mapping_cap_writeback_dirty() call in __filemap_fdatawrite_range()): BUG: unable to handle kernel NULL pointer dereference at 0000000000000508 IP: [<ffffffff81314790>] blk_get_backing_dev_info+0x10/0x20 PGD 9e62067 PUD 9ee8067 PMD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: CPU: 1 PID: 2422 Comm: sync Not tainted 4.5.0-rc7+ #400 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) task: ffff880009f4d700 ti: ffff880009f5c000 task.ti: ffff880009f5c000 RIP: 0010:[<ffffffff81314790>] [<ffffffff81314790>] blk_get_backing_dev_info+0x10/0x20 RSP: 0018:ffff880009f5fe68 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff88000ec17a38 RCX: ffffffff81a4e940 RDX: 7fffffffffffffff RSI: 0000000000000000 RDI: ffff88000ec176c0 RBP: ffff880009f5fe68 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: ffff88000ec17860 R13: ffffffff811b25c0 R14: ffff88000ec178e0 R15: ffff88000ec17a38 FS: 00007faee505d700(0000) GS:ffff88000fb00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000508 CR3: 0000000009e8a000 CR4: 00000000000006e0 Stack: ffff880009f5feb8 ffffffff8112e7f5 0000000000000000 7fffffffffffffff 0000000000000000 0000000000000000 7fffffffffffffff 0000000000000001 ffff88000ec178e0 ffff88000ec17860 ffff880009f5fec8 ffffffff8112e81f Call Trace: [<ffffffff8112e7f5>] __filemap_fdatawrite_range+0x85/0x90 [<ffffffff8112e81f>] filemap_fdatawrite+0x1f/0x30 [<ffffffff811b25d6>] fdatawrite_one_bdev+0x16/0x20 [<ffffffff811bc402>] iterate_bdevs+0xf2/0x130 [<ffffffff811b2763>] sys_sync+0x63/0x90 [<ffffffff815d4272>] entry_SYSCALL_64_fastpath+0x12/0x76 Code: 0f 1f 44 00 00 48 8b 87 f0 00 00 00 55 48 89 e5 <48> 8b 80 08 05 00 00 5d RIP [<ffffffff81314790>] blk_get_backing_dev_info+0x10/0x20 RSP <ffff880009f5fe68> CR2: 0000000000000508 ---[ end trace 2487336ceb3de62d ]--- The crash is easily reproducible by running the following command, if an msleep(100) is inserted before the call to func() in iterate_devs(): while :; do head -c1 /dev/nullb0; done > /dev/null & while :; do sync; done Fix it by holding the bd_mutex across the func() call and only calling func() if the bdev is opened. Cc: stable@vger.kernel.org Fixes: 5c0d6b60a0ba ("vfs: Create function for iterating over block devices") Reported-and-tested-by: Wei Fang <fangwei1@huawei.com> Signed-off-by: Rabin Vincent <rabinv@axis.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-01SMB3: parsing for new snapshot timestamp mount parmSteve French
New mount option "snapshot=<time>" to allow mounting an earlier version of the remote volume (if such a snapshot exists on the server). Note that eventually specifying a snapshot time of 1 will allow the user to mount the oldest snapshot. A subsequent patch add the processing for that and another for actually specifying the "time warp" create context on SMB2/SMB3 open. Check to make sure SMB2 negotiated, and ensure that we use a different tcon if mount same share twice but with different snaphshot times Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2016-11-30isofs: add KERN_CONT to printing of ER recordsMike Rapoport
The ER records are printed without explicit log level presuming line continuation until "\n". After the commit 4bcc595ccd8 (printk: reinstate KERN_CONT for printing continuation lines), the ER records are printed a character per line. Adding KERN_CONT to appropriate printk statements restores the printout behavior. Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-30Btrfs: fix tree search logic when replaying directory entry deletesRobbie Ko
If a log tree has a layout like the following: leaf N: ... item 240 key (282 DIR_LOG_ITEM 0) itemoff 8189 itemsize 8 dir log end 1275809046 leaf N + 1: item 0 key (282 DIR_LOG_ITEM 3936149215) itemoff 16275 itemsize 8 dir log end 18446744073709551615 ... When we pass the value 1275809046 + 1 as the parameter start_ret to the function tree-log.c:find_dir_range() (done by replay_dir_deletes()), we end up with path->slots[0] having the value 239 (points to the last item of leaf N, item 240). Because the dir log item in that position has an offset value smaller than *start_ret (1275809046 + 1) we need to move on to the next leaf, however the logic for that is wrong since it compares the current slot to the number of items in the leaf, which is smaller and therefore we don't lookup for the next leaf but instead we set the slot to point to an item that does not exist, at slot 240, and we later operate on that slot which has unexpected content or in the worst case can result in an invalid memory access (accessing beyond the last page of leaf N's extent buffer). So fix the logic that checks when we need to lookup at the next leaf by first incrementing the slot and only after to check if that slot is beyond the last item of the current leaf. Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Fixes: e02119d5a7b4 (Btrfs: Add a write ahead tree log to optimize synchronous operations) Cc: stable@vger.kernel.org # 2.6.29+ Signed-off-by: Filipe Manana <fdmanana@suse.com> [Modified changelog for clarity and correctness]
2016-11-30Btrfs: fix deadlock caused by fsync when logging directory entriesRobbie Ko
While logging new directory entries, at tree-log.c:log_new_dir_dentries(), after we call btrfs_search_forward() we get a leaf with a read lock on it, and without unlocking that leaf we can end up calling btrfs_iget() to get an inode pointer. The later (btrfs_iget()) can end up doing a read-only search on the same tree again, if the inode is not in memory already, which ends up causing a deadlock if some other task in the meanwhile started a write search on the tree and is attempting to write lock the same leaf that btrfs_search_forward() locked while holding write locks on upper levels of the tree blocking the read search from btrfs_iget(). In this scenario we get a deadlock. So fix this by releasing the search path before calling btrfs_iget() at tree-log.c:log_new_dir_dentries(). Example trace of such deadlock: [ 4077.478852] kworker/u24:10 D ffff88107fc90640 0 14431 2 0x00000000 [ 4077.486752] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] [ 4077.494346] ffff880ffa56bad0 0000000000000046 0000000000009000 ffff880ffa56bfd8 [ 4077.502629] ffff880ffa56bfd8 ffff881016ce21c0 ffffffffa06ecb26 ffff88101a5d6138 [ 4077.510915] ffff880ebb5173b0 ffff880ffa56baf8 ffff880ebb517410 ffff881016ce21c0 [ 4077.519202] Call Trace: [ 4077.528752] [<ffffffffa06ed5ed>] ? btrfs_tree_lock+0xdd/0x2f0 [btrfs] [ 4077.536049] [<ffffffff81053680>] ? wake_up_atomic_t+0x30/0x30 [ 4077.542574] [<ffffffffa068cc1f>] ? btrfs_search_slot+0x79f/0xb10 [btrfs] [ 4077.550171] [<ffffffffa06a5073>] ? btrfs_lookup_file_extent+0x33/0x40 [btrfs] [ 4077.558252] [<ffffffffa06c600b>] ? __btrfs_drop_extents+0x13b/0xdf0 [btrfs] [ 4077.566140] [<ffffffffa06fc9e2>] ? add_delayed_data_ref+0xe2/0x150 [btrfs] [ 4077.573928] [<ffffffffa06fd629>] ? btrfs_add_delayed_data_ref+0x149/0x1d0 [btrfs] [ 4077.582399] [<ffffffffa06cf3c0>] ? __set_extent_bit+0x4c0/0x5c0 [btrfs] [ 4077.589896] [<ffffffffa06b4a64>] ? insert_reserved_file_extent.constprop.75+0xa4/0x320 [btrfs] [ 4077.599632] [<ffffffffa06b206d>] ? start_transaction+0x8d/0x470 [btrfs] [ 4077.607134] [<ffffffffa06bab57>] ? btrfs_finish_ordered_io+0x2e7/0x600 [btrfs] [ 4077.615329] [<ffffffff8104cbc2>] ? process_one_work+0x142/0x3d0 [ 4077.622043] [<ffffffff8104d729>] ? worker_thread+0x109/0x3b0 [ 4077.628459] [<ffffffff8104d620>] ? manage_workers.isra.26+0x270/0x270 [ 4077.635759] [<ffffffff81052b0f>] ? kthread+0xaf/0xc0 [ 4077.641404] [<ffffffff81052a60>] ? kthread_create_on_node+0x110/0x110 [ 4077.648696] [<ffffffff814a9ac8>] ? ret_from_fork+0x58/0x90 [ 4077.654926] [<ffffffff81052a60>] ? kthread_create_on_node+0x110/0x110 [ 4078.358087] kworker/u24:15 D ffff88107fcd0640 0 14436 2 0x00000000 [ 4078.365981] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] [ 4078.373574] ffff880ffa57fad0 0000000000000046 0000000000009000 ffff880ffa57ffd8 [ 4078.381864] ffff880ffa57ffd8 ffff88103004d0a0 ffffffffa06ecb26 ffff88101a5d6138 [ 4078.390163] ffff880fbeffc298 ffff880ffa57faf8 ffff880fbeffc2f8 ffff88103004d0a0 [ 4078.398466] Call Trace: [ 4078.408019] [<ffffffffa06ed5ed>] ? btrfs_tree_lock+0xdd/0x2f0 [btrfs] [ 4078.415322] [<ffffffff81053680>] ? wake_up_atomic_t+0x30/0x30 [ 4078.421844] [<ffffffffa068cc1f>] ? btrfs_search_slot+0x79f/0xb10 [btrfs] [ 4078.429438] [<ffffffffa06a5073>] ? btrfs_lookup_file_extent+0x33/0x40 [btrfs] [ 4078.437518] [<ffffffffa06c600b>] ? __btrfs_drop_extents+0x13b/0xdf0 [btrfs] [ 4078.445404] [<ffffffffa06fc9e2>] ? add_delayed_data_ref+0xe2/0x150 [btrfs] [ 4078.453194] [<ffffffffa06fd629>] ? btrfs_add_delayed_data_ref+0x149/0x1d0 [btrfs] [ 4078.461663] [<ffffffffa06cf3c0>] ? __set_extent_bit+0x4c0/0x5c0 [btrfs] [ 4078.469161] [<ffffffffa06b4a64>] ? insert_reserved_file_extent.constprop.75+0xa4/0x320 [btrfs] [ 4078.478893] [<ffffffffa06b206d>] ? start_transaction+0x8d/0x470 [btrfs] [ 4078.486388] [<ffffffffa06bab57>] ? btrfs_finish_ordered_io+0x2e7/0x600 [btrfs] [ 4078.494561] [<ffffffff8104cbc2>] ? process_one_work+0x142/0x3d0 [ 4078.501278] [<ffffffff8104a507>] ? pwq_activate_delayed_work+0x27/0x40 [ 4078.508673] [<ffffffff8104d729>] ? worker_thread+0x109/0x3b0 [ 4078.515098] [<ffffffff8104d620>] ? manage_workers.isra.26+0x270/0x270 [ 4078.522396] [<ffffffff81052b0f>] ? kthread+0xaf/0xc0 [ 4078.528032] [<ffffffff81052a60>] ? kthread_create_on_node+0x110/0x110 [ 4078.535325] [<ffffffff814a9ac8>] ? ret_from_fork+0x58/0x90 [ 4078.541552] [<ffffffff81052a60>] ? kthread_create_on_node+0x110/0x110 [ 4079.355824] user-space-program D ffff88107fd30640 0 32020 1 0x00000000 [ 4079.363716] ffff880eae8eba10 0000000000000086 0000000000009000 ffff880eae8ebfd8 [ 4079.372003] ffff880eae8ebfd8 ffff881016c162c0 ffffffffa06ecb26 ffff88101a5d6138 [ 4079.380294] ffff880fbed4b4c8 ffff880eae8eba38 ffff880fbed4b528 ffff881016c162c0 [ 4079.388586] Call Trace: [ 4079.398134] [<ffffffffa06ed595>] ? btrfs_tree_lock+0x85/0x2f0 [btrfs] [ 4079.405431] [<ffffffff81053680>] ? wake_up_atomic_t+0x30/0x30 [ 4079.411955] [<ffffffffa06876fb>] ? btrfs_lock_root_node+0x2b/0x40 [btrfs] [ 4079.419644] [<ffffffffa068ce83>] ? btrfs_search_slot+0xa03/0xb10 [btrfs] [ 4079.427237] [<ffffffffa06aba52>] ? btrfs_buffer_uptodate+0x52/0x70 [btrfs] [ 4079.435041] [<ffffffffa0689b60>] ? generic_bin_search.constprop.38+0x80/0x190 [btrfs] [ 4079.443897] [<ffffffffa068ea44>] ? btrfs_insert_empty_items+0x74/0xd0 [btrfs] [ 4079.451975] [<ffffffffa072c443>] ? copy_items+0x128/0x850 [btrfs] [ 4079.458890] [<ffffffffa072da10>] ? btrfs_log_inode+0x629/0xbf3 [btrfs] [ 4079.466292] [<ffffffffa06f34a1>] ? btrfs_log_inode_parent+0xc61/0xf30 [btrfs] [ 4079.474373] [<ffffffffa06f45a9>] ? btrfs_log_dentry_safe+0x59/0x80 [btrfs] [ 4079.482161] [<ffffffffa06c298d>] ? btrfs_sync_file+0x20d/0x330 [btrfs] [ 4079.489558] [<ffffffff8112777c>] ? do_fsync+0x4c/0x80 [ 4079.495300] [<ffffffff81127a0a>] ? SyS_fdatasync+0xa/0x10 [ 4079.501422] [<ffffffff814a9b72>] ? system_call_fastpath+0x16/0x1b [ 4079.508334] user-space-program D ffff88107fc30640 0 32021 1 0x00000004 [ 4079.516226] ffff880eae8efbf8 0000000000000086 0000000000009000 ffff880eae8effd8 [ 4079.524513] ffff880eae8effd8 ffff881030279610 ffffffffa06ecb26 ffff88101a5d6138 [ 4079.532802] ffff880ebb671d88 ffff880eae8efc20 ffff880ebb671de8 ffff881030279610 [ 4079.541092] Call Trace: [ 4079.550642] [<ffffffffa06ed595>] ? btrfs_tree_lock+0x85/0x2f0 [btrfs] [ 4079.557941] [<ffffffff81053680>] ? wake_up_atomic_t+0x30/0x30 [ 4079.564463] [<ffffffffa068cc1f>] ? btrfs_search_slot+0x79f/0xb10 [btrfs] [ 4079.572058] [<ffffffffa06bb7d8>] ? btrfs_truncate_inode_items+0x168/0xb90 [btrfs] [ 4079.580526] [<ffffffffa06b04be>] ? join_transaction.isra.15+0x1e/0x3a0 [btrfs] [ 4079.588701] [<ffffffffa06b206d>] ? start_transaction+0x8d/0x470 [btrfs] [ 4079.596196] [<ffffffffa0690ac6>] ? block_rsv_add_bytes+0x16/0x50 [btrfs] [ 4079.603789] [<ffffffffa06bc2e9>] ? btrfs_truncate+0xe9/0x2e0 [btrfs] [ 4079.610994] [<ffffffffa06bd00b>] ? btrfs_setattr+0x30b/0x410 [btrfs] [ 4079.618197] [<ffffffff81117c1c>] ? notify_change+0x1dc/0x680 [ 4079.624625] [<ffffffff8123c8a4>] ? aa_path_perm+0xd4/0x160 [ 4079.630854] [<ffffffff810f4fcb>] ? do_truncate+0x5b/0x90 [ 4079.636889] [<ffffffff810f59fa>] ? do_sys_ftruncate.constprop.15+0x10a/0x160 [ 4079.644869] [<ffffffff8110d87b>] ? SyS_fcntl+0x5b/0x570 [ 4079.650805] [<ffffffff814a9b72>] ? system_call_fastpath+0x16/0x1b [ 4080.410607] user-space-program D ffff88107fc70640 0 32028 12639 0x00000004 [ 4080.418489] ffff880eaeccbbe0 0000000000000086 0000000000009000 ffff880eaeccbfd8 [ 4080.426778] ffff880eaeccbfd8 ffff880f317ef1e0 ffffffffa06ecb26 ffff88101a5d6138 [ 4080.435067] ffff880ef7e93928 ffff880f317ef1e0 ffff880eaeccbc08 ffff880f317ef1e0 [ 4080.443353] Call Trace: [ 4080.452920] [<ffffffffa06ed15d>] ? btrfs_tree_read_lock+0xdd/0x190 [btrfs] [ 4080.460703] [<ffffffff81053680>] ? wake_up_atomic_t+0x30/0x30 [ 4080.467225] [<ffffffffa06876bb>] ? btrfs_read_lock_root_node+0x2b/0x40 [btrfs] [ 4080.475400] [<ffffffffa068cc81>] ? btrfs_search_slot+0x801/0xb10 [btrfs] [ 4080.482994] [<ffffffffa06b2df0>] ? btrfs_clean_one_deleted_snapshot+0xe0/0xe0 [btrfs] [ 4080.491857] [<ffffffffa06a70a6>] ? btrfs_lookup_inode+0x26/0x90 [btrfs] [ 4080.499353] [<ffffffff810ec42f>] ? kmem_cache_alloc+0xaf/0xc0 [ 4080.505879] [<ffffffffa06bd905>] ? btrfs_iget+0xd5/0x5d0 [btrfs] [ 4080.512696] [<ffffffffa06caf04>] ? btrfs_get_token_64+0x104/0x120 [btrfs] [ 4080.520387] [<ffffffffa06f341f>] ? btrfs_log_inode_parent+0xbdf/0xf30 [btrfs] [ 4080.528469] [<ffffffffa06f45a9>] ? btrfs_log_dentry_safe+0x59/0x80 [btrfs] [ 4080.536258] [<ffffffffa06c298d>] ? btrfs_sync_file+0x20d/0x330 [btrfs] [ 4080.543657] [<ffffffff8112777c>] ? do_fsync+0x4c/0x80 [ 4080.549399] [<ffffffff81127a0a>] ? SyS_fdatasync+0xa/0x10 [ 4080.555534] [<ffffffff814a9b72>] ? system_call_fastpath+0x16/0x1b Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Fixes: 2f2ff0ee5e43 (Btrfs: fix metadata inconsistencies after directory fsync) Cc: stable@vger.kernel.org # 4.1+ Signed-off-by: Filipe Manana <fdmanana@suse.com> [Modified changelog for clarity and correctness]
2016-11-30Btrfs: fix enospc in hole punchingRobbie Ko
The hole punching can result in adding new leafs (and as a consequence new nodes) to the tree because when we find file extent items that span beyond the hole range we may end up not deleting them (just adjusting them, reducing their range by reducing their length or increasing their offset field) and add new file extent items representing holes. So after splitting a leaf (therefore creating a new one) to insert a new file extent item representing a hole, a new node might be added to each level of the tree in the worst case scenario (since there's a new key and every parent node was full). For example if a file has an extent item representing the range 0 to 64Mb and we punch a hole in the range 1Mb to 20Mb, the existing extent item is duplicated and one of the copies is adjusted to represent the range 0 to 1Mb, the other copy adjusted to represent the range 20Mb to 64Mb, and a new file extent item representing a hole in the range 1Mb to 20Mb is inserted. Fix this by using btrfs_calc_trans_metadata_size() instead of btrfs_calc_trunc_metadata_size(), so that enough metadata space is reserved for the worst possible case. Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> [Modified changelog for clarity and correctness]
2016-11-30btrfs: improve delayed refs iterationsWang Xiaoguang
This issue was found when I tried to delete a heavily reflinked file, when deleting such files, other transaction operation will not have a chance to make progress, for example, start_transaction() will blocked in wait_current_trans(root) for long time, sometimes it even triggers soft lockups, and the time taken to delete such heavily reflinked file is also very large, often hundreds of seconds. Using perf top, it reports that: PerfTop: 7416 irqs/sec kernel:99.8% exact: 0.0% [4000Hz cpu-clock], (all, 4 CPUs) --------------------------------------------------------------------------------------- 84.37% [btrfs] [k] __btrfs_run_delayed_refs.constprop.80 11.02% [kernel] [k] delay_tsc 0.79% [kernel] [k] _raw_spin_unlock_irq 0.78% [kernel] [k] _raw_spin_unlock_irqrestore 0.45% [kernel] [k] do_raw_spin_lock 0.18% [kernel] [k] __slab_alloc It seems __btrfs_run_delayed_refs() took most cpu time, after some debug work, I found it's select_delayed_ref() causing this issue, for a delayed head, in our case, it'll be full of BTRFS_DROP_DELAYED_REF nodes, but select_delayed_ref() will firstly try to iterate node list to find BTRFS_ADD_DELAYED_REF nodes, obviously it's a disaster in this case, and waste much time. To fix this issue, we introduce a new ref_add_list in struct btrfs_delayed_ref_head, then in select_delayed_ref(), if this list is not empty, we can directly use nodes in this list. With this patch, it just took about 10~15 seconds to delte the same file. Now using perf top, it reports that: PerfTop: 2734 irqs/sec kernel:99.5% exact: 0.0% [4000Hz cpu-clock], (all, 4 CPUs) ---------------------------------------------------------------------------------------- 20.74% [kernel] [k] _raw_spin_unlock_irqrestore 16.33% [kernel] [k] __slab_alloc 5.41% [kernel] [k] lock_acquired 4.42% [kernel] [k] lock_acquire 4.05% [kernel] [k] lock_release 3.37% [kernel] [k] _raw_spin_unlock_irq For normal files, this patch also gives help, at least we do not need to iterate whole list to found BTRFS_ADD_DELAYED_REF nodes. Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30btrfs: qgroup: Fix qgroup data leaking by using subtree tracingQu Wenruo
Commit 62b99540a1d91e464 (btrfs: relocation: Fix leaking qgroups numbers on data extents) only fixes the problem partly. The previous fix is to trace all new data extents at transaction commit time when balance finishes. However balance is not done in a large transaction, every path replacement can happen in its own transaction. This makes the fix useless if transaction commits during relocation. For example: relocate_block_group() |-merge_reloc_roots() | |- merge_reloc_root() | |- btrfs_start_transaction() <- Trans X | |- replace_path() <- Cause leak | |- btrfs_end_transaction_throttle() <- Trans X commits here | | Leak not fixed | | | |- btrfs_start_transaction() <- Trans Y | |- replace_path() <- Cause leak | |- btrfs_end_transaction_throttle() <- Trans Y ends | but not committed |-btrfs_join_transaction() <- Still trans Y |-qgroup_fix() <- Only fixes data leak | in trans Y |-btrfs_commit_transaction() <- Trans Y commits In that case, qgroup fixup can only fix data leak in trans Y, data leak in trans X is out of fix. So the correct fix should happen in the same transaction of replace_path(). This patch fixes it by tracing both subtrees of tree block swap, so it can fix the problem and ensure all leaking and fix are in the same transaction, so no leak again. Reported-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30btrfs: Export and move leaf/subtree qgroup helpers to qgroup.cQu Wenruo
Move account_shared_subtree() to qgroup.c and rename it to btrfs_qgroup_trace_subtree(). Do the same thing for account_leaf_items() and rename it to btrfs_qgroup_trace_leaf_items(). Since all these functions are only for qgroup, move them to qgroup.c and export them is more appropriate. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30btrfs: qgroup: Rename functions to make it follow reserve,trace,account stepsQu Wenruo
Rename btrfs_qgroup_insert_dirty_extent(_nolock) to btrfs_qgroup_trace_extent(_nolock), according to the new reserve/trace/account naming schema. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30btrfs: qgroup: Add comments explaining how btrfs qgroup worksQu Wenruo
Add explaination how btrfs qgroups work. Qgroup is split into 3 main phrases: 1) Reserve To ensure qgroup doesn't exceed its limit 2) Trace To info qgroup to trace which extent 3) Account Calculate qgroup number change for each traced extent. This should save quite some time for new developers. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30btrfs: use bio_for_each_segment_all in __btrfsic_submit_bioChristoph Hellwig
And remove the bogus check for a NULL return value from kmap, which can't happen. While we're at it: I don't think that kmapping up to 256 will work without deadlocks on highmem machines, a better idea would be to use vm_map_ram to map all of them into a single virtual address range. Incidentally that would also simplify the code a lot. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30btrfs: refactor __btrfs_lookup_bio_sums to use bio_for_each_segment_allChristoph Hellwig
Rework the loop a little bit to use the generic bio_for_each_segment_all helper for iterating over the bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30btrfs: calculate end of bio offset properlyChristoph Hellwig
Use the bvec offset and len members to prepare for multipage bvecs. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30btrfs: use bi_sizeChristoph Hellwig
Instead of using bi_vcnt to calculate it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30btrfs: don't access the bio directly in btrfs_csum_one_bioChristoph Hellwig
Use bio_for_each_segment_all to iterate over the segments instead. This requires a bit of reshuffling so that we only lookup up the ordered item once inside the bio_for_each_segment_all loop. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30btrfs: don't access the bio directly in the direct I/O codeChristoph Hellwig
Just use bio_for_each_segment_all to iterate over all segments. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30btrfs: don't access the bio directly in the raid5/6 codeChristoph Hellwig
Just use bio_for_each_segment_all to iterate over all segments. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30btrfs: use bio iterators for the decompression handlersChristoph Hellwig
Pass the full bio to the decompression routines and use bio iterators to iterate over the data in the bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30btrfs: Ensure proper sector alignment for btrfs_free_reserved_data_spaceJeff Mahoney
This fixes the WARN_ON on BTRFS_I(inode)->reserved_extents in btrfs_destroy_inode and the WARN_ON on nonzero delalloc bytes on umount with qgroups enabled. I was able to reproduce this by setting up a small (~500kb) quota limit and writing a file one byte at a time until I hit the limit. The warnings would all hit on umount. The root cause is that we would reserve a block-sized range in both the reservation and the quota in btrfs_check_data_free_space, but if we encountered a problem (like e.g. EDQUOT), we would only release the single byte in the qgroup reservation. That caused an iotree state split, which increased the number of outstanding extents, in turn disallowing releasing the metadata reservation. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>