summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2015-06-02btrfs: make root id query unprivilegedDavid Sterba
The INO_LOOKUP ioctl can lookup path for a given inode number and is thus restricted. As a sideefect it can find the root id of the containing subvolume and we're using this int the 'btrfs inspect rootid' command. The restriction is unnecessary in case we set the ioctl args args::treeid = 0 args::objectid = 256 (BTRFS_FIRST_FREE_OBJECTID) Then the path will be empty and the treeid is filled with the root id of the inode on which the ioctl is called. This behaviour is unchanged, after the root restriction is removed. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02Btrfs: fix block group ->space_info null pointer dereferenceFilipe Manana
When we create a block group we add it to the rbtree of block groups before setting its ->space_info field (while it's NULL). This is problematic since other tasks can access the block group from the rbtree and attempt to use its ->space_info before it is set by btrfs_make_block_group(). This can happen for example when a concurrent fitrim ioctl operation is ongoing, which produces a trace like the following when CONFIG_DEBUG_PAGEALLOC is set. [11509.604369] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 [11509.606373] IP: [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02 [11509.608179] PGD 2296a8067 PUD 22f4a2067 PMD 0 [11509.608179] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [11509.608179] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq processor i2c_piix4 psmou [11509.608179] CPU: 10 PID: 8538 Comm: fstrim Tainted: G W 4.0.0-rc5-btrfs-next-9+ #2 [11509.608179] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [11509.608179] task: ffff88009f5c46d0 ti: ffff8801b3edc000 task.ti: ffff8801b3edc000 [11509.608179] RIP: 0010:[<ffffffff8107d675>] [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02 [11509.608179] RSP: 0018:ffff8801b3edf9e8 EFLAGS: 00010002 [11509.608179] RAX: 0000000000000046 RBX: 0000000000000000 RCX: 0000000000000000 [11509.608179] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018 [11509.608179] RBP: ffff8801b3edfaa8 R08: 0000000000000001 R09: 0000000000000000 [11509.608179] R10: 0000000000000000 R11: ffff88009f5c4f98 R12: 0000000000000000 [11509.608179] R13: 0000000000000000 R14: 0000000000000018 R15: ffff88009f5c46d0 [11509.608179] FS: 00007f280a10e840(0000) GS:ffff88023ed40000(0000) knlGS:0000000000000000 [11509.608179] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [11509.608179] CR2: 0000000000000018 CR3: 00000002119bc000 CR4: 00000000000006e0 [11509.608179] Stack: [11509.608179] 0000000000000000 0000000000000000 0000000000000004 0000000000000000 [11509.608179] ffff880100000000 ffffffff00000000 0000000000000001 ffffffff00000000 [11509.608179] 0000000000000001 0000000000000000 ffff880100000000 00000000000006c4 [11509.608179] Call Trace: [11509.608179] [<ffffffff8107dc57>] ? __lock_acquire+0x696/0xf02 [11509.608179] [<ffffffff8107e806>] lock_acquire+0xa5/0x116 [11509.608179] [<ffffffffa04cc876>] ? do_trimming+0x51/0x145 [btrfs] [11509.608179] [<ffffffff81434f37>] _raw_spin_lock+0x34/0x44 [11509.608179] [<ffffffffa04cc876>] ? do_trimming+0x51/0x145 [btrfs] [11509.608179] [<ffffffffa04cc876>] do_trimming+0x51/0x145 [btrfs] [11509.608179] [<ffffffffa04cde7d>] btrfs_trim_block_group+0x201/0x491 [btrfs] [11509.608179] [<ffffffffa04849e2>] btrfs_trim_fs+0xe0/0x129 [btrfs] [11509.608179] [<ffffffffa04bb80a>] btrfs_ioctl_fitrim+0x138/0x167 [btrfs] [11509.608179] [<ffffffffa04c002f>] btrfs_ioctl+0x50d/0x21e8 [btrfs] [11509.608179] [<ffffffff81123bda>] ? might_fault+0x58/0xb5 [11509.608179] [<ffffffff81123bda>] ? might_fault+0x58/0xb5 [11509.608179] [<ffffffff81123bda>] ? might_fault+0x58/0xb5 [11509.608179] [<ffffffff81158050>] ? cp_new_stat+0x147/0x15e [11509.608179] [<ffffffff81163041>] do_vfs_ioctl+0x3c6/0x479 [11509.608179] [<ffffffff81158116>] ? SYSC_newfstat+0x25/0x2e [11509.608179] [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58 [11509.608179] [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f [11509.608179] [<ffffffff8116314e>] SyS_ioctl+0x5a/0x7f [11509.608179] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 [11509.608179] Code: f4 01 00 0f 85 c0 00 00 00 48 c7 c1 f3 1f 7d 81 48 c7 c2 aa cb 7c 81 be fc 0b 00 00 eb 70 83 3d 61 eb 9c 00 00 0f 84 a5 00 00 00 <49> 81 3e 40 a3 2b 82 b8 00 00 00 [11509.608179] RIP [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02 [11509.608179] RSP <ffff8801b3edf9e8> [11509.608179] CR2: 0000000000000018 [11509.608179] ---[ end trace 570a5c6769f0e49a ]--- Which corresponds to the following access in fs/btrfs/free-space-cache.c: static int do_trimming(struct btrfs_block_group_cache *block_group, u64 *total_trimmed, u64 start, u64 bytes, u64 reserved_start, u64 reserved_bytes, struct btrfs_trim_range *trim_entry) { struct btrfs_space_info *space_info = block_group->space_info; (...) spin_lock(&space_info->lock); ^^^^^ - block_group->space_info is NULL... Fix this by ensuring the block group's ->space_info is set before adding the block group to the rbtree. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02Btrfs: check error before reporting missing device and add uuidAnand Jain
Report missing device when add is successful, otherwise it would exit as ENOMEM. And add uuid to the report. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02btrfs: Fix superblock csum type check.Qu Wenruo
Old csum type check is wrong and can't catch csum_type 1(not supported). Fix it to avoid hostile 0 division. Reported-by: Lukas Lueg <lukas.lueg@gmail.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02Btrfs: incremental send, fix clone operations for compressed extentsFilipe Manana
Marc reported a problem where the receiving end of an incremental send was performing clone operations that failed with -EINVAL. This happened because, unlike for uncompressed extents, we were not checking if the source clone offset and length, after summing the data offset, falls within the source file's boundaries. So make sure we do such checks when attempting to issue clone operations for compressed extents. Problem reproducible with the following steps: $ mkfs.btrfs -f /dev/sdb $ mount -o compress /dev/sdb /mnt $ mkfs.btrfs -f /dev/sdc $ mount -o compress /dev/sdc /mnt2 # Create the file with a single extent of 128K. This creates a metadata file # extent item with a data start offset of 0 and a logical length of 128K. $ xfs_io -f -c "pwrite -S 0xaa 64K 128K" -c "fsync" /mnt/foo # Now rewrite the range 64K to 112K of our file. This will make the inode's # metadata continue to point to the 128K extent we created before, but now # with an extent item that points to the extent with a data start offset of # 112K and a logical length of 16K. # That metadata file extent item is associated with the logical file offset # at 176K and covers the logical file range 176K to 192K. $ xfs_io -c "pwrite -S 0xbb 64K 112K" -c "fsync" /mnt/foo # Now rewrite the range 180K to 12K. This will make the inode's metadata # continue to point the the 128K extent we created earlier, with a single # extent item that points to it with a start offset of 112K and a logical # length of 4K. # That metadata file extent item is associated with the logical file offset # at 176K and covers the logical file range 176K to 180K. $ xfs_io -c "pwrite -S 0xcc 180K 12K" -c "fsync" /mnt/foo $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ touch /mnt/bar # Calls the btrfs clone ioctl. $ ~/xfstests/src/cloner -s $((176 * 1024)) -d $((176 * 1024)) \ -l $((4 * 1024)) /mnt/foo /mnt/bar $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send /mnt/snap1 | btrfs receive /mnt2 At subvol /mnt/snap1 At subvol snap1 $ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2 At subvol /mnt/snap2 At snapshot snap2 ERROR: failed to clone extents to bar Invalid argument A test case for fstests follows soon. Reported-by: Marc MERLIN <marc@merlins.org> Tested-by: Marc MERLIN <marc@merlins.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: David Sterba <dsterba@suse.cz> Tested-by: Jan Alexander Steffens (heftig) <jan.steffens@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02btrfs: qgroup: Fix possible leak in btrfs_add_qgroup_relation()Christian Engelmayer
Commit 9c8b35b1ba21 ("btrfs: quota: Automatically update related qgroups or mark INCONSISTENT flags when assigning/deleting a qgroup relations.") introduced the allocation of a temporary ulist in function btrfs_add_qgroup_relation() and added the corresponding cleanup to the out path. However, the allocation was introduced before the src/dst level check that directly returns. Fix the possible leakage of the ulist by moving the allocation after the input validation. Detected by Coverity CID 1295988. Signed-off-by: Christian Engelmayer <cengelma@gmx.at> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02Btrfs: fix mutex unlock without prior lock on space cache truncationFilipe Manana
If the call to btrfs_truncate_inode_items() failed and we don't have a block group, we were unlocking the cache_write_mutex without having locked it (we do it only if we have a block group). Fixes: 1bbc621ef284 ("Btrfs: allow block group cache writeout outside critical section in commit") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02Btrfs: log when missing device is createdAnand Jain
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02btrfs: fix warnings after changes in btrfs_abort_transactionDavid Sterba
fs/btrfs/volumes.c: In function ‘btrfs_create_uuid_tree’: fs/btrfs/volumes.c:3909:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=] btrfs_abort_transaction(trans, tree_root, ^ CC [M] fs/btrfs/ioctl.o fs/btrfs/ioctl.c: In function ‘create_subvol’: fs/btrfs/ioctl.c:549:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=] btrfs_abort_transaction(trans, root, PTR_ERR(new_root)); PTR_ERR returns long, but we're really using 'int' for the error codes everywhere so just set and use the local variable. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02btrfs: add 'cold' compiler annotations to all error handling functionsDavid Sterba
The annotated functios will be placed into .text.unlikely section. The annotation also hints compiler to move the code out of the hot paths, and may implicitly mark if-statement leading to that block as unlikely. This is a heuristic, the impact on the generated code is not significant. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02btrfs: report exact callsite where transaction abort occursDavid Sterba
WARN is called from a single location and all bugreports say that's in super.c __btrfs_abort_transaction. This is slightly confusing as we'd rather want to know the exact callsite. Whereas this information is printed in the syslog below the stacktrace, this requires further look and we usually see only the headline from WARNING. Moving the WARN into the macro has to inline some code and increases code by a few kilobytes: text data bss dec hex filename 835481 20305 14120 869906 d4612 btrfs.ko.before 842883 20305 14120 877308 d62fc btrfs.ko.after The delta is +7k (130+ calls), measured on 3.19 x86_64, distro config. The increase is not small and could lead to worse icache use. The code is on error/exit paths that can be recognized by compiler as cold and moved out of the way so the impact is speculated to be low, if measurable at all. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02btrfs: let tree defrag work in SSD modeDavid Sterba
Long time ago (2008) the defrag was automatic for new b-tree writes but has been disabled after performance problems. There was a leftover in tree-defrag.c that effectively stops any defragmentation on b-trees. This is a bit unexpected and IMHO undesired. The SSD mode is an optimization and defrag is supposed to work if the users asks for it. Related commits: 6702ed490ca0bb44e17131818a5a18b773957c5a Btrfs: Add run time btree defrag, and an ioctl to force btree defrag e18e4809b10e6c9efb5fe10c1ddcb4ebb690d517 Btrfs: Add mount -o ssd, which includes optimizations for seek free storage b3236e68bf86b3ae87f58984a1822369225211cb Btrfs: Leave on the tree defragger in mount -o ssd, it still helps there 9afbb0b752ef30a429c45b9de6706e28ad1a36e1 Btrfs: Disable tree defrag in SSD mode The last three commits switch the defrag+ssd off/on/off and the last one 3f157a2fd2ad731e1ed9964fecdc5f459f04a4a4 Btrfs: Online btree defragmentation fixes misses the bits from tree-defrag.c to revert to the behaviour introduced in e18e4809b10e. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02Btrfs: check pending chunks when shrinking fs to avoid corruptionFilipe Manana
When we shrink the usable size of a device (its total_bytes), we go over all the device extent items in the device tree and attempt to relocate the chunk of any device extent that goes beyond the new usable size for the device. We do that after setting the new usable size (total_bytes) in the device object, so that all new allocations (and reallocations) don't use areas of the device that go beyond the new (shorter) size. However we were not considering that before setting the new size in the device, pending chunks might have been created that use device extents that go beyond the new size, and those device extents are not yet in the device tree after we search the device tree - they are still attached to the list of new block group for some ongoing transaction handle, and they are only added to the device tree when the transaction handle is ended (via btrfs_create_pending_block_groups()). So check for pending chunks with device extents that go beyond the new size and if any exists, commit the current transaction and repeat the search in the device tree. Not doing this it would mean we would return success to user space while still having extents that go beyond the new size, and later user space could override those locations on the device while the fs still references them, causing all sorts of corruption and unexpected events. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02Btrfs: don't invalidate root dentry when subvolume deletion failsOmar Sandoval
Since commit bafc9b754f75 ("vfs: More precise tests in d_invalidate"), mounted subvolumes can be deleted because d_invalidate() won't fail. However, we run into problems when we attempt to delete the default subvolume while it is mounted as the root filesystem: # btrfs subvol list / ID 257 gen 306 top level 5 path rootvol ID 267 gen 334 top level 5 path snap1 # btrfs subvol get-default / ID 267 gen 334 top level 5 path snap1 # btrfs inspect-internal rootid / 267 # mount -o subvol=/ /dev/vda1 /mnt # btrfs subvol del /mnt/snap1 Delete subvolume (no-commit): '/mnt/snap1' ERROR: cannot delete '/mnt/snap1' - Operation not permitted # findmnt / findmnt: can't read /proc/mounts: No such file or directory # ls /proc # Markus reported that this same scenario simply led to a kernel oops. This happens because in btrfs_ioctl_snap_destroy(), we call d_invalidate() before we check may_destroy_subvol(), which means that we detach the submounts and drop the dentry before erroring out. Instead, we should only invalidate the dentry once the deletion has succeeded. Additionally, the shrink_dcache_sb() isn't necessary; d_invalidate() will prune the dcache for the deleted subvolume. Cc: <stable@vger.kernel.org> Fixes: bafc9b754f75 ("vfs: More precise tests in d_invalidate") Reported-by: Markus Schauler <mschauler@gmail.com> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03Btrfs: incremental send, check if orphanized dir inode needs delayed renameFilipe Manana
If a directory inode is orphanized, because some inode previously processed has a new name that collides with the old name of the current inode, we need to check if it needs its rename operation delayed too, as its ancestor-descendent relationship with some other inode might have been reversed between the parent and send snapshots and therefore its rename operation needs to happen after that other inode is renamed. For example, for the following reproducer where this is needed (provided by Robbie Ko): $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt2 $ mkdir -p /mnt/data/n1/n2 $ mkdir /mnt/data/n4 $ mkdir -p /mnt/data/t6/t7 $ mkdir /mnt/data/t5 $ mkdir /mnt/data/t7 $ mkdir /mnt/data/n4/t2 $ mkdir /mnt/data/t4 $ mkdir /mnt/data/t3 $ mv /mnt/data/t7 /mnt/data/n4/t2 $ mv /mnt/data/t4 /mnt/data/n4/t2/t7 $ mv /mnt/data/t5 /mnt/data/n4/t2/t7/t4 $ mv /mnt/data/t6 /mnt/data/n4/t2/t7/t4/t5 $ mv /mnt/data/n1/n2 /mnt/data/n4/t2/t7/t4/t5/t6 $ mv /mnt/data/n1 /mnt/data/n4/t2/t7/t4/t5/t6 $ mv /mnt/data/n4/t2/t7/t4/t5/t6/t7 /mnt/data/n4/t2/t7/t4/t5/t6/n2 $ mv /mnt/data/t3 /mnt/data/n4/t2/t7/t4/t5/t6/n2/t7 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ mv /mnt/data/n4/t2/t7/t4/t5/t6/n1 /mnt/data/n4 $ mv /mnt/data/n4/t2 /mnt/data/n4/n1 $ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6/n2 /mnt/data/n4/n1/t2 $ mv /mnt/data/n4/n1/t2/n2/t7/t3 /mnt/data/n4/n1/t2 $ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6 /mnt/data/n4/n1/t2 $ mv /mnt/data/n4/n1/t2/t7/t4 /mnt/data/n4/n1/t2/t6 $ mv /mnt/data/n4/n1/t2/t7 /mnt/data/n4/n1/t2/t3 $ mv /mnt/data/n4/n1/t2/n2/t7 /mnt/data/n4/n1/t2 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send /mnt/snap1 | btrfs receive /mnt2 $ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2 ERROR: send ioctl failed with -12: Cannot allocate memory Where the parent snapshot directory hierarchy is the following: . (ino 256) |-- data/ (ino 257) |-- n4/ (ino 260) |-- t2/ (ino 265) |-- t7/ (ino 264) |-- t4/ (ino 266) |-- t5/ (ino 263) |-- t6/ (ino 261) |-- n1/ (ino 258) |-- n2/ (ino 259) |-- t7/ (ino 262) |-- t3/ (ino 267) And the send snapshot's directory hierarchy is the following: . (ino 256) |-- data/ (ino 257) |-- n4/ (ino 260) |-- n1/ (ino 258) |-- t2/ (ino 265) |-- n2/ (ino 259) |-- t3/ (ino 267) | |-- t7 (ino 264) | |-- t6/ (ino 261) | |-- t4/ (ino 266) | |-- t5/ (ino 263) | |-- t7/ (ino 262) While processing inode 262 we orphanize inode 264 and later attempt to rename inode 264 to its new name/location, which resulted in building an incorrect destination path string for the rename operation with the value "data/n4/t2/t7/t4/t5/t6/n2/t7/t3/t7". This rename operation must have been done only after inode 267 is processed and renamed, as the ancestor-descendent relationship between inodes 264 and 267 was reversed between both snapshots, because otherwise it results in an infinite loop when building the path string for inode 264 when we are processing an inode with a number larger than 264. That loop is the following: start inode 264, send progress of 265 for example parent of 264 -> 267 parent of 267 -> 262 parent of 262 -> 259 parent of 259 -> 261 parent of 261 -> 263 parent of 263 -> 266 parent of 266 -> 264 |--> back to first iteration while current path string length is <= PATH_MAX, and fail with -ENOMEM otherwise So fix this by making the check if we need to delay a directory rename regardless of the current inode having been orphanized or not. A test case for fstests follows soon. Thanks to Robbie Ko for providing a reproducer for this problem. Reported-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-06-03Btrfs: incremental send, don't delay directory renames unnecessarilyFilipe Manana
Even though we delay the rename of directories when they become descendents of other directories that were also renamed in the send root to prevent infinite path build loops, we were doing it in cases where this was not needed and was actually harmful resulting in infinite path build loops as we ended up with a circular dependency of delayed directory renames. Consider the following reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt2 $ mkdir /mnt/data $ mkdir /mnt/data/n1 $ mkdir /mnt/data/n1/n2 $ mkdir /mnt/data/n4 $ mkdir /mnt/data/n1/n2/p1 $ mkdir /mnt/data/n1/n2/p1/p2 $ mkdir /mnt/data/t6 $ mkdir /mnt/data/t7 $ mkdir -p /mnt/data/t5/t7 $ mkdir /mnt/data/t2 $ mkdir /mnt/data/t4 $ mkdir -p /mnt/data/t1/t3 $ mkdir /mnt/data/p1 $ mv /mnt/data/t1 /mnt/data/p1 $ mkdir -p /mnt/data/p1/p2 $ mv /mnt/data/t4 /mnt/data/p1/p2/t1 $ mv /mnt/data/t5 /mnt/data/n4/t5 $ mv /mnt/data/n1/n2/p1/p2 /mnt/data/n4/t5/p2 $ mv /mnt/data/t7 /mnt/data/n4/t5/p2/t7 $ mv /mnt/data/t2 /mnt/data/n4/t1 $ mv /mnt/data/p1 /mnt/data/n4/t5/p2/p1 $ mv /mnt/data/n1/n2 /mnt/data/n4/t5/p2/p1/p2/n2 $ mv /mnt/data/n4/t5/p2/p1/p2/t1 /mnt/data/n4/t5/p2/p1/p2/n2/t1 $ mv /mnt/data/n4/t5/t7 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7 $ mv /mnt/data/n4/t5/p2/p1/t1/t3 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t3 $ mv /mnt/data/n4/t5/p2/p1/p2/n2/p1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7/p1 $ mv /mnt/data/t6 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t3/t5 $ mv /mnt/data/n4/t5/p2/p1/t1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t3/t1 $ mv /mnt/data/n1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7/p1/n1 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ mv /mnt/data/n4/t1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7/p1/t1 $ mv /mnt/data/n4/t5/p2/p1/p2/n2/t1 /mnt/data/n4/ $ mv /mnt/data/n4/t5/p2/p1/p2/n2 /mnt/data/n4/t1/n2 $ mv /mnt/data/n4/t1/t7/p1 /mnt/data/n4/t1/n2/p1 $ mv /mnt/data/n4/t1/t3/t1 /mnt/data/n4/t1/n2/t1 $ mv /mnt/data/n4/t1/t3 /mnt/data/n4/t1/n2/t1/t3 $ mv /mnt/data/n4/t5/p2/p1/p2 /mnt/data/n4/t1/n2/p1/p2 $ mv /mnt/data/n4/t1/t7 /mnt/data/n4/t1/n2/p1/t7 $ mv /mnt/data/n4/t5/p2/p1 /mnt/data/n4/t1/n2/p1/p2/p1 $ mv /mnt/data/n4/t1/n2/t1/t3/t5 /mnt/data/n4/t1/n2/p1/p2/t5 $ mv /mnt/data/n4/t5 /mnt/data/n4/t1/n2/p1/p2/p1/t5 $ mv /mnt/data/n4/t1/n2/p1/p2/p1/t5/p2 /mnt/data/n4/t1/n2/p1/p2/p1/p2 $ mv /mnt/data/n4/t1/n2/p1/p2/p1/p2/t7 /mnt/data/n4/t1/t7 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send /mnt/snap1 | btrfs receive /mnt2 $ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive -vv /mnt2 ERROR: send ioctl failed with -12: Cannot allocate memory This reproducer resulted in an infinite path build loop when building the path for inode 266 because the following circular dependency of delayed directory renames was created: ino 272 <- ino 261 <- ino 259 <- ino 268 <- ino 267 <- ino 261 Where the notation "X <- Y" means the rename of inode X is delayed by the rename of inode Y (X will be renamed after Y is renamed). This resulted in an infinite path build loop of inode 266 because that inode has inode 261 as an ancestor in the send root and inode 261 is in the circular dependency of delayed renames listed above. Fix this by not delaying the rename of a directory inode if an ancestor of the inode in the send root, which has a delayed rename operation, is not also a descendent of the inode in the parent root. Thanks to Robbie Ko for sending the reproducer example. A test case for xfstests follows soon. Reported-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-06-02f2fs: fix to return exact trimmed sizeJaegeuk Kim
Now, we add all the candidates for trim commands and then finally issue discard commands. So, we should count the trimmed size in back-end. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-06-02vfs: read file_handle only once in handle_to_pathSasha Levin
We used to read file_handle twice. Once to get the amount of extra bytes, and once to fetch the entire structure. This may be problematic since we do size verifications only after the first read, so if the number of extra bytes changes in userspace between the first and second calls, we'll have an incoherent view of file_handle. Instead, read the constant size once, and copy that over to the final structure without having to re-read it again. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-02f2fs: support FALLOC_FL_INSERT_RANGEChao Yu
FALLOC_FL_INSERT_RANGE flag for ->fallocate was introduced in commit dd46c787788d ("fs: Add support FALLOC_FL_INSERT_RANGE for fallocate"). The effect of FALLOC_FL_INSERT_RANGE command is the opposite of FALLOC_FL_COLLAPSE_RANGE, if this command was performed, all data from offset to EOF in our file will be shifted to right as given length, and then range [offset, offset + length] becomes a hole. This command is useful for our user who wants to add some data in the middle of the file, for example: video/music editor will insert a keyframe in specified position of media file, with this command we can easily create a hole for inserting without removing original data. This patch introduces f2fs_insert_range() to support FALLOC_FL_INSERT_RANGE. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Yuan Zhong <yuan.mark.zhong@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-06-02f2fs: hide common code in f2fs_replace_blockChao Yu
This patch clean up codes through: 1.rename f2fs_replace_block to __f2fs_replace_block(). 2.introduce new f2fs_replace_block() to include __f2fs_replace_block() and some common related codes around __f2fs_replace_block(). Then, newly introduced function f2fs_replace_block can be used by following patch. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-06-02gfs2: limit quota log messagesAbhi Das
This patch makes the quota subsystem only report once that a particular user/group has exceeded their allotted quota. Previously, it was possible for a program to continuously try exceeding quota (despite receiving EDQUOT) and in turn trigger gfs2 to issue a kernel log message about quota exceed. In theory, this could get out of hand and flood the log and the filesystem hosting the log files. Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-06-02gfs2: fix quota updates on block boundariesAbhi Das
For smaller block sizes (512B, 1K, 2K), some quotas straddle block boundaries such that the usage value is on one block and the rest of the quota is on the previous block. In such cases, the value does not get updated correctly. This patch fixes that by addressing the boundary conditions correctly. This patch also adds a (s64) cast that was missing in a call to gfs2_quota_change() in inode.c Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-06-02buffer: remove unusued 'ret' variableJens Axboe
Merge hickup on my part, due to a clash between the writeback changes and the EOPNOTSUPP removal in _submit_bh(). Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: disassociate inodes from dying bdi_writebacksTejun Heo
For the purpose of foreign inode detection, wb's (bdi_writeback's) are identified by the associated memcg ID. As we create a separate wb for each memcg, this is enough to identify the active wb's; however, when blkcg is enabled or disabled higher up in the hierarchy, the mapping between memcg and blkcg changes which in turn creates a new wb to service the new mapping. The old wb is unlinked from index and released after all references are drained. The foreign inode detection logic can't detect this condition because both the old and new wb's point to the same memcg and thus never decides to move inodes attached to the old wb to the new one. This patch adds logic to initiate switching immediately in wbc_attach_and_unlock_inode() if the associated wb is dying. We can make the usual foreign detection logic to distinguish the different wb's mapped to the memcg but the dying wb is never gonna be in active service again and there's no point in tracking the usage history and reaching the switch verdict after enough data points are collected. It's already known that the wb has to be switched. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: implement foreign cgroup inode bdi_writeback switchingTejun Heo
As concurrent write sharing of an inode is expected to be very rare and memcg only tracks page ownership on first-use basis severely confining the usefulness of such sharing, cgroup writeback tracks ownership per-inode. While the support for concurrent write sharing of an inode is deemed unnecessary, an inode being written to by different cgroups at different points in time is a lot more common, and, more importantly, charging only by first-use can too readily lead to grossly incorrect behaviors (single foreign page can lead to gigabytes of writeback to be incorrectly attributed). To resolve this issue, cgroup writeback detects the majority dirtier of an inode and transfers the ownership to it. The previous patches implemented the foreign condition detection mechanism and laid the groundwork. This patch implements the actual switching. With the previously implemented [unlocked_]inode_to_wb_and_list_lock() and wb stat transaction, grabbing wb->list_lock, inode->i_lock and mapping->tree_lock gives us full exclusion against all wb operations on the target inode. inode_switch_wb_work_fn() grabs all the locks and transfers the inode atomically along with its RECLAIMABLE and WRITEBACK stats. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: add lockdep annotation to inode_to_wb()Tejun Heo
With the previous three patches, all operations which acquire wb from inode are either under one of inode->i_lock, mapping->tree_lock or wb->list_lock or protected by unlocked_inode_to_wb transaction. This will be depended upon by foreign inode wb switching. This patch adds lockdep assertion to inode_to_wb() so that usages outside the above list locks can be caught easily. There are three exceptions. * locked_inode_to_wb_and_lock_list() is holding wb->list_lock but the wb may not be the inode's. Ensuring that is the function's role after all. Updated to deref inode->i_wb directly. * inode_wb_stat_unlocked_begin() is usually protected by combination of !I_WB_SWITCH and rcu_read_lock(). Updated to deref inode->i_wb directly. * inode_congested() wants to test whether inode->i_wb is set before starting the transaction. Added inode_to_wb_is_valid() which tests inode->i_wb directly. v5: might_lock() removed. It annotates that the lock is grabbed w/ irq enabled which isn't the case and triggering lockdep warning spuriously. v4: might_lock() added to unlocked_inode_to_wb_begin(). v3: inode_congested() conversion added. v2: locked_inode_to_wb_and_lock_list() was missing in the first version. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: use unlocked_inode_to_wb transaction in inode_congested()Tejun Heo
Similar to wb stat updates, inode_congested() accesses the associated wb of an inode locklessly, which will break with foreign inode wb switching. This path updates inode_congested() to use unlocked inode wb access transaction introduced by the previous patch. Combined with the previous two patches, this makes all wb list and access operations to be protected by either of inode->i_lock, wb->list_lock, or mapping->tree_lock while wb switching is in progress. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: implement unlocked_inode_to_wb transaction and use it for stat ↵Tejun Heo
updates The mechanism for detecting whether an inode should switch its wb (bdi_writeback) association is now in place. This patch build the framework for the actual switching. This patch adds a new inode flag I_WB_SWITCHING, which has two functions. First, the easy one, it ensures that there's only one switching in progress for a give inode. Second, it's used as a mechanism to synchronize wb stat updates. The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters but track the current number of dirty pages and pages under writeback respectively. As such, when an inode is moved from one wb to another, the inode's portion of those stats have to be transferred together; unfortunately, this is a bit tricky as those stat updates are percpu operations which are performed without holding any lock in some places. This patch solves the problem in a similar way as memcg. Each such lockless stat updates are wrapped in transaction surrounded by unlocked_inode_to_wb_begin/end(). During normal operation, they map to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted, mapping->tree_lock is grabbed across the transaction. In turn, the switching path sets I_WB_SWITCHING and waits for a RCU grace period to pass before actually starting to switch, which guarantees that all stat update paths are synchronizing against mapping->tree_lock. This patch still doesn't implement the actual switching. v3: Updated on top of the recent cancel_dirty_page() updates. unlocked_inode_to_wb_begin() now nests inside mem_cgroup_begin_page_stat() to match the locking order. v2: The i_wb access transaction will be used for !stat accesses too. Function names and comments updated accordingly. s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/ s/switch_wb/switch_wbs/ Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: implement [locked_]inode_to_wb_and_lock_list()Tejun Heo
cgroup writeback currently assumes that inode to wb association doesn't change; however, with the planned foreign inode wb switching mechanism, the association will change dynamically. When an inode needs to be put on one of the IO lists of its wb, the current code simply calls inode_to_wb() and locks the returned wb; however, with the planned wb switching, the association may change before locking the wb and may even get released. This patch implements [locked_]inode_to_wb_and_lock_list() which pins the associated wb while holding i_lock, releases it, acquires wb->list_lock and verifies that the association hasn't changed inbetween. As the association will be protected by both locks among other things, this guarantees that the wb is the inode's associated wb until the list_lock is released. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: implement foreign cgroup inode detectionTejun Heo
As concurrent write sharing of an inode is expected to be very rare and memcg only tracks page ownership on first-use basis severely confining the usefulness of such sharing, cgroup writeback tracks ownership per-inode. While the support for concurrent write sharing of an inode is deemed unnecessary, an inode being written to by different cgroups at different points in time is a lot more common, and, more importantly, charging only by first-use can too readily lead to grossly incorrect behaviors (single foreign page can lead to gigabytes of writeback to be incorrectly attributed). To resolve this issue, cgroup writeback detects the majority dirtier of an inode and will transfer the ownership to it. To avoid unnnecessary oscillation, the detection mechanism keeps track of history and gives out the switch verdict only if the foreign usage pattern is stable over a certain amount of time and/or writeback attempts. The detection mechanism has fairly low space and computation overhead. It adds 8 bytes to struct inode (one int and two u16's) and minimal amount of calculation per IO. The detection mechanism converges to the correct answer usually in several seconds of IO time when there's a clear majority dirtier. Even when there isn't, it can reach an acceptable answer fairly quickly under most circumstances. Please see wb_detach_inode() for more details. This patch only implements detection. Following patches will implement actual switching. v2: wbc_account_io() now checks whether the wbc is associated with a wb before dereferencing it. This can happen when pageout() is writing pages directly without going through the usual writeback path. As pageout() path is single-threaded, we don't want it to be blocked behind a slow cgroup and ultimately want it to delegate actual writing to the usual writeback path. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: make writeback_control track the inode being written backTejun Heo
Currently, for cgroup writeback, the IO submission paths directly associate the bio's with the blkcg from inode_to_wb_blkcg_css(); however, it'd be necessary to keep more writeback context to implement foreign inode writeback detection. wbc (writeback_control) is the natural fit for the extra context - it persists throughout the writeback of each inode and is passed all the way down to IO submission paths. This patch adds wbc_attach_and_unlock_inode(), wbc_detach_inode(), and wbc_attach_fdatawrite_inode() which are used to associate wbc with the inode being written back. IO submission paths now use wbc_init_bio() instead of directly associating bio's with blkcg themselves. This leaves inode_to_wb_blkcg_css() w/o any user. The function is removed. wbc currently only tracks the associated wb (bdi_writeback). Future patches will add more for foreign inode detection. The association is established under i_lock which will be depended upon when migrating foreign inodes to other wb's. As currently, once established, inode to wb association never changes, going through wbc when initializing bio's doesn't cause any behavior changes. v2: submit_blk_blkcg() now checks whether the wbc is associated with a wb before dereferencing it. This can happen when pageout() is writing pages directly without going through the usual writeback path. As pageout() path is single-threaded, we don't want it to be blocked behind a slow cgroup and ultimately want it to delegate actual writing to the usual writeback path. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()Tejun Heo
Currently, majority of cgroup writeback support including all the above functions are implemented in include/linux/backing-dev.h and mm/backing-dev.c; however, the portion closely related to writeback logic implemented in include/linux/writeback.h and mm/page-writeback.c will expand to support foreign writeback detection and correction. This patch moves wb[_try]_get() and wb_put() to include/linux/backing-dev-defs.h so that they can be used from writeback.h and inode_{attach|detach}_wb() to writeback.h and page-writeback.c. This is pure reorganization and doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: move over_bground_thresh() to mm/page-writeback.cTejun Heo
and rename it to wb_over_bg_thresh(). The function is closely tied to the dirty throttling mechanism implemented in page-writeback.c. This relocation will allow future updates necessary for cgroup writeback support. While at it, add function comment. This is pure reorganization and doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: move global_dirty_limit into wb_domainTejun Heo
This patch is a part of the series to define wb_domain which represents a domain that wb's (bdi_writeback's) belong to and are measured against each other in. This will enable IO backpressure propagation for cgroup writeback. global_dirty_limit exists to regulate the global dirty threshold which is a property of the wb_domain. This patch moves hard_dirty_limit, dirty_lock, and update_time into wb_domain. This is pure reorganization and doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: reorganize [__]wb_update_bandwidth()Tejun Heo
__wb_update_bandwidth() is called from two places - fs/fs-writeback.c::balance_dirty_pages() and mm/page-writeback.c::wb_writeback(). The latter updates only the write bandwidth while the former also deals with the dirty ratelimit. The two callsites are distinguished by whether @thresh parameter is zero or not, which is cryptic. In addition, the two files define their own different versions of wb_update_bandwidth() on top of __wb_update_bandwidth(), which is confusing to say the least. This patch cleans up [__]wb_update_bandwidth() in the following ways. * __wb_update_bandwidth() now takes explicit @update_ratelimit parameter to gate dirty ratelimit handling. * mm/page-writeback.c::wb_update_bandwidth() is flattened into its caller - balance_dirty_pages(). * fs/fs-writeback.c::wb_update_bandwidth() is moved to mm/page-writeback.c and __wb_update_bandwidth() is made static. * While at it, add a lockdep assertion to __wb_update_bandwidth(). Except for the lockdep addition, this is pure reorganization and doesn't introduce any behavioral changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: clean up wb_dirty_limit()Tejun Heo
The function name wb_dirty_limit(), its argument @dirty and the local variable @wb_dirty are mortally confusing given that the function calculates per-wb threshold value not dirty pages, especially given that @dirty and @wb_dirty are used elsewhere for dirty pages. Let's rename the function to wb_calc_thresh() and wb_dirty to wb_thresh. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02ext2: enable cgroup writeback supportTejun Heo
Writeback now supports cgroup writeback and the generic writeback, buffer, libfs, and mpage helpers that ext2 uses are all updated to work with cgroup writeback. This patch enables cgroup writeback for ext2 by adding FS_CGROUP_WRITEBACK to its ->fs_flags. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: linux-ext4@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02mpage: make __mpage_writepage() honor cgroup writebackTejun Heo
__mpage_writepage() is used to implement mpage_writepages() which in turn is used for ->writepages() of various filesystems. All writeback logic is now updated to handle cgroup writeback and the block cgroup to issue IOs for is encoded in writeback_control and can be retrieved from the inode; however, __mpage_writepage() currently ignores the blkcg indicated by the inode and issues all bio's without explicit blkcg association. This patch updates __mpage_writepage() so that the issued bio's are associated with inode_to_writeback_blkcg_css(inode). v2: Updated for per-inode wb association. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02buffer, writeback: make __block_write_full_page() honor cgroup writebackTejun Heo
[__]block_write_full_page() is used to implement ->writepage in various filesystems. All writeback logic is now updated to handle cgroup writeback and the block cgroup to issue IOs for is encoded in writeback_control and can be retrieved from the inode; however, [__]block_write_full_page() currently ignores the blkcg indicated by inode and issues all bio's without explicit blkcg association. This patch adds submit_bh_blkcg() which associates the bio with the specified blkio cgroup before issuing and uses it in __block_write_full_page() so that the issued bio's are associated with inode_to_wb_blkcg_css(inode). v2: Updated for per-inode wb association. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: dirty inodes against their matching cgroup bdi_writeback'sTejun Heo
__mark_inode_dirty() always dirtied the inode against the root wb (bdi_writeback). The previous patches added all the infrastructure necessary to attribute an inode against the wb of the dirtying cgroup. This patch updates __mark_inode_dirty() so that it uses the wb associated with the inode instead of unconditionally using the root one. Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all pages will keep being dirtied against the root wb. v2: Updated for per-inode wb association. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: make writeback initiation functions handle multiple bdi_writeback'sTejun Heo
[try_]writeback_inodes_sb[_nr]() and sync_inodes_sb() currently only handle dirty inodes on the root wb (bdi_writeback) of the target bdi. This patch implements bdi_split_work_to_wbs() and use it to make these functions handle multiple wb's. bdi_split_work_to_wbs() takes a base wb_writeback_work and create clones of it and issue them to the wb's of the target bdi. The base work's nr_pages is distributed using wb_split_bdi_pages() - ie. according to each wb's write bandwidth's proportion in the bdi. Cloning a bdi involves memory allocation which may fail. In such cases, bdi_split_work_to_wbs() issues the base work directly and waits for its completion before proceeding to the next wb to guarantee forward progress and correctness under memory pressure. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: restructure try_writeback_inodes_sb[_nr]()Tejun Heo
try_writeback_inodes_sb_nr() wraps writeback_inodes_sb_nr() so that it handles s_umount locking and skips if writeback is already in progress. The in progress test is performed on the root wb (bdi_writeback) which isn't sufficient for cgroup writeback support. The test must be done per-wb. To prepare for the change, this patch factors out __writeback_inodes_sb_nr() from writeback_inodes_sb_nr() and adds @skip_if_busy and moves the in progress test right before queueing the wb_writeback_work. try_writeback_inodes_sb_nr() now just grabs s_umount and invokes __writeback_inodes_sb_nr() with asserted @skip_if_busy. This way, later addition of multiple wb handling can skip only the wb's which already have writeback in progress. This swaps the order between in progress test and s_umount test which can flip the return value when writeback is in progress and s_umount is being held by someone else but this shouldn't cause any meaningful difference. It's a fringe condition and the return value is an unsynchronized hint anyway. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: implement wb_wait_for_single_work()Tejun Heo
For cgroup writeback, multiple wb_writeback_work items may need to be issuedto accomplish a single task. The previous patch updated the waiting mechanism such that wb_wait_for_completion() can wait for multiple work items. Issuing mulitple work items involves memory allocation which may fail. As most writeback operations can't fail or blocked on memory allocation, in such cases, we'll fall back to sequential issuing of an on-stack work item, which would need to be waited upon sequentially. This patch implements wb_wait_for_single_work() which waits for a single work item independently from wb_completion waiting so that such fallback mechanism can be used without getting tangled with the usual issuing / completion operation. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: implement bdi_wait_for_completion()Tejun Heo
If the completion of a wb_writeback_work can be waited upon by setting its ->done to a struct completion and waiting on it; however, for cgroup writeback support, it's necessary to issue multiple work items to multiple bdi_writebacks and wait for the completion of all. This patch implements wb_completion which can wait for multiple work items and replaces the struct completion with it. It can be defined using DEFINE_WB_COMPLETION_ONSTACK(), used for multiple work items and waited for by wb_wait_for_completion(). Nobody currently issues multiple work items and this patch doesn't introduce any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: add wb_writeback_work->auto_freeTejun Heo
Currently, a wb_writeback_work is freed automatically on completion if it doesn't have ->done set. Add wb_writeback_work->auto_free to make the switch explicit. This will help cgroup writeback support where waiting for completion and whether to free automatically don't necessarily move together. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: make wakeup_dirtytime_writeback() handle multiple bdi_writeback'sTejun Heo
wakeup_dirtytime_writeback() currently only starts writeback on the root wb (bdi_writeback). For cgroup writeback support, update the function to check all wbs. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: make wakeup_flusher_threads() handle multiple bdi_writeback'sTejun Heo
wakeup_flusher_threads() currently only starts writeback on the root wb (bdi_writeback). For cgroup writeback support, update the function to wake up all wbs and distribute the number of pages to write according to the proportion of each wb's write bandwidth, which is implemented in wb_split_bdi_pages(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: make bdi_start_background_writeback() take bdi_writeback instead ↵Tejun Heo
of backing_dev_info bdi_start_background_writeback() currently takes @bdi and kicks the root wb (bdi_writeback). In preparation for cgroup writeback support, make it take wb instead. This patch doesn't make any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: make writeback_in_progress() take bdi_writeback instead of ↵Tejun Heo
backing_dev_info writeback_in_progress() currently takes @bdi and returns whether writeback is in progress on its root wb (bdi_writeback). In preparation for cgroup writeback support, make it take wb instead. While at it, make it an inline function. This patch doesn't make any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02writeback: remove bdi_start_writeback()Tejun Heo
bdi_start_writeback() is a thin wrapper on top of __wb_start_writeback() which is used only by laptop_mode_timer_fn(). This patches removes bdi_start_writeback(), renames __wb_start_writeback() to wb_start_writeback() and makes laptop_mode_timer_fn() use it instead. This doesn't cause any functional difference and will ease making laptop_mode_timer_fn() cgroup writeback aware. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>