summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2016-05-17dax: Remove zeroing from dax_io()Jan Kara
All the filesystems are now zeroing blocks themselves for DAX IO to avoid races between dax_io() and dax_fault(). Remove the zeroing code from dax_io() and add warning to catch the case when somebody unexpectedly returns new or unwritten buffer. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-17dax: Remove dead zeroing code from fault handlersJan Kara
Now that all filesystems zero out blocks allocated for a fault handler, we can just remove the zeroing from the handler itself. Also add checks that no filesystem returns to us unwritten or new buffer. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-17ext2: Avoid DAX zeroing to corrupt dataJan Kara
Currently ext2 zeroes any data blocks allocated for DAX inode however it still returns them as BH_New. Thus DAX code zeroes them again in dax_insert_mapping() which can possibly overwrite the data that has been already stored to those blocks by a racing dax_io(). Avoid marking pre-zeroed buffers as new. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-17ext2: Fix block zeroing in ext2_get_blocks() for DAXJan Kara
When zeroing allocated blocks for DAX, we accidentally zeroed only the first allocated block instead of all of them. So far this problem is hidden by the fact that page faults always need only a single block and DAX write code zeroes blocks again. But the zeroing in DAX code is racy and needs to be removed so fix the zeroing in ext2 to zero all allocated blocks. Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-17Merge branch 'ovl-fixes' into for-linusAl Viro
Backmerge to resolve a conflict in ovl_lookup_real(); "ovl_lookup_real(): use lookup_one_len_unlocked()" instead, but it was too late in the cycle to rebase.
2016-05-16dax: Remove complete_unwritten argumentJan Kara
Fault handlers currently take complete_unwritten argument to convert unwritten extents after PTEs are updated. However no filesystem uses this anymore as the code is racy. Remove the unused argument. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-16DAX: move RADIX_DAX_ definitions to dax.cNeilBrown
These don't belong in radix-tree.c any more than PAGECACHE_TAG_* do. Let's try to maintain the idea that radix-tree simply implements an abstract data type. Acked-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-16f2fs: manipulate dirty file inodes when DATA_FLUSH is setJaegeuk Kim
It needs to maintain dirty file inodes only if DATA_FLUSH is set. Otherwise, let's avoid its overhead. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-16f2fs: add fault injection to sysfsSheng Yong
This patch introduces a new struct f2fs_fault_info and a global f2fs_fault to save fault injection status. Fault injection entries are created in /sys/fs/f2fs/fault_injection/ during initializing f2fs module. Signed-off-by: Sheng Yong <shengyong1@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-16f2fs: no need inc dirty pages under inode lockYunlei He
No need inc dirty pages under inode lock Signed-off-by: Yunlei He <heyunlei@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-16f2fs: fix incorrect error path handling in f2fs_move_rehashed_direntsChao Yu
Fix two bugs in error path of f2fs_move_rehashed_dirents: - release dir's inode page if fail to call kmalloc - recover i_current_depth if fail to converting Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-16f2fs: fix i_current_depth during inline dentry conversionChao Yu
With below steps, we will see that dentry page becoming unaccessable later. This is because we forget updating i_current_depth in inode during inline dentry conversion, after that, once we failed at somewhere, it will leave i_current_depth as 0 in non-inline directory. Then, during ->lookup, the current_depth value makes all dentry pages in first level invisible. Fix it. 1) mount f2fs with inline_dentry option 2) mkdir dir 3) touch 180 files named [0-179] in dir 4) touch 180 in dir (fail after inline dir conversion) 5) ll dir ls: cannot access /mnt/f2fs/dir/0: No such file or directory ls: cannot access /mnt/f2fs/dir/1: No such file or directory ls: cannot access /mnt/f2fs/dir/2: No such file or directory ls: cannot access /mnt/f2fs/dir/3: No such file or directory ls: cannot access /mnt/f2fs/dir/4: No such file or directory drwxr-xr-x 2 root root 4096 may 13 21:47 ./ drwxr-xr-x 3 root root 4096 may 13 21:46 ../ -????????? ? ? ? ? ? 0 -????????? ? ? ? ? ? 1 -????????? ? ? ? ? ? 10 -????????? ? ? ? ? ? 100 -????????? ? ? ? ? ? 101 -????????? ? ? ? ? ? 102 Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-16f2fs: correct return value type of f2fs_fill_superSheng Yong
Signed-off-by: Sheng Yong <shengyong1@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-16Merge branch 'efi-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI updates from Ingo Molnar: "The main changes in this cycle were: - Drop the unused EFI_SYSTEM_TABLES efi.flags bit and ensure the ARM/arm64 EFI System Table mapping is read-only (Ard Biesheuvel) - Add a comment to explain that one of the code paths in the x86/pat code is only executed for EFI boot (Matt Fleming) - Improve Secure Boot status checks on arm64 and handle unexpected errors (Linn Crosetto) - Remove the global EFI memory map variable 'memmap' as the same information is already available in efi::memmap (Matt Fleming) - Add EFI Memory Attribute table support for ARM/arm64 (Ard Biesheuvel) - Add EFI GOP framebuffer support for ARM/arm64 (Ard Biesheuvel) - Add EFI Bootloader Control driver for storing reboot(2) data in EFI variables for consumption by bootloaders (Jeremy Compostella) - Add Core EFI capsule support (Matt Fleming) - Add EFI capsule char driver (Kweh, Hock Leong) - Unify EFI memory map code for ARM and arm64 (Ard Biesheuvel) - Add generic EFI support for detecting when firmware corrupts CPU status register bits (like IRQ flags) when performing EFI runtime service calls (Mark Rutland) ... and other misc cleanups" * 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits) efivarfs: Make efivarfs_file_ioctl() static efi: Merge boolean flag arguments efi/capsule: Move 'capsule' to the stack in efi_capsule_supported() efibc: Fix excessive stack footprint warning efi/capsule: Make efi_capsule_pending() lockless efi: Remove unnecessary (and buggy) .memmap initialization from the Xen EFI driver efi/runtime-wrappers: Remove ARCH_EFI_IRQ_FLAGS_MASK #ifdef x86/efi: Enable runtime call flag checking arm/efi: Enable runtime call flag checking arm64/efi: Enable runtime call flag checking efi/runtime-wrappers: Detect firmware IRQ flag corruption efi/runtime-wrappers: Remove redundant #ifdefs x86/efi: Move to generic {__,}efi_call_virt() arm/efi: Move to generic {__,}efi_call_virt() arm64/efi: Move to generic {__,}efi_call_virt() efi/runtime-wrappers: Add {__,}efi_call_virt() templates efi/arm-init: Reserve rather than unmap the memory map for ARM as well efi: Add misc char driver interface to update EFI firmware x86/efi: Force EFI reboot to process pending capsules efi: Add 'capsule' update support ...
2016-05-16namei: Improve hash mixing if CONFIG_DCACHE_WORD_ACCESSGeorge Spelvin
The hash mixing between adding the next 64 bits of name was just a bit weak. Replaced with a still very fast but slightly more effective mixing function. Signed-off-by: George Spelvin <linux@horizon.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-16Merge branch 'foreign/jeffm/uapi' into for-chris-4.7-20160516David Sterba
# Conflicts: # include/uapi/linux/btrfs.h
2016-05-16Merge branch 'foreign/anand/dev-del-by-id-ext' into for-chris-4.7-20160516David Sterba
2016-05-16Merge branch 'cleanups-4.7' into for-chris-4.7-20160516David Sterba
2016-05-16btrfs: fix memory leak during RAID 5/6 device replacementScott Talbert
A 'struct bio' is allocated in scrub_missing_raid56_pages(), but it was never freed anywhere. Signed-off-by: Scott Talbert <scott.talbert@hgst.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
The nf_conntrack_core.c fix in 'net' is not relevant in 'net-next' because we no longer have a per-netns conntrack hash. The ip_gre.c conflict as well as the iwlwifi ones were cases of overlapping changes. Conflicts: drivers/net/wireless/intel/iwlwifi/mvm/tx.c net/ipv4/ip_gre.c net/netfilter/nf_conntrack_core.c Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-14Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "Overlayfs fixes from Miklos, assorted fixes from me. Stable fodder of varying severity, all sat in -next for a while" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ovl: ignore permissions on underlying lookup vfs: add lookup_hash() helper vfs: rename: check backing inode being equal vfs: add vfs_select_inode() helper get_rock_ridge_filename(): handle malformed NM entries ecryptfs: fix handling of directory opening atomic_open(): fix the handling of create_error fix the copy vs. map logics in blk_rq_map_user_iov() do_splice_to(): cap the size before passing to ->splice_read()
2016-05-13Merge branch 'for-4.6-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "During v4.6-rc1 cgroup namespace support was merged. There is an issue where it's impossible to tell whether a given cgroup mount point is bind mounted or namespaced. Serge has been working on the issue but it took longer than expected to resolve, so the late pull request. Given that it's a completely new feature and the patches don't touch anything else, the risk seems acceptable. However, if this is too late, an alternative is plugging new cgroup ns creation for v4.6 and retrying for v4.7" * 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: fix compile warning kernfs: kernfs_sop_show_path: don't return 0 after seq_dentry call cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces kernfs_path_from_node_locked: don't overwrite nlen
2016-05-13svcrdma: Do not add XDR padding to xdr_buf page vectorChuck Lever
An xdr_buf has a head, a vector of pages, and a tail. Each RPC request is presented to the NFS server contained in an xdr_buf. The RDMA transport would like to supply the NFS server with only the NFS WRITE payload bytes in the page vector. In some common cases, that would allow the NFS server to swap those pages right into the target file's page cache. Have the transport's RDMA Read logic put XDR pad bytes in the tail iovec, and not in the pages that hold the data payload. The NFSv3 WRITE XDR decoder is finicky about the lengths involved, so make sure it is looking in the correct places when computing the total length of the incoming NFS WRITE request. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-05-13nfsd: handle seqid wraparound in nfsd4_preprocess_layout_stateidJeff Layton
Move the existing static function to an inline helper, and call it. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-05-13crash_dump: Add vmcore_elf32_check_archDaniel Wagner
parse_crash_elf{32|64}_headers will check the headers via the elf_check_arch respectively vmcore_elf64_check_arch macro. The MIPS architecture implements those two macros differently. In order to make the differentiation more explicit, let's introduce an vmcore_elf32_check_arch to allow the archs to overwrite it. Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Suggested-by: Maciej W. Rozycki <macro@imgtec.com> Reviewed-by: Maciej W. Rozycki <macro@imgtec.com> Cc: linux-kernel@vger.kernel.org Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/12535/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2016-05-13ext4: pre-zero allocated blocks for DAX IOJan Kara
Currently ext4 treats DAX IO the same way as direct IO. I.e., it allocates unwritten extents before IO is done and converts unwritten extents afterwards. However this way DAX IO can race with page fault to the same area: ext4_ext_direct_IO() dax_fault() dax_io() get_block() - allocates unwritten extent copy_from_iter_pmem() get_block() - converts unwritten block to written and zeroes it out ext4_convert_unwritten_extents() So data written with DAX IO gets lost. Similarly dax_new_buf() called from dax_io() can overwrite data that has been already written to the block via mmap. Fix the problem by using pre-zeroed blocks for DAX IO the same way as we use them for DAX mmap. The downside of this solution is that every allocating write writes each block twice (once zeros, once data). Fixing the race with locking is possible as well however we would need to lock-out faults for the whole range written to by DAX IO. And that is not easy to do without locking-out faults for the whole file which seems too aggressive. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-13ext4: refactor direct IO codeJan Kara
Currently ext4 direct IO handling is split between ext4_ext_direct_IO() and ext4_ind_direct_IO(). However the extent based function calls into the indirect based one for some cases and for example it is not able to handle file extending. Previously it was not also properly handling retries in case of ENOSPC errors. With DAX things would get even more contrieved so just refactor the direct IO code and instead of indirect / extent split do the split to read vs writes. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-13ext4: fix race in transient ENOSPC detectionJan Kara
When there are blocks to free in the running transaction, block allocator can return ENOSPC although the filesystem has some blocks to free. We use ext4_should_retry_alloc() to force commit of the current transaction and return whether anything was committed so that it makes sense to retry the allocation. However the transaction may get committed after block allocation fails but before we call ext4_should_retry_alloc(). So ext4_should_retry_alloc() returns false because there is nothing to commit and we wrongly return ENOSPC. Fix the race by unconditionally returning 1 from ext4_should_retry_alloc() when we tried to commit a transaction. This should not add any unnecessary retries since we had a transaction running a while ago when trying to allocate blocks and we want to retry the allocation once that transaction has committed anyway. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-13ext4: handle transient ENOSPC properly for DAXJan Kara
ext4_dax_get_blocks() was accidentally omitted fixing get blocks handlers to properly handle transient ENOSPC errors. Fix it now to use ext4_get_blocks_trans() helper which takes care of these errors. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-13dax: call get_blocks() with create == 1 for write faults to unwritten extentsJan Kara
Currently, __dax_fault() does not call get_blocks() callback with create argument set, when we got back unwritten extent from the initial get_blocks() call during a write fault. This is because originally filesystems were supposed to convert unwritten extents to written ones using complete_unwritten() callback. Later this was abandoned in favor of using pre-zeroed blocks however the condition whether get_blocks() needs to be called with create == 1 remained. Fix the condition so that filesystems are not forced to zero-out and convert unwritten extents when get_blocks() is called with create == 0 (which introduces unnecessary overhead for read faults and can be problematic as the filesystem may possibly be read-only). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-12jfs: Switch to generic xattr handlersAndreas Gruenbacher
This is mostly the same as on other filesystems except for attribute names with an "os2." prefix: for those, the prefix is not stored on disk, and on-attribute names without a prefix have "os2." added. As on several other filesystems, the underlying function for setting/removing xattrs (__jfs_setxattr) removes attributes when the value is NULL, so the set xattr handlers will work as expected. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-12jfs: Clean up xattr name mappingAndreas Gruenbacher
Instead of stripping "os2." prefixes in __jfs_setxattr, make callers strip them, as __jfs_getxattr already does. With that change, use the same name mapping function in jfs_{get,set,remove}xattr. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-12gfs2: Switch to generic xattr handlersAl Viro
Switch to the generic xattr handlers and take the necessary glocks at the layer below. The following are the new xattr "entry points"; they are called with the glock held already in the following cases: gfs2_xattr_get: From SELinux, during lookups. gfs2_xattr_set: The glock is never held. gfs2_get_acl: From gfs2_create_inode -> posix_acl_create and gfs2_setattr -> posix_acl_chmod. gfs2_set_acl: From gfs2_setattr -> posix_acl_chmod. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-13Btrfs: add semaphore to synchronize direct IO writes with fsyncFilipe Manana
Due to the optimization of lockless direct IO writes (the inode's i_mutex is not held) introduced in commit 38851cc19adb ("Btrfs: implement unlocked dio write"), we started having races between such writes with concurrent fsync operations that use the fast fsync path. These races were addressed in the patches titled "Btrfs: fix race between fsync and lockless direct IO writes" and "Btrfs: fix race between fsync and direct IO writes for prealloc extents". The races happened because the direct IO path, like every other write path, does create extent maps followed by the corresponding ordered extents while the fast fsync path collected first ordered extents and then it collected extent maps. This made it possible to log file extent items (based on the collected extent maps) without waiting for the corresponding ordered extents to complete (get their IO done). The two fixes mentioned before added a solution that consists of making the direct IO path create first the ordered extents and then the extent maps, while the fsync path attempts to collect any new ordered extents once it collects the extent maps. This was simple and did not require adding any synchonization primitive to any data structure (struct btrfs_inode for example) but it makes things more fragile for future development endeavours and adds an exceptional approach compared to the other write paths. This change adds a read-write semaphore to the btrfs inode structure and makes the direct IO path create the extent maps and the ordered extents while holding read access on that semaphore, while the fast fsync path collects extent maps and ordered extents while holding write access on that semaphore. The logic for direct IO write path is encapsulated in a new helper function that is used both for cow and nocow direct IO writes. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-13Btrfs: fix race between block group relocation and nocow writesFilipe Manana
Relocation of a block group waits for all existing tasks flushing dellaloc, starting direct IO writes and any ordered extents before starting the relocation process. However for direct IO writes that end up doing nocow (inode either has the flag nodatacow set or the write is against a prealloc extent) we have a short time window that allows for a race that makes relocation proceed without waiting for the direct IO write to complete first, resulting in data loss after the relocation finishes. This is illustrated by the following diagram: CPU 1 CPU 2 btrfs_relocate_block_group(bg X) direct IO write starts against an extent in block group X using nocow mode (inode has the nodatacow flag or the write is for a prealloc extent) btrfs_direct_IO() btrfs_get_blocks_direct() --> can_nocow_extent() returns 1 btrfs_inc_block_group_ro(bg X) --> turns block group into RO mode btrfs_wait_ordered_roots() --> returns and does not know about the DIO write happening at CPU 2 (the task there has not created yet an ordered extent) relocate_block_group(bg X) --> rc->stage == MOVE_DATA_EXTENTS find_next_extent() --> returns extent that the DIO write is going to write to relocate_data_extent() relocate_file_extent_cluster() --> reads the extent from disk into pages belonging to the relocation inode and dirties them --> creates DIO ordered extent btrfs_submit_direct() --> submits bio against a location on disk obtained from an extent map before the relocation started btrfs_wait_ordered_range() --> writes all the pages read before to disk (belonging to the relocation inode) relocation finishes bio completes and wrote new data to the old location of the block group So fix this by tracking the number of nocow writers for a block group and make sure relocation waits for that number to go down to 0 before starting to move the extents. The same race can also happen with buffered writes in nocow mode since the patch I recently made titled "Btrfs: don't do unnecessary delalloc flushes when relocating", because we are no longer flushing all delalloc which served as a synchonization mechanism (due to page locking) and ensured the ordered extents for nocow buffered writes were created before we called btrfs_wait_ordered_roots(). The race with direct IO writes in nocow mode existed before that patch (no pages are locked or used during direct IO) and that fixed only races with direct IO writes that do cow. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-13Btrfs: fix race between fsync and direct IO writes for prealloc extentsFilipe Manana
When we do a direct IO write against a preallocated extent (fallocate) that does not go beyond the i_size of the inode, we do the write operation without holding the inode's i_mutex (an optimization that landed in commit 38851cc19adb ("Btrfs: implement unlocked dio write")). This allows for a very tiny time window where a race can happen with a concurrent fsync using the fast code path, as the direct IO write path creates first a new extent map (no longer flagged as a prealloc extent) and then it creates the ordered extent, while the fast fsync path first collects ordered extents and then it collects extent maps. This allows for the possibility of the fast fsync path to collect the new extent map without collecting the new ordered extent, and therefore logging an extent item based on the extent map without waiting for the ordered extent to be created and complete. This can result in a situation where after a log replay we end up with an extent not marked anymore as prealloc but it was only partially written (or not written at all), exposing random, stale or garbage data corresponding to the unwritten pages and without any checksums in the csum tree covering the extent's range. This is an extension of what was done in commit de0ee0edb21f ("Btrfs: fix race between fsync and lockless direct IO writes"). So fix this by creating first the ordered extent and then the extent map, so that this way if the fast fsync patch collects the new extent map it also collects the corresponding ordered extent. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-13Btrfs: fix number of transaction units for renames with whiteoutFilipe Manana
When we do a rename with the whiteout flag, we need to create the whiteout inode, which in the worst case requires 5 transaction units (1 inode item, 1 inode ref, 2 dir items and 1 xattr if selinux is enabled). So bump the number of transaction units from 11 to 16 if the whiteout flag is set. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13Btrfs: pin logs earlier when doing a rename exchange operationFilipe Manana
The btrfs_rename_exchange() started as a copy-paste from btrfs_rename(), which had a race fixed by my previous patch titled "Btrfs: pin log earlier when renaming", and so it suffers from the same problem. We pin the logs of the affected roots after we insert the new inode references, leaving a time window where concurrent tasks logging the inodes can end up logging both the new and old references, resulting in log trees that when replayed can turn the metadata into inconsistent states. This behaviour was added to btrfs_rename() in 2009 without any explanation about why not pinning the logs earlier, just leaving a comment about the posibility for the race. As of today it's perfectly safe and sane to pin the logs before we start doing any of the steps involved in the rename operation. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13Btrfs: unpin logs if rename exchange operation failsFilipe Manana
If rename exchange operations fail at some point after we pinned any of the logs, we end up aborting the current transaction but never unpin the logs, which leaves concurrent tasks that are trying to sync the logs (as part of an fsync request from user space) blocked forever and preventing the filesystem from being unmountable. Fix this by safely unpinning the log. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13Btrfs: fix inode leak on failure to setup whiteout inode in renameFilipe Manana
If we failed to fully setup the whiteout inode during a rename operation with the whiteout flag, we ended up leaking the inode, not decrementing its link count nor removing all its items from the fs/subvol tree. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUTDan Fuhry
Two new flags, RENAME_EXCHANGE and RENAME_WHITEOUT, provide for new behavior in the renameat2() syscall. This behavior is primarily used by overlayfs. This patch adds support for these flags to btrfs, enabling it to be used as a fully functional upper layer for overlayfs. RENAME_EXCHANGE support was written by Davide Italiano originally submitted on 2 April 2015. Signed-off-by: Davide Italiano <dccitaliano@gmail.com> Signed-off-by: Dan Fuhry <dfuhry@datto.com> [ remove unlikely ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13Btrfs: pin log earlier when renamingFilipe Manana
We were pinning the log right after the first step in the rename operation (inserting inode ref for the new name in the destination directory) instead of doing it before. This behaviour was introduced in 2009 for some reason that was not mentioned neither on the changelog nor any comment, with the drawback of a small time window where concurrent log writers can end up logging the new inode reference for the inode we are renaming while the rename operation is in progress (so that we can end up with a log containing both the new and old references). As of today there's no reason to not pin the log before that first step anymore, so just fix this. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13Btrfs: unpin log if rename operation failsFilipe Manana
If rename operations fail at some point after we pinned the log, we end up aborting the current transaction but never unpin the log, which leaves concurrent tasks that are trying to sync the log (as part of an fsync request from user space) blocked forever and preventing the filesystem from being unmountable. Fix this by safely unpinning the log. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13Btrfs: don't do unnecessary delalloc flushes when relocatingFilipe Manana
Before we start the actual relocation process of a block group, we do calls to flush delalloc of all inodes and then wait for ordered extents to complete. However we do these flush calls just to make sure we don't race with concurrent tasks that have actually already started to run delalloc and have allocated an extent from the block group we want to relocate, right before we set it to readonly mode, but have not yet created the respective ordered extents. The flush calls make us wait for such concurrent tasks because they end up calling filemap_fdatawrite_range() (through btrfs_start_delalloc_roots() -> __start_delalloc_inodes() -> btrfs_alloc_delalloc_work() -> btrfs_run_delalloc_work()) which ends up serializing us with those tasks due to attempts to lock the same pages (and the delalloc flush procedure calls the allocator and creates the ordered extents before unlocking the pages). These flushing calls not only make us waste time (cpu, IO) but also reduce the chances of writing larger extents (applications might be writing to contiguous ranges and we flush before they finish dirtying the whole ranges). So make sure we don't flush delalloc and just wait for concurrent tasks that have already started flushing delalloc and have allocated an extent from the block group we are about to relocate. This change also ends up fixing a race with direct IO writes that makes relocation not wait for direct IO ordered extents. This race is illustrated by the following diagram: CPU 1 CPU 2 btrfs_relocate_block_group(bg X) starts direct IO write, target inode currently has no ordered extents ongoing nor dirty pages (delalloc regions), therefore the root for our inode is not in the list fs_info->ordered_roots btrfs_direct_IO() __blockdev_direct_IO() btrfs_get_blocks_direct() btrfs_lock_extent_direct() locks range in the io tree btrfs_new_extent_direct() btrfs_reserve_extent() --> extent allocated from bg X btrfs_inc_block_group_ro(bg X) btrfs_start_delalloc_roots() __start_delalloc_inodes() --> does nothing, no dealloc ranges in the inode's io tree so the inode's root is not in the list fs_info->delalloc_roots btrfs_wait_ordered_roots() --> does not find the inode's root in the list fs_info->ordered_roots --> ends up not waiting for the direct IO write started by the task at CPU 2 relocate_block_group(rc->stage == MOVE_DATA_EXTENTS) prepare_to_relocate() btrfs_commit_transaction() iterates the extent tree, using its commit root and moves extents into new locations btrfs_add_ordered_extent_dio() --> now a ordered extent is created and added to the list root->ordered_extents and the root added to the list fs_info->ordered_roots --> this is too late and the task at CPU 1 already started the relocation btrfs_commit_transaction() btrfs_finish_ordered_io() btrfs_alloc_reserved_file_extent() --> adds delayed data reference for the extent allocated from bg X relocate_block_group(rc->stage == UPDATE_DATA_PTRS) prepare_to_relocate() btrfs_commit_transaction() --> delayed refs are run, so an extent item for the allocated extent from bg X is added to extent tree --> commit roots are switched, so the next scan in the extent tree will see the extent item sees the extent in the extent tree When this happens the relocation produces the following warning when it finishes: [ 7260.832836] ------------[ cut here ]------------ [ 7260.834653] WARNING: CPU: 5 PID: 6765 at fs/btrfs/relocation.c:4318 btrfs_relocate_block_group+0x245/0x2a1 [btrfs]() [ 7260.838268] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc [ 7260.850935] CPU: 5 PID: 6765 Comm: btrfs Not tainted 4.5.0-rc6-btrfs-next-28+ #1 [ 7260.852998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [ 7260.852998] 0000000000000000 ffff88020bf57bc0 ffffffff812648b3 0000000000000000 [ 7260.852998] 0000000000000009 ffff88020bf57bf8 ffffffff81051608 ffffffffa03c1b2d [ 7260.852998] ffff8800b2bbb800 0000000000000000 ffff8800b17bcc58 ffff8800399dd000 [ 7260.852998] Call Trace: [ 7260.852998] [<ffffffff812648b3>] dump_stack+0x67/0x90 [ 7260.852998] [<ffffffff81051608>] warn_slowpath_common+0x99/0xb2 [ 7260.852998] [<ffffffffa03c1b2d>] ? btrfs_relocate_block_group+0x245/0x2a1 [btrfs] [ 7260.852998] [<ffffffff810516d4>] warn_slowpath_null+0x1a/0x1c [ 7260.852998] [<ffffffffa03c1b2d>] btrfs_relocate_block_group+0x245/0x2a1 [btrfs] [ 7260.852998] [<ffffffffa039d9de>] btrfs_relocate_chunk.isra.29+0x66/0xdb [btrfs] [ 7260.852998] [<ffffffffa039f314>] btrfs_balance+0xde1/0xe4e [btrfs] [ 7260.852998] [<ffffffff8127d671>] ? debug_smp_processor_id+0x17/0x19 [ 7260.852998] [<ffffffffa03a9583>] btrfs_ioctl_balance+0x255/0x2d3 [btrfs] [ 7260.852998] [<ffffffffa03ac96a>] btrfs_ioctl+0x11e0/0x1dff [btrfs] [ 7260.852998] [<ffffffff811451df>] ? handle_mm_fault+0x443/0xd63 [ 7260.852998] [<ffffffff81491817>] ? _raw_spin_unlock+0x31/0x44 [ 7260.852998] [<ffffffff8108b36a>] ? arch_local_irq_save+0x9/0xc [ 7260.852998] [<ffffffff811876ab>] vfs_ioctl+0x18/0x34 [ 7260.852998] [<ffffffff81187cb2>] do_vfs_ioctl+0x550/0x5be [ 7260.852998] [<ffffffff81190c30>] ? __fget_light+0x4d/0x71 [ 7260.852998] [<ffffffff81187d77>] SyS_ioctl+0x57/0x79 [ 7260.852998] [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b [ 7260.893268] ---[ end trace eb7803b24ebab8ad ]--- This is because at the end of the first stage, in relocate_block_group(), we commit the current transaction, which makes delayed refs run, the commit roots are switched and so the second stage will find the extent item that the ordered extent added to the delayed refs. But this extent was not moved (ordered extent completed after first stage finished), so at the end of the relocation our block group item still has a positive used bytes counter, triggering a warning at the end of btrfs_relocate_block_group(). Later on when trying to read the extent contents from disk we hit a BUG_ON() due to the inability to map a block with a logical address that belongs to the block group we relocated and is no longer valid, resulting in the following trace: [ 7344.885290] BTRFS critical (device sdi): unable to find logical 12845056 len 4096 [ 7344.887518] ------------[ cut here ]------------ [ 7344.888431] kernel BUG at fs/btrfs/inode.c:1833! [ 7344.888431] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 7344.888431] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc [ 7344.888431] CPU: 0 PID: 6831 Comm: od Tainted: G W 4.5.0-rc6-btrfs-next-28+ #1 [ 7344.888431] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [ 7344.888431] task: ffff880215818600 ti: ffff880204684000 task.ti: ffff880204684000 [ 7344.888431] RIP: 0010:[<ffffffffa037c88c>] [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs] [ 7344.888431] RSP: 0018:ffff8802046878f0 EFLAGS: 00010282 [ 7344.888431] RAX: 00000000ffffffea RBX: 0000000000001000 RCX: 0000000000000001 [ 7344.888431] RDX: ffff88023ec0f950 RSI: ffffffff8183b638 RDI: 00000000ffffffff [ 7344.888431] RBP: ffff880204687908 R08: 0000000000000001 R09: 0000000000000000 [ 7344.888431] R10: ffff880204687770 R11: ffffffff82f2d52d R12: 0000000000001000 [ 7344.888431] R13: ffff88021afbfee8 R14: 0000000000006208 R15: ffff88006cd199b0 [ 7344.888431] FS: 00007f1f9e1d6700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000 [ 7344.888431] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7344.888431] CR2: 00007f1f9dc8cb60 CR3: 000000023e3b6000 CR4: 00000000000006f0 [ 7344.888431] Stack: [ 7344.888431] 0000000000001000 0000000000001000 ffff880204687b98 ffff880204687950 [ 7344.888431] ffffffffa0395c8f ffffea0004d64d48 0000000000000000 0000000000001000 [ 7344.888431] ffffea0004d64d48 0000000000001000 0000000000000000 0000000000000000 [ 7344.888431] Call Trace: [ 7344.888431] [<ffffffffa0395c8f>] submit_extent_page+0xf5/0x16f [btrfs] [ 7344.888431] [<ffffffffa03970ac>] __do_readpage+0x4a0/0x4f1 [btrfs] [ 7344.888431] [<ffffffffa039680d>] ? btrfs_create_repair_bio+0xcb/0xcb [btrfs] [ 7344.888431] [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs] [ 7344.888431] [<ffffffff8108df55>] ? trace_hardirqs_on+0xd/0xf [ 7344.888431] [<ffffffffa039728c>] __do_contiguous_readpages.constprop.26+0xc2/0xe4 [btrfs] [ 7344.888431] [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs] [ 7344.888431] [<ffffffffa039739b>] __extent_readpages.constprop.25+0xed/0x100 [btrfs] [ 7344.888431] [<ffffffff81129d24>] ? lru_cache_add+0xe/0x10 [ 7344.888431] [<ffffffffa0397ea8>] extent_readpages+0x160/0x1aa [btrfs] [ 7344.888431] [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs] [ 7344.888431] [<ffffffff8115daad>] ? alloc_pages_current+0xa9/0xcd [ 7344.888431] [<ffffffffa037cdc9>] btrfs_readpages+0x1f/0x21 [btrfs] [ 7344.888431] [<ffffffff81128316>] __do_page_cache_readahead+0x168/0x1fc [ 7344.888431] [<ffffffff811285a0>] ondemand_readahead+0x1f6/0x207 [ 7344.888431] [<ffffffff811285a0>] ? ondemand_readahead+0x1f6/0x207 [ 7344.888431] [<ffffffff8111cf34>] ? pagecache_get_page+0x2b/0x154 [ 7344.888431] [<ffffffff8112870e>] page_cache_sync_readahead+0x3d/0x3f [ 7344.888431] [<ffffffff8111dbf7>] generic_file_read_iter+0x197/0x4e1 [ 7344.888431] [<ffffffff8117773a>] __vfs_read+0x79/0x9d [ 7344.888431] [<ffffffff81178050>] vfs_read+0x8f/0xd2 [ 7344.888431] [<ffffffff81178a38>] SyS_read+0x50/0x7e [ 7344.888431] [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b [ 7344.888431] Code: 8d 4d e8 45 31 c9 45 31 c0 48 8b 00 48 c1 e2 09 48 8b 80 80 fc ff ff 4c 89 65 e8 48 8b b8 f0 01 00 00 e8 1d 42 02 00 85 c0 79 02 <0f> 0b 4c 0 [ 7344.888431] RIP [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs] [ 7344.888431] RSP <ffff8802046878f0> [ 7344.970544] ---[ end trace eb7803b24ebab8ae ]--- Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2016-05-13Btrfs: don't wait for unrelated IO to finish before relocationFilipe Manana
Before the relocation process of a block group starts, it sets the block group to readonly mode, then flushes all delalloc writes and then finally it waits for all ordered extents to complete. This last step includes waiting for ordered extents destinated at extents allocated in other block groups, making us waste unecessary time. So improve this by waiting only for ordered extents that fall into the block group's range. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2016-05-13Btrfs: fix empty symlink after creating symlink and fsync parent dirFilipe Manana
If we create a symlink, fsync its parent directory, crash/power fail and mount the filesystem, we end up with an empty symlink, which not only is useless it's also not allowed in linux (the man page symlink(2) is well explicit about that). So we just need to make sure to fully log an inode if it's a symlink, to ensure its inline extent gets logged, ensuring the same behaviour as ext3, ext4, xfs, reiserfs, f2fs, nilfs2, etc. Example reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/testdir $ sync $ ln -s /mnt/foo /mnt/testdir/bar $ xfs_io -c fsync /mnt/testdir <power fail> $ mount /dev/sdb /mnt $ readlink /mnt/testdir/bar <empty string> A test case for fstests follows soon. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-13Btrfs: fix for incorrect directory entries after fsync log replayFilipe Manana
If we move a directory to a new parent and later log that parent and don't explicitly log the old parent, when we replay the log we can end up with entries for the moved directory in both the old and new parent directories. Besides being ilegal to have directories with multiple hard links in linux, it also resulted in the leaving the inode item with a link count of 1. A similar issue also happens if we move a regular file - after the log tree is replayed the file has a link in both the old and new parent directories, when it should be only at the new directory. Sample reproducer: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ mkdir /mnt/x $ mkdir /mnt/y $ touch /mnt/x/foo $ mkdir /mnt/y/z $ sync $ ln /mnt/x/foo /mnt/x/bar $ mv /mnt/y/z /mnt/x/z < power fail > $ mount /dev/sdc /mnt $ ls -1Ri /mnt /mnt: 257 x 258 y /mnt/x: 259 bar 259 foo 260 z /mnt/x/z: /mnt/y: 260 z /mnt/y/z: $ umount /dev/sdc $ btrfs check /dev/sdc Checking filesystem on /dev/sdc UUID: a67e2c4a-a4b4-4fdc-b015-9d9af1e344be checking extents checking free space cache checking fs roots root 5 inode 260 errors 2000, link count wrong unresolved ref dir 257 index 4 namelen 1 name z filetype 2 errors 0 unresolved ref dir 258 index 2 namelen 1 name z filetype 2 errors 0 (...) Attempting to remove the directory becomes impossible: $ mount /dev/sdc /mnt $ rmdir /mnt/y/z $ ls -lh /mnt/y ls: cannot access /mnt/y/z: No such file or directory total 0 d????????? ? ? ? ? ? z $ rmdir /mnt/x/z rmdir: failed to remove ‘/mnt/x/z’: Stale file handle $ ls -lh /mnt/x ls: cannot access /mnt/x/z: Stale file handle total 0 -rw-r--r-- 2 root root 0 Apr 6 18:06 bar -rw-r--r-- 2 root root 0 Apr 6 18:06 foo d????????? ? ? ? ? ? z So make sure that on rename we set the last_unlink_trans value for our inode, even if it's a directory, to the value of the current transaction's ID and that if the new parent directory is logged that we fallback to a transaction commit. A test case for fstests is being submitted as well. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-05-12ext4: switch to ->iterate_shared()Al Viro
Note that we need relax_dir() equivalent for directories locked shared. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-12hfs: switch to ->iterate_shared()Al Viro
exact parallel of hfsplus analogue Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-12hfsplus: switch to ->iterate_shared()Al Viro
We need to protect the list of hfsplus_readdir_data against parallel insertions (in readdir) and removals (in release). Add a spinlock for that. Note that it has nothing to do with protection of hfsplus_readdir_data->key - we have an exclusion between hfsplus_readdir() and hfsplus_delete_cat() on directory lock and between several hfsplus_readdir() for the same struct file on ->f_pos_lock. The spinlock is strictly for list changes. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>