summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2016-05-09nfs: don't share mounts between network namespacesJ. Bruce Fields
There's no guarantee that an IP address in a different network namespace actually represents the same endpoint. Also, if we allow unprivileged nfs mounts some day then this might allow an unprivileged user in another network namespace to misdirect somebody else's nfs mounts. If sharing between containers is really what's wanted then that could still be arranged explicitly, for example with bind mounts. Reported-by: "Eric W. Biederman" <ebiederm@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09NFS: Fix an LOCK/OPEN race when unlinking an open fileChuck Lever
At Connectathon 2016, we found that recent upstream Linux clients would occasionally send a LOCK operation with a zero stateid. This appeared to happen in close proximity to another thread returning a delegation before unlinking the same file while it remained open. Earlier, the client received a write delegation on this file and returned the open stateid. Now, as it is getting ready to unlink the file, it returns the write delegation. But there is still an open file descriptor on that file, so the client must OPEN the file again before it returns the delegation. Since commit 24311f884189 ('NFSv4: Recovery of recalled read delegations is broken'), nfs_open_delegation_recall() clears the NFS_DELEGATED_STATE flag _before_ it sends the OPEN. This allows a racing LOCK on the same inode to be put on the wire before the OPEN operation has returned a valid open stateid. To eliminate this race, serialize delegation return with the acquisition of a file lock on the same file. Adopt the same approach as is used in the unlock path. This patch also eliminates a similar race seen when sending a LOCK operation at the same time as returning a delegation on the same file. Fixes: 24311f884189 ('NFSv4: Recovery of recalled read ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> [Anna: Add sentence about LOCK / delegation race] Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09nfs: have flexfiles mirror keep creds for both ro and rw layoutsJeff Layton
A mirror can be shared between multiple layouts, even with different iomodes. That makes stats gathering simpler, but it causes a problem when we get different creds in READ vs. RW layouts. The current code drops the newer credentials onto the floor when this occurs. That's problematic when you fetch a READ layout first, and then a RW. If the READ layout doesn't have the correct creds to do a write, then writes will fail. We could just overwrite the READ credentials with the RW ones, but that would break the ability for the server to fence the layout for reads if things go awry. We need to be able to revert to the earlier READ creds if the RW layout is returned afterward. The simplest fix is to just keep two sets of creds per mirror. One for READ layouts and one for RW, and then use the appropriate set depending on the iomode of the layout segment. Also fix up some RCU nits that sparse found. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09nfs: get a reference to the credential in ff_layout_alloc_lsegJeff Layton
We're just as likely to have allocation problems here as we would if we delay looking up the credential like we currently do. Fix the code to get a rpc_cred reference early, as soon as the mirror is set up. This allows us to eliminate the mirror early if there is a problem getting an rpc credential. This also allows us to drop the uid/gid from the layout_mirror struct as well. In the event that we find an existing mirror where this one would go, we swap in the new creds unconditionally, and drop the reference to the old one. Note that the old ff_layout_update_mirror_cred function wouldn't set this pointer unless the DS version was 3, but we don't know what the DS version is at this point. I'm a little unclear on why it did that as you still need creds to talk to v4 servers as well. I have the code set it regardless of the DS version here. Also note the change to using generic creds instead of calling lookup_cred directly. With that change, we also need to populate the group_info pointer in the acred as some functions expect that to never be NULL. Instead of allocating one every time however, we can allocate one when the module is loaded and share it since the group_info is refcounted. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09nfs: have ff_layout_get_ds_cred take a reference to the credJeff Layton
In later patches, we're going to want to allow the creds to be updated when we get a new layout with updated creds. Have this function take a reference to the cred that is later put once the call has been dispatched. Also, prepare for this change by ensuring we follow RCU rules when getting a reference to the cred as well. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09nfs: don't call nfs4_ff_layout_prepare_ds from ff_layout_get_ds_credJeff Layton
All the callers already call that function before calling into here, so it ends up being a no-op anyway. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09NFS: Save struct inode * inside nfs_commit_info to clarify usage of i_lockDave Wysochanski
Commit ea2cf22 created nfs_commit_info and saved &inode->i_lock inside this NFS specific structure. This obscures the usage of i_lock. Instead, save struct inode * so later it's clear the spinlock taken is i_lock. Should be no functional change. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09nfs: add debug to directio "good_bytes" countingWeston Andros Adamson
This will pop a warning if we count too many good bytes. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09pnfs: set NFS_IOHDR_REDO in pnfs_read_resend_pnfsWeston Andros Adamson
Like other resend paths, mark the (old) hdr as NFS_IOHDR_REDO. This ensures the hdr completion function will not count the (old) hdr as good bytes. Also, vector the error back through the hdr->task.tk_status like other retry calls. This fixes a bug with the FlexFiles layout where libaio was reporting more bytes read than requested. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09btrfs: fix int32 overflow in shrink_delalloc().Adam Borowski
UBSAN: Undefined behaviour in fs/btrfs/extent-tree.c:4623:21 signed integer overflow: 10808 * 262144 cannot be represented in type 'int [8]' If 8192<=items<16384, we request a writeback of an insane number of pages which is benign (everything will be written). But if items>=16384, the space reservation won't be enough. Signed-off-by: Adam Borowski <kilobyte@angband.pl> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-07get_rock_ridge_filename(): handle malformed NM entriesAl Viro
Payloads of NM entries are not supposed to contain NUL. When we run into such, only the part prior to the first NUL goes into the concatenation (i.e. the directory entry name being encoded by a bunch of NM entries). We do stop when the amount collected so far + the claimed amount in the current NM entry exceed 254. So far, so good, but what we return as the total length is the sum of *claimed* sizes, not the actual amount collected. And that can grow pretty large - not unlimited, since you'd need to put CE entries in between to be able to get more than the maximum that could be contained in one isofs directory entry / continuation chunk and we are stop once we'd encountered 32 CEs, but you can get about 8Kb easily. And that's what will be passed to readdir callback as the name length. 8Kb __copy_to_user() from a buffer allocated by __get_free_page() Cc: stable@vger.kernel.org # 0.98pl6+ (yes, really) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-07f2fs: do not preallocate block unaligned to 4KBJaegeuk Kim
Previously f2fs_preallocate_blocks() tries to allocate unaligned blocks. In f2fs_write_begin(), however, prepare_write_begin() does not skip its allocation due to (len != 4KB). So, it needs locking node page twice unexpectedly. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07f2fs: read node blocks ahead when truncating blocksJaegeuk Kim
This patch enables reading node blocks in advance when truncating large data blocks. > time rm $MNT/testfile (500GB) after drop_cachees Before : 9.422 s After : 4.821 s Reported-by: Stephen Bates <stephen.bates@microsemi.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07f2fs: fallocate data blocks in single locked node pageJaegeuk Kim
This patch is to improve the expand_inode speed in fallocate by allocating data blocks as many as possible in single locked node page. In SSD, # time fallocate -l 500G $MNT/testfile Before : 1m 33.410 s After : 24.758 s Reported-by: Stephen Bates <stephen.bates@microsemi.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07f2fs: fix inode cache leakChao Yu
When testing f2fs with inline_dentry option, generic/342 reports: VFS: Busy inodes after unmount of dm-0. Self-destruct in 5 seconds. Have a nice day... After rmmod f2fs module, kenrel shows following dmesg: ============================================================================= BUG f2fs_inode_cache (Tainted: G O ): Objects remaining in f2fs_inode_cache on __kmem_cache_shutdown() ----------------------------------------------------------------------------- Disabling lock debugging due to kernel taint INFO: Slab 0xf51ca0e0 objects=22 used=1 fp=0xd1e6fc60 flags=0x40004080 CPU: 3 PID: 7455 Comm: rmmod Tainted: G B O 4.6.0-rc4+ #16 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 00000086 00000086 d062fe18 c13a83a0 f51ca0e0 d062fe38 d062fea4 c11c7276 c1981040 f51ca0e0 00000016 00000001 d1e6fc60 40004080 656a624f 20737463 616d6572 6e696e69 6e692067 66326620 6e695f73 5f65646f 68636163 6e6f2065 Call Trace: [<c13a83a0>] dump_stack+0x5f/0x8f [<c11c7276>] slab_err+0x76/0x80 [<c11cbfc0>] ? __kmem_cache_shutdown+0x100/0x2f0 [<c11cbfc0>] ? __kmem_cache_shutdown+0x100/0x2f0 [<c11cbfe5>] __kmem_cache_shutdown+0x125/0x2f0 [<c1198a38>] kmem_cache_destroy+0x158/0x1f0 [<c176b43d>] ? mutex_unlock+0xd/0x10 [<f8f15aa3>] exit_f2fs_fs+0x4b/0x5a8 [f2fs] [<c10f596c>] SyS_delete_module+0x16c/0x1d0 [<c1001b10>] ? do_fast_syscall_32+0x30/0x1c0 [<c13c59bf>] ? __this_cpu_preempt_check+0xf/0x20 [<c10afa7d>] ? trace_hardirqs_on_caller+0xdd/0x210 [<c10ad50b>] ? trace_hardirqs_off+0xb/0x10 [<c1001b81>] do_fast_syscall_32+0xa1/0x1c0 [<c176d888>] sysenter_past_esp+0x45/0x74 INFO: Object 0xd1e6d9e0 @offset=6624 kmem_cache_destroy f2fs_inode_cache: Slab cache still has objects CPU: 3 PID: 7455 Comm: rmmod Tainted: G B O 4.6.0-rc4+ #16 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 00000286 00000286 d062fef4 c13a83a0 f174b000 d062ff14 d062ff28 c1198ac7 c197fe18 f3c5b980 d062ff20 000d04f2 d062ff0c d062ff0c d062ff14 d062ff14 f8f20dc0 fffffff5 d062e000 d062ff30 f8f15aa3 d062ff7c c10f596c 73663266 Call Trace: [<c13a83a0>] dump_stack+0x5f/0x8f [<c1198ac7>] kmem_cache_destroy+0x1e7/0x1f0 [<f8f15aa3>] exit_f2fs_fs+0x4b/0x5a8 [f2fs] [<c10f596c>] SyS_delete_module+0x16c/0x1d0 [<c1001b10>] ? do_fast_syscall_32+0x30/0x1c0 [<c13c59bf>] ? __this_cpu_preempt_check+0xf/0x20 [<c10afa7d>] ? trace_hardirqs_on_caller+0xdd/0x210 [<c10ad50b>] ? trace_hardirqs_off+0xb/0x10 [<c1001b81>] do_fast_syscall_32+0xa1/0x1c0 [<c176d888>] sysenter_past_esp+0x45/0x74 The reason is: in recovery flow, we use delayed iput mechanism for directory which has recovered dentry block. It means the reference of inode will be held until last dirty dentry page being writebacked. But when we mount f2fs with inline_dentry option, during recovery, dirent may only be recovered into dir inode page rather than dentry page, so there are no chance for us to release inode reference in ->writepage when writebacking last dentry page. We can call paired iget/iput explicityly for inline_dentry case, but for non-inline_dentry case, iput will call writeback_single_inode to write all data pages synchronously, but during recovery, ->writepages of f2fs skips writing all pages, result in losing dirent. This patch fixes this issue by obsoleting old mechanism, and introduce a new dir_list to hold all directory inodes which has recovered datas until finishing recovery. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07fscrypto/f2fs: allow fs-specific key prefix for fs encryptionJaegeuk Kim
This patch allows fscrypto to handle a second key prefix given by filesystem. The main reason is to provide backward compatibility, since previously f2fs used "f2fs:" as a crypto prefix instead of "fscrypt:". Later, ext4 should also provide key_prefix() to give "ext4:". One concern decribed by Ted would be kinda double check overhead of prefixes. In x86, for example, validate_user_key consumes 8 ms after boot-up, which turns out derive_key_aes() consumed most of the time to load specific crypto module. After such the cold miss, it shows almost zero latencies, which treats as a negligible overhead. Note that request_key() detects wrong prefix in prior to derive_key_aes() even. Cc: Ted Tso <tytso@mit.edu> Cc: stable@vger.kernel.org # v4.6 Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07f2fs: avoid panic when truncating to max filesizeChao Yu
The following panic occurs when truncating inode which has inline xattr to max filesize. [<ffffffffa013d3be>] get_dnode_of_data+0x4e/0x580 [f2fs] [<ffffffffa013aca1>] ? read_node_page+0x51/0x90 [f2fs] [<ffffffffa013ad99>] ? get_node_page.part.34+0xb9/0x170 [f2fs] [<ffffffffa01235b1>] truncate_blocks+0x131/0x3f0 [f2fs] [<ffffffffa01238e3>] f2fs_truncate+0x73/0x100 [f2fs] [<ffffffffa01239d2>] f2fs_setattr+0x62/0x2a0 [f2fs] [<ffffffff811a72c8>] notify_change+0x158/0x300 [<ffffffff8118a42b>] do_truncate+0x6b/0xa0 [<ffffffff8118e539>] ? __sb_start_write+0x49/0x100 [<ffffffff8118a798>] do_sys_ftruncate.constprop.12+0x118/0x170 [<ffffffff8118a82e>] SyS_ftruncate+0xe/0x10 [<ffffffff8169efcf>] tracesys+0xe1/0xe6 [<ffffffffa0139ae0>] get_node_path+0x210/0x220 [f2fs] <ffff880206a89ce8> --[ end trace 5fea664dfbcc6625 ]--- The reason is truncate_blocks tries to truncate all node and data blocks start from specified block offset with value of (max filesize / block size), but actually, our valid max block offset is (max filesize / block size) - 1, so f2fs detects such invalid block offset with BUG_ON in truncation path. This patch lets f2fs skip truncating data which is exceeding max filesize. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07f2fs: fix incorrect mapping in ->bmapChao Yu
Currently, generic_block_bmap is used in f2fs_bmap, its semantics is when the mapping is been found, return position of target physical block, otherwise return zero. But, previously, when there is no mapping info for specified logical block, f2fs_bmap will map target physical block to a uninitialized variable, which should be wrong. Fix it. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07f2fs: remove an obsolete variableJaegeuk Kim
This patch removes an obsolete variable used in add_free_nid. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07f2fs: don't worry about inode leak in evict_inodeJaegeuk Kim
Even if an inode failed to release its blocks, it should be kept in an orphan inode list, so it will be released later. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07f2fs: shrink size of struct seg_entryChao Yu
Restructure struct seg_entry to eliminate holes in it, after that, in 32-bits machine, it reduces size from 32 bytes to 24 bytes; in 64-bits machine, it reduces size from 56 bytes to 40 bytes. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07f2fs: reuse get_extent_infoChao Yu
Reuse get_extent_info for readability. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07f2fs: remove unneeded memset when updating xattrChao Yu
Each of fields in struct f2fs_xattr_entry will be assigned later, so previously we don't need to memset the struct. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07f2fs: remove unneeded readahead in find_fsync_dnodesChao Yu
In find_fsync_dnodes, get_tmp_page will read dnode page synchronously, previously, ra_meta_page did the same work, which is redundant, remove it. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07f2fs: retry to truncate blocks in -ENOMEM caseJaegeuk Kim
This patch modifies to retry truncating node blocks in -ENOMEM case. Signed-off-by: Hou Pengyang <houpengyang@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07f2fs: fix leak of orphan inode objectsJaegeuk Kim
When unmounting filesystem, we should release all the ino entries. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07f2fs: revisit error handling flowsJaegeuk Kim
This patch fixes a couple of bugs regarding to orphan inodes when handling errors. This tries to - call alloc_nid_done with add_orphan_inode in handle_failed_inode - let truncate blocks in f2fs_evict_inode - not make a bad inode due to i_mode change Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07f2fs: inject ENOSPC failuresJaegeuk Kim
This patch injects ENOSPC failures. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07f2fs: inject page allocation failuresJaegeuk Kim
This patch adds page allocation failures. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07f2fs: inject kmalloc failureJaegeuk Kim
This patch injects kmalloc failure given a fault injection rate. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07f2fs: add mount option to select fault injection ratioJaegeuk Kim
This patch adds a mount option to select fault ratio. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07f2fs: use f2fs_grab_cache_page instead of grab_cache_pageJaegeuk Kim
This patch converts grab_cache_page to f2fs_grab_cache_page. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07f2fs: introduce f2fs_kmalloc to wrap kmallocJaegeuk Kim
This patch adds f2fs_kmalloc. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07f2fs: add proc entry to show valid block bitmapJaegeuk Kim
This patch adds a new proc entry to show segment information in more detail. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07f2fs: introduce macros for proc entriesJaegeuk Kim
This adds macros to be used multiple proc entries. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07efivarfs: Make efivarfs_file_ioctl() staticPeter Jones
There are no callers except through the file_operations struct below this, so it should be static like everything else here. Signed-off-by: Peter Jones <pjones@redhat.com> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/1462570771-13324-6-git-send-email-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-07efi: Merge boolean flag argumentsJulia Lawall
The parameters atomic and duplicates of efivar_init always have opposite values. Drop the parameter atomic, replace the uses of !atomic with duplicates, and update the call sites accordingly. The code using duplicates is slightly reorganized with an 'else', to avoid duplicating the lock code. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jeremy Kerr <jk@ozlabs.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Saurabh Sengar <saurabh.truth@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vaishali Thakkar <vaishali.thakkar@oracle.com> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/1462570771-13324-5-git-send-email-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-06GFS2: Refactor gfs2_remove_from_journalBob Peterson
This patch makes two simple changes to function gfs2_remove_from_journal. First, it removes the parameter that specifies the transaction. Since it's always passed in as current->journal_info, we might as well set that in the function rather than passing it in. Second, it changes the meta parameter to use an enum to make the code more clear. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2016-05-06btrfs: don't force mounts to wait for cleaner_kthread to delete one or more ↵Zygo Blaxell
subvolumes During a mount, we start the cleaner kthread first because the transaction kthread wants to wake up the cleaner kthread. We start the transaction kthread next because everything in btrfs wants transactions. We do reloc recovery in the thread that was doing the original mount call once the transaction kthread is running. This means that the cleaner kthread could already be running when reloc recovery happens (e.g. if a snapshot delete was started before a crash). Relocation does not play well with the cleaner kthread, so a mutex was added in commit 5f3164813b90f7dbcb5c3ab9006906222ce471b7 "Btrfs: fix race between balance recovery and root deletion" to prevent both from being active at the same time. If the cleaner kthread is already holding the mutex by the time we get to btrfs_recover_relocation, the mount will be blocked until at least one deleted subvolume is cleaned (possibly more if the mount process doesn't get the lock right away). During this time (which could be an arbitrarily long time on a large/slow filesystem), the mount process is stuck and the filesystem is unnecessarily inaccessible. Fix this by locking cleaner_mutex before we start cleaner_kthread, and unlocking the mutex after mount no longer requires it. This ensures that the mounting process will not be blocked by the cleaner kthread. The cleaner kthread is already prepared for mutex contention and will just go to sleep until the mutex is available. Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06btrfs: ioctl: reorder exclusive op check in RM_DEVDavid Sterba
Move the op exclusivity check before the other code (same as in ADD_DEV). Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06btrfs: add write protection to SET_FEATURES ioctlDavid Sterba
Perform the want_write check if we get far enough to do any writes. Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06btrfs: fix lock dep warning move scratch super outside of chunk_mutexAnand Jain
Move scratch super outside of the chunk lock to avoid below lockdep warning. The better place to scratch super is in the function btrfs_rm_dev_replace_free_srcdev() just before free_device, which is outside of the chunk lock as well. To reproduce: (fresh boot) mkfs.btrfs -f -draid5 -mraid5 /dev/sdc /dev/sdd /dev/sde mount /dev/sdc /btrfs dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100 (get devmgt from https://github.com/asj/devmgt.git) devmgt detach /dev/sde dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100 sync btrfs replace start -Brf 3 /dev/sdf /btrfs <-- devmgt attach host7 ====================================================== [ INFO: possible circular locking dependency detected ] 4.6.0-rc2asj+ #1 Not tainted --------------------------------------------------- btrfs/2174 is trying to acquire lock: (sb_writers){.+.+.+}, at: [<ffffffff812449b4>] __sb_start_write+0xb4/0xf0 but task is already holding lock: (&fs_info->chunk_mutex){+.+.+.}, at: [<ffffffffa05c5f55>] btrfs_dev_replace_finishing+0x145/0x980 [btrfs] which lock already depends on the new lock. Chain exists of: sb_writers --> &fs_devs->device_list_mutex --> &fs_info->chunk_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&fs_info->chunk_mutex); lock(&fs_devs->device_list_mutex); lock(&fs_info->chunk_mutex); lock(sb_writers); *** DEADLOCK *** -> #0 (sb_writers){.+.+.+}: [<ffffffff810e6415>] __lock_acquire+0x1bc5/0x1ee0 [<ffffffff810e707e>] lock_acquire+0xbe/0x210 [<ffffffff810df49a>] percpu_down_read+0x4a/0xa0 [<ffffffff812449b4>] __sb_start_write+0xb4/0xf0 [<ffffffff81265534>] mnt_want_write+0x24/0x50 [<ffffffff812508a2>] path_openat+0x952/0x1190 [<ffffffff81252451>] do_filp_open+0x91/0x100 [<ffffffff8123f5cc>] file_open_name+0xfc/0x140 [<ffffffff8123f643>] filp_open+0x33/0x60 [<ffffffffa0572bb6>] update_dev_time+0x16/0x40 [btrfs] [<ffffffffa057f60d>] btrfs_scratch_superblocks+0x5d/0xb0 [btrfs] [<ffffffffa057f70e>] btrfs_rm_dev_replace_remove_srcdev+0xae/0xd0 [btrfs] [<ffffffffa05c62c5>] btrfs_dev_replace_finishing+0x4b5/0x980 [btrfs] [<ffffffffa05c6ae8>] btrfs_dev_replace_start+0x358/0x530 [btrfs] Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06btrfs: Fix BUG_ON condition in scrub_setup_recheck_block()Ashish Samant
pagev array in scrub_block{} is of size SCRUB_MAX_PAGES_PER_BLOCK. page_index should be checked with the same to trigger BUG_ON(). Signed-off-by: Ashish Samant <ashish.samant@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06Btrfs: remove BUG_ON()'s in btrfs_map_blockJosef Bacik
btrfs_map_block can go horribly wrong in the face of fs corruption, lets agree to not be assholes and panic at any possible chance things are all fucked up. Signed-off-by: Josef Bacik <jbacik@fb.com> [ removed type casts ] Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06Btrfs: fix divide error upon chunk's stripe_lenLiu Bo
The struct 'map_lookup' uses type int for @stripe_len, while btrfs_chunk_stripe_len() can return a u64 value, and it may end up with @stripe_len being undefined value and it can lead to 'divide error' in __btrfs_map_block(). This changes 'map_lookup' to use type u64 for stripe_len, also right now we only use BTRFS_STRIPE_LEN for stripe_len, so this adds a valid checker for BTRFS_STRIPE_LEN. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ folded division fix to scrub_raid56_parity ] Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06btrfs: sysfs: protect reading label by lockDavid Sterba
If the label setting ioctl races with sysfs label handler, we could get mixed result in the output, part old part new. We should either get the old or new label. The chances to hit this race are low. Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06btrfs: add check to sysfs handler of labelDavid Sterba
Add a sanity check for the fs_info as we will dereference it, similar to what the 'store features' handler does. Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06btrfs: add read-only check to sysfs handler of featuresDavid Sterba
We don't want to trigger the change on a read-only filesystem, similar to what the label handler does. Signed-off-by: David Sterba <dsterba@suse.cz>
2016-05-06btrfs: reuse existing variable in scrub_stripe, reduce stack usageDavid Sterba
The key variable occupies 17 bytes, the key_start is used once, we can simply reuse existing 'key' for that purpose. As the key is not a simple type, compiler doest not do it on itself. Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06btrfs: use dynamic allocation for root item in create_subvolDavid Sterba
The size of root item is more than 400 bytes, which is quite a lot of stack space. As we do IO from inside the subvolume ioctls, we should keep the stack usage low in case the filesystem is on top of other layers (NFS, device mapper, iscsi, etc). Reviewed-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>