summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2016-05-11f2fs: avoid f2fs_bug_on during recoveryJaegeuk Kim
We don't need to use f2fs_bug_on() to treat with any error case when allocating a block during recovery. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-11f2fs: show # of orphan inodesJaegeuk Kim
This adds debug information for # of orphan inodes. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-11f2fs: support in batch fzero in dnode pageChao Yu
This patch tries to speedup fzero_range by making space preallocation and address removal of blocks in one dnode page as in batch operation. In virtual machine, with zram driver: dd if=/dev/zero of=/mnt/f2fs/file bs=1M count=4096 time xfs_io -f /mnt/f2fs/file -c "fzero 0 4096M" Before: real 0m3.276s user 0m0.008s sys 0m3.260s After: real 0m1.568s user 0m0.000s sys 0m1.564s Signed-off-by: Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: consider ENOSPC case] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-11f2fs: support in batch multi blocks preallocationChao Yu
This patch introduces reserve_new_blocks to make preallocation of multi blocks as in batch operation, so it can avoid lots of redundant operation, result in better performance. In virtual machine, with rotational device: time fallocate -l 32G /mnt/f2fs/file Before: real 0m4.584s user 0m0.000s sys 0m4.580s After: real 0m0.292s user 0m0.000s sys 0m0.272s In x86, with SSD: time fallocate -l 500G $MNT/testfile Before : 24.758 s After : 1.604 s Signed-off-by: Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: fix bugs and add performance numbers measured in x86.] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-11f2fs: make atomic/volatile operation exclusiveChao Yu
atomic/volatile ioctl interfaces are exposed to user like other file operation interface, it needs to make them getting exclusion against to each other to avoid potential conflict among these operations in concurrent scenario. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-11f2fs: use mnt_{want,drop}_write_file in ioctlChao Yu
In interfaces of ioctl, mnt_{want,drop}_write_file should be used for: - get exclusion against file system freezing which may used by lvm snapshot. - do telling filesystem that a write is about to be performed on it, and make sure that the writes are permitted. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-11Merge branch 'ovl-fixes' into for-linusAl Viro
2016-05-10ovl: ignore permissions on underlying lookupMiklos Szeredi
Generally permission checking is not necessary when overlayfs looks up a dentry on one of the underlying layers, since search permission on base directory was already checked in ovl_permission(). More specifically using lookup_one_len() causes a problem when the lower directory lacks search permission for a specific user while the upper directory does have search permission. Since lookups are cached, this causes inconsistency in behavior: success depends on who did the first lookup. So instead use lookup_hash() which doesn't do the permission check. Reported-by: Ignacy Gawędzki <ignacy.gawedzki@green-communications.fr> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-05-10vfs: add lookup_hash() helperMiklos Szeredi
Overlayfs needs lookup without inode_permission() and already has the name hash (in form of dentry->d_name on overlayfs dentry). It also doesn't support filesystems with d_op->d_hash() so basically it only needs the actual hashed lookup from lookup_one_len_unlocked() So add a new helper that does unlocked lookup of a hashed name. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-05-10vfs: rename: check backing inode being equalMiklos Szeredi
If a file is renamed to a hardlink of itself POSIX specifies that rename(2) should do nothing and return success. This condition is checked in vfs_rename(). However it won't detect hard links on overlayfs where these are given separate inodes on the overlayfs layer. Overlayfs itself detects this condition and returns success without doing anything, but then vfs_rename() will proceed as if this was a successful rename (detach_mounts(), d_move()). The correct thing to do is to detect this condition before even calling into overlayfs. This patch does this by calling vfs_select_inode() to get the underlying inodes. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org> # v4.2+
2016-05-10vfs: add vfs_select_inode() helperMiklos Szeredi
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org> # v4.2+
2016-05-10f2fs: switch to ->iterate_shared()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-10afs: switch to ->iterate_shared()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-10befs: switch to ->iterate_shared()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-10befs: constify stuff a bitAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-10Btrfs: fix fspath error deallocationVincent Stehlé
Make sure to deallocate fspath with vfree() in case of error in init_ipath(). fspath is allocated with vmalloc() in init_data_container() since commit 425d17a290c0 ("Btrfs: use larger limit for translation of logical to inode"). Signed-off-by: Vincent Stehlé <vincent.stehle@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-10btrfs: make find_workspace warn if there are no workspacesDavid Sterba
Be verbose if there are no workspaces at all, ie. the module init time preallocation failed. Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-10btrfs: make find_workspace always succeedDavid Sterba
With just one preallocated workspace we can guarantee forward progress even if there's no memory available for new workspaces. The cost is more waiting but we also get rid of several error paths. On average, there will be several idle workspaces, so the waiting penalty won't be so bad. In the worst case, all cpus will compete for one workspace until there's some memory. Attempts to allocate a new one are done each time the waiters are woken up. Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-10btrfs: preallocate compression workspacesDavid Sterba
Preallocate one workspace for each compression type so we can guarantee forward progress in the worst case. A failure cannot be a hard error as we might not use compression at all on the filesystem. If we can't allocate the workspaces later when need them, it might actually deadlock, but in such situation the system has effectively not enough memory to operate properly. Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-10btrfs: rename and document compression workspace membersDavid Sterba
The names are confusing, pick more fitting names and add comments. Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-10btrfs: GFP_NOFS does not GFP_HIGHMEMDavid Sterba
Masking HIGHMEM out of NOFS does not make sense. Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-10btrfs: switch to common message helpers in open_ctree, adjust messagesDavid Sterba
Currently we lack the identification of the filesystem in most if not all mount messages, done via printk/pr_* functions. We can use the btrfs_* helpers in open_ctree, as the fs_info <-> sb link is established at the beginning of the function. The messages have been updated at the same time to be more consistent: * dropped sb->s_id, as it's not available via btrfs_* * added %d for return code where appropriate * wording changed * %Lx replaced by %llx Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-09Revert "proc/base: make prompt shell start from new line after executing ↵Robin Humble
"cat /proc/$pid/wchan"" This reverts the 4.6-rc1 commit 7e2bc81da333 ("proc/base: make prompt shell start from new line after executing "cat /proc/$pid/wchan") because it breaks /proc/$PID/whcan formatting in ps and top. Revert also because the patch is inconsistent - it adds a newline at the end of only the '0' wchan, and does not add a newline when /proc/$PID/wchan contains a symbol name. eg. $ ps -eo pid,stat,wchan,comm PID STAT WCHAN COMMAND ... 1189 S - dbus-launch 1190 Ssl 0 dbus-daemon 1198 Sl 0 lightdm 1299 Ss ep_pol systemd 1301 S - (sd-pam) 1304 Ss wait sh Signed-off-by: Robin Humble <plaguedbypenguins@gmail.com> Cc: Minfei Huang <mnfhuang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
In netdevice.h we removed the structure in net-next that is being changes in 'net'. In macsec.c and rtnetlink.c we have overlaps between fixes in 'net' and the u64 attribute changes in 'net-next'. The mlx5 conflicts have to do with vxlan support dependencies. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-09isofs: switch to ->iterate_shared()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-09cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespacesSerge E. Hallyn
Patch summary: When showing a cgroupfs entry in mountinfo, show the path of the mount root dentry relative to the reader's cgroup namespace root. Short explanation (courtesy of mkerrisk): If we create a new cgroup namespace, then we want both /proc/self/cgroup and /proc/self/mountinfo to show cgroup paths that are correctly virtualized with respect to the cgroup mount point. Previous to this patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo does not. Long version: When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup namespace, and then mounts a new instance of the freezer cgroup, the new mount will be rooted at /a/b. The root dentry field of the mountinfo entry will show '/a/b'. cat > /tmp/do1 << EOF mount -t cgroup -o freezer freezer /mnt grep freezer /proc/self/mountinfo EOF unshare -Gm bash /tmp/do1 > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer > 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer The task's freezer cgroup entry in /proc/self/cgroup will simply show '/': grep freezer /proc/self/cgroup 9:freezer:/ If instead the same task simply bind mounts the /a/b cgroup directory, the resulting mountinfo entry will again show /a/b for the dentry root. However in this case the task will find its own cgroup at /mnt/a/b, not at /mnt: mount --bind /sys/fs/cgroup/freezer/a/b /mnt 130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer In other words, there is no way for the task to know, based on what is in mountinfo, which cgroup directory is its own. Example (by mkerrisk): First, a little script to save some typing and verbiage: echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)" cat /proc/self/mountinfo | grep freezer | awk '{print "\tmountinfo:\t\t" $4 "\t" $5}' Create cgroup, place this shell into the cgroup, and look at the state of the /proc files: 2653 2653 # Our shell 14254 # cat(1) /proc/self/cgroup: 10:freezer:/a/b mountinfo: / /sys/fs/cgroup/freezer Create a shell in new cgroup and mount namespaces. The act of creating a new cgroup namespace causes the process's current cgroups directories to become its cgroup root directories. (Here, I'm using my own version of the "unshare" utility, which takes the same options as the util-linux version): Look at the state of the /proc files: /proc/self/cgroup: 10:freezer:/ mountinfo: / /sys/fs/cgroup/freezer The third entry in /proc/self/cgroup (the pathname of the cgroup inside the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which is rooted at /a/b in the outer namespace. However, the info in /proc/self/mountinfo is not for this cgroup namespace, since we are seeing a duplicate of the mount from the old mount namespace, and the info there does not correspond to the new cgroup namespace. However, trying to create a new mount still doesn't show us the right information in mountinfo: # propagating to other mountns /proc/self/cgroup: 7:freezer:/ mountinfo: /a/b /mnt/freezer The act of creating a new cgroup namespace caused the process's current freezer directory, "/a/b", to become its cgroup freezer root directory. In other words, the pathname directory of the directory within the newly mounted cgroup filesystem should be "/", but mountinfo wrongly shows us "/a/b". The consequence of this is that the process in the cgroup namespace cannot correctly construct the pathname of its cgroup root directory from the information in /proc/PID/mountinfo. With this patch, the dentry root field in mountinfo is shown relative to the reader's cgroup namespace. So the same steps as above: /proc/self/cgroup: 10:freezer:/a/b mountinfo: / /sys/fs/cgroup/freezer /proc/self/cgroup: 10:freezer:/ mountinfo: /../.. /sys/fs/cgroup/freezer /proc/self/cgroup: 10:freezer:/ mountinfo: / /mnt/freezer cgroup.clone_children freezer.parent_freezing freezer.state tasks cgroup.procs freezer.self_freezing notify_on_release 3164 2653 # First shell that placed in this cgroup 3164 # Shell started by 'unshare' 14197 # cat(1) Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com> Tested-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2016-05-09get_acorn_filename(): deobfuscate a bitAl Viro
Lots of Idiotic Silly Parentheses is -> that way... What that condition checks is that there's exactly 32 bytes between the end of name and the end of entire drectory record. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-09btrfs: switch to ->iterate_shared()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-09logfs: no need to lock directory in lseekAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-09switch ecryptfs to ->iterate_sharedAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-09Merge branch 'for-linus' into work.lookupsAl Viro
2016-05-099p: switch to ->iterate_shared()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-09fat: switch to ->iterate_shared()Al Viro
... and make that weird ioctl lock directory only shared. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-09romfs, squashfs: switch to ->iterate_shared()Al Viro
don't need to lock directory in ->llseek(), either Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-09more trivial ->iterate_shared conversionsAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-09kernfs: no point locking directory around that generic_file_llseek()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-09configfs_readdir(): make safe under shared lockAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-09nfs: per-name sillyunlink exclusionAl Viro
use d_alloc_parallel() for sillyunlink/lookup exclusion and explicit rwsem (nfs_rmdir() being a writer and nfs_call_unlink() - a reader) for rmdir/sillyunlink one. That ought to make lookup/readdir/!O_CREAT atomic_open really parallel on NFS. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-09nfs: don't share mounts between network namespacesJ. Bruce Fields
There's no guarantee that an IP address in a different network namespace actually represents the same endpoint. Also, if we allow unprivileged nfs mounts some day then this might allow an unprivileged user in another network namespace to misdirect somebody else's nfs mounts. If sharing between containers is really what's wanted then that could still be arranged explicitly, for example with bind mounts. Reported-by: "Eric W. Biederman" <ebiederm@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09NFS: Fix an LOCK/OPEN race when unlinking an open fileChuck Lever
At Connectathon 2016, we found that recent upstream Linux clients would occasionally send a LOCK operation with a zero stateid. This appeared to happen in close proximity to another thread returning a delegation before unlinking the same file while it remained open. Earlier, the client received a write delegation on this file and returned the open stateid. Now, as it is getting ready to unlink the file, it returns the write delegation. But there is still an open file descriptor on that file, so the client must OPEN the file again before it returns the delegation. Since commit 24311f884189 ('NFSv4: Recovery of recalled read delegations is broken'), nfs_open_delegation_recall() clears the NFS_DELEGATED_STATE flag _before_ it sends the OPEN. This allows a racing LOCK on the same inode to be put on the wire before the OPEN operation has returned a valid open stateid. To eliminate this race, serialize delegation return with the acquisition of a file lock on the same file. Adopt the same approach as is used in the unlock path. This patch also eliminates a similar race seen when sending a LOCK operation at the same time as returning a delegation on the same file. Fixes: 24311f884189 ('NFSv4: Recovery of recalled read ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> [Anna: Add sentence about LOCK / delegation race] Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09nfs: have flexfiles mirror keep creds for both ro and rw layoutsJeff Layton
A mirror can be shared between multiple layouts, even with different iomodes. That makes stats gathering simpler, but it causes a problem when we get different creds in READ vs. RW layouts. The current code drops the newer credentials onto the floor when this occurs. That's problematic when you fetch a READ layout first, and then a RW. If the READ layout doesn't have the correct creds to do a write, then writes will fail. We could just overwrite the READ credentials with the RW ones, but that would break the ability for the server to fence the layout for reads if things go awry. We need to be able to revert to the earlier READ creds if the RW layout is returned afterward. The simplest fix is to just keep two sets of creds per mirror. One for READ layouts and one for RW, and then use the appropriate set depending on the iomode of the layout segment. Also fix up some RCU nits that sparse found. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09nfs: get a reference to the credential in ff_layout_alloc_lsegJeff Layton
We're just as likely to have allocation problems here as we would if we delay looking up the credential like we currently do. Fix the code to get a rpc_cred reference early, as soon as the mirror is set up. This allows us to eliminate the mirror early if there is a problem getting an rpc credential. This also allows us to drop the uid/gid from the layout_mirror struct as well. In the event that we find an existing mirror where this one would go, we swap in the new creds unconditionally, and drop the reference to the old one. Note that the old ff_layout_update_mirror_cred function wouldn't set this pointer unless the DS version was 3, but we don't know what the DS version is at this point. I'm a little unclear on why it did that as you still need creds to talk to v4 servers as well. I have the code set it regardless of the DS version here. Also note the change to using generic creds instead of calling lookup_cred directly. With that change, we also need to populate the group_info pointer in the acred as some functions expect that to never be NULL. Instead of allocating one every time however, we can allocate one when the module is loaded and share it since the group_info is refcounted. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09nfs: have ff_layout_get_ds_cred take a reference to the credJeff Layton
In later patches, we're going to want to allow the creds to be updated when we get a new layout with updated creds. Have this function take a reference to the cred that is later put once the call has been dispatched. Also, prepare for this change by ensuring we follow RCU rules when getting a reference to the cred as well. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09nfs: don't call nfs4_ff_layout_prepare_ds from ff_layout_get_ds_credJeff Layton
All the callers already call that function before calling into here, so it ends up being a no-op anyway. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09NFS: Save struct inode * inside nfs_commit_info to clarify usage of i_lockDave Wysochanski
Commit ea2cf22 created nfs_commit_info and saved &inode->i_lock inside this NFS specific structure. This obscures the usage of i_lock. Instead, save struct inode * so later it's clear the spinlock taken is i_lock. Should be no functional change. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09nfs: add debug to directio "good_bytes" countingWeston Andros Adamson
This will pop a warning if we count too many good bytes. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09pnfs: set NFS_IOHDR_REDO in pnfs_read_resend_pnfsWeston Andros Adamson
Like other resend paths, mark the (old) hdr as NFS_IOHDR_REDO. This ensures the hdr completion function will not count the (old) hdr as good bytes. Also, vector the error back through the hdr->task.tk_status like other retry calls. This fixes a bug with the FlexFiles layout where libaio was reporting more bytes read than requested. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-09btrfs: fix int32 overflow in shrink_delalloc().Adam Borowski
UBSAN: Undefined behaviour in fs/btrfs/extent-tree.c:4623:21 signed integer overflow: 10808 * 262144 cannot be represented in type 'int [8]' If 8192<=items<16384, we request a writeback of an insane number of pages which is benign (everything will be written). But if items>=16384, the space reservation won't be enough. Signed-off-by: Adam Borowski <kilobyte@angband.pl> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-07get_rock_ridge_filename(): handle malformed NM entriesAl Viro
Payloads of NM entries are not supposed to contain NUL. When we run into such, only the part prior to the first NUL goes into the concatenation (i.e. the directory entry name being encoded by a bunch of NM entries). We do stop when the amount collected so far + the claimed amount in the current NM entry exceed 254. So far, so good, but what we return as the total length is the sum of *claimed* sizes, not the actual amount collected. And that can grow pretty large - not unlimited, since you'd need to put CE entries in between to be able to get more than the maximum that could be contained in one isofs directory entry / continuation chunk and we are stop once we'd encountered 32 CEs, but you can get about 8Kb easily. And that's what will be passed to readdir callback as the name length. 8Kb __copy_to_user() from a buffer allocated by __get_free_page() Cc: stable@vger.kernel.org # 0.98pl6+ (yes, really) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-07f2fs: do not preallocate block unaligned to 4KBJaegeuk Kim
Previously f2fs_preallocate_blocks() tries to allocate unaligned blocks. In f2fs_write_begin(), however, prepare_write_begin() does not skip its allocation due to (len != 4KB). So, it needs locking node page twice unexpectedly. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>