summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2016-01-09fs: xattr: Use kvfree()Richard Weinberger
... instead of open coding it. Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-08nbd: use ->compat_ioctl()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-08Merge branch 'for-linus' into work.miscAl Viro
2016-01-08compat_ioctl: don't call do_ioctl under set_fs(KERNEL_DS)Jann Horn
This replaces all code in fs/compat_ioctl.c that translated ioctl arguments into a in-kernel structure, then performed do_ioctl under set_fs(KERNEL_DS), with code that allocates data on the user stack and can call the VFS ioctl handler under USER_DS. This is done as a hardening measure because the caller does not know what kind of ioctl handler will be invoked, only that no corresponding compat_ioctl handler exists and what the ioctl command number is. The accidental invocation of an unlocked_ioctl handler that unexpectedly calls copy_to_user could be a severe security issue. Signed-off-by: Jann Horn <jann@thejh.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-08compat_ioctl: don't pass fd around when not neededAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-08compat_ioctl: don't look up the fd twiceJann Horn
In code in fs/compat_ioctl.c that translates ioctl arguments into a in-kernel structure, then performs sys_ioctl, possibly under set_fs(KERNEL_DS), this commit changes the sys_ioctl calls to do_ioctl calls. do_ioctl is a new function that does the same thing as sys_ioctl, but doesn't look up the fd again. This change is made to avoid (potential) security issues because of ioctl handlers that accept one of the ioctl commands I2C_FUNCS, VIDEO_GET_EVENT, MTIOCPOS, MTIOCGET, TIOCGSERIAL, TIOCSSERIAL, RTC_IRQP_READ, RTC_EPOCH_READ. This can happen for multiple reasons: - The ioctl command number could be reused. - The ioctl handler might not check the full ioctl command. This is e.g. true for drm_ioctl. - The ioctl handler is very special, e.g. cuse_file_ioctl The real issue is that set_fs(KERNEL_DS) is used here, but that's fixed in a separate commit "compat_ioctl: don't call do_ioctl under set_fs(KERNEL_DS)". This change mitigates potential security issues by preventing a race that permits invocation of unlocked_ioctl handlers under KERNEL_DS through compat code even if a corresponding compat_ioctl handler exists. So far, no way has been identified to use this to damage kernel memory without having CAP_SYS_ADMIN in the init ns (with the capability, doing reads/writes at arbitrary kernel addresses should be easy through CUSE's ioctl handler with FUSE_IOCTL_UNRESTRICTED set). [AV: two missed sys_ioctl() taken care of] Signed-off-by: Jann Horn <jann@thejh.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-08nfsd: give up on CB_LAYOUTRECALLs after two lease periodsJeff Layton
Have the CB_LAYOUTRECALL code treat NFS4_OK and NFS4ERR_DELAY returns equivalently. Change the code to periodically resend CB_LAYOUTRECALLS until the ls_layouts list is empty or the client returns a different error code. If we go for two lease periods without the list being emptied or the client sending a hard error, then we give up and clean out the list anyway. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Tested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-01-08ext4: add FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR interface supportLi Xi
This patch adds FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR ioctl interface support for ext4. The interface is kept consistent with XFS_IOC_FSGETXATTR/XFS_IOC_FSGETXATTR. Signed-off-by: Li Xi <lixi@ddn.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Jan Kara <jack@suse.cz>
2016-01-08ext4: add project quota supportLi Xi
This patch adds mount options for enabling/disabling project quota accounting and enforcement. A new specific inode is also used for project quota accounting. [ Includes fix from Dan Carpenter to crrect error checking from dqget(). ] Signed-off-by: Li Xi <lixi@ddn.com> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Jan Kara <jack@suse.cz>
2016-01-08ext4: adds project ID supportLi Xi
Signed-off-by: Li Xi <lixi@ddn.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Jan Kara <jack@suse.cz>
2016-01-08ext4 crypto: simplify interfaces to directory entry insert functionsTheodore Ts'o
A number of functions include ext4_add_dx_entry, make_indexed_dir, etc. are being passed a dentry even though the only thing they use is the containing parent. We can shrink the code size slightly by making this replacement. This will also be useful in cases where we don't have a dentry as the argument to the directory entry insert functions. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-01-08f2fs: skip releasing nodes in chindless extent treeChao Yu
If there are no nodes in extent tree, let's skip releasing step to avoid any overhead of grabbing/releasing extent tree lock. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-01-08f2fs: use atomic type for node count in extent treeChao Yu
1. rename field in struct extent_tree from count to node_cnt for readability. 2. alter to use atomic type for node_cnt. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-01-08f2fs: recognize encrypted data in f2fs_fiemapChao Yu
This patch fixes to teach f2fs_fiemap to recognize encrypted data. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-01-08f2fs: clean up f2fs_balance_fsJaegeuk Kim
This patch adds one parameter to clean up all the callers of f2fs_balance_fs. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-01-08f2fs: remove redundant callsJaegeuk Kim
This patch removes redundant calls. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-01-08f2fs: avoid unnecessary f2fs_balance_fs callsJaegeuk Kim
Only when node page is newly dirtied, it needs to check whether we need to do f2fs_gc. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-01-08f2fs: check the page status filled from diskJaegeuk Kim
After reading a page, we need to check whether there is any error. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-01-08f2fs: introduce __get_node_page to reuse common codeChao Yu
There are duplicated code in between get_node_page and get_node_page_ra, introduce __get_node_page to includes common parts of these two, and export get_node_page and get_node_page_ra by reusing __get_node_page. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-01-08f2fs: check node id earily when readaheading node pageChao Yu
Add node id check in ra_node_page and get_node_page_ra like get_node_page. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-01-08locks: rename __posix_lock_file to posix_lock_inodeJeff Layton
...a more descriptive name and we can drop the double underscore prefix. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
2016-01-08locks: prink more detail when there are leaked locksJeff Layton
Right now, we just get WARN_ON_ONCE, which is not particularly helpful. Have it dump some info about the locks and the inode to make it easier to track down leaked locks in the future. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
2016-01-08locks: pass inode pointer to locks_free_lock_contextJeff Layton
...so we can print information about it if there are leaked locks. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
2016-01-08locks: sprinkle some tracepoints around the file locking codeJeff Layton
Add some tracepoints around the POSIX locking code. These were useful when tracking down problems when handling the race between setlk and close. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
2016-01-08locks: don't check for race with close when setting OFD lockJeff Layton
We don't clean out OFD locks on close(), so there's no need to check for a race with them here. They'll get cleaned out at the same time that flock locks are. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
2016-01-08NFS: Fix a compile warning about unused variable in nfs_generic_pg_pgios()Trond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-01-08NFSv4: Fix a compile warning about no prototype for nfs4_ioctl()Trond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-01-07locks: fix unlock when fcntl_setlk races with a closeJeff Layton
Dmitry reported that he was able to reproduce the WARN_ON_ONCE that fires in locks_free_lock_context when the flc_posix list isn't empty. The problem turns out to be that we're basically rebuilding the file_lock from scratch in fcntl_setlk when we discover that the setlk has raced with a close. If the l_whence field is SEEK_CUR or SEEK_END, then we may end up with fl_start and fl_end values that differ from when the lock was initially set, if the file position or length of the file has changed in the interim. Fix this by just reusing the same lock request structure, and simply override fl_type value with F_UNLCK as appropriate. That ensures that we really are unlocking the lock that was initially set. While we're there, make sure that we do pop a WARN_ON_ONCE if the removal ever fails. Also return -EBADF in this event, since that's what we would have returned if the close had happened earlier. Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Fixes: c293621bbf67 (stale POSIX lock handling) Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
2016-01-08xfs: bmapbt checking on debug kernels too expensiveDave Chinner
For large sparse or fragmented files, checking every single entry in the bmapbt on every operation is prohibitively expensive. Especially as such checks rarely discover problems during normal operations on high extent coutn files. Our regression tests don't tend to exercise files with hundreds of thousands to millions of extents, so mostly this isn't noticed. However, trying to run things like xfs_mdrestore of large filesystem dumps on a debug kernel quickly becomes impossible as the CPU is completely burnt up repeatedly walking the sparse file bmapbt that is generated for every allocation that is made. Hence, if the file has more than 10,000 extents, just don't bother with walking the tree to check it exhaustively. The btree code has checks that ensure that the newly inserted/removed/modified record is correctly ordered, so the entrie tree walk in thses cases has limited additional value. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-08xfs: add tracepoints to readpage callsDave Chinner
This allows us to see page cache driven readahead in action as it passes through XFS. This helps to understand buffered read throughput problems such as readahead IO IO sizes being too small for the underlying device to reach max throughput. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-07Merge branch 'bugfixes'Trond Myklebust
* bugfixes: SUNRPC: Fixup socket wait for memory SUNRPC: Fix a missing break in rpc_anyaddr() pNFS/flexfiles: Fix an Oopsable typo in ff_mirror_match_fh() NFS: Fix attribute cache revalidation NFS: Ensure we revalidate attributes before using execute_ok() NFS: Flush reclaim writes using FLUSH_COND_STABLE NFS: Background flush should not be low priority NFSv4.1/pnfs: Fixup an lo->plh_block_lgets imbalance in layoutreturn NFSv4: Don't perform cached access checks before we've OPENed the file NFS: Allow the combination pNFS and labeled NFS NFS42: handle layoutstats stateid error nfs: Fix race in __update_open_stateid() nfs: fix missing assignment in nfs4_sequence_done tracepoint
2016-01-07NFS: Use wait_on_atomic_t() for unlock after readaheadBenjamin Coddington
The use of wait_on_atomic_t() for waiting on I/O to complete before unlocking allows us to git rid of the NFS_IO_INPROGRESS flag, and thus the nfs_iocounter's flags member, and finally the nfs_iocounter altogether. The count of I/O is moved to the lock context, and the counter increment/decrement functions become simple enough to open-code. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> [Trond: Fix up conflict with existing function nfs_wait_atomic_killable()] Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-01-07Btrfs: fix fitrim discarding device area reserved for boot loader's useFilipe Manana
As of the 4.3 kernel release, the fitrim ioctl can now discard any region of a disk that is not allocated to any chunk/block group, including the first megabyte which is used for our primary superblock and by the boot loader (grub for example). Fix this by not allowing to trim/discard any region in the device starting with an offset not greater than min(alloc_start_mount_option, 1Mb), just as it was not possible before 4.3. A reproducer test case for xfstests follows. seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { cd / rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 # Write to the [0, 64Kb[ and [68Kb, 1Mb[ ranges of the device. These ranges are # reserved for a boot loader to use (GRUB for example) and btrfs should never # use them - neither for allocating metadata/data nor should trim/discard them. # The range [64Kb, 68Kb[ is used for the primary superblock of the filesystem. $XFS_IO_PROG -c "pwrite -S 0xfd 0 64K" $SCRATCH_DEV | _filter_xfs_io $XFS_IO_PROG -c "pwrite -S 0xfd 68K 956K" $SCRATCH_DEV | _filter_xfs_io # Now mount the filesystem and perform a fitrim against it. _scratch_mount _require_batched_discard $SCRATCH_MNT $FSTRIM_PROG $SCRATCH_MNT # Now unmount the filesystem and verify the content of the ranges was not # modified (no trim/discard happened on them). _scratch_unmount echo "Content of the ranges [0, 64Kb] and [68Kb, 1Mb[ after fitrim:" od -t x1 -N $((64 * 1024)) $SCRATCH_DEV od -t x1 -j $((68 * 1024)) -N $((956 * 1024)) $SCRATCH_DEV status=0 exit Reported-by: Vincent Petry <PVince81@yahoo.fr> Reported-by: Andrei Borzenkov <arvidjaar@gmail.com> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109341 Fixes: 499f377f49f0 (btrfs: iterate over unused chunk space in FITRIM) Cc: stable@vger.kernel.org # 4.3+ Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-01-07nfsd: Fix nfsd leaks sunrpc module referencesKinglong Mee
Stefan Hajnoczi reports, nfsd leaks 3 references to the sunrpc module here: # echo -n "asdf 1234" >/proc/fs/nfsd/portlist bash: echo: write error: Protocol not supported Now stop nfsd and try unloading the kernel modules: # systemctl stop nfs-server # systemctl stop nfs # systemctl stop proc-fs-nfsd.mount # systemctl stop var-lib-nfs-rpc_pipefs.mount # rmmod nfsd # rmmod nfs_acl # rmmod lockd # rmmod auth_rpcgss # rmmod sunrpc rmmod: ERROR: Module sunrpc is in use # lsmod | grep rpc sunrpc 315392 3 It is caused by nfsd don't cleanup rpcb program for nfsd when destroying svc service after creating xprt fail. Reported-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-01-07lockd: constify nlmsvc_binding structureJulia Lawall
The nlmsvc_binding structure is never modified, so declare it as const. Done with the help of Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-01-07lockd: use to_delayed_workGeliang Tang
Use to_delayed_work() instead of open-coding it. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-01-07nfsd: use to_delayed_workGeliang Tang
Use to_delayed_work() instead of open-coding it. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-01-07Btrfs: Check metadata redundancy on balanceSam Tygier
When converting a filesystem via balance check that metadata mode is at least as redundant as the data mode. For example give warning when: -dconvert=raid1 -mconvert=single Signed-off-by: Sam Tygier <samtygier@yahoo.co.uk> [ minor message reformatting ] Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: statfs: report zero available if metadata are exhaustedDavid Sterba
There is one ENOSPC case that's very confusing. There's Available greater than zero but no file operation succeds (besides removing files). This happens when the metadata are exhausted and there's no possibility to allocate another chunk. In this scenario it's normal that there's still some space in the data chunk and the calculation in df reflects that in the Avail value. To at least give some clue about the ENOSPC situation, let statfs report zero value in Avail, even if there's still data space available. Current: /dev/sdb1 4.0G 3.3G 719M 83% /mnt/test New: /dev/sdb1 4.0G 3.3G 0 100% /mnt/test We calculate the remaining metadata space minus global reserve. If this is (supposedly) smaller than zero, there's no space. But this does not hold in practice, the exhausted state happens where's still some positive delta. So we apply some guesswork and compare the delta to a 4M threshold. (Practically observed delta was 2M.) We probably cannot calculate the exact threshold value because this depends on the internal reservations requested by various operations, so some operations that consume a few metadata will succeed even if the Avail is zero. But this is better than the other way around. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: preallocate path for snapshot creation at ioctl timeDavid Sterba
We can also preallocate btrfs_path that's used during pending snapshot creation and avoid another late ENOMEM failure. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: allocate root item at snapshot ioctl timeDavid Sterba
The actual snapshot creation is delayed until transaction commit. If we cannot get enough memory for the root item there, we have to fail the whole transaction commit which is bad. So we'll allocate the memory at the ioctl call and pass it along with the pending_snapshot struct. The potential ENOMEM will be returned to the caller of snapshot ioctl. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: do an allocation earlier during snapshot creationDavid Sterba
We can allocate pending_snapshot earlier and do not have to do cleanup in case of failure. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: use smaller type for btrfs_path locksDavid Sterba
The values of btrfs_path::locks are 0 to 4, fit into a u8. Let's see: * overall size of btrfs_path drops down from 136 to 112 (-24 bytes), * better packing in a slab page +6 objects * the whole structure now fits to 2 cachelines * slight decrease in code size: text data bss dec hex filename 938731 43670 23144 1005545 f57e9 fs/btrfs/btrfs.ko.before 938203 43670 23144 1005017 f55d9 fs/btrfs/btrfs.ko.after (and the generated assembly does not change much) The main purpose is to decrease the size of the structure without affecting performance. The byte access is usually well behaving accross arches, the locks are not accessed frequently and sometimes just compared to zero. Note for further size reduction attempts: the slots could be made u16 but this might generate worse code on some arches (non-byte and non-int access). Also the range of operations on slots is wider compared to locks and the potential performance drop should be evaluated first. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: use smaller type for btrfs_path lowest_levelDavid Sterba
The level is 0..7, we can use smaller type. The size of btrfs_path is now 136 bytes from 144, which is +2 objects that fit into a 4k slab. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: use smaller type for btrfs_path readaDavid Sterba
The possible values for reada are all positive and bounded, we can later save some bytes by storing it in u8. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: cleanup, use enum values for btrfs_path readaDavid Sterba
Replace the integers by enums for better readability. The value 2 does not have any meaning since a717531942f488209dded30f6bc648167bcefa72 "Btrfs: do less aggressive btree readahead" (2009-01-22). Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: constify static arraysDavid Sterba
There are a few statically initialized arrays that can be made const. The remaining (like file_system_type, sysfs attributes or prop handlers) do not allow that due to type mismatch when passed to the APIs or because the structures are modified through other members. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: constify remaining structs with function pointersDavid Sterba
* struct extent_io_ops * struct btrfs_free_space_op Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs tests: replace whole ops structure for free space testsDavid Sterba
Preparatory work for making btrfs_free_space_op constant. In test_steal_space_from_bitmap_to_extent, we substitute use_bitmap with own version thus preventing constification. We can rework it so we replace the whole structure with the correct function pointers. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: use list_for_each_entry* in backref.cGeliang Tang
Use list_for_each_entry*() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@163.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>