summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2016-09-26btrfs: add dynamic debug supportJeff Mahoney
We can re-use the dynamic debugging descriptor to make use of the dynamic debugging mechanism but still use our own printk interface. Defining the DEBUG macro works as it did before. When it's defined, all of the messages default to print. We can also enable all debug messages at boot or module-load time using the 'dyndbg' and 'btrfs.dyndbg' options. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26btrfs: Fix warning "variable ‘gen’ set but not used"Luis Henriques
Variable 'gen' in reada_for_search() is not used since commit 58dc4ce43251 ("btrfs: remove unused parameter from readahead_tree_block"). This patch simply removes this variable. Signed-off-by: Luis Henriques <luis.henriques@canonical.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26btrfs: Fix warning "variable ‘blocksize’ set but not used"Luis Henriques
Variable 'blocksize' in reada_walk_down() is not used since commit d3e46fea1b1e ("btrfs: sink blocksize parameter to readahead_tree_block"). This patch simply removes this variable. Signed-off-by: Luis Henriques <luis.henriques@canonical.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26btrfs: let btrfs_delete_unused_bgs() to clean relocated bgsNaohiro Aota
Currently, btrfs_relocate_chunk() is removing relocated BG by itself. But the work can be done by btrfs_delete_unused_bgs() (and it's better since it trim the BG). Let's dedupe the code. While btrfs_delete_unused_bgs() is already hitting the relocated BG, it skip the BG since the BG has "ro" flag set (to keep balancing BG intact). On the other hand, btrfs cannot drop "ro" flag here to prevent additional writes. So this patch make use of "removed" flag. btrfs_delete_unused_bgs() now detect the flag to distinguish whether a read-only BG is relocating or not. Signed-off-by: Naohiro Aota <naohiro.aota@hgst.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26Btrfs: bail out if block group has different mixed flagLiu Bo
Currently we allow inconsistence about mixed flag (BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA). We'd get ENOSPC if block group has mixed flag and btrfs doesn't. If that happens, we have one space_info with mixed flag and another space_info only with BTRFS_BLOCK_GROUP_METADATA, and global_block_rsv.space_info points to the latter one, but all bytes from block_group contributes to the mixed space_info, thus all the allocation will fail with ENOSPC. This adds a check for the above case. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> [ updated message ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26Btrfs: fix memory leak in reading btree blocksLiu Bo
So we can read a btree block via readahead or intentional read, and we can end up with a memory leak when something happens as follows, 1) readahead starts to read block A but does not wait for read completion, 2) btree_readpage_end_io_hook finds that block A is corrupted, and it needs to clear all block A's pages' uptodate bit. 3) meanwhile an intentional read kicks in and checks block A's pages' uptodate to decide which page needs to be read. 4) when some pages have the uptodate bit during 3)'s check so 3) doesn't count them for eb->io_pages, but they are later cleared by 2) so we has to readpage on the page, we get the wrong eb->io_pages which results in a memory leak of this block. This fixes the problem by firstly getting all pages's locking and then checking pages' uptodate bit. t1(readahead) t2(readahead endio) t3(the following read) read_extent_buffer_pages end_bio_extent_readpage for pg in eb: for page 0,1,2 in eb: if pg is uptodate: btree_readpage_end_io_hook(pg) num_reads++ if uptodate: eb->io_pages = num_reads SetPageUptodate(pg) _______________ for pg in eb: for page 3 in eb: read_extent_buffer_pages if pg is NOT uptodate: btree_readpage_end_io_hook(pg) for pg in eb: __extent_read_full_page(pg) sanity check reports something wrong if pg is uptodate: clear_extent_buffer_uptodate(eb) num_reads++ for pg in eb: eb->io_pages = num_reads ClearPageUptodate(page) _______________ for pg in eb: if pg is NOT uptodate: __extent_read_full_page(pg) So t3's eb->io_pages is not consistent with the number of pages it's reading, and during endio(), atomic_dec_and_test(&eb->io_pages) will get a negative number so that we're not able to free the eb. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26Btrfs: remove BUG() in raid56Liu Bo
This BUG() has been triggered by a fuzz testing image, which contains an invalid chunk type, ie. a single stripe chunk has the raid6 type. Btrfs can handle this gracefully by returning -EIO, so besides using btrfs_warn to give us more debugging information rather than a single BUG(), we can return error properly. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26btrfs: fix check_shared for fiemap ioctlLu Fengqi
Only in the case of different root_id or different object_id, check_shared identified extent as the shared. However, If a extent was referred by different offset of same file, it should also be identified as shared. In addition, check_shared's loop scale is at least n^3, so if a extent has too many references, even causes soft hang up. First, add all delayed_ref to the ref_tree and calculate the unqiue_refs, if the unique_refs is greater than one, return BACKREF_FOUND_SHARED. Then individually add the on-disk reference(inline/keyed) to the ref_tree and calculate the unique_refs of the ref_tree to check if the unique_refs is greater than one.Because once there are two references to return SHARED, so the time complexity is close to the constant. Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26btrfs: create example debugfs file only in debugging buildDavid Sterba
Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26btrfs: fix perms on demonstration debugfs interfaceEric Sandeen
btrfs provides a helpful demonstration of how to export a global variable via debugfs; however, it is unique among other debugfs files in that it is world-writable, which causes some concern to people who are not familiar with its purpose. Fix it so that it is only user-writable. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26Btrfs: fix memory leak of block group cacheLiu Bo
While processing delayed refs, we may update block group's statistics and attach it to cur_trans->dirty_bgs, and later writing dirty block groups will process the list, which happens during btrfs_commit_transaction(). For whatever reason, the transaction is aborted and dirty_bgs is not processed in cleanup_transaction(), we end up with memory leak of these dirty block group cache. Since btrfs_start_dirty_block_groups() doesn't make it go to the commit critical section, this also adds the cleanup work inside it. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26xfs: log recovery tracepoints to track current lsn and buffer submissionBrian Foster
Log recovery has particular rules around buffer submission along with tricky corner cases where independent transactions can share an LSN. As such, it can be difficult to follow when/why buffers are submitted during recovery. Add a couple tracepoints to post the current LSN of a record when a new record is being processed and when a buffer is being skipped due to LSN ordering. Also, update the recover item class to include the LSN of the current transaction for the item being processed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-26xfs: update metadata LSN in buffers during log recoveryBrian Foster
Log recovery is currently broken for v5 superblocks in that it never updates the metadata LSN of buffers written out during recovery. The metadata LSN is recorded in various bits of metadata to provide recovery ordering criteria that prevents transient corruption states reported by buffer write verifiers. Without such ordering logic, buffer updates can be replayed out of order and lead to false positive transient corruption states. This is generally not a corruption vector on its own, but corruption detection shuts down the filesystem and ultimately prevents a mount if it occurs during log recovery. This requires an xfs_repair run that clears the log and potentially loses filesystem updates. This problem is avoided in most cases as metadata writes during normal filesystem operation update the metadata LSN appropriately. The problem with log recovery not updating metadata LSNs manifests if the system happens to crash shortly after log recovery itself. In this scenario, it is possible for log recovery to complete all metadata I/O such that the filesystem is consistent. If a crash occurs after that point but before the log tail is pushed forward by subsequent operations, however, the next mount performs the same log recovery over again. If a buffer is updated multiple times in the dirty range of the log, an earlier update in the log might not be valid based on the current state of the associated buffer after all of the updates in the log had been replayed (before the previous crash). If a verifier happens to detect such a problem, the filesystem claims corruption and immediately shuts down. This commonly manifests in practice as directory block verifier failures such as the following, likely due to directory verifiers being particularly detailed in their checks as compared to most others: ... Mounting V5 Filesystem XFS (dm-0): Starting recovery (logdev: internal) XFS (dm-0): Internal error XFS_WANT_CORRUPTED_RETURN at line ... of \ file fs/xfs/libxfs/xfs_dir2_data.c. Caller xfs_dir3_data_verify ... ... Update log recovery to update the metadata LSN of recovered buffers. Since metadata LSNs are already updated by write verifer functions via attached log items, attach a dummy log item to the buffer during validation and explicitly set the LSN of the current transaction. This ensures that the metadata LSN of a buffer is updated based on whether the recovery I/O actually completes, and if so, that subsequent recovery attempts identify that the buffer is already up to date with respect to the current transaction. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-26xfs: don't warn on buffers not being recovered due to LSNBrian Foster
The log recovery buffer validation function is invoked in cases where a buffer update may be skipped due to LSN ordering. If the validation function happens to come across directory conversion situations (e.g., a dir3 block to data conversion), it may warn about seeing a buffer log format of one type and a buffer with a magic number of another. This warning is not valid as the buffer update is ultimately skipped. This is indicated by a current_lsn of NULLCOMMITLSN provided by the caller. As such, update xlog_recover_validate_buf_type() to only warn in such cases when a buffer update is expected. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-26xfs: pass current lsn to log recovery buffer validationBrian Foster
The current LSN must be available to the buffer validation function to provide the ability to update the metadata LSN of the buffer. Pass the current_lsn value down to xlog_recover_validate_buf_type() in preparation. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-26xfs: rework log recovery to submit buffers on LSN boundariesBrian Foster
The fix to log recovery to update the metadata LSN in recovered buffers introduces the requirement that a buffer is submitted only once per current LSN. Log recovery currently submits buffers on transaction boundaries. This is not sufficient as the abstraction between log records and transactions allows for various scenarios where multiple transactions can share the same current LSN. If independent transactions share an LSN and both modify the same buffer, log recovery can incorrectly skip updates and leave the filesystem in an inconsisent state. In preparation for proper metadata LSN updates during log recovery, update log recovery to submit buffers for write on LSN change boundaries rather than transaction boundaries. Explicitly track the current LSN in a new struct xlog field to handle the various corner cases of when the current LSN may or may not change. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-26xfs: quiesce the filesystem after recovery on readonly mountDave Chinner
Recently we've had a number of reports where log recovery on a v5 filesystem has reported corruptions that looked to be caused by recovery being re-run over the top of an already-recovered metadata. This has uncovered a bug in recovery (fixed elsewhere) but the vector that caused this was largely unknown. A kdump test started tripping over this problem - the system would be crashed, the kdump kernel and environment would boot and dump the kernel core image, and then the system would reboot. After reboot, the root filesystem was triggering log recovery and corruptions were being detected. The metadumps indicated the above log recovery issue. What is happening is that the kdump kernel and environment is mounting the root device read-only to find the binaries needed to do it's work. The result of this is that it is running log recovery. However, because there were unlinked files and EFIs to be processed by recovery, the completion of phase 1 of log recovery could not mark the log clean. And because it's a read-only mount, the unmount process does not write records to the log to mark it clean, either. Hence on the next mount of the filesystem, log recovery was run again across all the metadata that had already been recovered and this is what triggered corruption warnings. To avoid this problem, we need to ensure that a read-only mount always updates the log when it completes the second phase of recovery. We already handle this sort of issue with rw->ro remount transitions, so the solution is as simple as quiescing the filesystem at the appropriate time during the mount process. This results in the log being marked clean so the mount behaviour recorded in the logs on repeated RO mounts will change (i.e. log recovery will no longer be run on every mount until a RW mount is done). This is a user visible change in behaviour, but it is harmless. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-26xfs: remote attribute blocks aren't really userdataDave Chinner
When adding a new remote attribute, we write the attribute to the new extent before the allocation transaction is committed. This means we cannot reuse busy extents as that violates crash consistency semantics. Hence we currently treat remote attribute extent allocation like userdata because it has the same overwrite ordering constraints as userdata. Unfortunately, this also allows the allocator to incorrectly apply extent size hints to the remote attribute extent allocation. This results in interesting failures, such as transaction block reservation overruns and in-memory inode attribute fork corruption. To fix this, we need to separate the busy extent reuse configuration from the userdata configuration. This changes the definition of XFS_BMAPI_METADATA slightly - it now means that allocation is metadata and reuse of busy extents is acceptible due to the metadata ordering semantics of the journal. If this flag is not set, it means the allocation is that has unordered data writeback, and hence busy extent reuse is not allowed. It no longer implies the allocation is for user data, just that the data write will not be strictly ordered. This matches the semantics for both user data and remote attribute block allocation. As such, This patch changes the "userdata" field to a "datatype" field, and adds a "no busy reuse" flag to the field. When we detect an unordered data extent allocation, we immediately set the no reuse flag. We then set the "user data" flags based on the inode fork we are allocating the extent to. Hence we only set userdata flags on data fork allocations now and consider attribute fork remote extents to be an unordered metadata extent. The result is that remote attribute extents now have the expected allocation semantics, and the data fork allocation behaviour is completely unchanged. It should be noted that there may be other ways to fix this (e.g. use ordered metadata buffers for the remote attribute extent data write) but they are more invasive and difficult to validate both from a design and implementation POV. Hence this patch takes the simple, obvious route to fixing the problem... Reported-and-tested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-24fs: compat_ioctl: add pretimeout functions for watchdogsWolfram Sang
Watchdog core now handles those ioctls centrally, so we want 64 bit support, too. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Vladimir Zapolskiy <vladimir_zapolskiy@mentor.com> Acked-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
2016-09-23Merge branch 'for-linus-4.8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Josef fixed a problem when quotas are enabled with his latest ENOSPC rework, and Jeff added more checks into the subvol ioctls to avoid tripping up lookup_one_len" * 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: ensure that file descriptor used with subvol ioctls is a dir Btrfs: handle quota reserve failure properly
2016-09-23Merge tag 'configfs-for-4.8-2' of git://git.infradead.org/users/hch/configfsLinus Torvalds
Pull configfs fix from Christoph Hellwig: "One more trivial fix for the binary attribute code from Phil Turnbull" * tag 'configfs-for-4.8-2' of git://git.infradead.org/users/hch/configfs: configfs: Return -EFBIG from configfs_write_bin_file.
2016-09-23nfsd: fix dprintk in nfsd4_encode_getdeviceinfoJeff Layton
nfserr is big-endian, so we should convert it to host-endian before printing it. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-09-23NFS: cache_lib: use complete() instead of complete_all()Daniel Wagner
There is only one waiter for the completion, therefore there is no need to use complete_all(). Let's make that clear by using complete() instead of complete_all(). The generic caching code from sunrpc is calling revisit() only once. The usage pattern of the completion is: waiter context waker context do_cache_lookup_wait() nfs_cache_defer_req_alloc() init_completion() do_cache_lookup() nfs_cache_wait_for_upcall() wait_for_completion_timeout() nfs_dns_cache_revisit() complete() nfs_cache_defer_req_put() Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-23NFS: direct: use complete() instead of complete_all()Daniel Wagner
There is only one waiter for the completion, therefore there is no need to use complete_all(). Let's make that clear by using complete() instead of complete_all(). nfs_file_direct_write() or nfs_file_direct_read() allocated a request object via nfs_direct_req_alloc(), which initializes the completion. The request object then is freed later in the exit path. Between the initialization and the release either nfs_direct_write_schedule_iovec() resp nfs_direct_read_schedule_iovec() are called which will asynchronously process the request. The calling function waits via nfs_direct_wait() till the async work has been done. Thus there is only one waiter on the completion. nfs_direct_pgio_init() and nfs_direct_read_completion() are passed via function pointers to nfs pageio. The first function does a ref counting (get_dreq() and put_dreq()) which ensures that nfs_direct_read_completion() and nfs_direct_read_schedule_iovec() only call the completion path once. The usage pattern of the completion is: waiter context waker context nfs_file_direct_write() dreq = nfs_direct_req_alloc() init_completion() nfs_direct_write_schedule_iovec() nfs_direct_wait() wait_for_completion_killable() nfs_direct_write_schedule_work() nfs_direct_complete() complete() nfs_file_direct_read() dreq = nfs_direct_req_all() init_completion() nfs_direct_read_schedule_iovec() nfs_direct_wait() wait_for_completion_killable() nfs_direct_read_schedule_iovec() nfs_direct_complete() complete() nfs_direct_read_completion() nfs_direct_complete() complete() Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2016-09-23devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/ptsEric W. Biederman
In 99.99% of the cases only root in a user namespace can mount /dev/pts and in those cases the owner of /dev/pts/ptmx will remain root.root In the oddball case where someone else has CAP_SYS_ADMIN this code modifies the /dev/pts mount code to use current_fsuid and current_fsgid as the values to use when creating the /dev/ptmx inode. As is done when any other file is created. This is a code simplification, and it allows running without a root user entirely. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-23devpts: Remove sync_filesystemsEric W. Biederman
devpts does not and never will have anything to sync so don't bother calling sync_filesystems on remount. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-23devpts: Make devpts_kill_sb safe if fsi is NULLEric W. Biederman
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-23devpts: Simplify devpts_mount by using mount_nodevEric W. Biederman
Now that all of the work of setting up a superblock has been moved to devpts_fill_super simplify devpts_mount by calling mount_nodev instead of rolling mount_nodev by hand. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-23devpts: Move the creation of /dev/pts/ptmx into fill_superEric W. Biederman
The code makes more sense here and things are just clearer. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-23devpts: Move parse_mount_options into fill_superEric W. Biederman
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-22nsfs: Simplify __ns_get_pathEric W. Biederman
Move mntget from the very beginning of __ns_get_path to the success path of __ns_get_path, and remove the mntget calls. This removes the possibility that there will be a mntget/mntput pair of __ns_get_path has to retry, and generally simplifies the code. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-09-22Merge branch 'nsfs-ioctls' into HEADEric W. Biederman
From: Andrey Vagin <avagin@openvz.org> Each namespace has an owning user namespace and now there is not way to discover these relationships. Pid and user namepaces are hierarchical. There is no way to discover parent-child relationships too. Why we may want to know relationships between namespaces? One use would be visualization, in order to understand the running system. Another would be to answer the question: what capability does process X have to perform operations on a resource governed by namespace Y? One more use-case (which usually called abnormal) is checkpoint/restart. In CRIU we are going to dump and restore nested namespaces. There [1] was a discussion about which interface to choose to determing relationships between namespaces. Eric suggested to add two ioctl-s [2]: > Grumble, Grumble. I think this may actually a case for creating ioctls > for these two cases. Now that random nsfs file descriptors are bind > mountable the original reason for using proc files is not as pressing. > > One ioctl for the user namespace that owns a file descriptor. > One ioctl for the parent namespace of a namespace file descriptor. Here is an implementaions of these ioctl-s. $ man man7/namespaces.7 ... Since Linux 4.X, the following ioctl(2) calls are supported for namespace file descriptors. The correct syntax is: fd = ioctl(ns_fd, ioctl_type); where ioctl_type is one of the following: NS_GET_USERNS Returns a file descriptor that refers to an owning user names‐ pace. NS_GET_PARENT Returns a file descriptor that refers to a parent namespace. This ioctl(2) can be used for pid and user namespaces. For user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same meaning. In addition to generic ioctl(2) errors, the following specific ones can occur: EINVAL NS_GET_PARENT was called for a nonhierarchical namespace. EPERM The requested namespace is outside of the current namespace scope. [1] https://lkml.org/lkml/2016/7/6/158 [2] https://lkml.org/lkml/2016/7/9/101 Changes for v2: * don't return ENOENT for init_user_ns and init_pid_ns. There is nothing outside of the init namespace, so we can return EPERM in this case too. > The fewer special cases the easier the code is to get > correct, and the easier it is to read. // Eric Changes for v3: * rename ns->get_owner() to ns->owner(). get_* usually means that it grabs a reference. Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> Cc: "W. Trevor King" <wking@tremily.us> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Serge Hallyn <serge.hallyn@canonical.com>
2016-09-22nsfs: add ioctl to get a parent namespaceAndrey Vagin
Pid and user namepaces are hierarchical. There is no way to discover parent-child relationships. In a future we will use this interface to dump and restore nested namespaces. Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2016-09-22nsfs: add ioctl to get an owning user namespace for ns file descriptorAndrey Vagin
Each namespace has an owning user namespace and now there is not way to discover these relationships. Understending namespaces relationships allows to answer the question: what capability does process X have to perform operations on a resource governed by namespace Y? After a long discussion, Eric W. Biederman proposed to use ioctl-s for this purpose. The NS_GET_USERNS ioctl returns a file descriptor to an owning user namespace. It returns EPERM if a target namespace is outside of a current user namespace. v2: rename parent to relative v3: Add a missing mntput when returning -EAGAIN --EWB Acked-by: Serge Hallyn <serge@hallyn.com> Link: https://lkml.org/lkml/2016/7/6/158 Signed-off-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2016-09-22kernel: add a helper to get an owning user namespace for a namespaceAndrey Vagin
Return -EPERM if an owning user namespace is outside of a process current user namespace. v2: In a first version ns_get_owner returned ENOENT for init_user_ns. This special cases was removed from this version. There is nothing outside of init_user_ns, so we can return EPERM. v3: rename ns->get_owner() to ns->owner(). get_* usually means that it grabs a reference. Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2016-09-22NFS: nfs_prime_dcache must validate the filenameTrond Myklebust
Before we try to stash it in the dcache, we need to at least check that the filename passed to us by the server is non-empty and doesn't contain any illegal '\0' or '/' characters. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-22nfs: allow blocking locks to be awoken by lock callbacksJeff Layton
Add a waitqueue head to the client structure. Have clients set a wait on that queue prior to requesting a lock from the server. If the lock is blocked, then we can use that to wait for wakeups. Note that we do need to do this "manually" since we need to set the wait on the waitqueue prior to requesting the lock, but requesting a lock can involve activities that can block. However, only do that for NFSv4.1 locks, either by compiling out all of the waitqueue handling when CONFIG_NFS_V4_1 is disabled, or skipping all of it at runtime if we're dealing with v4.0, or v4.1 servers that don't send lock callbacks. Note too that even when we expect to get a lock callback, RFC5661 section 20.11.4 is pretty clear that we still need to poll for them, so we do still sleep on a timeout. We do however always poll at the longest interval in that case. Signed-off-by: Jeff Layton <jlayton@redhat.com> [Anna: nfs4_retry_setlk() "status" should default to -ERESTARTSYS] Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-22f2fs: preallocate blocks for encrypted fileYunlei He
This patch allow preallocates data blocks for buffered aio writes in encrypted file. Signed-off-by: Yunlei He <heyunlei@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: fix to avoid BUG_ON] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-22f2fs: show dirty inode numberChao Yu
This patch enables showing dirty inode number in procfs. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-22f2fs: support IO error injectionChao Yu
This patch adds to support IO error injection for testing IO error tolerance of f2fs. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-22f2fs: fix to return error number of read_all_xattrs correctlyChao Yu
We treat all error in read_all_xattrs as a no memory error, which covers the real reason of failure in it. Fix it by return correct errno in order to reflect the real cause. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-22f2fs: make f2fs_filetype_table staticChao Yu
There is no more user of f2fs_filetype_table outside of dir.c, make it static. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-09-22devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/ptsEric W. Biederman
In 99.99% of the cases only root in a user namespace can mount /dev/pts and in those cases the owner of /dev/pts/ptmx will remain root.root In the oddball case where someone else has CAP_SYS_ADMIN this code modifies the /dev/pts mount code to use current_fsuid and current_fsgid as the values to use when creating the /dev/ptmx inode. As is done when any other file is created. This is a code simplification, and it allows running without a root user entirely. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-09-22devpts: Remove sync_filesystemsEric W. Biederman
devpts does not and never will have anything to sync so don't bother calling sync_filesystems on remount. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-09-22devpts: Make devpts_kill_sb safe if fsi is NULLEric W. Biederman
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-09-22devpts: Simplify devpts_mount by using mount_nodevEric W. Biederman
Now that all of the work of setting up a superblock has been moved to devpts_fill_super simplify devpts_mount by calling mount_nodev instead of rolling mount_nodev by hand. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-09-22devpts: Move the creation of /dev/pts/ptmx into fill_superEric W. Biederman
The code makes more sense here and things are just clearer. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-09-22devpts: Move parse_mount_options into fill_superEric W. Biederman
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-09-22userns: When the per user per user namespace limit is reached return ENOSPCEric W. Biederman
The current error codes returned when a the per user per user namespace limit are hit (EINVAL, EUSERS, and ENFILE) are wrong. I asked for advice on linux-api and it we made clear that those were the wrong error code, but a correct effor code was not suggested. The best general error code I have found for hitting a resource limit is ENOSPC. It is not perfect but as it is unambiguous it will serve until someone comes up with a better error code. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>