summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2018-07-25gfs2: fallocate_chunk: Always initialize struct iomapAndreas Gruenbacher
In fallocate_chunk, always initialize the iomap before calling gfs2_iomap_get_alloc: future changes could otherwise cause things like iomap.flags to leak across calls. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Bob Peterson <rpeterso@redhat.com>
2018-07-24fs/cifs: Simplify ib_post_(send|recv|srq_recv)() callsBart Van Assche
Instead of declaring and passing a dummy 'bad_wr' pointer, pass NULL as third argument to ib_post_(send|recv|srq_recv)(). Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-25GFS2: Fix recovery issues for spectatorsBob Peterson
This patch fixes a couple problems dealing with spectators who remain with gfs2 mounts after the last non-spectator node fails. Before this patch, spectator mounts would try to acquire the dlm's mounted lock EX as part of its normal recovery sequence. The mounted lock is only used to determine whether the node is the first mounter, the first node to mount the file system, for the purposes of file system recovery and journal replay. It's not necessary for spectators: they should never do journal recovery. If they acquire the lock it will prevent another "real" first-mounter from acquiring the lock in EX mode, which means it also cannot do journal recovery because it doesn't think it's the first node to mount the file system. This patch checks if the mounter is a spectator, and if so, avoids grabbing the mounted lock. This allows a secondary mounter who is really the first non-spectator mounter, to do journal recovery: since the spectator doesn't acquire the lock, it can grab it in EX mode, and therefore consider itself to be the first mounter both as a "real" first mount, and as a first-real-after-spectator. Note that the control lock still needs to be taken in PR mode in order to fetch the lvb value so it has the current status of all journal's recovery. This is used as it is today by a first mounter to replay the journals. For spectators, it's merely used to fetch the status bits. All recovery is bypassed and the node waits until recovery is completed by a non-spectator node. I also improved the cryptic message given by control_mount when a spectator is waiting for a non-spectator to perform recovery. It also fixes a problem in gfs2_recover_set whereby spectators were never queueing recovery work for their own journal. They cannot do recovery themselves, but they still need to queue the work so they can check the recovery bits and clear the DFL_BLOCK_LOCKS bit once the recovery happens on another node. When the work queue runs on a spectator, it bypasses most of the work so it won't print a bunch of annoying messages. All it will print is a bunch of messages that look like this until recovery completes on the non-spectator node: GFS2: fsid=mycluster:scratch.s: recover generation 3 jid 0 GFS2: fsid=mycluster:scratch.s: recover jid 0 result busy These continue every 1.5 seconds until the recovery is done by the non-spectator, at which time it says: GFS2: fsid=mycluster:scratch.s: recover generation 4 done Then it proceeds with its mount. If the file system is mounted in spectator node and the last remaining non-spectator is fenced, any IO to the file system is blocked by dlm and the spectator waits until recovery is performed by a non-spectator. If a spectator tries to mount the file system before any non-spectators, it blocks and repeatedly gives this kernel message: GFS2: fsid=mycluster:scratch: Recovery is required. Waiting for a non-spectator to mount. GFS2: fsid=mycluster:scratch: Recovery is required. Waiting for a non-spectator to mount. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-07-24exofs: use bio_clone_fast in _write_mirrorChristoph Hellwig
The mirroring code never changes the bio data or biovecs. This means we can reuse the biovec allocation easily instead of duplicating it. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by Boaz Harrosh <ooo@electrozaur.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-24xfs: properly handle free inodes in extent hint validatorsEric Sandeen
When inodes are freed in xfs_ifree(), di_flags is cleared (so extent size hints are removed) but the actual extent size fields are left intact. This causes the extent hint validators to fail on freed inodes which once had extent size hints. This can be observed (for example) by running xfs/229 twice on a non-crc xfs filesystem, or presumably on V5 with ikeep. Fixes: 7d71a67 ("xfs: verify extent size hint is valid in inode verifier") Fixes: 02a0fda ("xfs: verify COW extent size hint is valid in inode verifier") Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-24Merge branch 'iomap-write' into linux-gfs2/for-nextAndreas Gruenbacher
Pull in the gfs2 iomap-write changes: Tweak the existing code to properly support iomap write and eliminate an unnecessary special case in gfs2_block_map. Implement iomap write support for buffered and direct I/O. Simplify some of the existing code and eliminate code that is no longer used: gfs2: Remove gfs2_write_{begin,end} gfs2: iomap direct I/O support gfs2: gfs2_extent_length cleanup gfs2: iomap buffered write support gfs2: Further iomap cleanups This is based on the following changes on the xfs 'iomap-4.19-merge' branch: iomap: add private pointer to struct iomap iomap: add a page_done callback iomap: generic inline data handling iomap: complete partial direct I/O writes synchronously iomap: mark newly allocated buffer heads as new fs: factor out a __generic_write_end helper Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-07-24fs: gfs2: Adding new return type vm_fault_tSouptick Joarder
Use new return type vm_fault_t for gfs2_page_mkwrite handler. see commit 1c8f422059ae ("mm: change return type to vm_fault_t") for reference. Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-07-24gfs2: using posix_acl_xattr_size instead of posix_acl_to_xattrChengguang Xu
It seems better to get size by calling posix_acl_xattr_size() instead of calling posix_acl_to_xattr() with NULL buffer argument. posix_acl_xattr_size() never returns 0, so remove the unnecessary check. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-07-24gfs2: Don't reject a supposedly full bitmap if we have blocks reservedBob Peterson
Before this patch, you could get into situations like this: 1. Process 1 searches for X free blocks, finds them, makes a reservation 2. Process 2 searches for free blocks in the same rgrp, but now the bitmap is full because process 1's reservation is skipped over. So it marks the bitmap as GBF_FULL. 3. Process 1 tries to allocate blocks from its own reservation, but since the GBF_FULL bit is set, it skips over the rgrp and searches elsewhere, thus not using its own reservation. This patch adds an additional check to allow processes to use their own reservations. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-07-23filesystem-dax: Introduce dax_lock_mapping_entry()Dan Williams
In preparation for implementing support for memory poison (media error) handling via dax mappings, implement a lock_page() equivalent. Poison error handling requires rmap and needs guarantees that the page->mapping association is maintained / valid (inode not freed) for the duration of the lookup. In the device-dax case it is sufficient to simply hold a dev_pagemap reference. In the filesystem-dax case we need to use the entry lock. Export the entry lock via dax_lock_mapping_entry() that uses rcu_read_lock() to protect against the inode being freed, and revalidates the page->mapping association under xa_lock(). Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2018-07-23xfs: force summary counter recalc at next mountDarrick J. Wong
Use the "bad summary count" mount flag from the previous patch to skip writing the unmount record to force log recovery at the next mount, which will recalculate the summary counters for us. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-23xfs: refactor unmount record writeDarrick J. Wong
Refactor the writing of the unmount record into a separate helper. No functionality changes. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-23xfs: detect and fix bad summary counts at mountDarrick J. Wong
Filippo Giunchedi complained that xfs doesn't even perform basic sanity checks of the fs summary counters at mount time. Therefore, recalculate the summary counters from the AGFs after log recovery if the counts were bad (or we had to recover the fs). Enhance the recalculation routine to fail the mount entirely if the new values are also obviously incorrect. We use a mount state flag to record the "bad summary count" state so that the (subsequent) online fsck patches can detect subtlely incorrect counts and set the flag; clear it userspace asks for a repair; or force a recalculation at the next mount if nobody fixes it by unmount time. Reported-by: Filippo Giunchedi <fgiunchedi@wikimedia.org> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-23xfs: fix indentation and other whitespace problems in scrub/repairDarrick J. Wong
Now that we've shortened everything, fix up all the indentation and whitespace problems. There are no functional changes. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-07-23xfs: shorten struct xfs_scrub_context to struct xfs_scrubDarrick J. Wong
Shorten the name of the online fsck context structure. Whitespace damage will be fixed by a subsequent patch. There are no functional changes. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-07-23xfs: shorten xfs_repair_ prefix to xrep_Darrick J. Wong
Shorten all the metadata repair xfs_repair_* symbols to xrep_. Whitespace damage will be fixed by a subsequent patch. There are no functional changes. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-07-23xfs: shorten xfs_scrub_ prefixDarrick J. Wong
Shorten all the metadata checking xfs_scrub_ prefixes to xchk_. After this, the only xfs_scrub* symbols are the ones that pertain to both scrub and repair. Whitespace damage will be fixed in a subsequent patch. There are no functional changes. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-07-23xfs: clean up xfs_btree_del_cursor callersDarrick J. Wong
Less trivial cleanups of the error argument to xfs_btree_del_cursor; these require some minor code refactoring. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-23xfs: trivial xfs_btree_del_cursor cleanupsDarrick J. Wong
The error argument to xfs_btree_del_cursor already understands the "nonzero for error" semantics, so remove pointless error testing in the callers and pass it directly. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-23xfs: return from _defer_finish with a clean transactionDarrick J. Wong
The following assertion was seen on generic/051: XFS: Assertion failed: tp->t_firstblock == NULLFSBLOCK, file: fs/xfs/libxfs5 ------------[ cut here ]------------ kernel BUG at fs/xfs/xfs_message.c:102! invalid opcode: 0000 [#1] SMP PTI CPU: 2 PID: 20757 Comm: fsstress Not tainted 4.18.0-rc4+ #3969 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/4 RIP: 0010:assfail+0x23/0x30 Code: c3 66 0f 1f 44 00 00 48 89 f1 41 89 d0 48 c7 c6 88 e0 8c 82 48 89 fa RSP: 0018:ffff88012dc43c08 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff88012dc43ca0 RCX: 0000000000000000 RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffff828480eb RBP: ffff88012aa92758 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: f000000000000000 R12: 0000000000000000 R13: ffff88012dc43d48 R14: ffff88013092e7e8 R15: 0000000000000014 FS: 00007f8d689b8e80(0000) GS:ffff88013fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f8d689c7000 CR3: 000000012ba6a000 CR4: 00000000000006e0 Call Trace: xfs_defer_init+0xff/0x160 xfs_reflink_remap_extent+0x31b/0xa00 xfs_reflink_remap_blocks+0xec/0x4a0 xfs_reflink_remap_range+0x3a1/0x650 xfs_file_dedupe_range+0x39/0x50 vfs_dedupe_file_range+0x218/0x260 do_vfs_ioctl+0x262/0x6a0 ? __se_sys_newfstat+0x3c/0x60 ksys_ioctl+0x35/0x60 __x64_sys_ioctl+0x11/0x20 do_syscall_64+0x4b/0x190 entry_SYSCALL_64_after_hwframe+0x49/0xbe The root cause of the assertion failure is that xfs_defer_finish doesn't roll the transaction after processing all the deferred items. Therefore it returns a dirty transaction to the caller, which leaves the caller at risk of exceeding the transaction reservation if it logs more items. Brian Foster's patchset to move the defer_ops firstblock into the transaction requires t_firstblock == NULLFSBLOCK upon defer_ops initialization, which is how this was noticed at all. Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-23xfs: check leaf attribute block freemap in verifierDarrick J. Wong
Check the leaf attribute freemap when we're verifying the block. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-22Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull vfs fixes from Al Viro: "Fix several places that screw up cleanups after failures halfway through opening a file (one open-coding filp_clone_open() and getting it wrong, two misusing alloc_file()). That part is -stable fodder from the 'work.open' branch. And Christoph's regression fix for uapi breakage in aio series; include/uapi/linux/aio_abi.h shouldn't be pulling in the kernel definition of sigset_t, the reason for doing so in the first place had been bogus - there's no need to expose struct __aio_sigset in aio_abi.h at all" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: aio: don't expose __aio_sigset in uapi ocxlflash_getfile(): fix double-iput() on alloc_file() failures cxl_getfile(): fix double-iput() on alloc_file() failures drm_mode_create_lease_ioctl(): fix open-coded filp_clone_open()
2018-07-22efivars: Call guid_parse() against guid_t type of variableAndy Shevchenko
uuid_le_to_bin() is deprecated API and take into consideration that variable, to where we store parsed data, is type of guid_t we switch to guid_parse() for sake of consistency. While here, add error checking to it. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Hans de Goede <hdegoede@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lukas Wunner <lukas@wunner.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/20180720014726.24031-10-ard.biesheuvel@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-21Merge tag 'for-4.18-rc5-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "A fix of a corruption regarding fsync and clone, under some very specific conditions explained in the patch. The fix is marked for stable 3.16+ so I'd like to get it merged now given the impact" * tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix file data corruption after cloning a range and fsync
2018-07-21mm: make vm_area_alloc() initialize core fieldsLinus Torvalds
Like vm_area_dup(), it initializes the anon_vma_chain head, and the basic mm pointer. The rest of the fields end up being different for different users, although the plan is to also initialize the 'vm_ops' field to a dummy entry. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-21mm: use helper functions for allocating and freeing vm_area structsLinus Torvalds
The vm_area_struct is one of the most fundamental memory management objects, but the management of it is entirely open-coded evertwhere, ranging from allocation and freeing (using kmem_cache_[z]alloc and kmem_cache_free) to initializing all the fields. We want to unify this in order to end up having some unified initialization of the vmas, and the first step to this is to at least have basic allocation functions. Right now those functions are literally just wrappers around the kmem_cache_*() calls. This is a purely mechanical conversion: # new vma: kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc() # copy old vma kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old) # free vma kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma) to the point where the old vma passed in to the vm_area_dup() function isn't even used yet (because I've left all the old manual initialization alone). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-21fat: fix memory allocation failure handling of match_strdup()OGAWA Hirofumi
In parse_options(), if match_strdup() failed, parse_options() leaves opts->iocharset in unexpected state (i.e. still pointing the freed string). And this can be the cause of double free. To fix, this initialize opts->iocharset always when freeing. Link: http://lkml.kernel.org/r/8736wp9dzc.fsf@mail.parknet.co.jp Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Reported-by: syzbot+90b8e10515ae88228a92@syzkaller.appspotmail.com Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-21signal: Pass pid type into do_send_sig_infoEric W. Biederman
This passes the information we already have at the call sight into do_send_sig_info. Ultimately allowing for better handling of signals sent to a group of processes during fork. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-21signal: Pass pid type into send_sigio_to_task & send_sigurg_to_taskEric W. Biederman
This information is already present and using it directly simplifies the logic of the code. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-21signal: Use PIDTYPE_TGID to clearly store where file signals will be sentEric W. Biederman
When f_setown is called a pid and a pid type are stored. Replace the use of PIDTYPE_PID with PIDTYPE_TGID as PIDTYPE_TGID goes to the entire thread group. Replace the use of PIDTYPE_MAX with PIDTYPE_PID as PIDTYPE_PID now is only for a thread. Update the users of __f_setown to use PIDTYPE_TGID instead of PIDTYPE_PID. For now the code continues to capture task_pid (when task_tgid would really be appropriate), and iterate on PIDTYPE_PID (even when type == PIDTYPE_TGID) out of an abundance of caution to preserve existing behavior. Oleg Nesterov suggested using the test to ensure we use PIDTYPE_PID for tgid lookup also be used to avoid taking the tasklist lock. Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-21pid: Implement PIDTYPE_TGIDEric W. Biederman
Everywhere except in the pid array we distinguish between a tasks pid and a tasks tgid (thread group id). Even in the enumeration we want that distinction sometimes so we have added __PIDTYPE_TGID. With leader_pid we almost have an implementation of PIDTYPE_TGID in struct signal_struct. Add PIDTYPE_TGID as a first class member of the pid_type enumeration and into the pids array. Then remove the __PIDTYPE_TGID special case and the leader_pid in signal_struct. The net size increase is just an extra pointer added to struct pid and an extra pair of pointers of an hlist_node added to task_struct. The effect on code maintenance is the removal of a number of special cases today and the potential to remove many more special cases as PIDTYPE_TGID gets used to it's fullest. The long term potential is allowing zombie thread group leaders to exit, which will remove a lot more special cases in the code. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-21pids: Move the pgrp and session pid pointers from task_struct to signal_structEric W. Biederman
To access these fields the code always has to go to group leader so going to signal struct is no loss and is actually a fundamental simplification. This saves a little bit of memory by only allocating the pid pointer array once instead of once for every thread, and even better this removes a few potential races caused by the fact that group_leader can be changed by de_thread, while signal_struct can not. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-21pids: Compute task_tgid using signal->leader_pidEric W. Biederman
The cost is the the same and this removes the need to worry about complications that come from de_thread and group_leader changing. __task_pid_nr_ns has been updated to take advantage of this change. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-20sysfs, kobject: allow creating kobject belonging to arbitrary usersDmitry Torokhov
Normally kobjects and their sysfs representation belong to global root, however it is not necessarily the case for objects in separate namespaces. For example, objects in separate network namespace logically belong to the container's root and not global root. This change lays groundwork for allowing network namespace objects ownership to be transferred to container's root user by defining get_ownership() callback in ktype structure and using it in sysfs code to retrieve desired uid/gid when creating sysfs objects for given kobject. Co-Developed-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20kernfs: allow creating kernfs objects with arbitrary uid/gidDmitry Torokhov
This change allows creating kernfs files and directories with arbitrary uid/gid instead of always using GLOBAL_ROOT_UID/GID by extending kernfs_create_dir_ns() and kernfs_create_file_ns() with uid/gid arguments. The "simple" kernfs_create_file() and kernfs_create_dir() are left alone and always create objects belonging to the global root. When creating symlinks ownership (uid/gid) is taken from the target kernfs object. Co-Developed-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20filesystem-dax: Set page->indexDan Williams
In support of enabling memory_failure() handling for filesystem-dax mappings, set ->index to the pgoff of the page. The rmap implementation requires ->index to bound the search through the vma interval tree. The index is set and cleared at dax_associate_entry() and dax_disassociate_entry() time respectively. Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2018-07-20ovl: Enable metadata only featureVivek Goyal
All the bits are in patches before this. So it is time to enable the metadata only copy up feature. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20ovl: Do not do metacopy only for ioctl modifying file attrVivek Goyal
ovl_copy_up() by default will only do metadata only copy up (if enabled). That means when ovl_real_ioctl() calls ovl_real_file(), it will still get the lower file (as ovl_real_file() opens data file and not metacopy). And that means "chattr +i" will end up modifying lower inode. There seem to be two ways to solve this. A. Open metacopy file in ovl_real_ioctl() and do operations on that B. Force full copy up when FS_IOC_SETFLAGS is called. I am resorting to option B for now as it feels little safer option. If there are performance issues due to this, we can revisit it. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20ovl: Do not do metadata only copy-up for truncate operationVivek Goyal
truncate should copy up full file (and not do metacopy only), otherwise it will be broken. For example, use truncate to increase size of a file so that any read beyong existing size will return null bytes. If we don't copy up full file, then we end up opening lower file and read from it only reads upto the old size (and not new size after truncate). Hence to avoid such situations, copy up data as well when file size changes. So far it was being done by d_real(O_WRONLY) call in truncate() path. Now that patch has been reverted. So force full copy up in ovl_setattr() if size of file is changing. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20ovl: add helper to force data copy-upVivek Goyal
Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20ovl: Check redirect on index as wellVivek Goyal
Right now we seem to check redirect only if upperdentry is found. But it is possible that there is no upperdentry but later we found an index. We need to check redirect on index as well and set it in ovl_inode->redirect. Otherwise link code can assume that dentry does not have redirect and place a new one which breaks things. In my testing overlay/033 test started failing in xfstests. Following are the details. For example do following. $ mkdir lower upper work merged - Make lower dir with 4 links. $ echo "foo" > lower/l0.txt $ ln lower/l0.txt lower/l1.txt $ ln lower/l0.txt lower/l2.txt $ ln lower/l0.txt lower/l3.txt - Mount with index on and metacopy on. $ mount -t overlay -o lowerdir=lower,upperdir=upper,workdir=work,\ index=on,metacopy=on none merged - Link lower $ ln merged/l0.txt merged/l4.txt (This will metadata copy up of l0.txt and put an absolute redirect /l0.txt) $ echo 2 > /proc/sys/vm/drop/caches $ ls merged/l1.txt (Now l1.txt will be looked up. There is no upper dentry but there is lower dentry and index will be found. We don't check for redirect on index, hence ovl_inode->redirect will be NULL.) - Link Upper $ ln merged/l4.txt merged/l5.txt (Lookup of l4.txt will use inode from l1.txt lookup which is still in cache. It has ovl_inode->redirect NULL, hence link will put a new redirect and replace /l0.txt with /l4.txt - Drop caches. echo 2 > /proc/sys/vm/drop_caches - List l1.txt and it returns -ESTALE $ ls merged/l0.txt (It returns stale because, we found a metacopy of l0.txt in upper and it has redirect l4.txt but there is no file named l4.txt in lower layer. So lower data copy is not found and -ESTALE is returned.) So problem here is that we did not process redirect on index. Check redirect on index as well and then problem is fixed. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20ovl: Set redirect on upper inode when it is linkedVivek Goyal
When we create a hardlink to a metacopy upper file, first the redirect on that inode. Path based lookup will not work with newly created link and redirect will solve that issue. Also use absolute redirect as two hardlinks could be in different directores and relative redirect will not work. I have not put any additional locking around setting redirects while introducing redirects for non-dir files. For now it feels like existing locking is sufficient. If that's not the case, we will have add more locking. Following is my rationale about why do I think current locking seems ok. Basic problem for non-dir files is that more than on dentry could be pointing to same inode and in theory only relying on dentry based locks (d->d_lock) did not seem sufficient. We set redirect upon rename and upon link creation. In both the paths for non-dir file, VFS locks both source and target inodes (->i_rwsem). That means vfs rename and link operations on same source and target can't he happening in parallel (Even if there are multiple dentries pointing to same inode). So that probably means that at a time on an inode, only one call of ovl_set_redirect() could be working and we don't need additional locking in ovl_set_redirect(). ovl_inode->redirect is initialized only when inode is created new. That means it should not race with any other path and setting ovl_inode->redirect should be fine. Reading of ovl_inode->redirect happens in ovl_get_redirect() path. And this called only in ovl_set_redirect(). And ovl_set_redirect() already seemed to be protected using ->i_rwsem. That means ovl_set_redirect() and ovl_get_redirect() on source/target inode should not make progress in parallel and is mutually exclusive. Hence no additional locking required. Now, only case where ovl_set_redirect() and ovl_get_redirect() could race seems to be case of absolute redirects where ovl_get_redirect() has to travel up the tree. In that case we already take d->d_lock and that should be sufficient as directories will not have multiple dentries pointing to same inode. So given VFS locking and current usage of redirect, current locking around redirect seems to be ok for non-dir as well. Once we have the logic to remove redirect when metacopy file gets copied up, then we probably will need additional locking. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20ovl: Set redirect on metacopy files upon renameVivek Goyal
Set redirect on metacopy files upon rename. This will help find data dentry in lower dirs. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20ovl: Do not set dentry type ORIGIN for broken hardlinksVivek Goyal
If a dentry has copy up origin, we set flag OVL_PATH_ORIGIN. So far this decision was easy that we had to check only for oe->numlower and if it is non-zero, we knew there is copy up origin. (For non-dir we installed origin dentry in lowerstack[0]). But we don't create ORGIN xattr for broken hardlinks (index=off). And with metacopy feature it is possible that we will install lowerstack[0] but ORIGIN xattr is not there. It is data dentry of upper metacopy dentry which has been found using regular name based lookup or using REDIRECT. So with addition of this new case, just presence of oe->numlower is not sufficient to guarantee that ORIGIN xattr is present. So to differentiate between two cases, look at OVL_CONST_INO flag. If this flag is set and upperdentry is there, that means it can be marked as type ORIGIN. OVL_CONST_INO is not set if lower hardlink is broken or will be broken over copy up. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20ovl: Add an inode flag OVL_CONST_INOVivek Goyal
Add an ovl_inode flag OVL_CONST_INO. This flag signifies if inode number will remain constant over copy up or not. This flag does not get updated over copy up and remains unmodifed after setting once. Next patch in the series will make use of this flag. It will basically figure out if dentry is of type ORIGIN or not. And this can be derived by this flag. ORIGIN = (upperdentry && ovl_test_flag(OVL_CONST_INO, inode)). Suggested-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20ovl: Treat metacopy dentries as type OVL_PATH_MERGEVivek Goyal
Right now OVL_PATH_MERGE is used only for merged directories. But conceptually, a metacopy dentry (backed by a lower data dentry) is a merged entity as well. So mark metacopy dentries as OVL_PATH_MERGE and ovl_rename() makes use of this property later to set redirect on a metacopy file. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20ovl: Check redirects for metacopy filesVivek Goyal
Right now we rely on path based lookup for data origin of metacopy upper. This will work only if upper has not been renamed. We solved this problem already for merged directories using redirect. Use same logic for metacopy files. This patch just goes on to check redirects for metacopy files. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20ovl: Move some dir related ovl_lookup_single() code in else blockVivek Goyal
Move some directory related code in else block. This is pure code reorganization and no functionality change. Next patch enables redirect processing on metacopy files and needs this change. By keeping non-functional changes in a separate patch, next patch looks much smaller and cleaner. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20ovl: Do not expose metacopy only dentry from d_real()Vivek Goyal
Metacopy dentry/inode is internal to overlay and is never exposed outside of it. Exception is metacopy upper file used for fsync(). Modify d_real() to look for dentries/inode which have data, but also allow matching upper inode without data for the fsync case. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20ovl: Open file with data except for the case of fsyncVivek Goyal
ovl_open() should open file which contains data and not open metacopy inode. With the introduction of metacopy inodes, with current implementaion we will end up opening metacopy inode as well. But there can be certain circumstances like ovl_fsync() where we want to allow opening a metacopy inode instead. Hence, change ovl_open_realfile() and and add extra parameter which specifies whether to allow opening metacopy inode or not. If this parameter is false, we look for data inode and open that. This should allow covering both the cases. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>