summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2020-12-13jffs2: Allow setting rp_size to zero during remountinglizhe
Set rp_size to zero will be ignore during remounting. The method to identify whether we input a remounting option of rp_size is to check if the rp_size input is zero. It can not work well if we pass "rp_size=0". This patch add a bool variable "set_rp_size" to fix this problem. Reported-by: Jubin Zhong <zhongjubin@huawei.com> Signed-off-by: lizhe <lizhe67@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-12-13jffs2: Fix ignoring mounting options problem during remountinglizhe
The jffs2 mount options will be ignored when remounting jffs2. It can be easily reproduced with the steps listed below. 1. mount -t jffs2 -o compr=none /dev/mtdblockx /mnt 2. mount -o remount compr=zlib /mnt Since ec10a24f10c8, the option parsing happens before fill_super and then pass fc, which contains the options parsing results, to function jffs2_reconfigure during remounting. But function jffs2_reconfigure do not update c->mount_opts. This patch add a function jffs2_update_mount_opts to fix this problem. By the way, I notice that tmpfs use the same way to update remounting options. If it is necessary to unify them? Cc: <stable@vger.kernel.org> Fixes: ec10a24f10c8 ("vfs: Convert jffs2 to use the new mount API") Signed-off-by: lizhe <lizhe67@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-12-13jffs2: Fix GC exit abnormallyZhe Li
The log of this problem is: jffs2: Error garbage collecting node at 0x***! jffs2: No space for garbage collection. Aborting GC thread This is because GC believe that it do nothing, so it abort. After going over the image of jffs2, I find a scene that can trigger this problem stably. The scene is: there is a normal dirent node at summary-area, but abnormal at corresponding not-summary-area with error name_crc. The reason that GC exit abnormally is because it find that abnormal dirent node to GC, but when it goes to function jffs2_add_fd_to_list, it cannot meet the condition listed below: if ((*prev)->nhash == new->nhash && !strcmp((*prev)->name, new->name)) So no node is marked obsolete, statistical information of erase_block do not change, which cause GC exit abnormally. The root cause of this problem is: we do not check the name_crc of the abnormal dirent node with summary is enabled. Noticed that in function jffs2_scan_dirent_node, we use function jffs2_scan_dirty_space to deal with the dirent node with error name_crc. So this patch add a checking code in function read_direntry to ensure the correctness of dirent node. If checked failed, the dirent node will be marked obsolete so GC will pass this node and this problem will be fixed. Cc: <stable@vger.kernel.org> Signed-off-by: Zhe Li <lizhe67@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-12-13ubifs: Code cleanup by removing ifdef macro surroundingChengguang Xu
Define ubifs_listxattr and ubifs_xattr_handlers to NULL when CONFIG_UBIFS_FS_XATTR is not enabled, then we can remove many ugly ifdef macros in the code. Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Signed-off-by: Richard Weinberger <richard@nod.at>
2020-12-13jffs2: Fix if/else empty body warningsRandy Dunlap
When debug (print) macros are not enabled, change them to use the no_printk() macro instead of <nothing>. This fixes gcc warnings when -Wextra is used: ../fs/jffs2/nodelist.c:255:37: warning: suggest braces around empty body in an ‘else’ statement [-Wempty-body] ../fs/jffs2/nodelist.c:278:38: warning: suggest braces around empty body in an ‘else’ statement [-Wempty-body] ../fs/jffs2/nodelist.c:558:52: warning: suggest braces around empty body in an ‘else’ statement [-Wempty-body] ../fs/jffs2/xattr.c:1247:58: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body] ../fs/jffs2/xattr.c:1281:65: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body] Builds without warnings on all 3 levels of CONFIG_JFFS2_FS_DEBUG. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: linux-mtd@lists.infradead.org Signed-off-by: Richard Weinberger <richard@nod.at>
2020-12-13ubifs: Delete duplicated words + other fixesRandy Dunlap
Delete repeated words in fs/ubifs/. {negative, is, of, and, one, it} where "it it" was changed to "if it". Signed-off-by: Randy Dunlap <rdunlap@infradead.org> To: linux-fsdevel@vger.kernel.org Cc: Richard Weinberger <richard@nod.at> Cc: linux-mtd@lists.infradead.org Signed-off-by: Richard Weinberger <richard@nod.at>
2020-12-12fs/xfs: convert comma to semicolonZheng Yongjun
Replace a comma between expression statements by a semicolon. Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-12-12xfs: open code updating i_mode in xfs_set_aclChristoph Hellwig
Rather than going through the big and hairy xfs_setattr_nonsize function, just open code a transactional i_mode and i_ctime update. This allows to mark xfs_setattr_nonsize and remove the flags argument to it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-12-12xfs: remove xfs_vn_setattr_nonsizeChristoph Hellwig
Merge xfs_vn_setattr_nonsize into the only caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-12-12xfs: kill ialloced in xfs_dialloc()Gao Xiang
It's enough to just use return code, and get rid of an argument. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-12-12xfs: spilt xfs_dialloc() into 2 functionsDave Chinner
This patch explicitly separates free inode chunk allocation and inode allocation into two individual high level operations. Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-12-12xfs: move xfs_dialloc_roll() into xfs_dialloc()Dave Chinner
Get rid of the confusing ialloc_context and failure handling around xfs_dialloc() by moving xfs_dialloc_roll() into xfs_dialloc(). Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-12-12xfs: move on-disk inode allocation out of xfs_ialloc()Dave Chinner
So xfs_ialloc() will only address in-core inode allocation then, Also, rename xfs_ialloc() to xfs_dir_ialloc_init() in order to keep everything in xfs_inode.c under the same namespace. Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-12-12xfs: introduce xfs_dialloc_roll()Dave Chinner
Introduce a helper to make the on-disk inode allocation rolling logic clearer in preparation of the following cleanup. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-12-12xfs: convert noroom, okalloc in xfs_dialloc() to boolGao Xiang
Boolean is preferred for such use. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-12-12Merge tag 'io_uring-5.10-2020-12-11' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: "Two fixes in here, fixing issues introduced in this merge window" * tag 'io_uring-5.10-2020-12-11' of git://git.kernel.dk/linux-block: io_uring: fix file leak on error path of io ctx creation io_uring: fix mis-seting personality's creds
2020-12-12io_uring: remove 'twa_signal_ok' deadlock work-aroundJens Axboe
The TIF_NOTIFY_SIGNAL based implementation of TWA_SIGNAL is always safe to use, regardless of context, as we won't be recursing into the signal lock. So now that all archs are using that, we can drop this deadlock work-around as it's always safe to use TWA_SIGNAL. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-12io_uring: JOBCTL_TASK_WORK is no longer used by task_workJens Axboe
Remove the dead code, TWA_SIGNAL will never set JOBCTL_TASK_WORK at this point. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
xdp_return_frame_bulk() needs to pass a xdp_buff to __xdp_return(). strlcpy got converted to strscpy but here it makes no functional difference, so just keep the right code. Conflicts: net/netfilter/nf_tables_api.c Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-11Merge tag 'zonefs-5.10-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs Pull zonefs fix from Damien Le Moal: "A single patch in this pull request to fix a BIO and page reference leak when writing sequential zone files" * tag 'zonefs-5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs: zonefs: fix page reference and BIO leak
2020-12-11proc: use untagged_addr() for pagemap_read addressesMiles Chen
When we try to visit the pagemap of a tagged userspace pointer, we find that the start_vaddr is not correct because of the tag. To fix it, we should untag the userspace pointers in pagemap_read(). I tested with 5.10-rc4 and the issue remains. Explanation from Catalin in [1]: "Arguably, that's a user-space bug since tagged file offsets were never supported. In this case it's not even a tag at bit 56 as per the arm64 tagged address ABI but rather down to bit 47. You could say that the problem is caused by the C library (malloc()) or whoever created the tagged vaddr and passed it to this function. It's not a kernel regression as we've never supported it. Now, pagemap is a special case where the offset is usually not generated as a classic file offset but rather derived by shifting a user virtual address. I guess we can make a concession for pagemap (only) and allow such offset with the tag at bit (56 - PAGE_SHIFT + 3)" My test code is based on [2]: A userspace pointer which has been tagged by 0xb4: 0xb400007662f541c8 userspace program: uint64 OsLayer::VirtualToPhysical(void *vaddr) { uint64 frame, paddr, pfnmask, pagemask; int pagesize = sysconf(_SC_PAGESIZE); off64_t off = ((uintptr_t)vaddr) / pagesize * 8; // off = 0xb400007662f541c8 / pagesize * 8 = 0x5a00003b317aa0 int fd = open(kPagemapPath, O_RDONLY); ... if (lseek64(fd, off, SEEK_SET) != off || read(fd, &frame, 8) != 8) { int err = errno; string errtxt = ErrorString(err); if (fd >= 0) close(fd); return 0; } ... } kernel fs/proc/task_mmu.c: static ssize_t pagemap_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) { ... src = *ppos; svpfn = src / PM_ENTRY_BYTES; // svpfn == 0xb400007662f54 start_vaddr = svpfn << PAGE_SHIFT; // start_vaddr == 0xb400007662f54000 end_vaddr = mm->task_size; /* watch out for wraparound */ // svpfn == 0xb400007662f54 // (mm->task_size >> PAGE) == 0x8000000 if (svpfn > mm->task_size >> PAGE_SHIFT) // the condition is true because of the tag 0xb4 start_vaddr = end_vaddr; ret = 0; while (count && (start_vaddr < end_vaddr)) { // we cannot visit correct entry because start_vaddr is set to end_vaddr int len; unsigned long end; ... } ... } [1] https://lore.kernel.org/patchwork/patch/1343258/ [2] https://github.com/stressapptest/stressapptest/blob/master/src/os.cc#L158 Link: https://lkml.kernel.org/r/20201204024347.8295-1-miles.chen@mediatek.com Signed-off-by: Miles Chen <miles.chen@mediatek.com> Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Will Deacon <will@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> Cc: <stable@vger.kernel.org> [5.4-] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-11fsnotify: fix events reported to watching parent and childAmir Goldstein
fsnotify_parent() used to send two separate events to backends when a parent inode is watching children and the child inode is also watching. In an attempt to avoid duplicate events in fanotify, we unified the two backend callbacks to a single callback and handled the reporting of the two separate events for the relevant backends (inotify and dnotify). However the handling is buggy and can result in inotify and dnotify listeners receiving events of the type they never asked for or spurious events. The problem is the unified event callback with two inode marks (parent and child) is called when any of the parent and child inodes are watched and interested in the event, but the parent inode's mark that is interested in the event on the child is not necessarily the one we are currently reporting to (it could belong to a different group). So before reporting the parent or child event flavor to backend we need to check that the mark is really interested in that event flavor. The semantics of INODE and CHILD marks were hard to follow and made the logic more complicated than it should have been. Replace it with INODE and PARENT marks semantics to hopefully make the logic more clear. Thanks to Hugh Dickins for spotting a bug in the earlier version of this patch. Fixes: 497b0c5a7c06 ("fsnotify: send event to parent and child with single callback") CC: stable@vger.kernel.org Link: https://lore.kernel.org/r/20201202120713.702387-4-amir73il@gmail.com Reported-by: Hugh Dickins <hughd@google.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2020-12-10Merge tag 'nfs-for-5.10-3' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client fixes from Anna Schumaker: "Here are a handful more bugfixes for 5.10. Unfortunately, we found some problems with the new READ_PLUS operation that aren't easy to fix. We've decided to disable this codepath through a Kconfig option for now, but a series of patches going into 5.11 will clean up the code and fix the issues at the same time. This seemed like the best way to go about it. Summary: - Fix array overflow when flexfiles mirroring is enabled - Fix rpcrdma_inline_fixup() crash with new LISTXATTRS - Fix 5 second delay when doing inter-server copy - Disable READ_PLUS by default" * tag 'nfs-for-5.10-3' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFS: Disable READ_PLUS by default NFSv4.2: Fix 5 seconds delay when doing inter server copy NFS: Fix rpcrdma_inline_fixup() crash with new LISTXATTRS operation pNFS/flexfiles: Fix array overflow when flexfiles mirroring is enabled
2020-12-10Make sure that make_create_in_sticky() never sees uninitialized value of ↵Al Viro
dir_mode make sure nd->dir_mode is always initialized after success exit from link_path_walk(); in case of empty path it did not happen. Reported-by: Anant Thazhemadam <anant.thazhemadam@gmail.com> Tested-by: Anant Thazhemadam <anant.thazhemadam@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-12-10fs: Kill DCACHE_DONTCACHE dentry even if DCACHE_REFERENCED is setHao Li
If DCACHE_REFERENCED is set, fast_dput() will return true, and then retain_dentry() have no chance to check DCACHE_DONTCACHE. As a result, the dentry won't be killed and the corresponding inode can't be evicted. In the following example, the DAX policy can't take effects unless we do a drop_caches manually. # DCACHE_LRU_LIST will be set echo abcdefg > test.txt # DCACHE_REFERENCED will be set and DCACHE_DONTCACHE can't do anything xfs_io -c 'chattr +x' test.txt # Drop caches to make DAX changing take effects echo 2 > /proc/sys/vm/drop_caches What this patch does is preventing fast_dput() from returning true if DCACHE_DONTCACHE is set. Then retain_dentry() will detect the DCACHE_DONTCACHE and will return false. As a result, the dentry will be killed and the inode will be evicted. In this way, if we change per-file DAX policy, it will take effects automatically after this file is closed by all processes. I also add some comments to make the code more clear. Signed-off-by: Hao Li <lihao2018.fnst@cn.fujitsu.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-12-10fs: Handle I_DONTCACHE in iput_final() instead of generic_drop_inode()Hao Li
If generic_drop_inode() returns true, it means iput_final() can evict this inode regardless of whether it is dirty or not. If we check I_DONTCACHE in generic_drop_inode(), any inode with this bit set will be evicted unconditionally. This is not the desired behavior because I_DONTCACHE only means the inode shouldn't be cached on the LRU list. As for whether we need to evict this inode, this is what generic_drop_inode() should do. This patch corrects the usage of I_DONTCACHE. This patch was proposed in [1]. [1]: https://lore.kernel.org/linux-fsdevel/20200831003407.GE12096@dread.disaster.area/ Fixes: dae2f8ed7992 ("fs: Lift XFS_IDONTCACHE to the VFS layer") Signed-off-by: Hao Li <lihao2018.fnst@cn.fujitsu.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-12-10fs/namespace.c: WARN if mnt_count has become negativeEric Biggers
Missing calls to mntget() (or equivalently, too many calls to mntput()) are hard to detect because mntput() delays freeing mounts using task_work_add(), then again using call_rcu(). As a result, mnt_count can often be decremented to -1 without getting a KASAN use-after-free report. Such cases are still bugs though, and they point to real use-after-frees being possible. For an example of this, see the bug fixed by commit 1b0b9cc8d379 ("vfs: fsmount: add missing mntget()"), discussed at https://lkml.kernel.org/linux-fsdevel/20190605135401.GB30925@xxxxxxxxxxxxxxxxxxxxxxxxx/T/#u. This bug *should* have been trivial to find. But actually, it wasn't found until syzkaller happened to use fchdir() to manipulate the reference count just right for the bug to be noticeable. Address this by making mntput_no_expire() issue a WARN if mnt_count has become negative. Suggested-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-12-10NFS: Disable READ_PLUS by defaultAnna Schumaker
We've been seeing failures with xfstests generic/091 and generic/263 when using READ_PLUS. I've made some progress on these issues, and the tests fail later on but still don't pass. Let's disable READ_PLUS by default until we can work out what is going on. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-12-10NFSv4.2: Fix 5 seconds delay when doing inter server copyDai Ngo
Since commit b4868b44c5628 ("NFSv4: Wait for stateid updates after CLOSE/OPEN_DOWNGRADE"), every inter server copy operation suffers 5 seconds delay regardless of the size of the copy. The delay is from nfs_set_open_stateid_locked when the check by nfs_stateid_is_sequential fails because the seqid in both nfs4_state and nfs4_stateid are 0. Fix __nfs42_ssc_open to delay setting of NFS_OPEN_STATE in nfs4_state, until after the call to update_open_stateid, to indicate this is the 1st open. This fix is part of a 2 patches, the other patch is the fix in the source server to return the stateid for COPY_NOTIFY request with seqid 1 instead of 0. Fixes: ce0887ac96d3 ("NFSD add nfs4 inter ssc to nfsd4_copy") Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-12-10NFS: Fix rpcrdma_inline_fixup() crash with new LISTXATTRS operationChuck Lever
By switching to an XFS-backed export, I am able to reproduce the ibcomp worker crash on my client with xfstests generic/013. For the failing LISTXATTRS operation, xdr_inline_pages() is called with page_len=12 and buflen=128. - When ->send_request() is called, rpcrdma_marshal_req() does not set up a Reply chunk because buflen is smaller than the inline threshold. Thus rpcrdma_convert_iovs() does not get invoked at all and the transport's XDRBUF_SPARSE_PAGES logic is not invoked on the receive buffer. - During reply processing, rpcrdma_inline_fixup() tries to copy received data into rq_rcv_buf->pages because page_len is positive. But there are no receive pages because rpcrdma_marshal_req() never allocated them. The result is that the ibcomp worker faults and dies. Sometimes that causes a visible crash, and sometimes it results in a transport hang without other symptoms. RPC/RDMA's XDRBUF_SPARSE_PAGES support is not entirely correct, and should eventually be fixed or replaced. However, my preference is that upper-layer operations should explicitly allocate their receive buffers (using GFP_KERNEL) when possible, rather than relying on XDRBUF_SPARSE_PAGES. Reported-by: Olga kornievskaia <kolga@netapp.com> Suggested-by: Olga kornievskaia <kolga@netapp.com> Fixes: c10a75145feb ("NFSv4.2: add the extended attribute proc functions.") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Olga kornievskaia <kolga@netapp.com> Reviewed-by: Frank van der Linden <fllinden@amazon.com> Tested-by: Olga kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-12-10exec: Transform exec_update_mutex into a rw_semaphoreEric W. Biederman
Recently syzbot reported[0] that there is a deadlock amongst the users of exec_update_mutex. The problematic lock ordering found by lockdep was: perf_event_open (exec_update_mutex -> ovl_i_mutex) chown (ovl_i_mutex -> sb_writes) sendfile (sb_writes -> p->lock) by reading from a proc file and writing to overlayfs proc_pid_syscall (p->lock -> exec_update_mutex) While looking at possible solutions it occured to me that all of the users and possible users involved only wanted to state of the given process to remain the same. They are all readers. The only writer is exec. There is no reason for readers to block on each other. So fix this deadlock by transforming exec_update_mutex into a rw_semaphore named exec_update_lock that only exec takes for writing. Cc: Jann Horn <jannh@google.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Bernd Edlinger <bernd.edlinger@hotmail.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Christopher Yeoh <cyeoh@au1.ibm.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Fixes: eea9673250db ("exec: Add exec_update_mutex to replace cred_guard_mutex") [0] https://lkml.kernel.org/r/00000000000063640c05ade8e3de@google.com Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com Link: https://lkml.kernel.org/r/87ft4mbqen.fsf@x220.int.ebiederm.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10exec: Move io_uring_task_cancel after the point of no returnEric W. Biederman
Now that unshare_files happens in begin_new_exec after the point of no return, io_uring_task_cancel can also happen later. Effectively this means io_uring activities for a task are only canceled when exec succeeds. Link: https://lkml.kernel.org/r/878saih2op.fsf@x220.int.ebiederm.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10coredump: Document coredump code exclusively used by cell spufsEric W. Biederman
Oleg Nesterov recently asked[1] why is there an unshare_files in do_coredump. After digging through all of the callers of lookup_fd it turns out that it is arch/powerpc/platforms/cell/spufs/coredump.c:coredump_next_context that needs the unshare_files in do_coredump. Looking at the history[2] this code was also the only piece of coredump code that required the unshare_files when the unshare_files was added. Looking at that code it turns out that cell is also the only architecture that implements elf_coredump_extra_notes_size and elf_coredump_extra_notes_write. I looked at the gdb repo[3] support for cell has been removed[4] in binutils 2.34. Geoff Levand reports he is still getting questions on how to run modern kernels on the PS3, from people using 3rd party firmware so this code is not dead. According to Wikipedia the last PS3 shipped in Japan sometime in 2017. So it will probably be a little while before everyone's hardware dies. Add some comments briefly documenting the coredump code that exists only to support cell spufs to make it easier to understand the coredump code. Eventually the hardware will be dead, or their won't be userspace tools, or the coredump code will be refactored and it will be too difficult to update a dead architecture and these comments make it easy to tell where to pull to remove cell spufs support. [1] https://lkml.kernel.org/r/20201123175052.GA20279@redhat.com [2] 179e037fc137 ("do_coredump(): make sure that descriptor table isn't shared") [3] git://sourceware.org/git/binutils-gdb.git [4] abf516c6931a ("Remove Cell Broadband Engine debugging support"). Link: https://lkml.kernel.org/r/87h7pdnlzv.fsf_-_@x220.int.ebiederm.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10file: Remove get_files_structEric W. Biederman
When discussing[1] exec and posix file locks it was realized that none of the callers of get_files_struct fundamentally needed to call get_files_struct, and that by switching them to helper functions instead it will both simplify their code and remove unnecessary increments of files_struct.count. Those unnecessary increments can result in exec unnecessarily unsharing files_struct which breaking posix locks, and it can result in fget_light having to fallback to fget reducing system performance. Now that get_files_struct has no more users and can not cause the problems for posix file locking and fget_light remove get_files_struct so that it does not gain any new users. [1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com Suggested-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200817220425.9389-13-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-24-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10file: Rename __close_fd_get_file close_fd_get_fileEric W. Biederman
The function close_fd_get_file is explicitly a variant of __close_fd[1]. Now that __close_fd has been renamed close_fd, rename close_fd_get_file to be consistent with close_fd. When __alloc_fd, __close_fd and __fd_install were introduced the double underscore indicated that the function took a struct files_struct parameter. The function __close_fd_get_file never has so the naming has always been inconsistent. This just cleans things up so there are not any lingering mentions or references __close_fd left in the code. [1] 80cd795630d6 ("binder: fix use-after-free due to ksys_close() during fdget()") Link: https://lkml.kernel.org/r/20201120231441.29911-23-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10file: Replace ksys_close with close_fdEric W. Biederman
Now that ksys_close is exactly identical to close_fd replace the one caller of ksys_close with close_fd. [1] https://lkml.kernel.org/r/20200818112020.GA17080@infradead.org Suggested-by: Christoph Hellwig <hch@infradead.org> Link: https://lkml.kernel.org/r/20201120231441.29911-22-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10file: Rename __close_fd to close_fd and remove the files parameterEric W. Biederman
The function __close_fd was added to support binder[1]. Now that binder has been fixed to no longer need __close_fd[2] all calls to __close_fd pass current->files. Therefore transform the files parameter into a local variable initialized to current->files, and rename __close_fd to close_fd to reflect this change, and keep it in sync with the similar changes to __alloc_fd, and __fd_install. This removes the need for callers to care about the extra care that needs to be take if anything except current->files is passed, by limiting the callers to only operation on current->files. [1] 483ce1d4b8c3 ("take descriptor-related part of close() to file.c") [2] 44d8047f1d87 ("binder: use standard functions to allocate fds") Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200817220425.9389-17-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-21-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10file: Merge __alloc_fd into alloc_fdEric W. Biederman
The function __alloc_fd was added to support binder[1]. With binder fixed[2] there are no more users. As alloc_fd just calls __alloc_fd with "files=current->files", merge them together by transforming the files parameter into a local variable initialized to current->files. [1] dcfadfa4ec5a ("new helper: __alloc_fd()") [2] 44d8047f1d87 ("binder: use standard functions to allocate fds") Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200817220425.9389-16-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-20-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10file: In f_dupfd read RLIMIT_NOFILE once.Eric W. Biederman
Simplify the code, and remove the chance of races by reading RLIMIT_NOFILE only once in f_dupfd. Pass the read value of RLIMIT_NOFILE into alloc_fd which is the other location the rlimit was read in f_dupfd. As f_dupfd is the only caller of alloc_fd this changing alloc_fd is trivially safe. Further this causes alloc_fd to take all of the same arguments as __alloc_fd except for the files_struct argument. Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200817220425.9389-15-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-19-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10file: Merge __fd_install into fd_installEric W. Biederman
The function __fd_install was added to support binder[1]. With binder fixed[2] there are no more users. As fd_install just calls __fd_install with "files=current->files", merge them together by transforming the files parameter into a local variable initialized to current->files. [1] f869e8a7f753 ("expose a low-level variant of fd_install() for binder") [2] 44d8047f1d87 ("binder: use standard functions to allocate fds") Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1:https://lkml.kernel.org/r/20200817220425.9389-14-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-18-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10proc/fd: In fdinfo seq_show don't use get_files_structEric W. Biederman
When discussing[1] exec and posix file locks it was realized that none of the callers of get_files_struct fundamentally needed to call get_files_struct, and that by switching them to helper functions instead it will both simplify their code and remove unnecessary increments of files_struct.count. Those unnecessary increments can result in exec unnecessarily unsharing files_struct which breaking posix locks, and it can result in fget_light having to fallback to fget reducing system performance. Instead hold task_lock for the duration that task->files needs to be stable in seq_show. The task_lock was already taken in get_files_struct, and so skipping get_files_struct performs less work overall, and avoids the problems with the files_struct reference count. [1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com Suggested-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200817220425.9389-12-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-17-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10proc/fd: In proc_readfd_common use task_lookup_next_fd_rcuEric W. Biederman
When discussing[1] exec and posix file locks it was realized that none of the callers of get_files_struct fundamentally needed to call get_files_struct, and that by switching them to helper functions instead it will both simplify their code and remove unnecessary increments of files_struct.count. Those unnecessary increments can result in exec unnecessarily unsharing files_struct which breaking posix locks, and it can result in fget_light having to fallback to fget reducing system performance. Using task_lookup_next_fd_rcu simplifies proc_readfd_common, by moving the checking for the maximum file descritor into the generic code, and by remvoing the need for capturing and releasing a reference on files_struct. As task_lookup_fd_rcu may update the fd ctx->pos has been changed to be the fd +2 after task_lookup_fd_rcu returns. [1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com Suggested-by: Oleg Nesterov <oleg@redhat.com> Tested-by: Andy Lavr <andy.lavr@gmail.com> v1: https://lkml.kernel.org/r/20200817220425.9389-10-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-15-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10file: Implement task_lookup_next_fd_rcuEric W. Biederman
As a companion to fget_task and task_lookup_fd_rcu implement task_lookup_next_fd_rcu that will return the struct file for the first file descriptor number that is equal or greater than the fd argument value, or NULL if there is no such struct file. This allows file descriptors of foreign processes to be iterated through safely, without needed to increment the count on files_struct. Some concern[1] has been expressed that this function takes the task_lock for each iteration and thus for each file descriptor. This place where this function will be called in a commonly used code path is for listing /proc/<pid>/fd. I did some small benchmarks and did not see any measurable performance differences. For ordinary users ls is likely to stat each of the directory entries and tid_fd_mode called from tid_fd_revalidae has always taken the task lock for each file descriptor. So this does not look like it will be a big change in practice. At some point is will probably be worth changing put_files_struct to free files_struct after an rcu grace period so that task_lock won't be needed at all. [1] https://lkml.kernel.org/r/20200817220425.9389-10-ebiederm@xmission.com v1: https://lkml.kernel.org/r/20200817220425.9389-9-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-14-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10proc/fd: In tid_fd_mode use task_lookup_fd_rcuEric W. Biederman
When discussing[1] exec and posix file locks it was realized that none of the callers of get_files_struct fundamentally needed to call get_files_struct, and that by switching them to helper functions instead it will both simplify their code and remove unnecessary increments of files_struct.count. Those unnecessary increments can result in exec unnecessarily unsharing files_struct which breaking posix locks, and it can result in fget_light having to fallback to fget reducing system performance. Instead of manually coding finding the files struct for a task and then calling files_lookup_fd_rcu, use the helper task_lookup_fd_rcu that combines those to steps. Making the code simpler and removing the need to get a reference on a files_struct. [1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com Suggested-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200817220425.9389-7-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-12-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10file: Implement task_lookup_fd_rcuEric W. Biederman
As a companion to lookup_fd_rcu implement task_lookup_fd_rcu for querying an arbitrary process about a specific file. Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200818103713.aw46m7vprsy4vlve@wittgenstein Link: https://lkml.kernel.org/r/20201120231441.29911-11-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10file: Rename fcheck lookup_fd_rcuEric W. Biederman
Also remove the confusing comment about checking if a fd exists. I could not find one instance in the entire kernel that still matches the description or the reason for the name fcheck. The need for better names became apparent in the last round of discussion of this set of changes[1]. [1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com Link: https://lkml.kernel.org/r/20201120231441.29911-10-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10file: Replace fcheck_files with files_lookup_fd_rcuEric W. Biederman
This change renames fcheck_files to files_lookup_fd_rcu. All of the remaining callers take the rcu_read_lock before calling this function so the _rcu suffix is appropriate. This change also tightens up the debug check to verify that all callers hold the rcu_read_lock. All callers that used to call files_check with the files->file_lock held have now been changed to call files_lookup_fd_locked. This change of name has helped remind me of which locks and which guarantees are in place helping me to catch bugs later in the patchset. The need for better names became apparent in the last round of discussion of this set of changes[1]. [1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com Link: https://lkml.kernel.org/r/20201120231441.29911-9-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10file: Factor files_lookup_fd_locked out of fcheck_filesEric W. Biederman
To make it easy to tell where files->file_lock protection is being used when looking up a file create files_lookup_fd_locked. Only allow this function to be called with the file_lock held. Update the callers of fcheck and fcheck_files that are called with the files->file_lock held to call files_lookup_fd_locked instead. Hopefully this makes it easier to quickly understand what is going on. The need for better names became apparent in the last round of discussion of this set of changes[1]. [1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com Link: https://lkml.kernel.org/r/20201120231441.29911-8-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10file: Rename __fcheck_files to files_lookup_fd_rawEric W. Biederman
The function fcheck despite it's comment is poorly named as it has no callers that only check it's return value. All of fcheck's callers use the returned file descriptor. The same is true for fcheck_files and __fcheck_files. A new less confusing name is needed. In addition the names of these functions are confusing as they do not report the kind of locks that are needed to be held when these functions are called making error prone to use them. To remedy this I am making the base functio name lookup_fd and will and prefixes and sufficies to indicate the rest of the context. Name the function (previously called __fcheck_files) that proceeds from a struct files_struct, looks up the struct file of a file descriptor, and requires it's callers to verify all of the appropriate locks are held files_lookup_fd_raw. The need for better names became apparent in the last round of discussion of this set of changes[1]. [1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com Link: https://lkml.kernel.org/r/20201120231441.29911-7-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10proc/fd: In proc_fd_link use fget_taskEric W. Biederman
When discussing[1] exec and posix file locks it was realized that none of the callers of get_files_struct fundamentally needed to call get_files_struct, and that by switching them to helper functions instead it will both simplify their code and remove unnecessary increments of files_struct.count. Those unnecessary increments can result in exec unnecessarily unsharing files_struct which breaking posix locks, and it can result in fget_light having to fallback to fget reducing system performance. Simplifying proc_fd_link is a little bit tricky. It is necessary to know that there is a reference to fd_f ile while path_get is running. This reference can either be guaranteed to exist either by locking the fdtable as the code currently does or by taking a reference on the file in question. Use fget_task to remove the need for get_files_struct and to take a reference to file in question. [1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com Suggested-by: Oleg Nesterov <oleg@redhat.com> v1: https://lkml.kernel.org/r/20200817220425.9389-8-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-6-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>