summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2017-10-13fs/binfmt_misc.c: node could be NULL when evicting inodeEryu Guan
inode->i_private is assigned by a Node pointer only after registering a new binary format, so it could be NULL if inode was created by bm_fill_super() (or iput() was called by the error path in bm_register_write()), and this could result in NULL pointer dereference when evicting such an inode. e.g. mount binfmt_misc filesystem then umount it immediately: mount -t binfmt_misc binfmt_misc /proc/sys/fs/binfmt_misc umount /proc/sys/fs/binfmt_misc will result in BUG: unable to handle kernel NULL pointer dereference at 0000000000000013 IP: bm_evict_inode+0x16/0x40 [binfmt_misc] ... Call Trace: evict+0xd3/0x1a0 iput+0x17d/0x1d0 dentry_unlink_inode+0xb9/0xf0 __dentry_kill+0xc7/0x170 shrink_dentry_list+0x122/0x280 shrink_dcache_parent+0x39/0x90 do_one_tree+0x12/0x40 shrink_dcache_for_umount+0x2d/0x90 generic_shutdown_super+0x1f/0x120 kill_litter_super+0x29/0x40 deactivate_locked_super+0x43/0x70 deactivate_super+0x45/0x60 cleanup_mnt+0x3f/0x70 __cleanup_mnt+0x12/0x20 task_work_run+0x86/0xa0 exit_to_usermode_loop+0x6d/0x99 syscall_return_slowpath+0xba/0xf0 entry_SYSCALL_64_fastpath+0xa3/0xa Fix it by making sure Node (e) is not NULL. Link: http://lkml.kernel.org/r/20171010100642.31786-1-eguan@redhat.com Fixes: 83f918274e4b ("exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to bm_evict_inode()") Signed-off-by: Eryu Guan <eguan@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-13fs/mpage.c: fix mpage_writepage() for pages with buffersMatthew Wilcox
When using FAT on a block device which supports rw_page, we can hit BUG_ON(!PageLocked(page)) in try_to_free_buffers(). This is because we call clean_buffers() after unlocking the page we've written. Introduce a new clean_page_buffers() which cleans all buffers associated with a page and call it from within bdev_write_page(). [akpm@linux-foundation.org: s/PAGE_SIZE/~0U/ per Linus and Matthew] Link: http://lkml.kernel.org/r/20171006211541.GA7409@bombadil.infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Reported-by: Toshi Kani <toshi.kani@hpe.com> Reported-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Tested-by: Toshi Kani <toshi.kani@hpe.com> Acked-by: Johannes Thumshirn <jthumshirn@suse.de> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-12Merge tag 'xfs-4.14-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fixes from Darrick Wong: - Fix a stale kernel memory exposure when logging inodes. - Fix some build problems with CONFIG_XFS_RT=n - Don't change inode mode if the acl write fails, leaving the file totally inaccessible. - Fix a dangling pointer problem when removing an attr fork under memory pressure. - Don't crash while trying to invalidate a null buffer associated with a corrupt metadata pointer. * tag 'xfs-4.14-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: handle error if xfs_btree_get_bufs fails xfs: reinit btree pointer on attr tree inactivation walk xfs: Fix bool initialization/comparison xfs: don't change inode mode if ACL update fails xfs: move more RT specific code under CONFIG_XFS_RT xfs: Don't log uninitialised fields in inode structures
2017-10-12Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull quota fix from Jan Kara: "A fix for a regression in handling of quota grace times and warnings" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: quota: Generate warnings for DQUOT_SPACE_NOFAIL allocations
2017-10-11xfs: handle error if xfs_btree_get_bufs failsEric Sandeen
Jason reported that a corrupted filesystem failed to replay the log with a metadata block out of bounds warning: XFS (dm-2): _xfs_buf_find: Block out of range: block 0x80270fff8, EOFS 0x9c40000 _xfs_buf_find() and xfs_btree_get_bufs() return NULL if that happens, and then when xfs_alloc_fix_freelist() calls xfs_trans_binval() on that NULL bp, we oops with: BUG: unable to handle kernel NULL pointer dereference at 00000000000000f8 We don't handle _xfs_buf_find errors very well, every caller higher up the stack gets to guess at why it failed. But we should at least handle it somehow, so return EFSCORRUPTED here. Reported-by: Jason L Tibbitts III <tibbs@math.uh.edu> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-11xfs: reinit btree pointer on attr tree inactivation walkBrian Foster
xfs_attr3_root_inactive() walks the attr fork tree to invalidate the associated blocks. xfs_attr3_node_inactive() recursively descends from internal blocks to leaf blocks, caching block address values along the way to revisit parent blocks, locate the next entry and descend down that branch of the tree. The code that attempts to reread the parent block is unsafe because it assumes that the local xfs_da_node_entry pointer remains valid after an xfs_trans_brelse() and re-read of the parent buffer. Under heavy memory pressure, it is possible that the buffer has been reclaimed and reallocated by the time the parent block is reread. This means that 'btree' can point to an invalid memory address, lead to a random/garbage value for child_fsb and cause the subsequent read of the attr fork to go off the rails and return a NULL buffer for an attr fork offset that is most likely not allocated. Note that this problem can be manufactured by setting XFS_ATTR_BTREE_REF to 0 to prevent LRU caching of attr buffers, creating a file with a multi-level attr fork and removing it to trigger inactivation. To address this problem, reinit the node/btree pointers to the parent buffer after it has been re-read. This ensures btree points to a valid record and allows the walk to proceed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-11xfs: Fix bool initialization/comparisonThomas Meyer
Bool initializations should use true and false. Bool tests don't need comparisons. Signed-off-by: Thomas Meyer <thomas@m3y3r.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-11xfs: don't change inode mode if ACL update failsDave Chinner
If we get ENOSPC half way through setting the ACL, the inode mode can still be changed even though the ACL does not exist. Reorder the operation to only change the mode of the inode if the ACL is set correctly. Whilst this does not fix the problem with crash consistency (that requires attribute addition to be a deferred op) it does prevent ENOSPC and other non-fatal errors setting an xattr to be handled sanely. This fixes xfstests generic/449. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-11xfs: move more RT specific code under CONFIG_XFS_RTDave Chinner
Various utility functions and interfaces that iterate internal devices try to reference the realtime device even when RT support is not compiled into the kernel. Make sure this code is excluded from the CONFIG_XFS_RT=n build, and where appropriate stub functions to return fatal errors if they ever get called when RT support is not present. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-11xfs: Don't log uninitialised fields in inode structuresDave Chinner
Prevent kmemcheck from throwing warnings about reading uninitialised memory when formatting inodes into the incore log buffer. There are several issues here - we don't always log all the fields in the inode log format item, and we never log the inode the di_next_unlinked field. In the case of the inode log format item, this is exacerbated by the old xfs_inode_log_format structure padding issue. Hence make the padded, 64 bit aligned version of the structure the one we always use for formatting the log and get rid of the 64 bit variant. This means we'll always log the 64-bit version and so recovery only needs to convert from the unpadded 32 bit version from older 32 bit kernels. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-119p: set page uptodate when required in write_end()Alexander Levin
Commit 77469c3f570 prevented setting the page as uptodate when we wrote the right amount of data, fix that. Fixes: 77469c3f570 ("9p: saner ->write_end() on failing copy into non-uptodate page") Reviewed-by: Jan Kara <jack@suse.com> Signed-off-by: Alexander Levin <alexander.levin@verizon.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-11Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "Fairly old DIO bug caught by Andreas (3.10+) and several slightly younger blk_rq_map_user_iov() bugs, both on map and copy codepaths (Vitaly and me)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: bio_copy_user_iov(): don't ignore ->iov_offset more bio_map_user_iov() leak fixes fix unbalanced page refcounting in bio_map_user_iov direct-io: Prevent NULL pointer access in submit_page_section
2017-10-10direct-io: Prevent NULL pointer access in submit_page_sectionAndreas Gruenbacher
In the code added to function submit_page_section by commit b1058b981, sdio->bio can currently be NULL when calling dio_bio_submit. This then leads to a NULL pointer access in dio_bio_submit, so check for a NULL bio in submit_page_section before trying to submit it instead. Fixes xfstest generic/250 on gfs2. Cc: stable@vger.kernel.org # v3.10+ Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-10Merge tag 'nfsd-4.14-1' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd fix from Bruce Fields: "One fix for a 4.14 regression, and one minor fix to the MAINTAINERs file. (I was weirdly flattered by the idea that lots of random people suddenly seemed to think Jeff and I were VFS experts. Turns out it was just a typo)" * tag 'nfsd-4.14-1' of git://linux-nfs.org/~bfields/linux: nfsd4: define nfsd4_secinfo_no_name_release() MAINTAINERS: associate linux/fs.h with VFS instead of file locking
2017-10-10Merge tag 'f2fs-for-4.14-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs fix from Jaegeuk Kim: "This contains one bug fix which causes a kernel panic during fstrim introduced in 4.14-rc1" * tag 'f2fs-for-4.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: f2fs: fix potential panic during fstrim
2017-10-10quota: Generate warnings for DQUOT_SPACE_NOFAIL allocationsJan Kara
Eryu has reported that since commit 7b9ca4c61bc2 "quota: Reduce contention on dq_data_lock" test generic/233 occasionally fails. This is caused by the fact that since that commit we don't generate warning and set grace time for quota allocations that have DQUOT_SPACE_NOFAIL set (these are for example some metadata allocations in ext4). We need these allocations to behave regularly wrt warning generation and grace time setting so fix the code to return to the original behavior. Reported-and-tested-by: Eryu Guan <eguan@redhat.com> CC: stable@vger.kernel.org Fixes: 7b9ca4c61bc278b771fb57d6290a31ab1fc7fdac Signed-off-by: Jan Kara <jack@suse.cz>
2017-10-09Merge tag 'nfs-for-4.14-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Trond Myklebust: "Hightlights include: stable fixes: - nfs/filelayout: fix oops when freeing filelayout segment - NFS: Fix uninitialized rpc_wait_queue bugfixes: - NFSv4/pnfs: Fix an infinite layoutget loop - nfs: RPC_MAX_AUTH_SIZE is in bytes" * tag 'nfs-for-4.14-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4/pnfs: Fix an infinite layoutget loop nfs/filelayout: fix oops when freeing filelayout segment sunrpc: remove redundant initialization of sock NFS: Fix uninitialized rpc_wait_queue NFS: Cleanup error handling in nfs_idmap_request_key() nfs: RPC_MAX_AUTH_SIZE is in bytes
2017-10-06Merge tag 'xfs-4.14-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fixes from Darrick Wong: - fix a race between overlapping copy on write aio - fix cow fork swapping when we defragment reflinked files * tag 'xfs-4.14-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: handle racy AIO in xfs_reflink_end_cow xfs: always swap the cow forks when swapping extents
2017-10-06Merge branch 'for-4.14-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Two more fixes for bugs introduced in 4.13. The sector_t problem with 32bit architecture and !LBDAF config seems serious but the number of affected deployments is hopefully low. The clashing status bits could lead to a confusing in-memory state of the whole-filesystem operations if used with the quota override sysfs knob" * 'for-4.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix overlap of fs_info::flags values btrfs: avoid overflow when sector_t is 32 bit
2017-10-06Merge tag 'ceph-for-4.14-rc4' of git://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fixes from Ilya Dryomov: "Two fixups for CephFS snapshot-handling patches in -rc1" * tag 'ceph-for-4.14-rc4' of git://github.com/ceph/ceph-client: ceph: fix __choose_mds() for LSSNAP request ceph: properly queue cap snap for newly created snap realm
2017-10-06Merge branch 'overlayfs-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs fixes from Miklos Szeredi: "Fix a regression in 4.14 and one in 4.13. The latter is a case when Docker is doing something it really shouldn't and gets away with it. We now print a warning instead of erroring out. There are also fixes to several error paths" * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: fix regression caused by exclusive upper/work dir protection ovl: fix missing unlock_rename() in ovl_do_copy_up() ovl: fix dentry leak in ovl_indexdir_cleanup() ovl: fix dput() of ERR_PTR in ovl_cleanup_index() ovl: fix error value printed in ovl_lookup_index() ovl: fix may_write_real() for overlayfs directories
2017-10-05nfsd4: define nfsd4_secinfo_no_name_release()Eryu Guan
Commit 34b1744c91cc ("nfsd4: define ->op_release for compound ops") defined a couple ->op_release functions and run them if necessary. But there's a problem with that is that it reused nfsd4_secinfo_release() as the op_release of OP_SECINFO_NO_NAME, and caused a leak on struct nfsd4_secinfo_no_name in nfsd4_encode_secinfo_no_name(), because there's no .si_exp field in struct nfsd4_secinfo_no_name. I found this because I was unable to umount an ext4 partition after exporting it via NFS & run fsstress on the nfs mount. A simplified reproducer would be: # mount a local-fs device at /mnt/test, and export it via NFS with # fsid=0 export option (this is required) mount /dev/sda5 /mnt/test echo "/mnt/test *(rw,no_root_squash,fsid=0)" >> /etc/exports service nfs restart # locally mount the nfs export with all default, note that I have # nfsv4.1 configured as the default nfs version, because of the # fsid export option, v4 mount would fail and fall back to v3 mount localhost:/mnt/test /mnt/nfs # try to umount the underlying device, but got EBUSY umount /mnt/nfs service nfs stop umount /mnt/test <=== EBUSY here Fixed it by defining a separate nfsd4_secinfo_no_name_release() function as the op_release method of OP_SECINFO_NO_NAME that releases the correct nfsd4_secinfo_no_name structure. Fixes: 34b1744c91cc ("nfsd4: define ->op_release for compound ops") Signed-off-by: Eryu Guan <eguan@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-10-05ovl: fix regression caused by exclusive upper/work dir protectionAmir Goldstein
Enforcing exclusive ownership on upper/work dirs caused a docker regression: https://github.com/moby/moby/issues/34672. Euan spotted the regression and pointed to the offending commit. Vivek has brought the regression to my attention and provided this reproducer: Terminal 1: mount -t overlay -o workdir=work,lowerdir=lower,upperdir=upper none merged/ Terminal 2: unshare -m Terminal 1: umount merged mount -t overlay -o workdir=work,lowerdir=lower,upperdir=upper none merged/ mount: /root/overlay-testing/merged: none already mounted or mount point busy To fix the regression, I replaced the error with an alarming warning. With index feature enabled, mount does fail, but logs a suggestion to override exclusive dir protection by disabling index. Note that index=off mount does take the inuse locks, so a concurrent index=off will issue the warning and a concurrent index=on mount will fail. Documentation was updated to reflect this change. Fixes: 2cac0c00a6cd ("ovl: get exclusive ownership on upper/work dirs") Cc: <stable@vger.kernel.org> # v4.13 Reported-by: Euan Kemp <euank@euank.com> Reported-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-10-05ovl: fix missing unlock_rename() in ovl_do_copy_up()Amir Goldstein
Use the ovl_lock_rename_workdir() helper which requires unlock_rename() only on lock success. Fixes: ("fd210b7d67ee ovl: move copy up lock out") Cc: <stable@vger.kernel.org> # v4.13 Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-10-05ovl: fix dentry leak in ovl_indexdir_cleanup()Amir Goldstein
index dentry was not released when breaking out of the loop due to index verification error. Fixes: 415543d5c64f ("ovl: cleanup bad and stale index entries on mount") Cc: <stable@vger.kernel.org> # v4.13 Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-10-05ovl: fix dput() of ERR_PTR in ovl_cleanup_index()Amir Goldstein
Fixes: caf70cb2ba5d ("ovl: cleanup orphan index entries") Cc: <stable@vger.kernel.org> # v4.13 Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-10-05ovl: fix error value printed in ovl_lookup_index()Amir Goldstein
Fixes: 359f392ca53e ("ovl: lookup index entry for copy up origin") Cc: <stable@vger.kernel.org> # v4.13 Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-10-05ovl: fix may_write_real() for overlayfs directoriesAmir Goldstein
Overlayfs directory file_inode() is the overlay inode whether the real inode is upper or lower. This fixes a regression in xfstest generic/158. Fixes: 7c6893e3c9ab ("ovl: don't allow writing ioctl on lower layer") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-10-04NFSv4/pnfs: Fix an infinite layoutget loopTrond Myklebust
Since we can now use a lock stateid or a delegation stateid, that differs from the context stateid, we need to change the test in nfs4_layoutget_handle_exception() to take this into account. This fixes an infinite layoutget loop in the NFS client whereby it keeps retrying the initial layoutget using the same broken stateid. Fixes: 70d2f7b1ea19b ("pNFS: Use the standard I/O stateid when...") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-10-04Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "A lot of stuff, sorry about that. A week on a beach, then a bunch of time catching up then more time letting it bake in -next. Shan't do that again!" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (51 commits) include/linux/fs.h: fix comment about struct address_space checkpatch: fix ignoring cover-letter logic m32r: fix build failure lib/ratelimit.c: use deferred printk() version kernel/params.c: improve STANDARD_PARAM_DEF readability kernel/params.c: fix an overflow in param_attr_show kernel/params.c: fix the maximum length in param_get_string mm/memory_hotplug: define find_{smallest|biggest}_section_pfn as unsigned long mm/memory_hotplug: change pfn_to_section_nr/section_nr_to_pfn macro to inline function kernel/kcmp.c: drop branch leftover typo memremap: add scheduling point to devm_memremap_pages mm, page_alloc: add scheduling point to memmap_init_zone mm, memory_hotplug: add scheduling point to __add_pages lib/idr.c: fix comment for idr_replace() mm: memcontrol: use vmalloc fallback for large kmem memcg arrays kernel/sysctl.c: remove duplicate UINT_MAX check on do_proc_douintvec_conv() include/linux/bitfield.h: remove 32bit from FIELD_GET comment block lib/lz4: make arrays static const, reduces object code size exec: binfmt_misc: kill the onstack iname[BINPRM_BUF_SIZE] array exec: binfmt_misc: fix race between load_misc_binary() and kill_node() ...
2017-10-04Btrfs: fix overlap of fs_info::flags valuesTsutomu Itoh
Because the values of BTRFS_FS_EXCL_OP and BTRFS_FS_QUOTA_OVERRIDE overlap, we should change the value. First, BTRFS_FS_EXCL_OP was set to 14. commit 171938e52807 ("btrfs: track exclusive filesystem operation in flags") Next, the value of BTRFS_FS_QUOTA_OVERRIDE was set to 14. commit f29efe292198 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE") As a result, the value 14 overlapped, by accident. This problem is solved by defining the value of BTRFS_FS_EXCL_OP as 16, the flags are internal. Fixes: f29efe292198 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE") CC: stable@vger.kernel.org # 4.13+ Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minimize the change, update only BTRFS_FS_EXCL_OP ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-04btrfs: avoid overflow when sector_t is 32 bitGoffredo Baroncelli
Jean-Denis Girard noticed commit c821e7f3 "pass bytes to btrfs_bio_alloc" (https://patchwork.kernel.org/patch/9763081/) introduces a regression on 32 bit machines. When CONFIG_LBDAF is _not_ defined (CONFIG_LBDAF == Support for large (2TB+) block devices and files) sector_t is 32 bit on 32bit machines. In the function submit_extent_page, 'sector' (which is sector_t type) is multiplied by 512 to convert it from sectors to bytes, leading to an overflow when the disk is bigger than 4GB (!). I added a cast to u64 to avoid overflow. Fixes: c821e7f3 ("btrfs: pass bytes to btrfs_bio_alloc") CC: stable@vger.kernel.org # 4.13+ Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it> Tested-by: Jean-Denis Girard <jd.girard@sysnux.pf> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-04lsm: fix smack_inode_removexattr and xattr_getsecurity memleakCasey Schaufler
security_inode_getsecurity() provides the text string value of a security attribute. It does not provide a "secctx". The code in xattr_getsecurity() that calls security_inode_getsecurity() and then calls security_release_secctx() happened to work because SElinux and Smack treat the attribute and the secctx the same way. It fails for cap_inode_getsecurity(), because that module has no secctx that ever needs releasing. It turns out that Smack is the one that's doing things wrong by not allocating memory when instructed to do so by the "alloc" parameter. The fix is simple enough. Change the security_release_secctx() to kfree() because it isn't a secctx being returned by security_inode_getsecurity(). Change Smack to allocate the string when told to do so. Note: this also fixes memory leaks for LSMs which implement inode_getsecurity but not release_secctx, such as capabilities. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: stable@vger.kernel.org Signed-off-by: James Morris <james.l.morris@oracle.com>
2017-10-03 xfs: handle racy AIO in xfs_reflink_end_cowChristoph Hellwig
If we got two AIO writes into a COW area the second one might not have any COW extents left to convert. Handle that case gracefully instead of triggering an assert or accessing beyond the bounds of the extent list. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-03xfs: always swap the cow forks when swapping extentsDarrick J. Wong
Since the CoW fork exists as a secondary data structure to the data fork, we must always swap cow forks during swapext. We also need to swap the extent counts and reset the cowblocks tags. Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-10-03exec: binfmt_misc: kill the onstack iname[BINPRM_BUF_SIZE] arrayOleg Nesterov
After the previous change "fmt" can't go away, we can kill iname/iname_addr and use fmt->interpreter. Link: http://lkml.kernel.org/r/20170922143653.GA17232@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ben Woodard <woodard@redhat.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jim Foraker <foraker1@llnl.gov> Cc: <tdhooge@llnl.gov> Cc: Travis Gummels <tgummels@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03exec: binfmt_misc: fix race between load_misc_binary() and kill_node()Oleg Nesterov
load_misc_binary() makes a local copy of fmt->interpreter under entries_lock to avoid the race with kill_node() but this is not enough; the whole Node can be freed after we drop entries_lock, not only the ->interpreter string. Add dget/dput(fmt->dentry) to ensure bm_evict_inode() can't destroy/free this Node. Link: http://lkml.kernel.org/r/20170922143650.GA17227@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ben Woodard <woodard@redhat.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jim Foraker <foraker1@llnl.gov> Cc: Travis Gummels <tgummels@redhat.com> Cc: <tdhooge@llnl.gov> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03exec: binfmt_misc: remove the confusing e->interp_file != NULL checksOleg Nesterov
If MISC_FMT_OPEN_FILE flag is set e->interp_file must be valid or we have a bug which should not be silently ignored. Link: http://lkml.kernel.org/r/20170922143647.GA17222@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ben Woodard <woodard@redhat.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jim Foraker <foraker1@llnl.gov> Cc: <tdhooge@llnl.gov> Cc: Travis Gummels <tgummels@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to ↵Oleg Nesterov
bm_evict_inode() To ensure that load_misc_binary() can't use the partially destroyed Node, see also the next patch. The current logic looks wrong in any case, once we close interp_file it doesn't make any sense to delay kfree(inode->i_private), this Node is no longer valid. Even if the MISC_FMT_OPEN_FILE/interp_file checks were not racy (they are), load_misc_binary() should not try to reopen ->interpreter if MISC_FMT_OPEN_FILE is set but ->interp_file is NULL. And I can't understand why do we use filp_close(), not fput(). Link: http://lkml.kernel.org/r/20170922143644.GA17216@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ben Woodard <woodard@redhat.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jim Foraker <foraker1@llnl.gov> Cc: <tdhooge@llnl.gov> Cc: Travis Gummels <tgummels@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03exec: binfmt_misc: don't nullify Node->dentry in kill_node()Oleg Nesterov
kill_node() nullifies/checks Node->dentry to avoid double free. This complicates the next changes and this is very confusing: - we do not need to check dentry != NULL under entries_lock, kill_node() is always called under inode_lock(d_inode(root)) and we rely on this inode_lock() anyway, without this lock the MISC_FMT_OPEN_FILE cleanup could race with itself. - if kill_inode() was already called and ->dentry == NULL we should not even try to close e->interp_file. We can change bm_entry_write() to simply check !list_empty(list) before kill_node. Again, we rely on inode_lock(), in particular it saves us from the race with bm_status_write(), another caller of kill_node(). Link: http://lkml.kernel.org/r/20170922143641.GA17210@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ben Woodard <woodard@redhat.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jim Foraker <foraker1@llnl.gov> Cc: <tdhooge@llnl.gov> Cc: Travis Gummels <tgummels@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03exec: load_script: kill the onstack interp[BINPRM_BUF_SIZE] arrayOleg Nesterov
Patch series "exec: binfmt_misc: fix use-after-free, kill iname[BINPRM_BUF_SIZE]". It looks like this code was always wrong, then commit 948b701a607f ("binfmt_misc: add persistent opened binary handler for containers") added more problems. This patch (of 6): load_script() can simply use i_name instead, it points into bprm->buf[] and nobody can change this memory until we call prepare_binprm(). The only complication is that we need to also change the signature of bprm_change_interp() but this change looks good too. While at it, do whitespace/style cleanups. NOTE: the real motivation for this change is that people want to increase BINPRM_BUF_SIZE, we need to change load_misc_binary() too but this looks more complicated because afaics it is very buggy. Link: http://lkml.kernel.org/r/20170918163446.GA26793@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Travis Gummels <tgummels@redhat.com> Cc: Ben Woodard <woodard@redhat.com> Cc: Jim Foraker <foraker1@llnl.gov> Cc: <tdhooge@llnl.gov> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03userfaultfd: non-cooperative: fix fork use after freeAndrea Arcangeli
When reading the event from the uffd, we put it on a temporary fork_event list to detect if we can still access it after releasing and retaking the event_wqh.lock. If fork aborts and removes the event from the fork_event all is fine as long as we're still in the userfault read context and fork_event head is still alive. We've to put the event allocated in the fork kernel stack, back from fork_event list-head to the event_wqh head, before returning from userfaultfd_ctx_read, because the fork_event head lifetime is limited to the userfaultfd_ctx_read stack lifetime. Forgetting to move the event back to its event_wqh place then results in __remove_wait_queue(&ctx->event_wqh, &ewq->wq); in userfaultfd_event_wait_completion to remove it from a head that has been already freed from the reader stack. This could only happen if resolve_userfault_fork failed (for example if there are no file descriptors available to allocate the fork uffd). If it succeeded it was put back correctly. Furthermore, after find_userfault_evt receives a fork event, the forked userfault context in fork_nctx and uwq->msg.arg.reserved.reserved1 can be released by the fork thread as soon as the event_wqh.lock is released. Taking a reference on the fork_nctx before dropping the lock prevents an use after free in resolve_userfault_fork(). If the fork side aborted and it already released everything, we still try to succeed resolve_userfault_fork(), if possible. Fixes: 893e26e61d04eac9 ("userfaultfd: non-cooperative: Add fork() event") Link: http://lkml.kernel.org/r/20170920180413.26713-1-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03f2fs: fix potential panic during fstrimChao Yu
As Ju Hyung Park reported: "When 'fstrim' is called for manual trim, a BUG() can be triggered randomly with this patch. I'm seeing this issue on both x86 Desktop and arm64 Android phone. On x86 Desktop, this was caused during Ubuntu boot-up. I have a cronjob installed which calls 'fstrim -v /' during boot. On arm64 Android, this was caused during GC looping with 1ms gc_min_sleep_time & gc_max_sleep_time." Root cause of this issue is that f2fs_wait_discard_bios can only be used by f2fs_put_super, because during put_super there must be no other referrers, so it can ignore discard entry's reference count when removing the entry, otherwise in other caller we will hit bug_on in __remove_discard_cmd as there may be other issuer added reference count in discard entry. Thread A Thread B - issue_discard_thread - f2fs_ioc_fitrim - f2fs_trim_fs - f2fs_wait_discard_bios - __issue_discard_cmd - __submit_discard_cmd - __wait_discard_cmd - dc->ref++ - __wait_one_discard_bio - __wait_discard_cmd - __remove_discard_cmd - f2fs_bug_on(sbi, dc->ref) Fixes: 969d1b180d987c2be02de890d0fff0f66a0e80de Reported-by: Ju Hyung Park <qkrwngud825@gmail.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-10-02ceph: fix __choose_mds() for LSSNAP requestYan, Zheng
previous commit 5d37ca14 "ceph: send LSSNAP request to auth mds of directory inode" is buggy. It makes __choose_mds() choose mds base on hash of '.snap' dentry. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-10-02ceph: properly queue cap snap for newly created snap realmYan, Zheng
commit 3ae0bebc "ceph: queue cap snap only when snap realm's context changes" introduced a regression: we may not call queue_realm_cap_snaps() for newly created snap realm. This regression allows unflushed snapshot data to be overwritten. Link: http://tracker.ceph.com/issues/21483 Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-10-01nfs/filelayout: fix oops when freeing filelayout segmentScott Mayhew
Check for a NULL dsaddr in filelayout_free_lseg() before calling nfs4_fl_put_deviceid(). This fixes the following oops: [ 1967.645207] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030 [ 1967.646010] IP: [<ffffffffc06d6aea>] nfs4_put_deviceid_node+0xa/0x90 [nfsv4] [ 1967.646010] PGD c08bc067 PUD 915d3067 PMD 0 [ 1967.753036] Oops: 0000 [#1] SMP [ 1967.753036] Modules linked in: nfs_layout_nfsv41_files ext4 mbcache jbd2 loop rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache amd64_edac_mod ipmi_ssif edac_mce_amd edac_core kvm_amd sg kvm ipmi_si ipmi_devintf irqbypass pcspkr k8temp ipmi_msghandler i2c_piix4 shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic crct10dif_common amdkfd amd_iommu_v2 radeon i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mptsas ttm scsi_transport_sas mptscsih drm mptbase serio_raw i2c_core bnx2 dm_mirror dm_region_hash dm_log dm_mod [ 1967.790031] CPU: 2 PID: 1370 Comm: ls Not tainted 3.10.0-709.el7.test.bz1463784.x86_64 #1 [ 1967.790031] Hardware name: IBM BladeCenter LS21 -[7971AC1]-/Server Blade, BIOS -[BAE155AUS-1.10]- 06/03/2009 [ 1967.790031] task: ffff8800c42a3f40 ti: ffff8800c4064000 task.ti: ffff8800c4064000 [ 1967.790031] RIP: 0010:[<ffffffffc06d6aea>] [<ffffffffc06d6aea>] nfs4_put_deviceid_node+0xa/0x90 [nfsv4] [ 1967.790031] RSP: 0000:ffff8800c4067978 EFLAGS: 00010246 [ 1967.790031] RAX: ffffffffc062f000 RBX: ffff8801d468a540 RCX: dead000000000200 [ 1967.790031] RDX: ffff8800c40679f8 RSI: ffff8800c4067a0c RDI: 0000000000000000 [ 1967.790031] RBP: ffff8800c4067980 R08: ffff8801d468a540 R09: 0000000000000000 [ 1967.790031] R10: 0000000000000000 R11: ffffffffffffffff R12: ffff8801d468a540 [ 1967.790031] R13: ffff8800c40679f8 R14: ffff8801d5645300 R15: ffff880126f15ff0 [ 1967.790031] FS: 00007f11053c9800(0000) GS:ffff88012bd00000(0000) knlGS:0000000000000000 [ 1967.790031] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 1967.790031] CR2: 0000000000000030 CR3: 0000000094b55000 CR4: 00000000000007e0 [ 1967.790031] Stack: [ 1967.790031] ffff8801d468a540 ffff8800c4067990 ffffffffc062d2fe ffff8800c40679b0 [ 1967.790031] ffffffffc062b5b4 ffff8800c40679f8 ffff8801d468a540 ffff8800c40679d8 [ 1967.790031] ffffffffc06d39af ffff8800c40679f8 ffff880126f16078 0000000000000001 [ 1967.790031] Call Trace: [ 1967.790031] [<ffffffffc062d2fe>] nfs4_fl_put_deviceid+0xe/0x10 [nfs_layout_nfsv41_files] [ 1967.790031] [<ffffffffc062b5b4>] filelayout_free_lseg+0x24/0x90 [nfs_layout_nfsv41_files] [ 1967.790031] [<ffffffffc06d39af>] pnfs_free_lseg_list+0x5f/0x80 [nfsv4] [ 1967.790031] [<ffffffffc06d5a67>] _pnfs_return_layout+0x157/0x270 [nfsv4] [ 1967.790031] [<ffffffffc06c17dd>] nfs4_evict_inode+0x4d/0x70 [nfsv4] [ 1967.790031] [<ffffffff8121de19>] evict+0xa9/0x180 [ 1967.790031] [<ffffffff8121e729>] iput+0xf9/0x190 [ 1967.790031] [<ffffffffc0652cea>] nfs_dentry_iput+0x3a/0x50 [nfs] [ 1967.790031] [<ffffffff8121ab4f>] shrink_dentry_list+0x20f/0x490 [ 1967.790031] [<ffffffff8121b018>] d_invalidate+0xd8/0x150 [ 1967.790031] [<ffffffffc065446b>] nfs_readdir_page_filler+0x40b/0x600 [nfs] [ 1967.790031] [<ffffffffc0654bbd>] nfs_readdir_xdr_to_array+0x20d/0x3b0 [nfs] [ 1967.790031] [<ffffffff811f3482>] ? __mem_cgroup_commit_charge+0xe2/0x2f0 [ 1967.790031] [<ffffffff81183208>] ? __add_to_page_cache_locked+0x48/0x170 [ 1967.790031] [<ffffffffc0654d60>] ? nfs_readdir_xdr_to_array+0x3b0/0x3b0 [nfs] [ 1967.790031] [<ffffffffc0654d82>] nfs_readdir_filler+0x22/0x90 [nfs] [ 1967.790031] [<ffffffff8118351f>] do_read_cache_page+0x7f/0x190 [ 1967.790031] [<ffffffff81215d30>] ? fillonedir+0xe0/0xe0 [ 1967.790031] [<ffffffff8118366c>] read_cache_page+0x1c/0x30 [ 1967.790031] [<ffffffffc0654f9b>] nfs_readdir+0x1ab/0x6b0 [nfs] [ 1967.790031] [<ffffffffc06bd1c0>] ? nfs4_xdr_dec_layoutget+0x270/0x270 [nfsv4] [ 1967.790031] [<ffffffff81215d30>] ? fillonedir+0xe0/0xe0 [ 1967.790031] [<ffffffff81215c20>] vfs_readdir+0xb0/0xe0 [ 1967.790031] [<ffffffff81216045>] SyS_getdents+0x95/0x120 [ 1967.790031] [<ffffffff816b9449>] system_call_fastpath+0x16/0x1b [ 1967.790031] Code: 90 31 d2 48 89 d0 5d c3 85 f6 74 f5 8d 4e 01 89 f0 f0 0f b1 0f 39 f0 74 e2 89 c6 eb eb 0f 1f 40 00 66 66 66 66 90 55 48 89 e5 53 <48> 8b 47 30 48 89 fb a8 04 74 3b 8b 57 60 83 fa 02 74 19 8d 4a [ 1967.790031] RIP [<ffffffffc06d6aea>] nfs4_put_deviceid_node+0xa/0x90 [nfsv4] [ 1967.790031] RSP <ffff8800c4067978> [ 1967.790031] CR2: 0000000000000030 Signed-off-by: Scott Mayhew <smayhew@redhat.com> Fixes: 1ebf98012792 ("NFS/filelayout: Fix racy setting of fl->dsaddr...") Cc: stable@vger.kernel.org # v4.13+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-10-01NFS: Fix uninitialized rpc_wait_queueBenjamin Coddington
Michael Sterrett reports a NULL pointer dereference on NFSv3 mounts when CONFIG_NFS_V4 is not set because the NFS UOC rpc_wait_queue has not been initialized. Move the initialization of the queue out of the CONFIG_NFS_V4 conditional setion. Fixes: 7d6ddf88c4db ("NFS: Add an iocounter wait function for async RPC tasks") Cc: stable@vger.kernel.org # 4.11+ Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-10-01NFS: Cleanup error handling in nfs_idmap_request_key()Dan Carpenter
nfs_idmap_get_desc() can't actually return zero. But if it did then we would return ERR_PTR(0) which is NULL and the caller, nfs_idmap_get_key(), doesn't expect that so it leads to a NULL pointer dereference. I've cleaned this up by changing the "<=" to "<" so it's more clear that we don't return ERR_PTR(0). Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-10-01nfs: RPC_MAX_AUTH_SIZE is in bytesJ. Bruce Fields
The units of RPC_MAX_AUTH_SIZE is bytes, not 4-byte words. This causes the client to request a larger-than-necessary session replay slot size. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-10-01Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: "The scheduler pull request comes with the following updates: - Prevent a divide by zero issue by validating the input value of sysctl_sched_time_avg - Make task state printing consistent all over the place and have explicit state characters for IDLE and PARKED so they wont be displayed as 'D' state which confuses tools" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/sysctl: Check user input value of sysctl_sched_time_avg sched/debug: Add explicit TASK_PARKED printing sched/debug: Ignore TASK_IDLE for SysRq-W sched/debug: Add explicit TASK_IDLE printing sched/tracing: Use common task-state helpers sched/tracing: Fix trace_sched_switch task-state printing sched/debug: Remove unused variable sched/debug: Convert TASK_state to hex sched/debug: Implement consistent task-state printing