summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2018-01-17xfs: add scrub cross-referencing helpers for the free space btreesDarrick J. Wong
Add a couple of functions to the free space btrees that will be used to cross-reference metadata against the bnobt/cntbt, and a generic btree function that provides the real implementation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-01-18ubifs: remove error message in ubifs_xattr_getRock Lee
There is a situation that other modules, like overlayfs, try to get xattr value with a small buffer, if they get -ERANGE, they will try again with the proper buffer size. No need to report an error. Signed-off-by: Rock Lee <rli@sierrawireless.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2018-01-17ubifs: switch to fscrypt_prepare_setattr()Eric Biggers
Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2018-01-17ubifs: switch to fscrypt_prepare_lookup()Eric Biggers
Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2018-01-17ubifs: switch to fscrypt_prepare_rename()Eric Biggers
Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2018-01-17ubifs: switch to fscrypt_prepare_link()Eric Biggers
Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2018-01-17ubifs: switch to fscrypt_file_open()Eric Biggers
Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2018-01-17ubifs: Fix uninitialized variable in search_dh_cookie()Geert Uytterhoeven
fs/ubifs/tnc.c: In function ‘search_dh_cookie’: fs/ubifs/tnc.c:1893: warning: ‘err’ is used uninitialized in this function Indeed, err is always used uninitialized. According to an original review comment from Hyunchul, acknowledged by Richard, err should be initialized to -ENOENT to avoid the first call to tnc_next(). But we can achieve the same by reordering the code. Fixes: 781f675e2d7e ("ubifs: Fix unlink code wrt. double hash lookups") Reported-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Richard Weinberger <richard@nod.at>
2018-01-17Turn gfs2_block_truncate_page into gfs2_block_zero_rangeAndreas Gruenbacher
Turn gfs2_block_truncate_page into a function that zeroes a range within a block rather than only the end of a block. This will be used for cleaning the end of the first partial block and the start of the last partial block when punching a hole in a file. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-01-17gfs2: Improve non-recursive delete algorithmAndreas Gruenbacher
In rare cases, the current non-recursive delete algorithm doesn't deallocate empty intermediary indirect blocks. This should have very little practical effect, but deallocating all blocks correctly should still be preferable as it is cleaner and easier to validate. The fix consists of using the first block to deallocate to compute the start marker of the truncate point instead of the last block that needs to be kept. With that change, computing which indirect blocks are still needed becomes relatively easy. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-01-17gfs2: Fix metadata read-ahead during truncateAndreas Gruenbacher
The metadata read-ahead algorithm broke when switching from recursive to non-recursive delete: the current algorithm reads ahead blocks at height N - 1 while deallocating the blocks at hight N. However, deallocating the blocks at height N requires a complete walk of the metadata tree, not only down to height N - 1. Consequently, all blocks below height N - 1 will be accessed without read-ahead. Fix this by issuing read-aheads as early as possible, after each metapath lookup. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-01-17gfs2: Clean up {lookup,fillup}_metapathAndreas Gruenbacher
Split out the entire lookup loop from lookup_metapath and fillup_metapath. Make both functions return the actual height in mp->mp_aheight, and return 0 on success. Handle lookup errors properly in trunc_dealloc. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-01-17gfs2: Remove minor gfs2_journaled_truncate inefficienciesAndreas Gruenbacher
First, this function truncates the file in chunks. When the original file size isn't block aligned, each chunk that is truncated will remain be misaligned. This is inefficient. Second, this function doesn't recognize where holes are, so it loops through them. For each chunk of a hole, it creates a new transaction. At least avoid creating another transactions whe the current one is still empty. (An better fix would be to skip large holes, of course.) Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-01-17gfs2: truncate: Remove unnecessary oldsize parametersAndreas Gruenbacher
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-01-17gfs2: Clean up trunc_start error pathAndreas Gruenbacher
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-01-17gfs2: Remove pointless BUG_ONAndreas Gruenbacher
The current transaction is being dereferenced before asserting that is not NULL; that isn't going to help. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-01-17gfs2: Add gfs2_blk2rgrpd comment and fix incorrect useSteven Whitehouse
Document when to use gfs2_blk2rgrpd for "inexact" resource group matching. Based on that, fix an incorrect use of gfs2_blk2rgrpd in sweep_bh_for_rgrps. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-01-16f2fs: add resgid and resuid to reserve root blocksJaegeuk Kim
This patch adds mount options to reserve some blocks via resgid=%u,resuid=%u. It only activates with reserve_root=%u. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-16f2fs: implement cgroup writeback supportYufen Yu
Cgroup writeback requires explicit support from the filesystem. f2fs's data and node writeback IOs go through __write_data_page, which sets fio for submiting IOs. So, we add io_wbc for fio, associate bios with blkcg by invoking wbc_init_bio() and account IOs issuing by wbc_account_io(). In addtion, f2fs_fill_super() is updated to set SB_I_CGROUPWB. Meta writeback IOs is left alone by this patch and will always be attributed to the root cgroup. The results show that f2fs can throttle writeback nicely for data writing and file creating. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-16f2fs: remove unused pend_list_tagChao Yu
In commit 78997b569f56 ("f2fs: split discard policy"), we have get rid of using pend_list_tag field in struct discard_cmd_control, but forgot to remove it, now do it. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-16f2fs: avoid high cpu usage in discard threadChao Yu
We take very long time to finish generic/476, this is because we will check consistence of all discard entries in global rb tree while traversing all different granularity pending lists, even when the list is empty, in order to avoid that unneeded overhead, we have to skip the check when coming up an empty list. generic/476 time consumption: cost Before patch & w/o consistence check 57s Before patch & w/ consistence check 1426s After patch 78s Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-16f2fs: make local functions staticWei Yongjun
Fixes the following sparse warnings: fs/f2fs/segment.c:887:6: warning: symbol '__check_sit_bitmap' was not declared. Should it be static? fs/f2fs/segment.c:1327:6: warning: symbol 'f2fs_wait_discard_bio' was not declared. Should it be static? fs/f2fs/super.c:1661:5: warning: symbol 'f2fs_get_projid' was not declared. Should it be static? Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-16f2fs: add reserved blocks for root userJaegeuk Kim
This patch allows root to reserve some blocks via mount option. "-o reserve_root=N" means N x 4KB-sized blocks for root only. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-16f2fs: check segment type in __f2fs_replace_blockYunlong Song
In some case, the node blocks has wrong blkaddr whose segment type is NODE, e.g., recover inode has missing xattr flag and the blkaddr is in the xattr range. Since fsck.f2fs does not check the recovery nodes, this will cause __f2fs_replace_block change the curseg of node and do the update_sit_entry(sbi, new_blkaddr, 1) with no next_blkoff refresh, as a result, when recovery process write checkpoint and sync nodes, the next_blkoff of curseg is used in the segment bit map, then it will cause f2fs_bug_on. So let's check segment type in __f2fs_replace_block. Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-16f2fs: update inode info to inode page for new fileYunlei He
After checkpoint, 1. creat a new file A ,(with dirty inode && dirty inode page && xattr info) 2. backgroud wb write back file A inode page (without update from inode cache) 3. fsync file A, write back inode page of file A with inode cache info 4. sudden power off before new checkpoint In this case, recovery process will try to recover a zero inode page. Inline xattr flag of file A will be miss and xattr info will be taken as blkaddr index. Signed-off-by: Yunlei He <heyunlei@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-16f2fs: show precise # of blocks that user/root can useJaegeuk Kim
Let's show precise # of blocks that user/root can use through bavail and bfree respectively. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-01-16xfs: cancel tx on xfs_defer_finish() error during xattr set/removeBrian Foster
Chris Dunlop reports a problem where an xattr operation fails, reports the following error to syslog and hangs during unmount: ================================================ [ BUG: lock held when returning to user space! ] ... ------------------------------------------------ <PID> is leaving the kernel with locks still held! 1 lock held by <PID>: #0: (sb_internal){......}, at: [<ffffffffa07692a3>] xfs_trans_alloc+0xe3/0x130 [xfs] The failure/shutdown occurs during deferred ops processing which leads to an error return from xfs_defer_finish() via xfs_attr_leaf_addname(). While the root cause of the failure is unknown corruption, the cause of the subsequent BUG above and unmount hang is failure to cancel the transaction before returning to userspace. The transaction is not cancelled because the out_defer_cancel error handling paths in the xfs_attr_[leaf|node]_[add|remove]name() functions clear args.trans without releasing the transaction. The callers therefore lose the reference to the transaction and fail to cancel it. Since xfs_attr_[set|remove]() always cancel args.trans when != NULL and xfs_defer_finish()->...->xfs_trans_roll() should always return with a valid transaction, update the leaf/node xattr functions to not reset args.trans in the error path responsible for cancelling deferred ops. Reported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-01-16NFS: commit direct writes even if they fail partiallyJ. Bruce Fields
If some of the WRITE calls making up an O_DIRECT write syscall fail, we neglect to commit, even if some of the WRITEs succeed. We also depend on the commit code to free the reference count on the nfs_page taken in the "if (request_commit)" case at the end of nfs_direct_write_completion(). The problem was originally noticed because ENOSPC's encountered partway through a write would result in a closed file being sillyrenamed when it should have been unlinked. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-16nfs: remove unused label in nfs_encode_fh()Arnd Bergmann
The only reference to the label got removed, so we now get a harmless compiler warning: fs/nfs/export.c: In function 'nfs_encode_fh': fs/nfs/export.c:58:1: error: label 'out' defined but not used [-Werror=unused-label] Fixes: aaa150089465 ("nfs: remove dead code from nfs_encode_fh()") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-15cifs: Define usercopy region in cifs_request slab cacheDavid Windsor
CIFS request buffers, stored in the cifs_request slab cache, need to be copied to/from userspace. cache object allocation: fs/cifs/cifsfs.c: cifs_init_request_bufs(): ... cifs_req_poolp = mempool_create_slab_pool(cifs_min_rcv, cifs_req_cachep); fs/cifs/misc.c: cifs_buf_get(): ... ret_buf = mempool_alloc(cifs_req_poolp, GFP_NOFS); ... return ret_buf; In support of usercopy hardening, this patch defines a region in the cifs_request slab cache in which userspace copy operations are allowed. This region is known as the slab cache's usercopy region. Slab caches can now check that each dynamically sized copy operation involving cache-managed memory falls entirely within the slab's usercopy region. This patch is verbatim from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Signed-off-by: David Windsor <dave@nullcore.net> [kees: adjust commit log, provide usage trace] Cc: Steve French <sfrench@samba.org> Cc: linux-cifs@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2018-01-15vxfs: Define usercopy region in vxfs_inode slab cacheDavid Windsor
vxfs symlink pathnames, stored in struct vxfs_inode_info field vii_immed.vi_immed and therefore contained in the vxfs_inode slab cache, need to be copied to/from userspace. cache object allocation: fs/freevxfs/vxfs_super.c: vxfs_alloc_inode(...): ... vi = kmem_cache_alloc(vxfs_inode_cachep, GFP_KERNEL); ... return &vi->vfs_inode; fs/freevxfs/vxfs_inode.c: cxfs_iget(...): ... inode->i_link = vip->vii_immed.vi_immed; example usage trace: readlink_copy+0x43/0x70 vfs_readlink+0x62/0x110 SyS_readlinkat+0x100/0x130 fs/namei.c: readlink_copy(..., link): ... copy_to_user(..., link, len); (inlined in vfs_readlink) generic_readlink(dentry, ...): struct inode *inode = d_inode(dentry); const char *link = inode->i_link; ... readlink_copy(..., link); In support of usercopy hardening, this patch defines a region in the vxfs_inode slab cache in which userspace copy operations are allowed. This region is known as the slab cache's usercopy region. Slab caches can now check that each dynamically sized copy operation involving cache-managed memory falls entirely within the slab's usercopy region. This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Signed-off-by: David Windsor <dave@nullcore.net> [kees: adjust commit log, provide usage trace] Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Kees Cook <keescook@chromium.org>
2018-01-15ufs: Define usercopy region in ufs_inode_cache slab cacheDavid Windsor
The ufs symlink pathnames, stored in struct ufs_inode_info.i_u1.i_symlink and therefore contained in the ufs_inode_cache slab cache, need to be copied to/from userspace. cache object allocation: fs/ufs/super.c: ufs_alloc_inode(...): ... ei = kmem_cache_alloc(ufs_inode_cachep, GFP_NOFS); ... return &ei->vfs_inode; fs/ufs/ufs.h: UFS_I(struct inode *inode): return container_of(inode, struct ufs_inode_info, vfs_inode); fs/ufs/namei.c: ufs_symlink(...): ... inode->i_link = (char *)UFS_I(inode)->i_u1.i_symlink; example usage trace: readlink_copy+0x43/0x70 vfs_readlink+0x62/0x110 SyS_readlinkat+0x100/0x130 fs/namei.c: readlink_copy(..., link): ... copy_to_user(..., link, len); (inlined in vfs_readlink) generic_readlink(dentry, ...): struct inode *inode = d_inode(dentry); const char *link = inode->i_link; ... readlink_copy(..., link); In support of usercopy hardening, this patch defines a region in the ufs_inode_cache slab cache in which userspace copy operations are allowed. This region is known as the slab cache's usercopy region. Slab caches can now check that each dynamically sized copy operation involving cache-managed memory falls entirely within the slab's usercopy region. This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Signed-off-by: David Windsor <dave@nullcore.net> [kees: adjust commit log, provide usage trace] Cc: Evgeniy Dushistov <dushistov@mail.ru> Signed-off-by: Kees Cook <keescook@chromium.org>
2018-01-15orangefs: Define usercopy region in orangefs_inode_cache slab cacheDavid Windsor
orangefs symlink pathnames, stored in struct orangefs_inode_s.link_target and therefore contained in the orangefs_inode_cache, need to be copied to/from userspace. cache object allocation: fs/orangefs/super.c: orangefs_alloc_inode(...): ... orangefs_inode = kmem_cache_alloc(orangefs_inode_cache, ...); ... return &orangefs_inode->vfs_inode; fs/orangefs/orangefs-utils.c: exofs_symlink(...): ... inode->i_link = orangefs_inode->link_target; example usage trace: readlink_copy+0x43/0x70 vfs_readlink+0x62/0x110 SyS_readlinkat+0x100/0x130 fs/namei.c: readlink_copy(..., link): ... copy_to_user(..., link, len); (inlined in vfs_readlink) generic_readlink(dentry, ...): struct inode *inode = d_inode(dentry); const char *link = inode->i_link; ... readlink_copy(..., link); In support of usercopy hardening, this patch defines a region in the orangefs_inode_cache slab cache in which userspace copy operations are allowed. This region is known as the slab cache's usercopy region. Slab caches can now check that each dynamically sized copy operation involving cache-managed memory falls entirely within the slab's usercopy region. This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Signed-off-by: David Windsor <dave@nullcore.net> [kees: adjust commit log, provide usage trace] Cc: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Kees Cook <keescook@chromium.org>
2018-01-15exofs: Define usercopy region in exofs_inode_cache slab cacheDavid Windsor
The exofs short symlink names, stored in struct exofs_i_info.i_data and therefore contained in the exofs_inode_cache slab cache, need to be copied to/from userspace. cache object allocation: fs/exofs/super.c: exofs_alloc_inode(...): ... oi = kmem_cache_alloc(exofs_inode_cachep, GFP_KERNEL); ... return &oi->vfs_inode; fs/exofs/namei.c: exofs_symlink(...): ... inode->i_link = (char *)oi->i_data; example usage trace: readlink_copy+0x43/0x70 vfs_readlink+0x62/0x110 SyS_readlinkat+0x100/0x130 fs/namei.c: readlink_copy(..., link): ... copy_to_user(..., link, len); (inlined in vfs_readlink) generic_readlink(dentry, ...): struct inode *inode = d_inode(dentry); const char *link = inode->i_link; ... readlink_copy(..., link); In support of usercopy hardening, this patch defines a region in the exofs_inode_cache slab cache in which userspace copy operations are allowed. This region is known as the slab cache's usercopy region. Slab caches can now check that each dynamically sized copy operation involving cache-managed memory falls entirely within the slab's usercopy region. This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Signed-off-by: David Windsor <dave@nullcore.net> [kees: adjust commit log, provide usage trace] Cc: Boaz Harrosh <ooo@electrozaur.com> Signed-off-by: Kees Cook <keescook@chromium.org>
2018-01-15befs: Define usercopy region in befs_inode_cache slab cacheDavid Windsor
befs symlink pathnames, stored in struct befs_inode_info.i_data.symlink and therefore contained in the befs_inode_cache slab cache, need to be copied to/from userspace. cache object allocation: fs/befs/linuxvfs.c: befs_alloc_inode(...): ... bi = kmem_cache_alloc(befs_inode_cachep, GFP_KERNEL); ... return &bi->vfs_inode; befs_iget(...): ... strlcpy(befs_ino->i_data.symlink, raw_inode->data.symlink, BEFS_SYMLINK_LEN); ... inode->i_link = befs_ino->i_data.symlink; example usage trace: readlink_copy+0x43/0x70 vfs_readlink+0x62/0x110 SyS_readlinkat+0x100/0x130 fs/namei.c: readlink_copy(..., link): ... copy_to_user(..., link, len); (inlined in vfs_readlink) generic_readlink(dentry, ...): struct inode *inode = d_inode(dentry); const char *link = inode->i_link; ... readlink_copy(..., link); In support of usercopy hardening, this patch defines a region in the befs_inode_cache slab cache in which userspace copy operations are allowed. This region is known as the slab cache's usercopy region. Slab caches can now check that each dynamically sized copy operation involving cache-managed memory falls entirely within the slab's usercopy region. This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Signed-off-by: David Windsor <dave@nullcore.net> [kees: adjust commit log, provide usage trace] Cc: Luis de Bethencourt <luisbg@kernel.org> Cc: Salah Triki <salah.triki@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Luis de Bethencourt <luisbg@kernel.org>
2018-01-15jfs: Define usercopy region in jfs_ip slab cacheDavid Windsor
The jfs symlink pathnames, stored in struct jfs_inode_info.i_inline and therefore contained in the jfs_ip slab cache, need to be copied to/from userspace. cache object allocation: fs/jfs/super.c: jfs_alloc_inode(...): ... jfs_inode = kmem_cache_alloc(jfs_inode_cachep, GFP_NOFS); ... return &jfs_inode->vfs_inode; fs/jfs/jfs_incore.h: JFS_IP(struct inode *inode): return container_of(inode, struct jfs_inode_info, vfs_inode); fs/jfs/inode.c: jfs_iget(...): ... inode->i_link = JFS_IP(inode)->i_inline; example usage trace: readlink_copy+0x43/0x70 vfs_readlink+0x62/0x110 SyS_readlinkat+0x100/0x130 fs/namei.c: readlink_copy(..., link): ... copy_to_user(..., link, len); (inlined in vfs_readlink) generic_readlink(dentry, ...): struct inode *inode = d_inode(dentry); const char *link = inode->i_link; ... readlink_copy(..., link); In support of usercopy hardening, this patch defines a region in the jfs_ip slab cache in which userspace copy operations are allowed. This region is known as the slab cache's usercopy region. Slab caches can now check that each dynamically sized copy operation involving cache-managed memory falls entirely within the slab's usercopy region. This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Signed-off-by: David Windsor <dave@nullcore.net> [kees: adjust commit log, provide usage trace] Cc: Dave Kleikamp <shaggy@kernel.org> Cc: jfs-discussion@lists.sourceforge.net Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2018-01-15ext2: Define usercopy region in ext2_inode_cache slab cacheDavid Windsor
The ext2 symlink pathnames, stored in struct ext2_inode_info.i_data and therefore contained in the ext2_inode_cache slab cache, need to be copied to/from userspace. cache object allocation: fs/ext2/super.c: ext2_alloc_inode(...): struct ext2_inode_info *ei; ... ei = kmem_cache_alloc(ext2_inode_cachep, GFP_NOFS); ... return &ei->vfs_inode; fs/ext2/ext2.h: EXT2_I(struct inode *inode): return container_of(inode, struct ext2_inode_info, vfs_inode); fs/ext2/namei.c: ext2_symlink(...): ... inode->i_link = (char *)&EXT2_I(inode)->i_data; example usage trace: readlink_copy+0x43/0x70 vfs_readlink+0x62/0x110 SyS_readlinkat+0x100/0x130 fs/namei.c: readlink_copy(..., link): ... copy_to_user(..., link, len); (inlined into vfs_readlink) generic_readlink(dentry, ...): struct inode *inode = d_inode(dentry); const char *link = inode->i_link; ... readlink_copy(..., link); In support of usercopy hardening, this patch defines a region in the ext2_inode_cache slab cache in which userspace copy operations are allowed. This region is known as the slab cache's usercopy region. Slab caches can now check that each dynamically sized copy operation involving cache-managed memory falls entirely within the slab's usercopy region. This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Signed-off-by: David Windsor <dave@nullcore.net> [kees: adjust commit log, provide usage trace] Cc: Jan Kara <jack@suse.com> Cc: linux-ext4@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Jan Kara <jack@suse.cz>
2018-01-15ext4: Define usercopy region in ext4_inode_cache slab cacheDavid Windsor
The ext4 symlink pathnames, stored in struct ext4_inode_info.i_data and therefore contained in the ext4_inode_cache slab cache, need to be copied to/from userspace. cache object allocation: fs/ext4/super.c: ext4_alloc_inode(...): struct ext4_inode_info *ei; ... ei = kmem_cache_alloc(ext4_inode_cachep, GFP_NOFS); ... return &ei->vfs_inode; include/trace/events/ext4.h: #define EXT4_I(inode) \ (container_of(inode, struct ext4_inode_info, vfs_inode)) fs/ext4/namei.c: ext4_symlink(...): ... inode->i_link = (char *)&EXT4_I(inode)->i_data; example usage trace: readlink_copy+0x43/0x70 vfs_readlink+0x62/0x110 SyS_readlinkat+0x100/0x130 fs/namei.c: readlink_copy(..., link): ... copy_to_user(..., link, len) (inlined into vfs_readlink) generic_readlink(dentry, ...): struct inode *inode = d_inode(dentry); const char *link = inode->i_link; ... readlink_copy(..., link); In support of usercopy hardening, this patch defines a region in the ext4_inode_cache slab cache in which userspace copy operations are allowed. This region is known as the slab cache's usercopy region. Slab caches can now check that each dynamically sized copy operation involving cache-managed memory falls entirely within the slab's usercopy region. This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Signed-off-by: David Windsor <dave@nullcore.net> [kees: adjust commit log, provide usage trace] Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: linux-ext4@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2018-01-15vfs: Copy struct mount.mnt_id to userspace using put_user()David Windsor
The mnt_id field can be copied with put_user(), so there is no need to use copy_to_user(). In both cases, hardened usercopy is being bypassed since the size is constant, and not open to runtime manipulation. This patch is verbatim from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Signed-off-by: David Windsor <dave@nullcore.net> [kees: adjust commit log] Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2018-01-15vfs: Define usercopy region in names_cache slab cachesDavid Windsor
VFS pathnames are stored in the names_cache slab cache, either inline or across an entire allocation entry (when approaching PATH_MAX). These are copied to/from userspace, so they must be entirely whitelisted. cache object allocation: include/linux/fs.h: #define __getname() kmem_cache_alloc(names_cachep, GFP_KERNEL) example usage trace: strncpy_from_user+0x4d/0x170 getname_flags+0x6f/0x1f0 user_path_at_empty+0x23/0x40 do_mount+0x69/0xda0 SyS_mount+0x83/0xd0 fs/namei.c: getname_flags(...): ... result = __getname(); ... kname = (char *)result->iname; result->name = kname; len = strncpy_from_user(kname, filename, EMBEDDED_NAME_MAX); ... if (unlikely(len == EMBEDDED_NAME_MAX)) { const size_t size = offsetof(struct filename, iname[1]); kname = (char *)result; result = kzalloc(size, GFP_KERNEL); ... result->name = kname; len = strncpy_from_user(kname, filename, PATH_MAX); In support of usercopy hardening, this patch defines the entire cache object in the names_cache slab cache as whitelisted, since it may entirely hold name strings to be copied to/from userspace. This patch is verbatim from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Signed-off-by: David Windsor <dave@nullcore.net> [kees: adjust commit log, add usage trace] Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2018-01-15dcache: Define usercopy region in dentry_cache slab cacheDavid Windsor
When a dentry name is short enough, it can be stored directly in the dentry itself (instead in a separate kmalloc allocation). These dentry short names, stored in struct dentry.d_iname and therefore contained in the dentry_cache slab cache, need to be coped to userspace. cache object allocation: fs/dcache.c: __d_alloc(...): ... dentry = kmem_cache_alloc(dentry_cache, ...); ... dentry->d_name.name = dentry->d_iname; example usage trace: filldir+0xb0/0x140 dcache_readdir+0x82/0x170 iterate_dir+0x142/0x1b0 SyS_getdents+0xb5/0x160 fs/readdir.c: (called via ctx.actor by dir_emit) filldir(..., const char *name, ...): ... copy_to_user(..., name, namlen) fs/libfs.c: dcache_readdir(...): ... next = next_positive(dentry, p, 1) ... dir_emit(..., next->d_name.name, ...) In support of usercopy hardening, this patch defines a region in the dentry_cache slab cache in which userspace copy operations are allowed. This region is known as the slab cache's usercopy region. Slab caches can now check that each dynamic copy operation involving cache-managed memory falls entirely within the slab's usercopy region. This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Signed-off-by: David Windsor <dave@nullcore.net> [kees: adjust hunks for kmalloc-specific things moved later] [kees: adjust commit log, provide usage trace] Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2018-01-14nfs: Update server port after referral or migrationChuck Lever
After traversing a referral or recovering from a migration event, ensure that the server port reported in /proc/mounts is updated to the correct port setting for the new submount. Reported-by: Helen Chao <helen.chao@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14nfs: Referrals should use the same proto setting as their parentChuck Lever
Helen Chao <helen.chao@oracle.com> noticed that when a user traverses a referral on an NFS/RDMA mount, the resulting submount always uses TCP. This behavior does not match the vers= setting when traversing a referral (vers=4.1 is preserved). It also does not match the behavior of crossing from the pseudofs into a real filesystem (proto=rdma is preserved in that case). The Linux NFS client does not currently support the fs_locations_info attribute. The situation is similar for all NFSv4 servers I know of. Therefore until the community has broad support for fs_locations_info, when following a referral: - First try to connect with RPC-over-RDMA. This will fail quickly if the client has no RDMA-capable interfaces. - If connecting with RPC-over-RDMA fails, or the RPC-over-RDMA transport is not available, use TCP. Reported-by: Helen Chao <helen.chao@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14lockd: convert nlm_rqst.a_count from atomic_t to refcount_tElena Reshetova
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable nlm_rqst.a_count is used as pure reference counter. Convert it to refcount_t and fix up the operations. **Important note for maintainers: Some functions from refcount_t API defined in lib/refcount.c have different memory ordering guarantees than their atomic counterparts. The full comparison can be seen in https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon in state to be merged to the documentation tree. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the nlm_rqst.a_count it might make a difference in following places: - nlmclnt_release_call() and nlmsvc_release_call(): decrement in refcount_dec_and_test() only provides RELEASE ordering and control dependency on success vs. fully ordered atomic counterpart Suggested-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Windsor <dwindsor@gmail.com> Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14lockd: convert nlm_lockowner.count from atomic_t to refcount_tElena Reshetova
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable nlm_lockowner.count is used as pure reference counter. Convert it to refcount_t and fix up the operations. **Important note for maintainers: Some functions from refcount_t API defined in lib/refcount.c have different memory ordering guarantees than their atomic counterparts. The full comparison can be seen in https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon in state to be merged to the documentation tree. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the nlm_lockowner.count it might make a difference in following places: - nlm_put_lockowner(): decrement in refcount_dec_and_lock() only provides RELEASE ordering, control dependency on success and holds a spin lock on success vs. fully ordered atomic counterpart. No changes in spin lock guarantees. Suggested-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Windsor <dwindsor@gmail.com> Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14lockd: convert nsm_handle.sm_count from atomic_t to refcount_tElena Reshetova
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable nsm_handle.sm_count is used as pure reference counter. Convert it to refcount_t and fix up the operations. **Important note for maintainers: Some functions from refcount_t API defined in lib/refcount.c have different memory ordering guarantees than their atomic counterparts. The full comparison can be seen in https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon in state to be merged to the documentation tree. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the nsm_handle.sm_count it might make a difference in following places: - nsm_release(): decrement in refcount_dec_and_lock() only provides RELEASE ordering, control dependency on success and holds a spin lock on success vs. fully ordered atomic counterpart. No change for the spin lock guarantees. Suggested-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Windsor <dwindsor@gmail.com> Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14lockd: convert nlm_host.h_count from atomic_t to refcount_tElena Reshetova
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable nlm_host.h_count is used as pure reference counter. Convert it to refcount_t and fix up the operations. **Important note for maintainers: Some functions from refcount_t API defined in lib/refcount.c have different memory ordering guarantees than their atomic counterparts. The full comparison can be seen in https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon in state to be merged to the documentation tree. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the nlm_host.h_count it might make a difference in following places: - nlmsvc_release_host(): decrement in refcount_dec() provides RELEASE ordering, while original atomic_dec() was fully unordered. Since the change is for better, it should not matter. - nlmclnt_release_host(): decrement in refcount_dec_and_test() only provides RELEASE ordering and control dependency on success vs. fully ordered atomic counterpart. It doesn't seem to matter in this case since object freeing happens under mutex lock anyway. Suggested-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Windsor <dwindsor@gmail.com> Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14nfs/pnfs: fix nfs_direct_req ref leak when i/o falls back to the mdsScott Mayhew
Currently when falling back to doing I/O through the MDS (via pnfs_{read|write}_through_mds), the client frees the nfs_pgio_header without releasing the reference taken on the dreq via pnfs_generic_pg_{read|write}pages -> nfs_pgheader_init -> nfs_direct_pgio_init. It then takes another reference on the dreq via nfs_generic_pg_pgios -> nfs_pgheader_init -> nfs_direct_pgio_init and as a result the requester will become stuck in inode_dio_wait. Once that happens, other processes accessing the inode will become stuck as well. Ensure that pnfs_read_through_mds() and pnfs_write_through_mds() clean up correctly by calling hdr->completion_ops->completion() instead of calling hdr->release() directly. This can be reproduced (sometimes) by performing "storage failover takeover" commands on NetApp filer while doing direct I/O from a client. This can also be reproduced using SystemTap to simulate a failure while doing direct I/O from a client (from Dave Wysochanski <dwysocha@redhat.com>): stap -v -g -e 'probe module("nfs_layout_nfsv41_files").function("nfs4_fl_prepare_ds").return { $return=NULL; exit(); }' Suggested-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Scott Mayhew <smayhew@redhat.com> Fixes: 1ca018d28d ("pNFS: Fix a memory leak when attempted pnfs fails") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14pnfs/blocklayout: handle transient devicesBenjamin Coddington
PNFS block/SCSI layouts should gracefully handle cases where block devices are not available when a layout is retrieved, or the block devices are removed while the client holds a layout. While setting up a layout segment, keep a record of an unavailable or un-parsable block device in cache with a flag so that subsequent layouts do not spam the server with GETDEVINFO. We can reuse the current NFS_DEVICEID_UNAVAILABLE handling with one variation: instead of reusing the device, we will discard it and send a fresh GETDEVINFO after the timeout, since the lookup and validation of the device occurs within the GETDEVINFO response handling. A lookup of a layout segment that references an unavailable device will return a segment with the NFS_LSEG_UNAVAILABLE flag set. This will allow the pgio layer to mark the layout with the appropriate fail bit, which forces subsequent IO to the MDS, and prevents spamming the server with LAYOUTGET, LAYOUTRETURN. Finally, when IO to a block device fails, look up the block device(s) referenced by the pgio header, and mark them as unavailable. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14pnfs/blocklayout: set PNFS_LAYOUTRETURN_ON_ERRORBenjamin Coddington
If there's an error doing I/O to block device, and the client resends the I/O to the MDS, the MDS must recall the layout from the client before processing the I/O. Let's preempt that exchange by returning the layout before falling back to the MDS when there's an error. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>