summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2021-11-25io_uring: better to use REQ_F_IO_DRAIN for req->flagsHao Xu
It's better to use REQ_F_IO_DRAIN for req->flags rather than IOSQE_IO_DRAIN though they have same value. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20211125092103.224502-3-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-25io_uring: fix no lock protection for ctx->cq_extraHao Xu
ctx->cq_extra should be protected by completion lock so that the req_need_defer() does the right check. Cc: stable@vger.kernel.org Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20211125092103.224502-2-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-25fuse: send security context of inode on fileVivek Goyal
When a new inode is created, send its security context to server along with creation request (FUSE_CREAT, FUSE_MKNOD, FUSE_MKDIR and FUSE_SYMLINK). This gives server an opportunity to create new file and set security context (possibly atomically). In all the configurations it might not be possible to set context atomically. Like nfs and ceph, use security_dentry_init_security() to dermine security context of inode and send it with create, mkdir, mknod, and symlink requests. Following is the information sent to server. fuse_sectx_header, fuse_secctx, xattr_name, security_context - struct fuse_secctx_header This contains total number of security contexts being sent and total size of all the security contexts (including size of fuse_secctx_header). - struct fuse_secctx This contains size of security context which follows this structure. There is one fuse_secctx instance per security context. - xattr name string This string represents name of xattr which should be used while setting security context. - security context This is the actual security context whose size is specified in fuse_secctx struct. Also add the FUSE_SECURITY_CTX flag for the `flags` field of the fuse_init_out struct. When this flag is set the kernel will append the security context for a newly created inode to the request (create, mkdir, mknod, and symlink). The server is responsible for ensuring that the inode appears atomically (preferrably) with the requested security context. For example, If the server is using SELinux and backed by a "real" linux file system that supports extended attributes it can write the security context value to /proc/thread-self/attr/fscreate before making the syscall to create the inode. This patch is based on patch from Chirantan Ekbote <chirantan@chromium.org> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-11-25fuse: release pipe buf after last useMiklos Szeredi
Checking buf->flags should be done before the pipe_buf_release() is called on the pipe buffer, since releasing the buffer might modify the flags. This is exactly what page_cache_pipe_buf_release() does, and which results in the same VM_BUG_ON_PAGE(PageLRU(page)) that the original patch was trying to fix. Reported-by: Justin Forbes <jmforbes@linuxtx.org> Fixes: 712a951025c0 ("fuse: fix page stealing") Cc: <stable@vger.kernel.org> # v2.6.35 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-11-25fuse: extend init flagsMiklos Szeredi
FUSE_INIT flags are close to running out, so add another 32bits worth of space. Add FUSE_INIT_EXT flag to the old flags field in fuse_init_in. If this flag is set, then fuse_init_in is extended by 48bytes, in which a flags_hi field is allocated to contain the high 32bits of the flags. A flags_hi field is also added to fuse_init_out, allocated out of the remaining unused fields. Known userspace implementations of the fuse protocol have been checked to accept the extended FUSE_INIT request, but this might cause problems with other implementations. If that happens to be the case, the protocol negotiation will have to be extended with an extra initialization request roundtrip. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-11-25ksmbd: fix memleak in get_file_stream_info()Namjae Jeon
Fix memleak in get_file_stream_info() Fixes: 34061d6b76a4 ("ksmbd: validate OutputBufferLength of QUERY_DIR, QUERY_INFO, IOCTL requests") Cc: stable@vger.kernel.org # v5.15 Reported-by: Coverity Scan <scan-admin@coverity.com> Acked-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-11-25ksmbd: contain default data stream even if xattr is emptyNamjae Jeon
If xattr is not supported like exfat or fat, ksmbd server doesn't contain default data stream in FILE_STREAM_INFORMATION response. It will cause ppt or doc file update issue if local filesystem is such as ones. This patch move goto statement to contain it. Fixes: 9f6323311c70 ("ksmbd: add default data stream name in FILE_STREAM_INFORMATION") Cc: stable@vger.kernel.org # v5.15 Acked-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-11-25ksmbd: downgrade addition info error msg to debug in smb2_get_info_sec()Namjae Jeon
While file transfer through windows client, This error flood message happen. This flood message will cause performance degradation and misunderstand server has problem. Fixes: e294f78d3478 ("ksmbd: allow PROTECTED_DACL_SECINFO and UNPROTECTED_DACL_SECINFO addition information in smb2 set info security") Cc: stable@vger.kernel.org # v5.15 Acked-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-11-25ksmbd: Fix an error handling path in 'smb2_sess_setup()'Christophe JAILLET
All the error handling paths of 'smb2_sess_setup()' end to 'out_err'. All but the new error handling path added by the commit given in the Fixes tag below. Fix this error handling path and branch to 'out_err' as well. Fixes: 0d994cd482ee ("ksmbd: add buffer validation in session setup") Cc: stable@vger.kernel.org # v5.15 Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-11-24io_uring: disable drain with cqe skipPavel Begunkov
Current IOSQE_IO_DRAIN implementation doesn't work well with CQE skipping and it's not allowed, otherwise some requests might be not executed until the ring is destroyed and the userspace would hang. Let's fail all drain requests after seeing IOSQE_CQE_SKIP_SUCCESS at least once. All drained requests prior to that will get run normally, so there should be no stalls. However, even though such mixing wouldn't lead to issues at the moment, it's still not allowed as the behaviour may change. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bcf7164f8bf3eb54b7bb7b4fd119907fa4d4d43b.1636559119.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-24io_uring: don't spinlock when not posting CQEsPavel Begunkov
When no of queued for the batch completion requests need to post an CQE, see IOSQE_CQE_SKIP_SUCCESS, avoid grabbing ->completion_lock and other commit/post. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8d4b4a08bca022cbe19af00266407116775b3e4d.1636559119.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-24io_uring: add option to skip CQE postingPavel Begunkov
Emitting a CQE is expensive from the kernel perspective. Often, it's also not convenient for the userspace, spends some cycles on processing and just complicates the logic. A similar problems goes for linked requests, where we post an CQE for each request in the link. Introduce a new flags, IOSQE_CQE_SKIP_SUCCESS, trying to help with it. When set and a request completed successfully, it won't generate a CQE. When fails, it produces an CQE, but all following linked requests will be CQE-less, regardless whether they have IOSQE_CQE_SKIP_SUCCESS or not. The notion of "fail" is the same as for link failing-cancellation, where it's opcode dependent, and _usually_ result >= 0 is a success, but not always. Linked timeouts are a bit special. When the requests it's linked to was not attempted to be executed, e.g. failing linked requests, it follows the description above. Otherwise, whether a linked timeout will post a completion or not solely depends on IOSQE_CQE_SKIP_SUCCESS of that linked timeout request. Linked timeout never "fail" during execution, so for them it's unconditional. It's expected for users to not really care about the result of it but rely solely on the result of the master request. Another reason for such a treatment is that it's racy, and the timeout callback may be running awhile the master request posts its completion. use case 1: If one doesn't care about results of some requests, e.g. normal timeouts, just set IOSQE_CQE_SKIP_SUCCESS. Error result will still be posted and need to be handled. use case 2: Set IOSQE_CQE_SKIP_SUCCESS for all requests of a link but the last, and it'll post a completion only for the last one if everything goes right, otherwise there will be one only one CQE for the first failed request. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0220fbe06f7cf99e6fc71b4297bb1cb6c0e89c2c.1636559119.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-24io_uring: clean cqe filling functionsPavel Begunkov
Split io_cqring_fill_event() into a couple of more targeted functions. The first on is io_fill_cqe_aux() for completions that are not associated with request completions and doing the ->cq_extra accounting. Examples are additional CQEs from multishot poll and rsrc notifications. The second is io_fill_cqe_req(), should be called when it's a normal request completion. Nothing more to it at the moment, will be used in later patches. The last one is inlined __io_fill_cqe() for a finer grained control, should be used with caution and in hottest places. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/59a9117a4a44fc9efcf04b3afa51e0d080f5943c.1636559119.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-24iomap: iomap_read_inline_data cleanupAndreas Gruenbacher
Change iomap_read_inline_data to return 0 or an error code; this simplifies the callers. Add a description. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> [djwong: document the return value of iomap_read_inline_data explicitly] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-11-24xfs: remove xfs_inew_waitChristoph Hellwig
With the remove of xfs_dqrele_all_inodes, xfs_inew_wait and all the infrastructure used to wake the XFS_INEW bit waitqueue is unused. Reported-by: kernel test robot <lkp@intel.com> Fixes: 777eb1fa857e ("xfs: remove xfs_dqrele_all_inodes") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-11-24xfs: Fix the free logic of state in xfs_attr_node_hasnameYang Xu
When testing xfstests xfs/126 on lastest upstream kernel, it will hang on some machine. Adding a getxattr operation after xattr corrupted, I can reproduce it 100%. The deadlock as below: [983.923403] task:setfattr state:D stack: 0 pid:17639 ppid: 14687 flags:0x00000080 [ 983.923405] Call Trace: [ 983.923410] __schedule+0x2c4/0x700 [ 983.923412] schedule+0x37/0xa0 [ 983.923414] schedule_timeout+0x274/0x300 [ 983.923416] __down+0x9b/0xf0 [ 983.923451] ? xfs_buf_find.isra.29+0x3c8/0x5f0 [xfs] [ 983.923453] down+0x3b/0x50 [ 983.923471] xfs_buf_lock+0x33/0xf0 [xfs] [ 983.923490] xfs_buf_find.isra.29+0x3c8/0x5f0 [xfs] [ 983.923508] xfs_buf_get_map+0x4c/0x320 [xfs] [ 983.923525] xfs_buf_read_map+0x53/0x310 [xfs] [ 983.923541] ? xfs_da_read_buf+0xcf/0x120 [xfs] [ 983.923560] xfs_trans_read_buf_map+0x1cf/0x360 [xfs] [ 983.923575] ? xfs_da_read_buf+0xcf/0x120 [xfs] [ 983.923590] xfs_da_read_buf+0xcf/0x120 [xfs] [ 983.923606] xfs_da3_node_read+0x1f/0x40 [xfs] [ 983.923621] xfs_da3_node_lookup_int+0x69/0x4a0 [xfs] [ 983.923624] ? kmem_cache_alloc+0x12e/0x270 [ 983.923637] xfs_attr_node_hasname+0x6e/0xa0 [xfs] [ 983.923651] xfs_has_attr+0x6e/0xd0 [xfs] [ 983.923664] xfs_attr_set+0x273/0x320 [xfs] [ 983.923683] xfs_xattr_set+0x87/0xd0 [xfs] [ 983.923686] __vfs_removexattr+0x4d/0x60 [ 983.923688] __vfs_removexattr_locked+0xac/0x130 [ 983.923689] vfs_removexattr+0x4e/0xf0 [ 983.923690] removexattr+0x4d/0x80 [ 983.923693] ? __check_object_size+0xa8/0x16b [ 983.923695] ? strncpy_from_user+0x47/0x1a0 [ 983.923696] ? getname_flags+0x6a/0x1e0 [ 983.923697] ? _cond_resched+0x15/0x30 [ 983.923699] ? __sb_start_write+0x1e/0x70 [ 983.923700] ? mnt_want_write+0x28/0x50 [ 983.923701] path_removexattr+0x9b/0xb0 [ 983.923702] __x64_sys_removexattr+0x17/0x20 [ 983.923704] do_syscall_64+0x5b/0x1a0 [ 983.923705] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 983.923707] RIP: 0033:0x7f080f10ee1b When getxattr calls xfs_attr_node_get function, xfs_da3_node_lookup_int fails with EFSCORRUPTED in xfs_attr_node_hasname because we have use blocktrash to random it in xfs/126. So it free state in internal and xfs_attr_node_get doesn't do xfs_buf_trans release job. Then subsequent removexattr will hang because of it. This bug was introduced by kernel commit 07120f1abdff ("xfs: Add xfs_has_attr and subroutines"). It adds xfs_attr_node_hasname helper and said caller will be responsible for freeing the state in this case. But xfs_attr_node_hasname will free state itself instead of caller if xfs_da3_node_lookup_int fails. Fix this bug by moving the step of free state into caller. Also, use "goto error/out" instead of returning error directly in xfs_attr_node_addname_find_attr and xfs_attr_node_removename_setup function because we should free state ourselves. Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines") Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-11-24kernfs: switch global kernfs_rwsem lock to per-fs lockMinchan Kim
The kernfs implementation has big lock granularity(kernfs_rwsem) so every kernfs-based(e.g., sysfs, cgroup) fs are able to compete the lock. It makes trouble for some cases to wait the global lock for a long time even though they are totally independent contexts each other. A general example is process A goes under direct reclaim with holding the lock when it accessed the file in sysfs and process B is waiting the lock with exclusive mode and then process C is waiting the lock until process B could finish the job after it gets the lock from process A. This patch switches the global kernfs_rwsem to per-fs lock, which put the rwsem into kernfs_root. Suggested-by: Tejun Heo <tj@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Minchan Kim <minchan@kernel.org> Link: https://lore.kernel.org/r/20211118230008.2679780-1-minchan@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-23io_uring: improve argument types of kiocb_done()Pavel Begunkov
kiocb_done() accepts a pointer to struct kiocb, pass struct io_kiocb (i.e. io_uring's request) instead so we can get rid of useless container_of(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/252016eed77806f58b48251a85cd8c645f900433.1637524285.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-23io_uring: clean __io_import_iovec()Pavel Begunkov
Apparently, implicit 0 to NULL conversion with ERR_PTR is not recommended and makes some tooling like Smatch to complain. Handle it explicitly, compilers are perfectly capable to optimise it out. Link: https://lore.kernel.org/all/20211108134937.GA2863@kili/ Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5c6ed369ad95075dab345df679f8677b8fe66656.1637524285.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-23io_uring: improve send/recv error handlingPavel Begunkov
Hide all error handling under common if block, removes two extra ifs on the success path and keeps the handling more condensed. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5761545158a12968f3caf30f747eea65ed75dfc1.1637524285.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-23io_uring: simplify reissue in kiocb_donePavel Begunkov
Simplify failed resubmission prep in kiocb_done(), it's a bit ugly with conditional logic and hand handling cflags / select buffers. Instead, punt to tw and use io_req_task_complete() already handling all the cases. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/667c33484b05b612e9420e1b1d5f4dc46d0ee9ce.1637524285.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-23cifs: update internal version numberSteve French
To 2.34 Signed-off-by: Steve French <stfrench@microsoft.com>
2021-11-23smb2: clarify rc initialization in smb2_reconnectSteve French
It is clearer to initialize rc at the beginning of the function. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-11-23cifs: populate server_hostname for extra channelsShyam Prasad N
Recently, a new field got added to the smb3_fs_context struct named server_hostname. While creating extra channels, pick up this field from primary channel. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-11-23cifs: nosharesock should be set on new serverShyam Prasad N
Recent fix to maintain a nosharesock state on the server struct caused a regression. It updated this field in the old tcp session, and not the new one. This caused the multichannel scenario to misbehave. Fixes: c9f1c19cf7c5 (cifs: nosharesock should not share socket with future sessions) Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-11-23erofs: fix deadlock when shrink erofs slabHuang Jianan
We observed the following deadlock in the stress test under low memory scenario: Thread A Thread B - erofs_shrink_scan - erofs_try_to_release_workgroup - erofs_workgroup_try_to_freeze -- A - z_erofs_do_read_page - z_erofs_collection_begin - z_erofs_register_collection - erofs_insert_workgroup - xa_lock(&sbi->managed_pslots) -- B - erofs_workgroup_get - erofs_wait_on_workgroup_freezed -- A - xa_erase - xa_lock(&sbi->managed_pslots) -- B To fix this, it needs to hold xa_lock before freezing the workgroup since xarray will be touched then. So let's hold the lock before accessing each workgroup, just like what we did with the radix tree before. [ Gao Xiang: Jianhua Hao also reports this issue at https://lore.kernel.org/r/b10b85df30694bac8aadfe43537c897a@xiaomi.com ] Link: https://lore.kernel.org/r/20211118135844.3559-1-huangjianan@oppo.com Fixes: 64094a04414f ("erofs: convert workstn to XArray") Reviewed-by: Chao Yu <chao@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Huang Jianan <huangjianan@oppo.com> Reported-by: Jianhua Hao <haojianhua1@xiaomi.com> Signed-off-by: Gao Xiang <xiang@kernel.org>
2021-11-22io_uring: correct link-list traversal lockingPavel Begunkov
As io_remove_next_linked() is now under ->timeout_lock (see io_link_timeout_fn), we should update locking around io_for_each_link() and io_match_task() to use the new lock. Cc: stable@kernel.org # 5.15+ Fixes: 89850fce16a1a ("io_uring: run timeouts from task_work") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b54541cedf7de59cb5ae36109e58529ca16e66aa.1637631883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-21iomap: Fix inline extent handling in iomap_readpageAndreas Gruenbacher
Before commit 740499c78408 ("iomap: fix the iomap_readpage_actor return value for inline data"), when hitting an IOMAP_INLINE extent, iomap_readpage_actor would report having read the entire page. Since then, it only reports having read the inline data (iomap->length). This will force iomap_readpage into another iteration, and the filesystem will report an unaligned hole after the IOMAP_INLINE extent. But iomap_readpage_actor (now iomap_readpage_iter) isn't prepared to deal with unaligned extents, it will get things wrong on filesystems with a block size smaller than the page size, and we'll eventually run into the following warning in iomap_iter_advance: WARN_ON_ONCE(iter->processed > iomap_length(iter)); Fix that by changing iomap_readpage_iter to return 0 when hitting an inline extent; this will cause iomap_iter to stop immediately. To fix readahead as well, change iomap_readahead_iter to pass on iomap_readpage_iter return values less than or equal to zero. Fixes: 740499c78408 ("iomap: fix the iomap_readpage_actor return value for inline data") Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-11-21pstore/blk: Use "%lu" to format unsigned longGeert Uytterhoeven
On 32-bit: fs/pstore/blk.c: In function ‘__best_effort_init’: include/linux/kern_levels.h:5:18: warning: format ‘%zu’ expects argument of type ‘size_t’, but argument 3 has type ‘long unsigned int’ [-Wformat=] 5 | #define KERN_SOH "\001" /* ASCII Start Of Header */ | ^~~~~~ include/linux/kern_levels.h:14:19: note: in expansion of macro ‘KERN_SOH’ 14 | #define KERN_INFO KERN_SOH "6" /* informational */ | ^~~~~~~~ include/linux/printk.h:373:9: note: in expansion of macro ‘KERN_INFO’ 373 | printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__) | ^~~~~~~~~ fs/pstore/blk.c:314:3: note: in expansion of macro ‘pr_info’ 314 | pr_info("attached %s (%zu) (no dedicated panic_write!)\n", | ^~~~~~~ Cc: stable@vger.kernel.org Fixes: 7bb9557b48fcabaa ("pstore/blk: Use the normal block device I/O path") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210629103700.1935012-1-geert@linux-m68k.org Cc: Jens Axboe <axboe@kernel.dk> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-20Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "15 patches. Subsystems affected by this patch series: ipc, hexagon, mm (swap, slab-generic, kmemleak, hugetlb, kasan, damon, and highmem), and proc" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: proc/vmcore: fix clearing user buffer by properly using clear_user() kmap_local: don't assume kmap PTEs are linear arrays in memory mm/damon/dbgfs: fix missed use of damon_dbgfs_lock mm/damon/dbgfs: use '__GFP_NOWARN' for user-specified size buffer allocation kasan: test: silence intentional read overflow warnings hugetlb, userfaultfd: fix reservation restore on userfaultfd error hugetlb: fix hugetlb cgroup refcounting during mremap mm: kmemleak: slob: respect SLAB_NOLEAKTRACE flag hexagon: ignore vmlinux.lds hexagon: clean up timer-regs.h hexagon: export raw I/O routines for modules mm: emit the "free" trace report before freeing memory in kmem_cache_free() shm: extend forced shm destroy to support objects from several IPC nses ipc: WARN if trying to remove ipc object which is absent mm/swap.c:put_pages_list(): reinitialise the page list
2021-11-20Merge tag 'block-5.16-2021-11-19' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - Flip a cap check to avoid a selinux error (Alistair) - Fix for a regression this merge window where we can miss a queue ref put (me) - Un-mark pstore-blk as broken, as the condition that triggered that change has been rectified (Kees) - Queue quiesce and sync fixes (Ming) - FUA insertion fix (Ming) - blk-cgroup error path put fix (Yu) * tag 'block-5.16-2021-11-19' of git://git.kernel.dk/linux-block: blk-mq: don't insert FUA request with data into scheduler queue blk-cgroup: fix missing put device in error path from blkg_conf_pref() block: avoid to quiesce queue in elevator_init_mq Revert "mark pstore-blk as broken" blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release() block: fix missing queue put in error path block: Check ADMIN before NICE for IOPRIO_CLASS_RT
2021-11-20Merge tag '5.16-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs fixes from Steve French: "Three small cifs/smb3 fixes: two to address minor coverity issues and one cleanup" * tag '5.16-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: introduce cifs_ses_mark_for_reconnect() helper cifs: protect srv_count with cifs_tcp_ses_lock cifs: move debug print out of spinlock
2021-11-20proc/vmcore: fix clearing user buffer by properly using clear_user()David Hildenbrand
To clear a user buffer we cannot simply use memset, we have to use clear_user(). With a virtio-mem device that registers a vmcore_cb and has some logically unplugged memory inside an added Linux memory block, I can easily trigger a BUG by copying the vmcore via "cp": systemd[1]: Starting Kdump Vmcore Save Service... kdump[420]: Kdump is using the default log level(3). kdump[453]: saving to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/ kdump[458]: saving vmcore-dmesg.txt to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/ kdump[465]: saving vmcore-dmesg.txt complete kdump[467]: saving vmcore BUG: unable to handle page fault for address: 00007f2374e01000 #PF: supervisor write access in kernel mode #PF: error_code(0x0003) - permissions violation PGD 7a523067 P4D 7a523067 PUD 7a528067 PMD 7a525067 PTE 800000007048f867 Oops: 0003 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 468 Comm: cp Not tainted 5.15.0+ #6 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-27-g64f37cc530f1-prebuilt.qemu.org 04/01/2014 RIP: 0010:read_from_oldmem.part.0.cold+0x1d/0x86 Code: ff ff ff e8 05 ff fe ff e9 b9 e9 7f ff 48 89 de 48 c7 c7 38 3b 60 82 e8 f1 fe fe ff 83 fd 08 72 3c 49 8d 7d 08 4c 89 e9 89 e8 <49> c7 45 00 00 00 00 00 49 c7 44 05 f8 00 00 00 00 48 83 e7 f81 RSP: 0018:ffffc9000073be08 EFLAGS: 00010212 RAX: 0000000000001000 RBX: 00000000002fd000 RCX: 00007f2374e01000 RDX: 0000000000000001 RSI: 00000000ffffdfff RDI: 00007f2374e01008 RBP: 0000000000001000 R08: 0000000000000000 R09: ffffc9000073bc50 R10: ffffc9000073bc48 R11: ffffffff829461a8 R12: 000000000000f000 R13: 00007f2374e01000 R14: 0000000000000000 R15: ffff88807bd421e8 FS: 00007f2374e12140(0000) GS:ffff88807f000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f2374e01000 CR3: 000000007a4aa000 CR4: 0000000000350eb0 Call Trace: read_vmcore+0x236/0x2c0 proc_reg_read+0x55/0xa0 vfs_read+0x95/0x190 ksys_read+0x4f/0xc0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Some x86-64 CPUs have a CPU feature called "Supervisor Mode Access Prevention (SMAP)", which is used to detect wrong access from the kernel to user buffers like this: SMAP triggers a permissions violation on wrong access. In the x86-64 variant of clear_user(), SMAP is properly handled via clac()+stac(). To fix, properly use clear_user() when we're dealing with a user buffer. Link: https://lkml.kernel.org/r/20211112092750.6921-1-david@redhat.com Fixes: 997c136f518c ("fs/proc/vmcore.c: add hook to read_from_oldmem() to check for non-ram pages") Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Philipp Rudo <prudo@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-18Merge tag 'for-5.16-rc1-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Several xes and one old ioctl deprecation. Namely there's fix for crashes/warnings with lzo compression that was suspected to be caused by first pull merge resolution, but it was a different bug. Summary: - regression fix for a crash in lzo due to missing boundary checks of the page array - fix crashes on ARM64 due to missing barriers when synchronizing status bits between work queues - silence lockdep when reading chunk tree during mount - fix false positive warning in integrity checker on devices with disabled write caching - fix signedness of bitfields in scrub - start deprecation of balance v1 ioctl" * tag 'for-5.16-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: deprecate BTRFS_IOC_BALANCE ioctl btrfs: make 1-bit bit-fields of scrub_page unsigned int btrfs: check-integrity: fix a warning on write caching disabled disk btrfs: silence lockdep when reading chunk tree during mount btrfs: fix memory ordering between normal and ordered work functions btrfs: fix a out-of-bound access in copy_compressed_data_to_page()
2021-11-18Merge tag 'fs_for_v5.16-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull UDF fix from Jan Kara: "A fix for a long-standing UDF bug where we were not properly validating directory position inside readdir" * tag 'fs_for_v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: udf: Fix crash after seekdir
2021-11-18Merge tag 'fs.idmapped.v5.16-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull setattr idmapping fix from Christian Brauner: "This contains a simple fix for setattr. When determining the validity of the attributes the ia_{g,u}id fields contain the value that will be written to inode->i_{g,u}id. When the {g,u}id attribute of the file isn't altered and the caller's fs{g,u}id matches the current {g,u}id attribute the attribute change is allowed. The value in ia_{g,u}id does already account for idmapped mounts and will have taken the relevant idmapping into account. So in order to verify that the {g,u}id attribute isn't changed we simple need to compare the ia_{g,u}id value against the inode's i_{g,u}id value. This only has any meaning for idmapped mounts as idmapping helpers are idempotent without them. And for idmapped mounts this really only has a meaning when circular idmappings are used, i.e. mappings where e.g. id 1000 is mapped to id 1001 and id 1001 is mapped to id 1000. Such ciruclar mappings can e.g. be useful when sharing the same home directory between multiple users at the same time. Before this patch we could end up denying legitimate attribute changes and allowing invalid attribute changes when circular mappings are used. To even get into this situation the caller must've been privileged both to create that mapping and to create that idmapped mount. This hasn't been seen in the wild anywhere but came up when expanding the fstest suite during work on a series of hardening patches. All idmapped fstests pass without any regressions and we're adding new tests to verify the behavior of circular mappings. The new tests can be found at [1]" Link: https://lore.kernel.org/linux-fsdevel/20211109145713.1868404-2-brauner@kernel.org [1] * tag 'fs.idmapped.v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: fs: handle circular mappings correctly
2021-11-18pstore/ftrace: Allow immediate recordingUwe Kleine-König
Without a module param knob there was no way to enable pstore ftrace recording early enough to debug hangs happening during the boot process before userspace is up enough to enable it via the regular debugfs knobs. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Co-developed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210610082134.20636-1-u.kleine-koenig@pengutronix.de
2021-11-18reiserfs: don't use congestion_wait()NeilBrown
Block devices do not, in general, report congestion any more, so this congestion_wait() is effectively just a sleep. It isn't entirely clear what is being waited for, but as we only wait when j_async_throttle is elevated, it seems reasonable to stop waiting when j_async_throttle becomes zero - or after the same timeout. So change to use wait_event_event_timeout() for waiting, and wake_up_var() to signal an end to waiting. Link: https://lore.kernel.org/r/163712368225.13692.3419908086400748349@noble.neil.brown.name Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jan Kara <jack@suse.cz>
2021-11-17Merge tag 'gfs2-v5.16-rc2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 fixes from Andreas Gruenbacher: - The current iomap_file_buffered_write behavior of failing the entire write when part of the user buffer cannot be faulted in leads to an endless loop in gfs2. Work around that in gfs2 for now. - Various other bugs all over the place. * tag 'gfs2-v5.16-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Prevent endless loops in gfs2_file_buffered_write gfs2: Fix "Introduce flag for glock holder auto-demotion" gfs2: Fix length of holes reported at end-of-file gfs2: release iopen glock early in evict gfs2: Fix atomic bug in gfs2_instantiate gfs2: Only dereference i->iov when iter_is_iovec(i)
2021-11-17NFSv4.1: handle NFS4ERR_NOSPC by CREATE_SESSIONOlga Kornievskaia
When the client receives ERR_NOSPC on reply to CREATE_SESSION it leads to a client hanging in nfs_wait_client_init_complete(). Instead, complete and fail the client initiation with an EIO error which allows for the mount command to fail instead of hanging. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-11-17f2fs: rework write preallocationsEric Biggers
f2fs_write_begin() assumes that all blocks were preallocated by default unless FI_NO_PREALLOC is explicitly set. This invites data corruption, as there are cases in which not all blocks are preallocated. Commit 47501f87c61a ("f2fs: preallocate DIO blocks when forcing buffered_io") fixed one case, but there are others remaining. Fix up this logic by replacing this flag with FI_PREALLOCATED_ALL, which only gets set if all blocks for the current write were preallocated. Also clean up f2fs_preallocate_blocks(), move it to file.c, and make it handle some of the logic that was previously in write_iter() directly. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-11-17f2fs: compress: reduce one page array alloc and free when write compressed pageFengnan Chang
Don't alloc new page pointers array to replace old, just use old, introduce valid_nr_cpages to indicate valid number of page pointers in array, try to reduce one page array alloc and free when write compress page. Signed-off-by: Fengnan Chang <changfengnan@vivo.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-11-17NFSv42: Fix pagecache invalidation after COPY/CLONEBenjamin Coddington
The mechanism in use to allow the client to see the results of COPY/CLONE is to drop those pages from the pagecache. This forces the client to read those pages once more from the server. However, truncate_pagecache_range() zeros out partial pages instead of dropping them. Let us instead use invalidate_inode_pages2_range() with full-page offsets to ensure the client properly sees the results of COPY/CLONE operations. Cc: <stable@vger.kernel.org> # v4.7+ Fixes: 2e72448b07dc ("NFS: Add COPY nfs operation") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-11-17NFS: Add a tracepoint to show the results of nfs_set_cache_invalid()Benjamin Coddington
This provides some insight into the client's invalidation behavior to show both when the client uses the helper, and the results of calling the helper which can vary depending on how the helper is called. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-11-17NFSv42: Don't fail clone() unless the OP_CLONE operation failedTrond Myklebust
The failure to retrieve post-op attributes has no bearing on whether or not the clone operation itself was successful. We must therefore ignore the return value of decode_getfattr() when looking at the success or failure of nfs4_xdr_dec_clone(). Fixes: 36022770de6c ("nfs42: add CLONE xdr functions") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-11-17signal: Requeue signals in the appropriate queueEric W. Biederman
In the event that a tracer changes which signal needs to be delivered and that signal is currently blocked then the signal needs to be requeued for later delivery. With the advent of CLONE_THREAD the kernel has 2 signal queues per task. The per process queue and the per task queue. Update the code so that if the signal is removed from the per process queue it is requeued on the per process queue. This is necessary to make it appear the signal was never dequeued. The rr debugger reasonably believes that the state of the process from the last ptrace_stop it observed until PTRACE_EVENT_EXIT can be recreated by simply letting a process run. If a SIGKILL interrupts a ptrace_stop this is not true today. So return signals to their original queue in ptrace_signal so that signals that are not delivered appear like they were never dequeued. Fixes: 794aa320b79d ("[PATCH] sigfix-2.5.40-D6") History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.gi Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/87zgq4d5r4.fsf_-_@email.froward.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-11-17Merge tag 'nfsd-5.16-1' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd bugfix from Bruce Fields: "This is just one bugfix for a buffer overflow in knfsd's xdr decoding" * tag 'nfsd-5.16-1' of git://linux-nfs.org/~bfields/linux: NFSD: Fix exposure in nfsd4_decode_bitmap()
2021-11-17fs: Remove FS_THP_SUPPORTMatthew Wilcox (Oracle)
Instead of setting a bit in the fs_flags to set a bit in the address_space, set the bit in the address_space directly. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-11-17fs: dlm: fix build with CONFIG_IPV6 disabledAlexander Aring
This patch will surround the AF_INET6 case in sk_error_report() of dlm with a #if IS_ENABLED(CONFIG_IPV6). The field sk->sk_v6_daddr is not defined when CONFIG_IPV6 is disabled. If CONFIG_IPV6 is disabled, the socket creation with AF_INET6 should already fail because a runtime check if AF_INET6 is registered. However if there is the possibility that AF_INET6 is set as sk_family the sk_error_report() callback will print then an invalid family type error. Reported-by: kernel test robot <lkp@intel.com> Fixes: 4c3d90570bcc ("fs: dlm: don't call kernel_getpeername() in error_report()") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2021-11-17fs: handle circular mappings correctlyChristian Brauner
When calling setattr_prepare() to determine the validity of the attributes the ia_{g,u}id fields contain the value that will be written to inode->i_{g,u}id. When the {g,u}id attribute of the file isn't altered and the caller's fs{g,u}id matches the current {g,u}id attribute the attribute change is allowed. The value in ia_{g,u}id does already account for idmapped mounts and will have taken the relevant idmapping into account. So in order to verify that the {g,u}id attribute isn't changed we simple need to compare the ia_{g,u}id value against the inode's i_{g,u}id value. This only has any meaning for idmapped mounts as idmapping helpers are idempotent without them. And for idmapped mounts this really only has a meaning when circular idmappings are used, i.e. mappings where e.g. id 1000 is mapped to id 1001 and id 1001 is mapped to id 1000. Such ciruclar mappings can e.g. be useful when sharing the same home directory between multiple users at the same time. As an example consider a directory with two files: /source/file1 owned by {g,u}id 1000 and /source/file2 owned by {g,u}id 1001. Assume we create an idmapped mount at /target with an idmapping that maps files owned by {g,u}id 1000 to being owned by {g,u}id 1001 and files owned by {g,u}id 1001 to being owned by {g,u}id 1000. In effect, the idmapped mount at /target switches the ownership of /source/file1 and source/file2, i.e. /target/file1 will be owned by {g,u}id 1001 and /target/file2 will be owned by {g,u}id 1000. This means that a user with fs{g,u}id 1000 must be allowed to setattr /target/file2 from {g,u}id 1000 to {g,u}id 1000. Similar, a user with fs{g,u}id 1001 must be allowed to setattr /target/file1 from {g,u}id 1001 to {g,u}id 1001. Conversely, a user with fs{g,u}id 1000 must fail to setattr /target/file1 from {g,u}id 1001 to {g,u}id 1000. And a user with fs{g,u}id 1001 must fail to setattr /target/file2 from {g,u}id 1000 to {g,u}id 1000. Both cases must fail with EPERM for non-capable callers. Before this patch we could end up denying legitimate attribute changes and allowing invalid attribute changes when circular mappings are used. To even get into this situation the caller must've been privileged both to create that mapping and to create that idmapped mount. This hasn't been seen in the wild anywhere but came up when expanding the testsuite during work on a series of hardening patches. All idmapped fstests pass without any regressions and we add new tests to verify the behavior of circular mappings. Link: https://lore.kernel.org/r/20211109145713.1868404-1-brauner@kernel.org Fixes: 2f221d6f7b88 ("attr: handle idmapped mounts") Cc: Seth Forshee <seth.forshee@digitalocean.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org CC: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Seth Forshee <sforshee@digitalocean.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>