summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2022-04-25f2fs: Remove usage of list iterator pas the loop for list_move_tail()Jakob Koschel
In preparation to limit the scope of a list iterator to the list traversal loop, the usage of the list iterator variable 'next' should be avoided past the loop body [1]. Instead of calling list_move_tail() on 'next' after the loop, it is called within the loop if the correct location was found. After the loop it covers the case if no location was found and it should be inserted based on the 'head' of the list. Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-04-25f2fs: fix dereference of stale list iterator after loop bodyJakob Koschel
The list iterator variable will be a bogus pointer if no break was hit. Dereferencing it (cur->page in this case) could load an out-of-bounds/undefined value making it unsafe to use that in the comparision to determine if the specific element was found. Since 'cur->page' *can* be out-ouf-bounds it cannot be guaranteed that by chance (or intention of an attacker) it matches the value of 'page' even though the correct element was not found. This is fixed by using a separate list iterator variable for the loop and only setting the original variable if a suitable element was found. Then determing if the element was found is simply checking if the variable is set. Fixes: 8c242db9b8c0 ("f2fs: fix stale ATOMIC_WRITTEN_PAGE private pointer") Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-04-25f2fs: fix to do sanity check on inline_dots inodeChao Yu
As Wenqing reported in bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215765 It will cause a kernel panic with steps: - mkdir mnt - mount tmp40.img mnt - ls mnt folio_mark_dirty+0x33/0x50 f2fs_add_regular_entry+0x541/0xad0 [f2fs] f2fs_add_dentry+0x6c/0xb0 [f2fs] f2fs_do_add_link+0x182/0x230 [f2fs] __recover_dot_dentries+0x2d6/0x470 [f2fs] f2fs_lookup+0x5af/0x6a0 [f2fs] __lookup_slow+0xac/0x200 lookup_slow+0x45/0x70 walk_component+0x16c/0x250 path_lookupat+0x8b/0x1f0 filename_lookup+0xef/0x250 user_path_at_empty+0x46/0x70 vfs_statx+0x98/0x190 __do_sys_newlstat+0x41/0x90 __x64_sys_newlstat+0x1a/0x30 do_syscall_64+0x37/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae The root cause is for special file: e.g. character, block, fifo or socket file, f2fs doesn't assign address space operations pointer array for mapping->a_ops field, so, in a fuzzed image, if inline_dots flag was tagged in special file, during lookup(), when f2fs runs into __recover_dot_dentries(), it will cause NULL pointer access once f2fs_add_regular_entry() calls a_ops->set_dirty_page(). Fixes: 510022a85839 ("f2fs: add F2FS_INLINE_DOTS to recover missing dot dentries") Reported-by: Wenqing Liu <wenqingliu0120@gmail.com> Signed-off-by: Chao Yu <chao.yu@oppo.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-04-25f2fs: introduce data read/write showing path infoJaegeuk Kim
This was used in Android for a long time. Let's upstream it. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-04-25f2fs: remove unnecessary f2fs_lock_op in f2fs_new_inodeJaegeuk Kim
This can be removed, since f2fs_alloc_nid() actually doesn't require to block checkpoint and __f2fs_build_free_nids() is covered by nm_i->nat_tree_lock. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-04-25f2fs: don't set GC_FAILURE_PIN for background GCChao Yu
So that it can reduce the possibility that file be unpinned forcely by foreground GC due to .i_gc_failures[GC_FAILURE_PIN] exceeds threshold. Signed-off-by: Chao Yu <chao.yu@oppo.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-04-25f2fs: check pinfile in gc_data_segment() in advanceChao Yu
In order to skip migrating section which contains data of pinned file in advance. Signed-off-by: Chao Yu <chao.yu@oppo.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-04-25sysctl: minor cleanup in new_dir()Vasily Averin
Byte zeroing is not required here, since memory was allocated by kzalloc() Signed-off-by: Vasily Averin <vvs@openvz.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-04-25fs/jfs: Remove dead codeDave Kleikamp
Since the JFS code was first added to Linux, there has been code hidden in ifdefs for some potential future features such as defragmentation and supporting block sizes other than 4KB. There has been no ongoing development on JFS for many years, so it's past time to remove this dead code from the source. Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2022-04-25Merge tag 'f2fs-fix-5.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs fixes from Jaegeuk Kim: "This includes major bug fixes introduced in 5.18-rc1 and 5.17+: - Remove obsolete whint_mode (5.18-rc1) - Fix IO split issue caused by op_flags change in f2fs (5.18-rc1) - Fix a wrong condition check to detect IO failure loop (5.18-rc1) - Fix wrong data truncation during roll-forward (5.17+)" * tag 'f2fs-fix-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: f2fs: should not truncate blocks during roll-forward recovery f2fs: fix wrong condition check when failing metapage read f2fs: keep io_flags to avoid IO split due to different op_flags in two fio holders f2fs: remove obsolete whint_mode
2022-04-25fanotify: enable "evictable" inode marksAmir Goldstein
Now that the direct reclaim path is handled we can enable evictable inode marks. Link: https://lore.kernel.org/r/20220422120327.3459282-17-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2022-04-25fanotify: use fsnotify group lock helpersAmir Goldstein
Direct reclaim from fanotify mark allocation context may try to evict inodes with evictable marks of the same group and hit this deadlock: [<0>] fsnotify_destroy_mark+0x1f/0x3a [<0>] fsnotify_destroy_marks+0x71/0xd9 [<0>] __destroy_inode+0x24/0x7e [<0>] destroy_inode+0x2c/0x67 [<0>] dispose_list+0x49/0x68 [<0>] prune_icache_sb+0x5b/0x79 [<0>] super_cache_scan+0x11c/0x16f [<0>] shrink_slab.constprop.0+0x23e/0x40f [<0>] shrink_node+0x218/0x3e7 [<0>] do_try_to_free_pages+0x12a/0x2d2 [<0>] try_to_free_pages+0x166/0x242 [<0>] __alloc_pages_slowpath.constprop.0+0x30c/0x903 [<0>] __alloc_pages+0xeb/0x1c7 [<0>] cache_grow_begin+0x6f/0x31e [<0>] fallback_alloc+0xe0/0x12d [<0>] ____cache_alloc_node+0x15a/0x17e [<0>] kmem_cache_alloc_trace+0xa1/0x143 [<0>] fanotify_add_mark+0xd5/0x2b2 [<0>] do_fanotify_mark+0x566/0x5eb [<0>] __x64_sys_fanotify_mark+0x21/0x24 [<0>] do_syscall_64+0x6d/0x80 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae Set the FSNOTIFY_GROUP_NOFS flag to prevent going into direct reclaim from allocations under fanotify group lock and use the safe group lock helpers. Link: https://lore.kernel.org/r/20220422120327.3459282-16-amir73il@gmail.com Suggested-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220321112310.vpr7oxro2xkz5llh@quack3.lan/ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2022-04-25fanotify: implement "evictable" inode marksAmir Goldstein
When an inode mark is created with flag FAN_MARK_EVICTABLE, it will not pin the marked inode to inode cache, so when inode is evicted from cache due to memory pressure, the mark will be lost. When an inode mark with flag FAN_MARK_EVICATBLE is updated without using this flag, the marked inode is pinned to inode cache. When an inode mark is updated with flag FAN_MARK_EVICTABLE but an existing mark already has the inode pinned, the mark update fails with error EEXIST. Evictable inode marks can be used to setup inode marks with ignored mask to suppress events from uninteresting files or directories in a lazy manner, upon receiving the first event, without having to iterate all the uninteresting files or directories before hand. The evictbale inode mark feature allows performing this lazy marks setup without exhausting the system memory with pinned inodes. This change does not enable the feature yet. Link: https://lore.kernel.org/linux-fsdevel/CAOQ4uxiRDpuS=2uA6+ZUM7yG9vVU-u212tkunBmSnP_u=mkv=Q@mail.gmail.com/ Link: https://lore.kernel.org/r/20220422120327.3459282-15-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2022-04-25fanotify: factor out helper fanotify_mark_update_flags()Amir Goldstein
Handle FAN_MARK_IGNORED_SURV_MODIFY flag change in a helper that is called after updating the mark mask. Replace the added and removed return values and help variables with bool recalc return values and help variable, which makes the code a bit easier to follow. Rename flags argument to fan_flags to emphasize the difference from mark->flags. Link: https://lore.kernel.org/r/20220422120327.3459282-14-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2022-04-25fanotify: create helper fanotify_mark_user_flags()Amir Goldstein
To translate from fsnotify mark flags to user visible flags. Link: https://lore.kernel.org/r/20220422120327.3459282-13-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2022-04-25fsnotify: allow adding an inode mark without pinning inodeAmir Goldstein
fsnotify_add_mark() and variants implicitly take a reference on inode when attaching a mark to an inode. Make that behavior opt-out with the mark flag FSNOTIFY_MARK_FLAG_NO_IREF. Instead of taking the inode reference when attaching connector to inode and dropping the inode reference when detaching connector from inode, take the inode reference on attach of the first mark that wants to hold an inode reference and drop the inode reference on detach of the last mark that wants to hold an inode reference. Backends can "upgrade" an existing mark to take an inode reference, but cannot "downgrade" a mark with inode reference to release the refernce. This leaves the choice to the backend whether or not to pin the inode when adding an inode mark. This is intended to be used when adding a mark with ignored mask that is used for optimization in cases where group can afford getting unneeded events and reinstate the mark with ignored mask when inode is accessed again after being evicted. Link: https://lore.kernel.org/r/20220422120327.3459282-12-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2022-04-25dnotify: use fsnotify group lock helpersAmir Goldstein
Before commit 9542e6a643fc6 ("nfsd: Containerise filecache laundrette") nfsd would close open files in direct reclaim context. There is no guarantee that others memory shrinkers don't do the same and no guarantee that future shrinkers won't do that. For example, if overlayfs implements inode cache of fscache would keep open files to cached objects, inode shrinkers could end up closing open files to underlying fs. Direct reclaim from dnotify mark allocation context may try to close open files that have dnotify marks of the same group and hit a deadlock on mark_mutex. Set the FSNOTIFY_GROUP_NOFS flag to prevent going into direct reclaim from allocations under dnotify group lock and use the safe group lock helpers. Link: https://lore.kernel.org/r/20220422120327.3459282-11-amir73il@gmail.com Suggested-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220321112310.vpr7oxro2xkz5llh@quack3.lan/ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2022-04-25nfsd: use fsnotify group lock helpersAmir Goldstein
Before commit 9542e6a643fc6 ("nfsd: Containerise filecache laundrette") nfsd would close open files in direct reclaim context and that could cause a deadlock when fsnotify mark allocation went into direct reclaim and nfsd shrinker tried to free existing fsnotify marks. To avoid issues like this in future code, set the FSNOTIFY_GROUP_NOFS flag on nfsd fsnotify group to prevent going into direct reclaim from fsnotify_add_inode_mark(). Link: https://lore.kernel.org/r/20220422120327.3459282-10-amir73il@gmail.com Suggested-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220321112310.vpr7oxro2xkz5llh@quack3.lan/ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2022-04-25inotify: use fsnotify group lock helpersAmir Goldstein
inotify inode marks pin the inode so there is no need to set the FSNOTIFY_GROUP_NOFS flag. Link: https://lore.kernel.org/r/20220422120327.3459282-8-amir73il@gmail.com Suggested-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220321112310.vpr7oxro2xkz5llh@quack3.lan/ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2022-04-25fsnotify: create helpers for group mark_mutex lockAmir Goldstein
Create helpers to take and release the group mark_mutex lock. Define a flag FSNOTIFY_GROUP_NOFS in fsnotify_group that determines if the mark_mutex lock is fs reclaim safe or not. If not safe, the lock helpers take the lock and disable direct fs reclaim. In that case we annotate the mutex with a different lockdep class to express to lockdep that an allocation of mark of an fs reclaim safe group may take the group lock of another "NOFS" group to evict inodes. For now, converted only the callers in common code and no backend defines the NOFS flag. It is intended to be set by fanotify for evictable marks support. Link: https://lore.kernel.org/r/20220422120327.3459282-7-amir73il@gmail.com Suggested-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220321112310.vpr7oxro2xkz5llh@quack3.lan/ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2022-04-25fsnotify: make allow_dups a property of the groupAmir Goldstein
Instead of passing the allow_dups argument to fsnotify_add_mark() as an argument, define the group flag FSNOTIFY_GROUP_DUPS to express the allow_dups behavior and set this behavior at group creation time for all calls of fsnotify_add_mark(). Rename the allow_dups argument to generic add_flags argument for future use. Link: https://lore.kernel.org/r/20220422120327.3459282-6-amir73il@gmail.com Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2022-04-25fsnotify: pass flags argument to fsnotify_alloc_group()Amir Goldstein
Add flags argument to fsnotify_alloc_group(), define and use the flag FSNOTIFY_GROUP_USER in inotify and fanotify instead of the helper fsnotify_alloc_user_group() to indicate user allocation. Although the flag FSNOTIFY_GROUP_USER is currently not used after group allocation, we store the flags argument in the group struct for future use of other group flags. Link: https://lore.kernel.org/r/20220422120327.3459282-5-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2022-04-25fsnotify: fix wrong lockdep annotationsAmir Goldstein
Commit 6960b0d909cd ("fsnotify: change locking order") changed some of the mark_mutex locks in direct reclaim path to use: mutex_lock_nested(&group->mark_mutex, SINGLE_DEPTH_NESTING); This change is explained: "...It uses nested locking to avoid deadlock in case we do the final iput() on an inode which still holds marks and thus would take the mutex again when calling fsnotify_inode_delete() in destroy_inode()." The problem is that the mutex_lock_nested() is not a nested lock at all. In fact, it has the opposite effect of preventing lockdep from warning about a very possible deadlock. Due to these wrong annotations, a deadlock that was introduced with nfsd filecache in kernel v5.4 went unnoticed in v5.4.y for over two years until it was reported recently by Khazhismel Kumykov, only to find out that the deadlock was already fixed in kernel v5.5. Fix the wrong lockdep annotations. Cc: Khazhismel Kumykov <khazhy@google.com> Fixes: 6960b0d909cd ("fsnotify: change locking order") Link: https://lore.kernel.org/r/20220321112310.vpr7oxro2xkz5llh@quack3.lan/ Link: https://lore.kernel.org/r/20220422120327.3459282-4-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2022-04-25inotify: move control flags from mask to mark flagsAmir Goldstein
The inotify control flags in the mark mask (e.g. FS_IN_ONE_SHOT) are not relevant to object interest mask, so move them to the mark flags. This frees up some bits in the object interest mask. Link: https://lore.kernel.org/r/20220422120327.3459282-3-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2022-04-25inotify: show inotify mask flags in proc fdinfoAmir Goldstein
The inotify mask flags IN_ONESHOT and IN_EXCL_UNLINK are not "internal to kernel" and should be exposed in procfs fdinfo so CRIU can restore them. Fixes: 6933599697c9 ("inotify: hide internal kernel bits from fdinfo") Link: https://lore.kernel.org/r/20220422120327.3459282-2-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2022-04-25btrfs: Avoid live-lock in search_ioctl() on hardware with sub-page faultsCatalin Marinas
Commit a48b73eca4ce ("btrfs: fix potential deadlock in the search ioctl") addressed a lockdep warning by pre-faulting the user pages and attempting the copy_to_user_nofault() in an infinite loop. On architectures like arm64 with MTE, an access may fault within a page at a location different from what fault_in_writeable() probed. Since the sk_offset is rewound to the previous struct btrfs_ioctl_search_header boundary, there is no guaranteed forward progress and search_ioctl() may live-lock. Use fault_in_subpage_writeable() instead of fault_in_writeable() to ensure the permission is checked at the right granularity (smaller than PAGE_SIZE). Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Fixes: a48b73eca4ce ("btrfs: fix potential deadlock in the search ioctl") Reported-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: David Sterba <dsterba@suse.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20220423100751.1870771-4-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2022-04-25ceph: fix possible NULL pointer dereference for req->r_sessionXiubo Li
The request will be inserted into the ci->i_unsafe_dirops before assigning the req->r_session, so it's possible that we will hit NULL pointer dereference bug here. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/55327 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Aaron Tomlin <atomlin@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-04-25ceph: remove incorrect session state checkXiubo Li
Once the session is opened the s->s_ttl will be set, and when receiving a new mdsmap and the MDS map is changed, it will be possibly will close some sessions and open new ones. And then some sessions will be in CLOSING state evening without unmounting. URL: https://tracker.ceph.com/issues/54979 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-04-25ceph: get snap_rwsem read lock in handle_cap_export for ceph_add_capNiels Dossche
ceph_add_cap says in its function documentation that the caller should hold the read lock on the session snap_rwsem. Furthermore, not only ceph_add_cap needs that lock, when it calls to ceph_lookup_snap_realm it eventually calls ceph_get_snap_realm which states via lockdep that snap_rwsem needs to be held. handle_cap_export calls ceph_add_cap without that mdsc->snap_rwsem held. Thus, since ceph_get_snap_realm and ceph_add_cap both need the lock, the common place to acquire that lock is inside handle_cap_export. Signed-off-by: Niels Dossche <dossche.niels@gmail.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-04-24io_uring: cleanup error-handling around io_req_completeKanchan Joshi
Move common error-handling to io_req_complete, so that various callers avoid repeating that. Few callers (io_tee, io_splice) require slightly different handling. These are changed to use __io_req_complete instead. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220422101048.419942-1-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24io_uring: add socket(2) supportJens Axboe
Supports both regular socket(2) where a normal file descriptor is instantiated when called, or direct descriptors. Link: https://lore.kernel.org/r/20220412202240.234207-3-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24io_uring: add fgetxattr and getxattr supportStefan Roesch
This adds support to io_uring for the fgetxattr and getxattr API. Signed-off-by: Stefan Roesch <shr@fb.com> Acked-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20220323154420.3301504-5-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24io_uring: add fsetxattr and setxattr supportStefan Roesch
This adds support to io_uring for the fsetxattr and setxattr API. Signed-off-by: Stefan Roesch <shr@fb.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20220323154420.3301504-4-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24fs: split off do_getxattr from getxattrStefan Roesch
This splits off do_getxattr function from the getxattr function. This will allow io_uring to call it from its io worker. Signed-off-by: Stefan Roesch <shr@fb.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20220323154420.3301504-3-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24fs: split off setxattr_copy and do_setxattr function from setxattrStefan Roesch
This splits of the setup part of the function setxattr in its own dedicated function called setxattr_copy. In addition it also exposes a new function called do_setxattr for making the setxattr call. This makes it possible to call these two functions from io_uring in the processing of an xattr request. Signed-off-by: Stefan Roesch <shr@fb.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20220323154420.3301504-2-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24io_uring: return an error when cqe is droppedDylan Yudaken
Right now io_uring will not actively inform userspace if a CQE is dropped. This is extremely rare, requiring a CQ ring overflow, as well as a GFP_ATOMIC kmalloc failure. However the consequences could cause for example applications to go into an undefined state, possibly waiting for a CQE that never arrives. Return an error code (EBADR) in these cases. Since this is expected to be incredibly rare, try and avoid as much as possible affecting the hot code paths, and so it only is returned lazily and when there is no other available CQEs. Once the error is returned, reset the error condition assuming the user is either ok with it or will clean up appropriately. Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220421091345.2115755-6-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24io_uring: use constants for cq_overflow bitfieldDylan Yudaken
Prepare to use this bitfield for more flags by using constants instead of magic value 0 Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220421091345.2115755-5-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24io_uring: rework io_uring_enter to simplify return valueDylan Yudaken
io_uring_enter returns the count submitted preferrably over an error code. In some code paths this check is not required, so reorganise the code so that the check is only done as needed. This is also a prep for returning error codes only in waiting scenarios. Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220421091345.2115755-4-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24io_uring: trace cqe overflowsDylan Yudaken
Trace cqe overflows in io_uring. Print ocqe before the check, so if it is NULL it indicates that it has been dropped. Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220421091345.2115755-3-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24io_uring: allow re-poll if we made progressJens Axboe
We currently check REQ_F_POLLED before arming async poll for a notification to retry. If it's set, then we don't allow poll and will punt to io-wq instead. This is done to prevent a situation where a buggy driver will repeatedly return that there's space/data available yet we get -EAGAIN. However, if we already transferred data, then it should be safe to rely on poll again. Gate the check on whether or not REQ_F_PARTIAL_IO is also set. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24io_uring: support MSG_WAITALL for IORING_OP_SEND(MSG)Jens Axboe
Like commit 7ba89d2af17a for recv/recvmsg, support MSG_WAITALL for the send side. If this flag is set and we do a short send, retry for a stream of seqpacket socket. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24io_uring: add support for IORING_ASYNC_CANCEL_ANYJens Axboe
Rather than match on a specific key, be it user_data or file, allow canceling any request that we can lookup. Works like IORING_ASYNC_CANCEL_ALL in that it cancels multiple requests, but it doesn't key off user_data or the file. Can't be set with IORING_ASYNC_CANCEL_FD, as that's a key selector. Only one may be used at the time. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20220418164402.75259-6-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24io_uring: allow IORING_OP_ASYNC_CANCEL with 'fd' keyJens Axboe
Currently sqe->addr must contain the user_data of the request being canceled. Introduce the IORING_ASYNC_CANCEL_FD flag, which tells the kernel that we're keying off the file fd instead for cancelation. This allows canceling any request that a) uses a file, and b) was assigned the file based on the value being passed in. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20220418164402.75259-5-axboe@kernel.dk
2022-04-24io_uring: add support for IORING_ASYNC_CANCEL_ALLJens Axboe
The current cancelation will lookup and cancel the first request it finds based on the key passed in. Add a flag that allows to cancel any request that matches they key. It completes with the number of requests found and canceled, or res < 0 if an error occured. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20220418164402.75259-4-axboe@kernel.dk
2022-04-24io_uring: pass in struct io_cancel_data consistentlyJens Axboe
In preparation for being able to not only key cancel off the user_data, pass in the io_cancel_data struct for the various functions that deal with request cancelation. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20220418164402.75259-3-axboe@kernel.dk
2022-04-24io_uring: remove dead 'poll_only' argument to io_poll_cancel()Jens Axboe
It's only called from one location, and it always passes in 'false'. Kill the argument, and just pass in 'false' to io_poll_find(). Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20220418164402.75259-2-axboe@kernel.dk
2022-04-24io_uring: refactor io_disarm_next() lockingPavel Begunkov
Split timeout handling into removal + failing, so we can reduce spinlocking time and remove another instance of triple nested locking. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0f00d115f9d4c5749028f19623708ad3695512d6.1650458197.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24io_uring: move timeout locking in io_timeout_cancel()Pavel Begunkov
Move ->timeout_lock grabbing inside of io_timeout_cancel(), so we can do io_req_task_queue_fail() outside of the lock. It's much nicer than relying on triple nested locking. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/cde758c2897930d31e205ed8f476d4ec879a8849.1650458197.git.asml.silence@gmail.com [axboe: drop now wrong timeout_lock annotation] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24io_uring: store SCM state in io_fixed_file->file_ptrJens Axboe
A previous commit removed SCM accounting for non-unix sockets, as those are the only ones that can cause a fixed file reference. While that is true, it also means we're now dereferencing the file as part of the workqueue driven __io_sqe_files_unregister() after the process has exited. This isn't safe for SCM files, as unix gc may have already reaped them when the process exited. KASAN complains about this: [ 12.307040] Freed by task 0: [ 12.307592] kasan_save_stack+0x28/0x4c [ 12.308318] kasan_set_track+0x28/0x38 [ 12.309049] kasan_set_free_info+0x24/0x44 [ 12.309890] ____kasan_slab_free+0x108/0x11c [ 12.310739] __kasan_slab_free+0x14/0x1c [ 12.311482] slab_free_freelist_hook+0xd4/0x164 [ 12.312382] kmem_cache_free+0x100/0x1dc [ 12.313178] file_free_rcu+0x58/0x74 [ 12.313864] rcu_core+0x59c/0x7c0 [ 12.314675] rcu_core_si+0xc/0x14 [ 12.315496] _stext+0x30c/0x414 [ 12.316287] [ 12.316687] Last potentially related work creation: [ 12.317885] kasan_save_stack+0x28/0x4c [ 12.318845] __kasan_record_aux_stack+0x9c/0xb0 [ 12.319976] kasan_record_aux_stack_noalloc+0x10/0x18 [ 12.321268] call_rcu+0x50/0x35c [ 12.322082] __fput+0x2fc/0x324 [ 12.322873] ____fput+0xc/0x14 [ 12.323644] task_work_run+0xac/0x10c [ 12.324561] do_notify_resume+0x37c/0xe74 [ 12.325420] el0_svc+0x5c/0x68 [ 12.326050] el0t_64_sync_handler+0xb0/0x12c [ 12.326918] el0t_64_sync+0x164/0x168 [ 12.327657] [ 12.327976] Second to last potentially related work creation: [ 12.329134] kasan_save_stack+0x28/0x4c [ 12.329864] __kasan_record_aux_stack+0x9c/0xb0 [ 12.330735] kasan_record_aux_stack+0x10/0x18 [ 12.331576] task_work_add+0x34/0xf0 [ 12.332284] fput_many+0x11c/0x134 [ 12.332960] fput+0x10/0x94 [ 12.333524] __scm_destroy+0x80/0x84 [ 12.334213] unix_destruct_scm+0xc4/0x144 [ 12.334948] skb_release_head_state+0x5c/0x6c [ 12.335696] skb_release_all+0x14/0x38 [ 12.336339] __kfree_skb+0x14/0x28 [ 12.336928] kfree_skb_reason+0xf4/0x108 [ 12.337604] unix_gc+0x1e8/0x42c [ 12.338154] unix_release_sock+0x25c/0x2dc [ 12.338895] unix_release+0x58/0x78 [ 12.339531] __sock_release+0x68/0xec [ 12.340170] sock_close+0x14/0x20 [ 12.340729] __fput+0x18c/0x324 [ 12.341254] ____fput+0xc/0x14 [ 12.341763] task_work_run+0xac/0x10c [ 12.342367] do_notify_resume+0x37c/0xe74 [ 12.343086] el0_svc+0x5c/0x68 [ 12.343510] el0t_64_sync_handler+0xb0/0x12c [ 12.344086] el0t_64_sync+0x164/0x168 We have an extra bit we can use in file_ptr on 64-bit, use that to store whether this file is SCM'ed or not, avoiding the need to look at the file contents itself. This does mean that 32-bit will be stuck with SCM for all registered files, just like 64-bit did before the referenced commit. Fixes: 1f59bc0f18cf ("io_uring: don't scm-account for non af_unix sockets") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-24io_uring: kill ctx arg from io_req_put_rsrcPavel Begunkov
The ctx argument of io_req_put_rsrc() is not used, kill it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bb51bf3ff02775b03e6ea21bc79c25d7870d1644.1650311386.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>