summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2021-10-20gfs2: Move the inode glock locking to gfs2_file_buffered_writeAndreas Gruenbacher
So far, for buffered writes, we were taking the inode glock in gfs2_iomap_begin and dropping it in gfs2_iomap_end with the intention of not holding the inode glock while iomap_write_actor faults in user pages. It turns out that iomap_write_actor is called inside iomap_begin ... iomap_end, so the user pages were still faulted in while holding the inode glock and the locking code in iomap_begin / iomap_end was completely pointless. Move the locking into gfs2_file_buffered_write instead. We'll take care of the potential deadlocks due to faulting in user pages while holding a glock in a subsequent patch. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-20gfs2: Introduce flag for glock holder auto-demotionBob Peterson
This patch introduces a new HIF_MAY_DEMOTE flag and infrastructure that will allow glocks to be demoted automatically on locking conflicts. When a locking request comes in that isn't compatible with the locking state of an active holder and that holder has the HIF_MAY_DEMOTE flag set, the holder will be demoted before the incoming locking request is granted. Note that this mechanism demotes active holders (with the HIF_HOLDER flag set), while before we were only demoting glocks without any active holders. This allows processes to keep hold of locks that may form a cyclic locking dependency; the core glock logic will then break those dependencies in case a conflicting locking request occurs. We'll use this to avoid giving up the inode glock proactively before faulting in pages. Processes that allow a glock holder to be taken away indicate this by calling gfs2_holder_allow_demote(), which sets the HIF_MAY_DEMOTE flag. Later, they call gfs2_holder_disallow_demote() to clear the flag again, and then they check if their holder is still queued: if it is, they are still holding the glock; if it isn't, they can re-acquire the glock (or abort). Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-20gfs2: Clean up function may_grantAndreas Gruenbacher
Pass the first current glock holder into function may_grant and deobfuscate the logic there. While at it, switch from BUG_ON to GLOCK_BUG_ON in may_grant. To make that build cleanly, de-constify the may_grant arguments. We're now using function find_first_holder in do_promote, so move the function's definition above do_promote. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-20gfs2: Add wrapper for iomap_file_buffered_writeAndreas Gruenbacher
Add a wrapper around iomap_file_buffered_write. We'll add code for when the operation needs to be retried here later. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-20block: cleanup the flush plug helpersChristoph Hellwig
Consolidate the various helpers into a single blk_flush_plug helper that takes a plk_plug and the from_scheduler bool and switch all callsites to call it directly. Checks that the plug is non-NULL must be performed by the caller, something that most already do anyway. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211020144119.142582-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20io_uring: fix ltimeout unprepPavel Begunkov
io_unprep_linked_timeout() is broken, first it needs to return back REQ_F_ARM_LTIMEOUT, so the linked timeout is enqueued and disarmed. But now we refcounted it, and linked timeouts may get not executed at all, leaking a request. Just kill the unprep optimisation. Fixes: 906c6caaf586 ("io_uring: optimise io_prep_linked_timeout()") Reported-by: Beld Zhang <beldzhang@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/51b8e2bfc4bea8ee625cf2ba62b2a350cc9be031.1634719585.git.asml.silence@gmail.com Link: https://github.com/axboe/liburing/issues/460 Reported-by: Beld Zhang <beldzhang@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20io_uring: apply max_workers limit to all future usersPavel Begunkov
Currently, IORING_REGISTER_IOWQ_MAX_WORKERS applies only to the task that issued it, it's unexpected for users. If one task creates a ring, limits workers and then passes it to another task the limit won't be applied to the other task. Another pitfall is that a task should either create a ring or submit at least one request for IORING_REGISTER_IOWQ_MAX_WORKERS to work at all, furher complicating the picture. Change the API, save the limits and apply to all future users. Note, it should be done first before giving away the ring or submitting new requests otherwise the result is not guaranteed. Fixes: 2e480058ddc2 ("io-wq: provide a way to limit max number of workers") Link: https://github.com/axboe/liburing/issues/460 Reported-by: Beld Zhang <beldzhang@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/51d0bae97180e08ab722c0d5c93e7439cfb6f697.1634683237.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20io_uring: Use ERR_CAST() instead of ERR_PTR(PTR_ERR())Changcheng Deng
Use ERR_CAST() instead of ERR_PTR(PTR_ERR()). This makes it more readable and also fix this warning detected by err_cast.cocci: ./fs/io_uring.c: WARNING: 3208: 11-18: ERR_CAST can be used with buf Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Changcheng Deng <deng.changcheng@zte.com.cn> Link: https://lore.kernel.org/r/20211020084948.1038420-1-deng.changcheng@zte.com.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20security: Return xattr name from security_dentry_init_security()Vivek Goyal
Right now security_dentry_init_security() only supports single security label and is used by SELinux only. There are two users of this hook, namely ceph and nfs. NFS does not care about xattr name. Ceph hardcodes the xattr name to security.selinux (XATTR_NAME_SELINUX). I am making changes to fuse/virtiofs to send security label to virtiofsd and I need to send xattr name as well. I also hardcoded the name of xattr to security.selinux. Stephen Smalley suggested that it probably is a good idea to modify security_dentry_init_security() to also return name of xattr so that we can avoid this hardcoding in the callers. This patch adds a new parameter "const char **xattr_name" to security_dentry_init_security() and LSM puts the name of xattr too if caller asked for it (xattr_name != NULL). Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: James Morris <jamorris@linux.microsoft.com> [PM: fixed typos in the commit description] Signed-off-by: Paul Moore <paul@paul-moore.com>
2021-10-20ksmbd: add buffer validation in session setupMarios Makassikis
Make sure the security buffer's length/offset are valid with regards to the packet length. Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Marios Makassikis <mmakassikis@freebox.fr> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-10-20ksmbd: throttle session setup failures to avoid dictionary attacksNamjae Jeon
To avoid dictionary attacks (repeated session setups rapidly sent) to connect to server, ksmbd make a delay of a 5 seconds on session setup failure to make it harder to send enough random connection requests to break into a server if a user insert the wrong password 10 times in a row. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-10-20ksmbd: validate OutputBufferLength of QUERY_DIR, QUERY_INFO, IOCTL requestsHyunchul Lee
Validate OutputBufferLength of QUERY_DIR, QUERY_INFO, IOCTL requests and check the free size of response buffer for these requests. Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-10-19io_uring: split logic of force_nonblockHao Xu
Currently force_nonblock stands for three meanings: - nowait or not - in an io-worker or not(hold uring_lock or not) Let's split the logic to two flags, IO_URING_F_NONBLOCK and IO_URING_F_UNLOCKED for convenience of the next patch. Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20211018133431.103298-1-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19io-wq: max_worker fixesPavel Begunkov
First, fix nr_workers checks against max_workers, with max_worker registration, it may pretty easily happen that nr_workers > max_workers. Also, synchronise writing to acct->max_worker with wqe->lock. It's not an actual problem, but as we don't care about io_wqe_create_worker(), it's better than WRITE_ONCE()/READ_ONCE(). Fixes: 2e480058ddc2 ("io-wq: provide a way to limit max number of workers") Reported-by: Beld Zhang <beldzhang@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/11f90e6b49410b7d1a88f5d04fb8d95bb86b8cf3.1634671835.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19xfs: use separate btree cursor cache for each btree typeDarrick J. Wong
Now that we have the infrastructure to track the max possible height of each btree type, we can create a separate slab cache for cursors of each type of btree. For smaller indices like the free space btrees, this means that we can pack more cursors into a slab page, improving slab utilization. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: compute absolute maximum nlevels for each btree typeDarrick J. Wong
Add code for all five btree types so that we can compute the absolute maximum possible btree height for each btree type. This is a setup for the next patch, which makes every btree type have its own cursor cache. The functions are exported so that we can have xfs_db report the absolute maximum btree heights for each btree type, rather than making everyone run their own ad-hoc computations. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: kill XFS_BTREE_MAXLEVELSDarrick J. Wong
Nobody uses this symbol anymore, so kill it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: compute the maximum height of the rmap btree when reflink enabledDarrick J. Wong
Instead of assuming that the hardcoded XFS_BTREE_MAXLEVELS value is big enough to handle the maximally tall rmap btree when all blocks are in use and maximally shared, let's compute the maximum height assuming the rmapbt consumes as many blocks as possible. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: clean up xfs_btree_{calc_size,compute_maxlevels}Darrick J. Wong
During review of the next patch, Dave remarked that he found these two btree geometry calculation functions lacking in documentation and that they performed more work than was really necessary. These functions take the same parameters and have nearly the same logic; the only real difference is in the return values. Reword the function comment to make it clearer what each function does, and move them to be adjacent to reinforce their relation. Clean up both of them to stop opencoding the howmany functions, stop using the uint typedefs, and make them both support computations for more than 2^32 leaf records, since we're going to need all of the above for files with large data forks and large rmap btrees. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: compute maximum AG btree height for critical reservation calculationDarrick J. Wong
Compute the actual maximum AG btree height for deciding if a per-AG block reservation is critically low. This only affects the sanity check condition, since we /generally/ will trigger on the 10% threshold. This is a long-winded way of saying that we're removing one more usage of XFS_BTREE_MAXLEVELS. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: rename m_ag_maxlevels to m_allocbt_maxlevelsDarrick J. Wong
Years ago when XFS was thought to be much more simple, we introduced m_ag_maxlevels to specify the maximum btree height of per-AG btrees for a given filesystem mount. Then we observed that inode btrees don't actually have the same height and split that off; and now we have rmap and refcount btrees with much different geometries and separate maxlevels variables. The 'ag' part of the name doesn't make much sense anymore, so rename this to m_alloc_maxlevels to reinforce that this is the maximum height of the *free space* btrees. This sets us up for the next patch, which will add a variable to track the maximum height of all AG btrees. (Also take the opportunity to improve adjacent comments and fix minor style problems.) Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: dynamically allocate cursors based on maxlevelsDarrick J. Wong
To support future btree code, we need to be able to size btree cursors dynamically for very large btrees. Switch the maxlevels computation to use the precomputed values in the superblock, and create cursors that can handle a certain height. For now, we retain the btree cursor cache that can handle up to 9-level btrees, though a subsequent patch introduces separate caches for each btree type, where each cache's objects will be exactly tall enough to handle the specific btree type. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: encode the max btree height in the cursorDarrick J. Wong
Encode the maximum btree height in the cursor, since we're soon going to allow smaller cursors for AG btrees and larger cursors for file btrees. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: refactor btree cursor allocation functionDarrick J. Wong
Refactor btree allocation to a common helper. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: rearrange xfs_btree_cur fields for better packingDarrick J. Wong
Reduce the size of the btree cursor structure some more by rearranging fields to eliminate unused space. While we're at it, fix the ragged indentation and a spelling error. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: prepare xfs_btree_cur for dynamic cursor heightsDarrick J. Wong
Split out the btree level information into a separate struct and put it at the end of the cursor structure as a VLA. Files with huge data forks (and in the future, the realtime rmap btree) will require the ability to support many more levels than a per-AG btree cursor, which means that we're going to create per-btree type cursor caches to conserve memory for the more common case. Note that a subsequent patch actually introduces dynamic cursor heights. This one merely rearranges the structure to prepare for that. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: dynamically allocate btree scrub context structureDarrick J. Wong
Reorganize struct xchk_btree so that we can dynamically size the context structure to fit the type of btree cursor that we have. This will enable us to use memory more efficiently once we start adding very tall btree types. Right-size the lastkey array to match the number of *node* levels in the tree so that we stop wasting space. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: don't track firstrec/firstkey separately in xchk_btreeDarrick J. Wong
The btree scrubbing code checks that the records (or keys) that it finds in a btree block are all in order by calling the btree cursor's ->recs_inorder function. This of course makes no sense for the first item in the block, so we switch that off with a separate variable in struct xchk_btree. Christoph helped me figure out that the variable is unnecessary, since we just accessed bc_ptrs[level] and can compare that against zero. Use that, and save ourselves some memory space. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: reduce the size of nr_ops for refcount btree cursorsDarrick J. Wong
We're never going to run more than 4 billion btree operations on a refcount cursor, so shrink the field to an unsigned int to reduce the structure size. Fix whitespace alignment too. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: remove xfs_btree_cur.bc_blocklogDarrick J. Wong
This field isn't used by anyone, so get rid of it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: fix incorrect decoding in xchk_btree_cur_fsbnoDarrick J. Wong
During review of subsequent patches, Dave and I noticed that this function doesn't work quite right -- accessing cur->bc_ino depends on the ROOT_IN_INODE flag, not LONG_PTRS. Fix that and the parentheses isssue. While we're at it, remove the piece that accesses cur->bc_ag, because block 0 of an AG is never part of a btree. Note: This changes the btree scrubber tracepoints behavior -- if the cursor has no buffer for a certain level, it will always report NULLFSBLOCK. It is assumed that anyone tracing the online fsck code will also be tracing xchk_start/xchk_done or otherwise be aware of what exactly is being scrubbed. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: fix perag reference leak on iteration race with growfsBrian Foster
The for_each_perag*() set of macros are hacky in that some (i.e. those based on sb_agcount) rely on the assumption that perag iteration terminates naturally with a NULL perag at the specified end_agno. Others allow for the final AG to have a valid perag and require the calling function to clean up any potential leftover xfs_perag reference on termination of the loop. Aside from providing a subtly inconsistent interface, the former variant is racy with growfs because growfs can create discoverable post-eofs perags before the final superblock update that completes the grow operation and increases sb_agcount. This leads to the following assert failure (reproduced by xfs/104) in the perag free path during unmount: XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/libxfs/xfs_ag.c, line: 195 This occurs because one of the many for_each_perag() loops in the code that is expected to terminate with a NULL pag (and thus has no post-loop xfs_perag_put() check) raced with a growfs and found a non-NULL post-EOFS perag, but terminated naturally based on the end_agno check without releasing the post-EOFS perag. Rework the iteration logic to lift the agno check from the main for loop conditional to the iteration helper function. The for loop now purely terminates on a NULL pag and xfs_perag_next() avoids taking a reference to any perag beyond end_agno in the first place. Fixes: f250eedcf762 ("xfs: make for_each_perag... a first class citizen") Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-10-19xfs: terminate perag iteration reliably on agcountBrian Foster
The for_each_perag_from() iteration macro relies on sb_agcount to process every perag currently within EOFS from a given starting point. It's perfectly valid to have perag structures beyond sb_agcount, however, such as if a growfs is in progress. If a perag loop happens to race with growfs in this manner, it will actually attempt to process the post-EOFS perag where ->pag_agno == sb_agcount. This is reproduced by xfs/104 and manifests as the following assert failure in superblock write verifier context: XFS: Assertion failed: agno < mp->m_sb.sb_agcount, file: fs/xfs/libxfs/xfs_types.c, line: 22 Update the corresponding macro to only process perags that are within the current sb_agcount. Fixes: 58d43a7e3263 ("xfs: pass perags around in fsmap data dev functions") Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-10-19xfs: rename the next_agno perag iteration variableBrian Foster
Rename the next_agno variable to be consistent across the several iteration macros and shorten line length. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-10-19xfs: fold perag loop iteration logic into helper functionBrian Foster
Fold the loop iteration logic into a helper in preparation for further fixups. No functional change in this patch. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-10-19xfs: replace snprintf in show functions with sysfs_emitQing Wang
coccicheck complains about the use of snprintf() in sysfs show functions. Fix the coccicheck warning: WARNING: use scnprintf or sprintf. Use sysfs_emit instead of scnprintf or sprintf makes more sense. Signed-off-by: Qing Wang <wangqing@vivo.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-10-19locks: remove changelog commentsJ. Bruce Fields
This is only of historical interest, and anyone interested in the history can dig out an old version of locks.c from from git. Triggered by the observation that it references the now-removed Documentation/filesystems/mandatory-locking.rst. Reported-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
2021-10-19io_uring: warning about unused-but-set parameterArnd Bergmann
When enabling -Wunused warnings by building with W=1, I get an instance of the -Wunused-but-set-parameter warning in the io_uring code: fs/io_uring.c: In function 'io_queue_async_work': fs/io_uring.c:1445:61: error: parameter 'locked' set but not used [-Werror=unused-but-set-parameter] 1445 | static void io_queue_async_work(struct io_kiocb *req, bool *locked) | ~~~~~~^~~~~~ There are very few warnings of this type, so it would be nice to enable this by default and fix all the existing instances. As the assignment serves no purpose by itself other than to prevent developers from using the variable, an easy workaround is to remove the assignment and just rename the argument to "dont_use". Fixes: f237c30a5610 ("io_uring: batch task work locking") Link: https://lore.kernel.org/lkml/20210920121352.93063-1-arnd@kernel.org/ Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20211019153507.348480-1-arnd@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19erofs: lzma compression supportGao Xiang
Add MicroLZMA support in order to maximize compression ratios for specific scenarios. For example, it's useful for low-end embedded boards and as a secondary algorithm in a file for specific access patterns. MicroLZMA is a new container format for raw LZMA1, which was created by Lasse Collin aiming to minimize old LZMA headers and get rid of unnecessary EOPM (end of payload marker) as well as to enable fixed-sized output compression, especially for 4KiB pclusters. Similar to LZ4, inplace I/O approach is used to minimize runtime memory footprint when dealing with I/O. Overlapped decompression is handled with 1) bounced buffer for data under processing or 2) extra short-lived pages from the on-stack pagepool which will be shared in the same read request (128KiB for example). Link: https://lore.kernel.org/r/20211010213145.17462-8-xiang@kernel.org Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-10-19erofs: rename some generic methods in decompressorGao Xiang
Previously, some LZ4 methods were named with `generic'. However, while evaluating the effective LZMA approach, it seems they aren't quite generic at all (e.g. no need preparing dstpages for most LZMA cases.) Avoid such naming instead. Link: https://lore.kernel.org/r/20211010213145.17462-7-xiang@kernel.org Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-10-19erofs: introduce readmore decompression strategyGao Xiang
Previously, the readahead window was strictly followed by EROFS decompression strategy in order to minimize extra memory footprint. However, it could become inefficient if just reading the partial requested data for much big LZ4 pclusters and the upcoming LZMA implementation. Let's try to request the leading data in a pcluster without triggering memory reclaiming instead for the LZ4 approach first to boost up 100% randread of large big pclusters, and it has no real impact on low memory scenarios. It also introduces a way to expand read lengths in order to decompress the whole pcluster, which is useful for LZMA since the algorithm itself is relatively slow and causes CPU bound, but LZ4 is not. Link: https://lore.kernel.org/r/20211008200839.24541-4-xiang@kernel.org Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-10-19erofs: introduce the secondary compression headGao Xiang
Previously, for each HEAD lcluster, it can be either HEAD or PLAIN lcluster to indicate whether the whole pcluster is compressed or not. In this patch, a new HEAD2 head type is introduced to specify another compression algorithm other than the primary algorithm for each compressed file, which can be used for upcoming LZMA compression and LZ4 range dictionary compression for various data patterns. It has been stayed in the EROFS roadmap for years. Complete it now! Link: https://lore.kernel.org/r/20211017165721.2442-1-xiang@kernel.org Reviewed-by: Yue Hu <huyue2@yulong.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-10-19io_uring: inform block layer of how many requests we are submittingJens Axboe
The block layer can use this knowledge to make smarter decisions on how to handle the request, if it knows that N more may be coming. Switch to using blk_start_plug_nr_ios() to pass in that information. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19io_uring: simplify io_file_supports_nowait()Pavel Begunkov
Make sure that REQ_F_SUPPORT_NOWAIT is always set io_prep_rw(), and so we can stop caring about setting it down the line simplifying io_file_supports_nowait(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/60c8f1f5e2cb45e00f4897b2cec10c5b3669da91.1634425438.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19io_uring: combine REQ_F_NOWAIT_{READ,WRITE} flagsPavel Begunkov
Merge REQ_F_NOWAIT_READ and REQ_F_NOWAIT_WRITE into one flag, i.e. REQ_F_SUPPORT_NOWAIT. First it gets rid of dependence on CONFIG_64BIT but also simplifies the code. One thing to consider is when we don't have ->{read,write}_iter and go through loop_rw_iter(). Just fail it with -EAGAIN if we expect nowait behaviour but not sure whether it supports it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f832a20e5186c2e79c6519280c238f559a1d2bbc.1634425438.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19io_uring: arm poll for non-nowait filesPavel Begunkov
Don't check if we can do nowait before arming apoll, there are several reasons for that. First, we don't care much about files that don't support nowait. Second, it may be useful -- we don't want to be taking away extra workers from io-wq when it can go in some async. Even if it will go through io-wq eventually, it make difference in the numbers of workers actually used. And the last one, it's needed to clean nowait in future commits. [kernel test robot: fix unused-var] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9d06f3cb2c8b686d970269a87986f154edb83043.1634425438.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19fs/io_uring: Prioritise checking faster conditions first in io_writeNoah Goldstein
This commit reorders the conditions in a branch in io_write. The reorder to check 'ret2 == -EAGAIN' first as checking '(req->ctx->flags & IORING_SETUP_IOPOLL)' will likely be more expensive due to 2x memory derefences. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Link: https://lore.kernel.org/r/20211017013229.4124279-1-goldstein.w.n@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19io_uring: clean io_prep_rw()Pavel Begunkov
We already store req->file in a variable in io_prep_rw(), just use it instead of a couple of left references to kicob->ki_filp. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2f5889fc7ab670daefd5ccaedd99416d8355f0ad.1634314022.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19io_uring: optimise fixed rw rsrc node settingPavel Begunkov
Move fixed rw io_req_set_rsrc_node() from rw prep into io_import_fixed(), if we're using fixed buffers it will always be called during submission as we save the state in advance, Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/68c06f66d5aa9661f1e4b88d08c52d23528297ec.1634314022.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19io_uring: return iovec from __io_import_iovecPavel Begunkov
We pass iovec** into __io_import_iovec(), which should keep it, initialise and modify accordingly. It's expensive, return it directly from __io_import_iovec encoding errors with ERR_PTR if needed. io_import_iovec keeps the old interface, but it's inline and so everything is optimised nicely. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6230e9769982f03a8f86fa58df24666088c44d3e.1634314022.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>