summaryrefslogtreecommitdiff
path: root/fs/erofs/zdata.c
AgeCommit message (Collapse)Author
2024-01-27erofs: relaxed temporary buffers allocation on readaheadChunhai Guo
Even with inplace decompression, sometimes very few temporary buffers may be still needed for a single decompression shot (e.g. 16 pages for 64k sliding window or 4 pages for 16k sliding window). In low-memory scenarios, it would be better to try to allocate with GFP_NOWAIT on readahead first. That can help reduce the time spent on page allocation under durative memory pressure. Here are detailed performance numbers under multi-app launch benchmark workload [1] on ARM64 Android devices (8-core CPU and 8GB of memory) running a 5.15 LTS kernel with EROFS of 4k pclusters: +----------------------------------------------+ | LZ4 | vanilla | patched | diff | |----------------+---------+---------+---------| | Average (ms) | 3364 | 2684 | -20.21% | [64k sliding window] |----------------+---------+---------+---------| | Average (ms) | 2079 | 1610 | -22.56% | [16k sliding window] +----------------------------------------------+ The total size of system images for 4k pclusters is almost unchanged: (64k sliding window) 9,117,044 KB (16k sliding window) 9,113,096 KB Therefore, in addition to switch the sliding window from 64k to 16k, after applying this patch, it can eventually save 52.14% (3364 -> 1610) on average with no memory reservation. That is particularly useful for embedded devices with limited resources. [1] https://lore.kernel.org/r/20240109074143.4138783-1-guochunhai@vivo.com Suggested-by: Gao Xiang <xiang@kernel.org> Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Link: https://lore.kernel.org/r/20240126140142.201718-1-hsiangkao@linux.alibaba.com
2024-01-26erofs: fix infinite loop due to a race of filling compressed_bvecsGao Xiang
I encountered a race issue after lengthy (~594647 secs) stress tests on a 64k-page arm64 VM with several 4k-block EROFS images. The timing is like below: z_erofs_try_inplace_io z_erofs_fill_bio_vec cmpxchg(&compressed_bvecs[].page, NULL, ..) [access bufvec] compressed_bvecs[] = *bvec; Previously, z_erofs_submit_queue() just accessed bufvec->page only, so other fields in bufvec didn't matter. After the subpage block support is landed, .offset and .end can be used too, but filling bufvec isn't an atomic operation which can cause inconsistency. Let's use a spinlock to keep the atomicity of each bufvec. More specifically, just reuse the existing spinlock `pcl->obj.lockref.lock` since it's rarely used (also it takes a short time if even used) as long as the pcluster has a reference. Fixes: 192351616a9d ("erofs: support I/O submission for sub-page compressed blocks") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Sandeep Dhavale <dhavale@google.com> Link: https://lore.kernel.org/r/20240125120039.3228103-1-hsiangkao@linux.alibaba.com
2024-01-25erofs: get rid of unneeded GFP_NOFSJingbo Xu
Clean up some leftovers since there is no way for EROFS to be called again from a reclaim context. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240124031945.130782-1-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-12-21erofs: allow partially filled compressed bvecsYue Hu
In order to reduce memory footprints even further, let's allow partially filled compressed bvecs for readahead to bail out later. Signed-off-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20231221062341.23901-1-zbestahu@gmail.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-12-18erofs: enable sub-page compressed block supportGao Xiang
Let's just disable cached decompression and inplace I/Os for partial pages as the first step in order to enable sub-page block initial support. In other words, currently it works primarily based on temporary short-lived pages. Don't expect too much in terms of performance. Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20231206091057.87027-6-hsiangkao@linux.alibaba.com
2023-12-18erofs: fix ztailpacking for subpage compressed blocksGao Xiang
`pageofs_in` should be the compressed data offset of the page rather than of the block. Acked-by: Chao Yu <chao@kernel.org> Reviewed-by: Yue Hu <huyue2@coolpad.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20231214161337.753049-1-hsiangkao@linux.alibaba.com
2023-12-15erofs: record `pclustersize` in bytes instead of pagesGao Xiang
Currently, compressed sizes are recorded in pages using `pclusterpages`, However, for tailpacking pclusters, `tailpacking_size` is used instead. This approach doesn't work when dealing with sub-page blocks. To address this, let's switch them to the unified `pclustersize` in bytes. Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20231206091057.87027-3-hsiangkao@linux.alibaba.com
2023-12-15erofs: support I/O submission for sub-page compressed blocksGao Xiang
Add a basic I/O submission path first to support sub-page blocks: - Temporary short-lived pages will be used entirely; - In-place I/O pages can be used partially, but compressed pages need to be able to be mapped in contiguous virtual memory. As a start, currently cache decompression is explicitly disabled for sub-page blocks, which will be supported in the future. Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20231206091057.87027-2-hsiangkao@linux.alibaba.com
2023-12-15erofs: fix memory leak on short-lived bounced pagesGao Xiang
Both MicroLZMA and DEFLATE algorithms can use short-lived pages on demand for the overlapped inplace I/O decompression. However, those short-lived pages are actually added to `be->compressed_pages`. Thus, it should be checked instead of `pcl->compressed_bvecs`. The LZ4 algorithm doesn't work like this, so it won't be impacted. Fixes: 67139e36d970 ("erofs: introduce `z_erofs_parse_in_bvecs'") Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20231128180431.4116991-1-hsiangkao@linux.alibaba.com
2023-10-31erofs: fix erofs_insert_workgroup() lockref usageGao Xiang
As Linus pointed out [1], lockref_put_return() is fundamentally designed to be something that can fail. It behaves as a fastpath-only thing, and the failure case needs to be handled anyway. Actually, since the new pcluster was just allocated without being populated, it won't be accessed by others until it is inserted into XArray, so lockref helpers are actually unneeded here. Let's just set the proper reference count on initializing. [1] https://lore.kernel.org/r/CAHk-=whCga8BeQnJ3ZBh_Hfm9ctba_wpF444LpwRybVNMzO6Dw@mail.gmail.com Fixes: 7674a42f35ea ("erofs: use struct lockref to replace handcrafted approach") Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20231031060524.1103921-1-hsiangkao@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-08-23erofs: release ztailpacking pclusters properlyJingbo Xu
Currently ztailpacking pclusters are chained with FOLLOWED_NOINPLACE and not recorded into the managed_pslots XArray. After commit 7674a42f35ea ("erofs: use struct lockref to replace handcrafted approach"), ztailpacking pclusters won't be freed with erofs_workgroup_put() anymore, which will cause the following issue: BUG erofs_pcluster-1 (Tainted: G OE ): Objects remaining in erofs_pcluster-1 on __kmem_cache_shutdown() Use z_erofs_free_pcluster() directly to free ztailpacking pclusters. Fixes: 7674a42f35ea ("erofs: use struct lockref to replace handcrafted approach") Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230822110530.96831-1-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-08-23erofs: adapt folios for z_erofs_read_folio()Gao Xiang
It's a straight-forward conversion and no logic changes (except that it renames the corresponding tracepoint.) Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230817083942.103303-1-hsiangkao@linux.alibaba.com
2023-08-23erofs: adapt folios for z_erofs_readahead()Gao Xiang
It's a straight-forward conversion except that readahead_folio() will do folio_put() in advance but it doesn't matter since folios are still locked. As before, since file-backed folios (pages for now) are locked, so we could temporarily use folio->private as an internal counter to indicate split parts of each folio for the corresponding pclusters to decompress. When such counter becomes zero, the folio will be finally unlocked (see compress.h and z_erofs_onlinepage_endio()). Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230817082813.81180-7-hsiangkao@linux.alibaba.com
2023-08-23erofs: get rid of fe->backmost for cache decompressionGao Xiang
EROFS_MAP_FULL_MAPPED is more accurate to decide if caching the last incomplete pcluster for later read or not. Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230817082813.81180-6-hsiangkao@linux.alibaba.com
2023-08-23erofs: drop z_erofs_page_mark_eio()Gao Xiang
It can be folded into z_erofs_onlinepage_endio() to simplify the code. Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230817082813.81180-5-hsiangkao@linux.alibaba.com
2023-08-23erofs: tidy up z_erofs_do_read_page()Gao Xiang
- Fix a typo: spiltted => split; - Move !EROFS_MAP_MAPPED and EROFS_MAP_FRAGMENT upwards; - Increase `split` in advance to avoid unnecessary repeats. Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230817082813.81180-4-hsiangkao@linux.alibaba.com
2023-08-23erofs: move preparation logic into z_erofs_pcluster_begin()Gao Xiang
Some preparation logic should be part of z_erofs_pcluster_begin() instead of z_erofs_do_read_page(). Let's move now. Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230817082813.81180-3-hsiangkao@linux.alibaba.com
2023-08-23erofs: avoid obsolete {collector,collection} termsGao Xiang
{collector,collection} were once reserved in order to indicate different runtime logical extent instance of multi-reference pclusters. However, de-duplicated decompression has been landed in a more flexable way, thus `struct z_erofs_collection` was formally removed in commit 87ca34a7065d ("erofs: get rid of `struct z_erofs_collection'"). Let's handle the remaining leftovers, for example: `z_erofs_collector_begin` => `z_erofs_pcluster_begin` `z_erofs_collector_end` => `z_erofs_pcluster_end` as well as some comments. No logic changes. Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230817082813.81180-2-hsiangkao@linux.alibaba.com
2023-08-23erofs: simplify z_erofs_read_fragment()Gao Xiang
A trivial cleanup to make the fragment handling logic more clear. Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230817082813.81180-1-hsiangkao@linux.alibaba.com
2023-08-23erofs: refine warning messages for zdata I/OsFerry Meng
Don't warn users since -EINTR can be returned due to user interruption. Also suppress warning messages of readmore. Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230809060637.21311-1-mengferry@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-08-01erofs: fix wrong primary bvec selection on deduplicated extentsGao Xiang
When handling deduplicated compressed data, there can be multiple decompressed extents pointing to the same compressed data in one shot. In such cases, the bvecs which belong to the longest extent will be selected as the primary bvecs for real decompressors to decode and the other duplicated bvecs will be directly copied from the primary bvecs. Previously, only relative offsets of the longest extent were checked to decompress the primary bvecs. On rare occasions, it can be incorrect if there are several extents with the same start relative offset. As a result, some short bvecs could be selected for decompression and then cause data corruption. For example, as Shijie Sun reported off-list, considering the following extents of a file: 117: 903345.. 915250 | 11905 : 385024.. 389120 | 4096 ... 119: 919729.. 930323 | 10594 : 385024.. 389120 | 4096 ... 124: 968881.. 980786 | 11905 : 385024.. 389120 | 4096 The start relative offset is the same: 2225, but extent 119 (919729.. 930323) is shorter than the others. Let's restrict the bvec length in addition to the start offset if bvecs are not full. Reported-by: Shijie Sun <sunshijie@xiaomi.com> Fixes: 5c2a64252c5d ("erofs: introduce partial-referenced pclusters") Tested-by Shijie Sun <sunshijie@xiaomi.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230719065459.60083-1-hsiangkao@linux.alibaba.com
2023-07-12erofs: avoid infinite loop in z_erofs_do_read_page() when reading beyond EOFChunhai Guo
z_erofs_do_read_page() may loop infinitely due to the inappropriate truncation in the below statement. Since the offset is 64 bits and min_t() truncates the result to 32 bits. The solution is to replace unsigned int with a 64-bit type, such as erofs_off_t. cur = end - min_t(unsigned int, offset + end - map->m_la, end); - For example: - offset = 0x400160000 - end = 0x370 - map->m_la = 0x160370 - offset + end - map->m_la = 0x400000000 - offset + end - map->m_la = 0x00000000 (truncated as unsigned int) - Expected result: - cur = 0 - Actual result: - cur = 0x370 Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Fixes: 3883a79abd02 ("staging: erofs: introduce VLE decompression support") Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230710093410.44071-1-guochunhai@vivo.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-07-12erofs: avoid useless loops in z_erofs_pcluster_readmore() when reading ↵Chunhai Guo
beyond EOF z_erofs_pcluster_readmore() may take a long time to loop when the page offset is large enough, which is unnecessary should be prevented. For example, when the following case is encountered, it will loop 4691368 times, taking about 27 seconds: - offset = 19217289215 - inode_size = 1442672 Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Fixes: 386292919c25 ("erofs: introduce readmore decompression strategy") Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230710042531.28761-1-guochunhai@vivo.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-06-22erofs: Fix detection of atomic contextSandeep Dhavale
Current check for atomic context is not sufficient as z_erofs_decompressqueue_endio can be called under rcu lock from blk_mq_flush_plug_list(). See the stacktrace [1] In such case we should hand off the decompression work for async processing rather than trying to do sync decompression in current context. Patch fixes the detection by checking for rcu_read_lock_any_held() and while at it use more appropriate !in_task() check than in_atomic(). Background: Historically erofs would always schedule a kworker for decompression which would incur the scheduling cost regardless of the context. But z_erofs_decompressqueue_endio() may not always be in atomic context and we could actually benefit from doing the decompression in z_erofs_decompressqueue_endio() if we are in thread context, for example when running with dm-verity. This optimization was later added in patch [2] which has shown improvement in performance benchmarks. ============================================== [1] Problem stacktrace [name:core&]BUG: sleeping function called from invalid context at kernel/locking/mutex.c:291 [name:core&]in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1615, name: CpuMonitorServi [name:core&]preempt_count: 0, expected: 0 [name:core&]RCU nest depth: 1, expected: 0 CPU: 7 PID: 1615 Comm: CpuMonitorServi Tainted: G S W OE 6.1.25-android14-5-maybe-dirty-mainline #1 Hardware name: MT6897 (DT) Call trace: dump_backtrace+0x108/0x15c show_stack+0x20/0x30 dump_stack_lvl+0x6c/0x8c dump_stack+0x20/0x48 __might_resched+0x1fc/0x308 __might_sleep+0x50/0x88 mutex_lock+0x2c/0x110 z_erofs_decompress_queue+0x11c/0xc10 z_erofs_decompress_kickoff+0x110/0x1a4 z_erofs_decompressqueue_endio+0x154/0x180 bio_endio+0x1b0/0x1d8 __dm_io_complete+0x22c/0x280 clone_endio+0xe4/0x280 bio_endio+0x1b0/0x1d8 blk_update_request+0x138/0x3a4 blk_mq_plug_issue_direct+0xd4/0x19c blk_mq_flush_plug_list+0x2b0/0x354 __blk_flush_plug+0x110/0x160 blk_finish_plug+0x30/0x4c read_pages+0x2fc/0x370 page_cache_ra_unbounded+0xa4/0x23c page_cache_ra_order+0x290/0x320 do_sync_mmap_readahead+0x108/0x2c0 filemap_fault+0x19c/0x52c __do_fault+0xc4/0x114 handle_mm_fault+0x5b4/0x1168 do_page_fault+0x338/0x4b4 do_translation_fault+0x40/0x60 do_mem_abort+0x60/0xc8 el0_da+0x4c/0xe0 el0t_64_sync_handler+0xd4/0xfc el0t_64_sync+0x1a0/0x1a4 [2] Link: https://lore.kernel.org/all/20210317035448.13921-1-huangjianan@oppo.com/ Reported-by: Will Shiu <Will.Shiu@mediatek.com> Suggested-by: Gao Xiang <xiang@kernel.org> Signed-off-by: Sandeep Dhavale <dhavale@google.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Alexandre Mergnat <amergnat@baylibre.com> Link: https://lore.kernel.org/r/20230621220848.3379029-1-dhavale@google.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-06-18erofs: use poison pointer to replace the hard-coded addressGao Xiang
It's safer and cleaner to replace such hard-coded illegal pointer with poison pointers. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Link: https://lore.kernel.org/r/20230526201459.128169-7-hsiangkao@linux.alibaba.com
2023-06-18erofs: use struct lockref to replace handcrafted approachGao Xiang
Let's avoid the current handcrafted lockref although `struct lockref` inclusion usually increases extra 4 bytes with an explicit spinlock if CONFIG_DEBUG_SPINLOCK is off. Apart from the size difference, note that the meaning of refcount is also changed to active users. IOWs, it doesn't take an extra refcount for XArray tree insertion. I don't observe any significant performance difference at least on our cloud compute server but the new one indeed simplifies the overall codebase a bit. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Link: https://lore.kernel.org/r/20230529123727.79943-1-hsiangkao@linux.alibaba.com
2023-05-29erofs: adapt managed inode operations into foliosGao Xiang
This patch gets rid of erofs_try_to_free_cached_page() and fold it into .release_folio(). It also moves managed inode operations into zdata.c, which simplifies the code a bit. No logic changes. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Link: https://lore.kernel.org/r/20230526201459.128169-5-hsiangkao@linux.alibaba.com
2023-05-29erofs: kill hooked chains to avoid loops on deduplicated compressed imagesGao Xiang
After heavily stressing EROFS with several images which include a hand-crafted image of repeated patterns for more than 46 days, I found two chains could be linked with each other almost simultaneously and form a loop so that the entire loop won't be submitted. As a consequence, the corresponding file pages will remain locked forever. It can be _only_ observed on data-deduplicated compressed images. For example, consider two chains with five pclusters in total: Chain 1: 2->3->4->5 -- The tail pcluster is 5; Chain 2: 5->1->2 -- The tail pcluster is 2. Chain 2 could link to Chain 1 with pcluster 5; and Chain 1 could link to Chain 2 at the same time with pcluster 2. Since hooked chains are all linked locklessly now, I have no idea how to simply avoid the race. Instead, let's avoid hooked chains completely until I could work out a proper way to fix this and end users finally tell us that it's needed to add it back. Actually, this optimization can be found with multi-threaded workloads (especially even more often on deduplicated compressed images), yet I'm not sure about the overall system impacts of not having this compared with implementation complexity. Fixes: 267f2492c8f7 ("erofs: introduce multi-reference pclusters (fully-referenced)") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Link: https://lore.kernel.org/r/20230526201459.128169-4-hsiangkao@linux.alibaba.com
2023-05-29erofs: avoid on-stack pagepool directly passed by argumentsGao Xiang
On-stack pagepool is used so that short-lived temporary pages could be shared within a single I/O request (e.g. among multiple pclusters). Moving the remaining frontend-related uses into z_erofs_decompress_frontend to avoid too many arguments. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Link: https://lore.kernel.org/r/20230526201459.128169-3-hsiangkao@linux.alibaba.com
2023-05-29erofs: allocate extra bvec pages directly instead of retryingGao Xiang
If non-bootstrap bvecs cannot be kept in place (very rarely), an extra short-lived page is allocated. Let's just allocate it immediately rather than do unnecessary -EAGAIN return first and retry as a cleanup. Also it's unnecessary to use __GFP_NOFAIL here since we could gracefully fail out this case instead. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Link: https://lore.kernel.org/r/20230526201459.128169-2-hsiangkao@linux.alibaba.com
2023-05-29erofs: clean up z_erofs_pcluster_readmore()Yue Hu
`end` parameter is no needed since it's pointless for !backmost, we can handle it with backmost internally. And we only expand the trailing edge, so the newstart can be replaced with ->headoffset. Also, remove linux/prefetch.h inclusion since that is not used anymore after commit 386292919c25 ("erofs: introduce readmore decompression strategy"). Signed-off-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230525072605.17857-1-zbestahu@gmail.com [ Gao Xiang: update commit description. ] Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-05-29erofs: remove the member readahead from struct z_erofs_decompress_frontendYue Hu
The struct member is only used to add REQ_RAHEAD during I/O submission. So it is cleaner to pass it as a parameter than keep it in the struct. Also, rename function z_erofs_get_sync_decompress_policy() to z_erofs_is_sync_decompress() for better clarity and conciseness. Signed-off-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230524063944.1655-1-zbestahu@gmail.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-05-29erofs: fold in z_erofs_decompress()Yue Hu
No need this helper since it's just a simple wrapper for decompress method and only one caller. So, let's fold in directly instead. Signed-off-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230426084449.12781-1-zbestahu@gmail.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-05-23erofs: use HIPRI by default if per-cpu kthreads are enabledGao Xiang
As Sandeep shown [1], high priority RT per-cpu kthreads are typically helpful for Android scenarios to minimize the scheduling latencies. Switch EROFS_FS_PCPU_KTHREAD_HIPRI on by default if EROFS_FS_PCPU_KTHREAD is on since it's the typical use cases for EROFS_FS_PCPU_KTHREAD. Also clean up unneeded sched_set_normal(). [1] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Sandeep Dhavale <dhavale@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230522092141.124290-1-hsiangkao@linux.alibaba.com
2023-04-17erofs: sunset erofs_dbg()Gao Xiang
Such debug messages are rarely used now. Let's get rid of these, and revert locally if they are needed for debugging. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230414083027.12307-1-hsiangkao@linux.alibaba.com
2023-04-17erofs: keep meta inode into erofs_bufGao Xiang
So that erofs_read_metadata() can read metadata from other inodes (e.g. packed inode) as well. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230407141710.113882-2-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-04-17erofs: avoid hardcoded blocksize for subpage block supportJingbo Xu
As the first step of converting hardcoded blocksize to that specified in on-disk superblock, convert all call sites of hardcoded blocksize to sb->s_blocksize except for: 1) use sbi->blkszbits instead of sb->s_blocksize in erofs_superblock_csum_verify() since sb->s_blocksize has not been updated with the on-disk blocksize yet when the function is called. 2) use inode->i_blkbits instead of sb->s_blocksize in erofs_bread(), since the inode operated on may be an anonymous inode in fscache mode. Currently the anonymous inode is allocated from an anonymous mount maintained in erofs, while in the near future we may allocate anonymous inodes from a generic API directly and thus have no access to the anonymous inode's i_sb. Thus we keep the block size in i_blkbits for anonymous inodes in fscache mode. Be noted that this patch only gets rid of the hardcoded blocksize, in preparation for actually setting the on-disk block size in the following patch. The hard limit of constraining the block size to PAGE_SIZE still exists until the next patch. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230313135309.75269-2-jefflexu@linux.alibaba.com [ Gao Xiang: fold a patch to fix incorrect truncated offsets. ] Link: https://lore.kernel.org/r/20230413035734.15457-1-zhujia.zj@bytedance.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-03-09erofs: Revert "erofs: fix kvcalloc() misuse with __GFP_NOFAIL"Gao Xiang
Let's revert commit 12724ba38992 ("erofs: fix kvcalloc() misuse with __GFP_NOFAIL") since kvmalloc() already supports __GFP_NOFAIL in commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc"). So the original fix was wrong. Actually there was some issue as [1] discussed, so before that mm fix is landed, the warn could still happen but applying this commit first will cause less. [1] https://lore.kernel.org/r/20230305053035.1911-1-hsiangkao@linux.alibaba.com Fixes: 12724ba38992 ("erofs: fix kvcalloc() misuse with __GFP_NOFAIL") Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230309053148.9223-1-hsiangkao@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-02-16erofs: fix an error code in z_erofs_init_zip_subsystem()Dan Carpenter
Return -ENOMEM if alloc_workqueue() fails. Don't return success. Fixes: d8a650adf429 ("erofs: add per-cpu threads for decompression as an option") Signed-off-by: Dan Carpenter <error27@gmail.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/Y+4d0FRsUq8jPoOu@kili Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-02-15erofs: add per-cpu threads for decompression as an optionSandeep Dhavale
Using per-cpu thread pool we can reduce the scheduling latency compared to workqueue implementation. With this patch scheduling latency and variation is reduced as per-cpu threads are high priority kthread_workers. The results were evaluated on arm64 Android devices running 5.10 kernel. The table below shows resulting improvements of total scheduling latency for the same app launch benchmark runs with 50 iterations. Scheduling latency is the latency between when the task (workqueue kworker vs kthread_worker) became eligible to run to when it actually started running. +-------------------------+-----------+----------------+---------+ | | workqueue | kthread_worker | diff | +-------------------------+-----------+----------------+---------+ | Average (us) | 15253 | 2914 | -80.89% | | Median (us) | 14001 | 2912 | -79.20% | | Minimum (us) | 3117 | 1027 | -67.05% | | Maximum (us) | 30170 | 3805 | -87.39% | | Standard deviation (us) | 7166 | 359 | | +-------------------------+-----------+----------------+---------+ Background: Boot times and cold app launch benchmarks are very important to the Android ecosystem as they directly translate to responsiveness from user point of view. While EROFS provides a lot of important features like space savings, we saw some performance penalty in cold app launch benchmarks in few scenarios. Analysis showed that the significant variance was coming from the scheduling cost while decompression cost was more or less the same. Having per-cpu thread pool we can see from the above table that this variation is reduced by ~80% on average. This problem was discussed at LPC 2022. Link to LPC 2022 slides and talk at [1] [1] https://lpc.events/event/16/contributions/1338/ [ Gao Xiang: At least, we have to add this until WQ_UNBOUND workqueue issue [2] on many arm64 devices is resolved. ] [2] https://lore.kernel.org/r/CAJkfWY490-m6wNubkxiTPsW59sfsQs37Wey279LmiRxKt7aQYg@mail.gmail.com Signed-off-by: Sandeep Dhavale <dhavale@google.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230208093322.75816-1-hsiangkao@linux.alibaba.com
2023-02-15erofs: move zdata.h into zdata.cGao Xiang
Definitions in zdata.h are only used in zdata.c and for internal use only. No logic changes. Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230204093040.97967-4-hsiangkao@linux.alibaba.com
2023-02-15erofs: remove tagged pointer helpersGao Xiang
Just open-code the remaining one to simplify the code. Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230204093040.97967-3-hsiangkao@linux.alibaba.com
2023-02-15erofs: avoid tagged pointers to mark sync decompressionGao Xiang
We could just use a boolean in z_erofs_decompressqueue for sync decompression to simplify the code. Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230204093040.97967-2-hsiangkao@linux.alibaba.com
2023-01-10erofs: fix kvcalloc() misuse with __GFP_NOFAILGao Xiang
As reported by syzbot [1], kvcalloc() cannot work with __GFP_NOFAIL. Let's use kcalloc() instead. [1] https://lore.kernel.org/r/0000000000007796bd05f1852ec2@google.com Reported-by: syzbot+c3729cda01706a04fb98@syzkaller.appspotmail.com Fixes: fe3e5914e6dc ("erofs: try to leave (de)compressed_pages on stack if possible") Fixes: 4f05687fd703 ("erofs: introduce struct z_erofs_decompress_backend") Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230110074927.41651-1-hsiangkao@linux.alibaba.com
2022-12-07erofs: Fix pcluster memleak when its block address is zeroChen Zhongjin
syzkaller reported a memleak: https://syzkaller.appspot.com/bug?id=62f37ff612f0021641eda5b17f056f1668aa9aed unreferenced object 0xffff88811009c7f8 (size 136): ... backtrace: [<ffffffff821db19b>] z_erofs_do_read_page+0x99b/0x1740 [<ffffffff821dee9e>] z_erofs_readahead+0x24e/0x580 [<ffffffff814bc0d6>] read_pages+0x86/0x3d0 ... syzkaller constructed a case: in z_erofs_register_pcluster(), ztailpacking = false and map->m_pa = zero. This makes pcl->obj.index be zero although pcl is not a inline pcluster. Then following path adds refcount for grp, but the refcount won't be put because pcl is inline. z_erofs_readahead() z_erofs_do_read_page() # for another page z_erofs_collector_begin() erofs_find_workgroup() erofs_workgroup_get() Since it's illegal for the block address of a non-inlined pcluster to be zero, add check here to avoid registering the pcluster which would be leaked. Fixes: cecf864d3d76 ("erofs: support inline data decompression") Reported-by: syzbot+6f8cd9a0155b366d227f@syzkaller.appspotmail.com Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/Y42Kz6sVkf+XqJRB@debian Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-12-07erofs: clean up cached I/O strategiesGao Xiang
After commit 4c7e42552b3a ("erofs: remove useless cache strategy of DELAYEDALLOC"), only one cached I/O allocation strategy is supported: When cached I/O is preferred, page allocation is applied without direct reclaim. If allocation fails, fall back to inplace I/O. Let's get rid of z_erofs_cache_alloctype. No logical changes. Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Yue Hu <huyue2@coolpad.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20221206060352.152830-1-xiang@kernel.org
2022-11-15Merge tag 'erofs-for-6.1-rc6-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fixes from Gao Xiang: "Most patches randomly fix error paths or corner cases in fscache mode reported recently. One fixes an invalid access relating to fragments on crafted images. Summary: - Fix packed_inode invalid access when reading fragments on crafted images - Add a missing erofs_put_metabuf() in an error path in fscache mode - Fix incorrect `count' for unmapped extents in fscache mode - Fix use-after-free of fsid and domain_id string when remounting - Fix missing xas_retry() in fscache mode" * tag 'erofs-for-6.1-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: fix missing xas_retry() in fscache mode erofs: fix use-after-free of fsid and domain_id string erofs: get correct count for unmapped range in fscache mode erofs: put metabuf in error path in fscache mode erofs: fix general protection fault when reading fragment
2022-11-08fs: fix leaked psi pressure stateJohannes Weiner
When psi annotations were added to to btrfs compression reads, the psi state tracking over add_ra_bio_pages and btrfs_submit_compressed_read was faulty. A pressure state, once entered, is never left. This results in incorrectly elevated pressure, which triggers OOM kills. pflags record the *previous* memstall state when we enter a new one. The code tried to initialize pflags to 1, and then optimize the leave call when we either didn't enter a memstall, or were already inside a nested stall. However, there can be multiple PageWorkingset pages in the bio, at which point it's that path itself that enters repeatedly and overwrites pflags. This causes us to miss the exit. Enter the stall only once if needed, then unwind correctly. erofs has the same problem, fix that up too. And move the memstall exit past submit_bio() to restore submit accounting originally added by b8e24a9300b0 ("block: annotate refault stalls from IO submission"). Link: https://lkml.kernel.org/r/Y2UHRqthNUwuIQGS@cmpxchg.org Fixes: 4088a47e78f9 ("btrfs: add manual PSI accounting for compressed reads") Fixes: 99486c511f68 ("erofs: add manual PSI accounting for the compressed address space") Fixes: 118f3663fbc6 ("block: remove PSI accounting from the bio layer") Link: https://lore.kernel.org/r/d20a0a85-e415-cf78-27f9-77dd7a94bc8d@leemhuis.info/ Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Thorsten Leemhuis <linux@leemhuis.info> Tested-by: Thorsten Leemhuis <linux@leemhuis.info> Cc: Chao Yu <chao@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: David Sterba <dsterba@suse.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08erofs: fix general protection fault when reading fragmentYue Hu
As syzbot reported [1], the fragment feature sb flag is not set, so packed_inode != NULL needs to be checked in z_erofs_read_fragment(). [1] https://lore.kernel.org/all/0000000000002e7a8905eb841ddd@google.com/ Reported-by: syzbot+3faecbfd845a895c04cb@syzkaller.appspotmail.com Fixes: b15b2e307c3a ("erofs: support on-disk compressed fragments data") Signed-off-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20221021085325.25788-1-zbestahu@gmail.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-10-17erofs: fix up inplace decompression success rateGao Xiang
Partial decompression should be checked after updating length. It's a new regression when introducing multi-reference pclusters. Fixes: 2bfab9c0edac ("erofs: record the longest decompressed size in this round") Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20221014064915.8103-1-hsiangkao@linux.alibaba.com