summaryrefslogtreecommitdiff
path: root/fs/erofs
AgeCommit message (Collapse)Author
2021-11-23erofs: fix deadlock when shrink erofs slabHuang Jianan
We observed the following deadlock in the stress test under low memory scenario: Thread A Thread B - erofs_shrink_scan - erofs_try_to_release_workgroup - erofs_workgroup_try_to_freeze -- A - z_erofs_do_read_page - z_erofs_collection_begin - z_erofs_register_collection - erofs_insert_workgroup - xa_lock(&sbi->managed_pslots) -- B - erofs_workgroup_get - erofs_wait_on_workgroup_freezed -- A - xa_erase - xa_lock(&sbi->managed_pslots) -- B To fix this, it needs to hold xa_lock before freezing the workgroup since xarray will be touched then. So let's hold the lock before accessing each workgroup, just like what we did with the radix tree before. [ Gao Xiang: Jianhua Hao also reports this issue at https://lore.kernel.org/r/b10b85df30694bac8aadfe43537c897a@xiaomi.com ] Link: https://lore.kernel.org/r/20211118135844.3559-1-huangjianan@oppo.com Fixes: 64094a04414f ("erofs: convert workstn to XArray") Reviewed-by: Chao Yu <chao@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Huang Jianan <huangjianan@oppo.com> Reported-by: Jianhua Hao <haojianhua1@xiaomi.com> Signed-off-by: Gao Xiang <xiang@kernel.org>
2021-11-13Merge tag 'erofs-for-5.16-rc1-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fixes from Gao Xiang: - fix unsafe pagevec reuse which could cause unexpected behaviors - get rid of the unused DELAYEDALLOC strategy that has been replaced by TRYALLOC * tag 'erofs-for-5.16-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: remove useless cache strategy of DELAYEDALLOC erofs: fix unsafe pagevec reuse of hooked pclusters
2021-11-08erofs: remove useless cache strategy of DELAYEDALLOCYue Hu
After commit 1825c8d7ce93 ("erofs: force inplace I/O under low memory scenario") and TRYALLOC is widely used, DELAYEDALLOC won't be used anymore. Remove related dead code. Also, remove the blank line at the end of zdata.h. Link: https://lore.kernel.org/r/20211106082315.25781-1-huyue2@yulong.com Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Yue Hu <huyue2@yulong.com> Signed-off-by: Gao Xiang <xiang@kernel.org>
2021-11-08erofs: fix unsafe pagevec reuse of hooked pclustersGao Xiang
There are pclusters in runtime marked with Z_EROFS_PCLUSTER_TAIL before actual I/O submission. Thus, the decompression chain can be extended if the following pcluster chain hooks such tail pcluster. As the related comment mentioned, if some page is made of a hooked pcluster and another followed pcluster, it can be reused for in-place I/O (since I/O should be submitted anyway): _______________________________________________________________ | tail (partial) page | head (partial) page | |_____PRIMARY_HOOKED___|____________PRIMARY_FOLLOWED____________| However, it's by no means safe to reuse as pagevec since if such PRIMARY_HOOKED pclusters finally move into bypass chain without I/O submission. It's somewhat hard to reproduce with LZ4 and I just found it (general protection fault) by ro_fsstressing a LZMA image for long time. I'm going to actively clean up related code together with multi-page folio adaption in the next few months. Let's address it directly for easier backporting for now. Call trace for reference: z_erofs_decompress_pcluster+0x10a/0x8a0 [erofs] z_erofs_decompress_queue.isra.36+0x3c/0x60 [erofs] z_erofs_runqueue+0x5f3/0x840 [erofs] z_erofs_readahead+0x1e8/0x320 [erofs] read_pages+0x91/0x270 page_cache_ra_unbounded+0x18b/0x240 filemap_get_pages+0x10a/0x5f0 filemap_read+0xa9/0x330 new_sync_read+0x11b/0x1a0 vfs_read+0xf1/0x190 Link: https://lore.kernel.org/r/20211103182006.4040-1-xiang@kernel.org Fixes: 3883a79abd02 ("staging: erofs: introduce VLE decompression support") Cc: <stable@vger.kernel.org> # 4.19+ Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-11-02Merge tag 'gfs2-v5.15-rc5-mmap-fault' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 mmap + page fault deadlocks fixes from Andreas Gruenbacher: "Functions gfs2_file_read_iter and gfs2_file_write_iter are both accessing the user buffer to write to or read from while holding the inode glock. In the most basic deadlock scenario, that buffer will not be resident and it will be mapped to the same file. Accessing the buffer will trigger a page fault, and gfs2 will deadlock trying to take the same inode glock again while trying to handle that fault. Fix that and similar, more complex scenarios by disabling page faults while accessing user buffers. To make this work, introduce a small amount of new infrastructure and fix some bugs that didn't trigger so far, with page faults enabled" * tag 'gfs2-v5.15-rc5-mmap-fault' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Fix mmap + page fault deadlocks for direct I/O iov_iter: Introduce nofault flag to disable page faults gup: Introduce FOLL_NOFAULT flag to disable page faults iomap: Add done_before argument to iomap_dio_rw iomap: Support partial direct I/O on user copy failures iomap: Fix iomap_dio_rw return value for user copies gfs2: Fix mmap + page fault deadlocks for buffered I/O gfs2: Eliminate ip->i_gh gfs2: Move the inode glock locking to gfs2_file_buffered_write gfs2: Introduce flag for glock holder auto-demotion gfs2: Clean up function may_grant gfs2: Add wrapper for iomap_file_buffered_write iov_iter: Introduce fault_in_iov_iter_writeable iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable} powerpc/kvm: Fix kvm_use_magic_page iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value
2021-10-31erofs: don't trigger WARN() when decompression failsGao Xiang
syzbot reported a WARNING [1] due to corrupted compressed data. As Dmitry said, "If this is not a kernel bug, then the code should not use WARN. WARN if for kernel bugs and is recognized as such by all testing systems and humans." [1] https://lore.kernel.org/r/000000000000b3586105cf0ff45e@google.com Link: https://lore.kernel.org/r/20211025074311.130395-1-hsiangkao@linux.alibaba.com Cc: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Reported-by: syzbot+d8aaffc3719597e8cfb4@syzkaller.appspotmail.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-10-25erofs: get rid of ->lru usageGao Xiang
Currently, ->lru is a way to arrange non-LRU pages and has some in-kernel users. In order to minimize noticable issues of page reclaim and cache thrashing under high memory presure, limited temporary pages were all chained with ->lru and can be reused during the request. However, it seems that ->lru could be removed when folio is landing. Let's use page->private to chain temporary pages for now instead and transform EROFS formally after the topic of the folio / file page design is finalized. Link: https://lore.kernel.org/r/20211022090120.14675-1-hsiangkao@linux.alibaba.com Cc: Matthew Wilcox <willy@infradead.org> Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-10-24iomap: Add done_before argument to iomap_dio_rwAndreas Gruenbacher
Add a done_before argument to iomap_dio_rw that indicates how much of the request has already been transferred. When the request succeeds, we report that done_before additional bytes were tranferred. This is useful for finishing a request asynchronously when part of the request has already been completed synchronously. We'll use that to allow iomap_dio_rw to be used with page faults disabled: when a page fault occurs while submitting a request, we synchronously complete the part of the request that has already been submitted. The caller can then take care of the page fault and call iomap_dio_rw again for the rest of the request, passing in the number of bytes already tranferred. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-10-19erofs: lzma compression supportGao Xiang
Add MicroLZMA support in order to maximize compression ratios for specific scenarios. For example, it's useful for low-end embedded boards and as a secondary algorithm in a file for specific access patterns. MicroLZMA is a new container format for raw LZMA1, which was created by Lasse Collin aiming to minimize old LZMA headers and get rid of unnecessary EOPM (end of payload marker) as well as to enable fixed-sized output compression, especially for 4KiB pclusters. Similar to LZ4, inplace I/O approach is used to minimize runtime memory footprint when dealing with I/O. Overlapped decompression is handled with 1) bounced buffer for data under processing or 2) extra short-lived pages from the on-stack pagepool which will be shared in the same read request (128KiB for example). Link: https://lore.kernel.org/r/20211010213145.17462-8-xiang@kernel.org Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-10-19erofs: rename some generic methods in decompressorGao Xiang
Previously, some LZ4 methods were named with `generic'. However, while evaluating the effective LZMA approach, it seems they aren't quite generic at all (e.g. no need preparing dstpages for most LZMA cases.) Avoid such naming instead. Link: https://lore.kernel.org/r/20211010213145.17462-7-xiang@kernel.org Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-10-19erofs: introduce readmore decompression strategyGao Xiang
Previously, the readahead window was strictly followed by EROFS decompression strategy in order to minimize extra memory footprint. However, it could become inefficient if just reading the partial requested data for much big LZ4 pclusters and the upcoming LZMA implementation. Let's try to request the leading data in a pcluster without triggering memory reclaiming instead for the LZ4 approach first to boost up 100% randread of large big pclusters, and it has no real impact on low memory scenarios. It also introduces a way to expand read lengths in order to decompress the whole pcluster, which is useful for LZMA since the algorithm itself is relatively slow and causes CPU bound, but LZ4 is not. Link: https://lore.kernel.org/r/20211008200839.24541-4-xiang@kernel.org Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-10-19erofs: introduce the secondary compression headGao Xiang
Previously, for each HEAD lcluster, it can be either HEAD or PLAIN lcluster to indicate whether the whole pcluster is compressed or not. In this patch, a new HEAD2 head type is introduced to specify another compression algorithm other than the primary algorithm for each compressed file, which can be used for upcoming LZMA compression and LZ4 range dictionary compression for various data patterns. It has been stayed in the EROFS roadmap for years. Complete it now! Link: https://lore.kernel.org/r/20211017165721.2442-1-xiang@kernel.org Reviewed-by: Yue Hu <huyue2@yulong.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-10-18erofs: get compression algorithms directly on mappingGao Xiang
Currently, z_erofs_map_blocks_iter() returns whether extents are compressed or not, and the decompression frontend gets the specific algorithms then. It works but not quite well in many aspests, for example: - The decompression frontend has to deal with whether extents are compressed or not again and lookup the algorithms if compressed. It's duplicated and too detailed about the on-disk mapping. - A new secondary compression head will be introduced later so that each file can have 2 compression algorithms at most for different type of data. It could increase the complexity of the decompression frontend if still handled in this way; - A new readmore decompression strategy will be introduced to get better performance for much bigger pcluster and lzma, which needs the specific algorithm in advance as well. Let's look up compression algorithms in z_erofs_map_blocks_iter() directly instead. Link: https://lore.kernel.org/r/20211008200839.24541-2-xiang@kernel.org Reviewed-by: Chao Yu <chao@kernel.org> Reviewed-by: Yue Hu <huyue2@yulong.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-10-18erofs: add multiple device supportGao Xiang
In order to support multi-layer container images, add multiple device feature to EROFS. Two ways are available to use for now: - Devices can be mapped into 32-bit global block address space; - Device ID can be specified with the chunk indexes format. Note that it assumes no extent would cross device boundary and mkfs should take care of it seriously. In the future, a dedicated device manager could be introduced then thus extra devices can be automatically scanned by UUID as well. Link: https://lore.kernel.org/r/20211014081010.43485-1-hsiangkao@linux.alibaba.com Reviewed-by: Chao Yu <chao@kernel.org> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-10-17erofs: decouple basic mount options from fs_contextGao Xiang
Previously, EROFS mount options are all in the basic types, so erofs_fs_context can be directly copied with assignment. However, when the multiple device feature is introduced, it's hard to handle multiple device information like the other basic mount options. Let's separate basic mount option usage from fs_context, thus multiple device information can be handled gracefully then. No logic changes. Link: https://lore.kernel.org/r/20211007070224.12833-1-hsiangkao@linux.alibaba.com Reviewed-by: Chao Yu <chao@kernel.org> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-10-15erofs: remove the fast path of per-CPU buffer decompressionYue Hu
As Xiang mentioned, such path has no real impact to our current decompression strategy, remove it directly. Also, update the return value of z_erofs_lz4_decompress() to 0 if success to keep consistent with LZMA which will return 0 as well for that case. Link: https://lore.kernel.org/r/20211014065744.1787-1-zbestahu@gmail.com Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Yue Hu <huyue2@yulong.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-09-23erofs: clear compacted_2b if compacted_4b_initial > totalidxYue Hu
Currently, the whole indexes will only be compacted 4B if compacted_4b_initial > totalidx. So, the calculated compacted_2b is worthless for that case. It may waste CPU resources. No need to update compacted_4b_initial as mkfs since it's used to fulfill the alignment of the 1st compacted_2b pack and would handle the case above. We also need to clarify compacted_4b_end here. It's used for the last lclusters which aren't fitted in the previous compacted_2b packs. Some messages are from Xiang. Link: https://lore.kernel.org/r/20210914035915.1190-1-zbestahu@gmail.com Signed-off-by: Yue Hu <huyue2@yulong.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> [ Gao Xiang: it's enough to use "compacted_4b_initial < totalidx". ] Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-09-23erofs: fix misbehavior of unsupported chunk format checkGao Xiang
Unsupported chunk format should be checked with "if (vi->chunkformat & ~EROFS_CHUNK_FORMAT_ALL)" Found when checking with 4k-byte blockmap (although currently mkfs uses inode chunk indexes format by default.) Link: https://lore.kernel.org/r/20210922095141.233938-1-hsiangkao@linux.alibaba.com Fixes: c5aa903a59db ("erofs: support reading chunk-based uncompressed files") Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-09-09Merge tag 'libnvdimm-for-5.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: - Fix a race condition in the teardown path of raw mode pmem namespaces. - Cleanup the code that filesystems use to detect filesystem-dax capabilities of their underlying block device. * tag 'libnvdimm-for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: remove bdev_dax_supported xfs: factor out a xfs_buftarg_is_dax helper dax: stub out dax_supported for !CONFIG_FS_DAX dax: remove __generic_fsdax_supported dax: move the dax_read_lock() locking into dax_supported dax: mark dax_get_by_host static dm: use fs_dax_get_by_bdev instead of dax_get_by_host dax: stop using bdevname fsdax: improve the FS_DAX Kconfig description and help text libnvdimm/pmem: Fix crash triggered when I/O in-flight during unbind
2021-09-02Merge tag 'ovl-update-5.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs update from Miklos Szeredi: - Copy up immutable/append/sync/noatime attributes (Amir Goldstein) - Improve performance by enabling RCU lookup. - Misc fixes and improvements The reason this touches so many files is that the ->get_acl() method now gets a "bool rcu" argument. The ->get_acl() API was updated based on comments from Al and Linus: Link: https://lore.kernel.org/linux-fsdevel/CAJfpeguQxpd6Wgc0Jd3ks77zcsAv_bn0q17L3VNnnmPKu11t8A@mail.gmail.com/ * tag 'ovl-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: enable RCU'd ->get_acl() vfs: add rcu argument to ->get_acl() callback ovl: fix BUG_ON() in may_delete() when called from ovl_cleanup() ovl: use kvalloc in xattr copy-up ovl: update ctime when changing fileattr ovl: skip checking lower file's i_writecount on truncate ovl: relax lookup error on mismatch origin ftype ovl: do not set overlay.opaque for new directories ovl: add ovl_allow_offline_changes() helper ovl: disable decoding null uuid with redirect_dir ovl: consistent behavior for immutable/append-only inodes ovl: copy up sync/noatime fileattr flags ovl: pass ovl_fs to ovl_check_setxattr() fs: add generic helper for filling statx attribute flags
2021-08-25erofs: fix double free of 'copied'Gao Xiang
Dan reported a new smatch warning [1] "fs/erofs/inode.c:210 erofs_read_inode() error: double free of 'copied'" Due to new chunk-based format handling logic, the error path can be called after kfree(copied). Set "copied = NULL" after kfree(copied) to fix this. [1] https://lore.kernel.org/r/202108251030.bELQozR7-lkp@intel.com Link: https://lore.kernel.org/r/20210825120757.11034-1-hsiangkao@linux.alibaba.com Fixes: c5aa903a59db ("erofs: support reading chunk-based uncompressed files") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-08-20erofs: support reading chunk-based uncompressed filesGao Xiang
Add runtime support for chunk-based uncompressed files described in the previous patch. Link: https://lore.kernel.org/r/20210820100019.208490-2-hsiangkao@linux.alibaba.com Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-08-20erofs: introduce chunk-based file on-disk formatGao Xiang
Currently, uncompressed data except for tail-packing inline is consecutive on disk. In order to support chunk-based data deduplication, add a new corresponding inode data layout. In the future, the data source of chunks can be either (un)compressed. Link: https://lore.kernel.org/r/20210820100019.208490-1-hsiangkao@linux.alibaba.com Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-08-18vfs: add rcu argument to ->get_acl() callbackMiklos Szeredi
Add a rcu argument to the ->get_acl() callback to allow get_cached_acl_rcu() to call the ->get_acl() method in the next patch. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-19erofs: add fiemap support with iomapGao Xiang
This adds fiemap support for both uncompressed files and compressed files by using iomap infrastructure. Link: https://lore.kernel.org/r/20210813052931.203280-3-hsiangkao@linux.alibaba.com Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-08-19erofs: add support for the full decompressed lengthGao Xiang
Previously, there is no need to get the full decompressed length since EROFS supports partial decompression. However for some other cases such as fiemap, the full decompressed length is necessary for iomap to make it work properly. This patch adds a way to get the full decompressed length. Note that it takes more metadata overhead and it'd be avoided if possible in the performance sensitive scenario. Link: https://lore.kernel.org/r/20210818152231.243691-1-hsiangkao@linux.alibaba.com Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-08-11erofs: remove the mapping parameter from erofs_try_to_free_cached_page()Yue Hu
The mapping is not used at all, remove it and update related code. Link: https://lore.kernel.org/r/20210810072416.1392-1-zbestahu@gmail.com Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Yue Hu <huyue2@yulong.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-08-11erofs: directly use wrapper erofs_page_is_managed() when shrinkingYue Hu
We already have the wrapper function to identify managed page. Link: https://lore.kernel.org/r/20210810065450.1320-1-zbestahu@gmail.com Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Yue Hu <huyue2@yulong.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-08-10erofs: convert all uncompressed cases to iomapGao Xiang
Since tail-packing inline has been supported by iomap now, let's convert all EROFS uncompressed data I/O to iomap, which is pretty straight-forward. Link: https://lore.kernel.org/r/20210805003601.183063-4-hsiangkao@linux.alibaba.com Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-08-10erofs: dax support for non-tailpacking regular fileGao Xiang
DAX is quite useful for some VM use cases in order to save guest memory extremely with minimal lightweight EROFS. In order to prepare for such use cases, add preliminary dax support for non-tailpacking regular files for now. Tested with the DRAM-emulated PMEM and the EROFS image generated by "mkfs.erofs -Enoinline_data enwik9.fsdax.img enwik9" Link: https://lore.kernel.org/r/20210805003601.183063-3-hsiangkao@linux.alibaba.com Cc: nvdimm@lists.linux.dev Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-08-10erofs: iomap support for non-tailpacking DIOHuang Jianan
Add iomap support for non-tailpacking uncompressed data in order to support DIO and DAX. Direct I/O is useful in certain scenarios for uncompressed files. For example, double pagecache can be avoid by direct I/O when loop device is used for uncompressed files containing upper layer compressed filesystem. This adds iomap DIO support for non-tailpacking cases first and tail-packing inline files are handled in the follow-up patch. Link: https://lore.kernel.org/r/20210805003601.183063-2-hsiangkao@linux.alibaba.com Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Huang Jianan <huangjianan@oppo.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-06-08erofs: clean up file headers & footersGao Xiang
- Remove my outdated misleading email address; - Get rid of all unnecessary trailing newline by accident. Link: https://lore.kernel.org/r/20210602160634.10757-1-xiang@kernel.org Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-06-08erofs: remove the occupied parameter from z_erofs_pagevec_enqueue()Yue Hu
No any behavior to variable occupied in z_erofs_attach_page() which is only caller to z_erofs_pagevec_enqueue(). Link: https://lore.kernel.org/r/20210419102623.2015-1-zbestahu@gmail.com Signed-off-by: Yue Hu <huyue2@yulong.com> Reviewed-by: Gao Xiang <xiang@kernel.org> Signed-off-by: Gao Xiang <xiang@kernel.org>
2021-06-08erofs: fix error return code in erofs_read_superblock()Wei Yongjun
'ret' will be overwritten to 0 if erofs_sb_has_sb_chksum() return true, thus 0 will return in some error handling cases. Fix to return negative error code -EINVAL instead of 0. Link: https://lore.kernel.org/r/20210519141657.3062715-1-weiyongjun1@huawei.com Fixes: b858a4844cfb ("erofs: support superblock checksum") Cc: stable <stable@vger.kernel.org> # 5.5+ Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Reviewed-by: Gao Xiang <xiang@kernel.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <xiang@kernel.org>
2021-05-13erofs: fix 1 lcluster-sized pcluster for big pclusterGao Xiang
If the 1st NONHEAD lcluster of a pcluster isn't CBLKCNT lcluster type rather than a HEAD or PLAIN type instead, which means its pclustersize _must_ be 1 lcluster (since its uncompressed size < 2 lclusters), as illustrated below: HEAD HEAD / PLAIN lcluster type ____________ ____________ |_:__________|_________:__| file data (uncompressed) . . .____________. |____________| pcluster data (compressed) Such on-disk case was explained before [1] but missed to be handled properly in the runtime implementation. It can be observed if manually generating 1 lcluster-sized pcluster with 2 lclusters (thus CBLKCNT doesn't exist.) Let's fix it now. [1] https://lore.kernel.org/r/20210407043927.10623-1-xiang@kernel.org Link: https://lore.kernel.org/r/20210510064715.29123-1-xiang@kernel.org Fixes: cec6e93beadf ("erofs: support parsing big pcluster compress indexes") Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <xiang@kernel.org>
2021-04-10erofs: enable big pcluster featureGao Xiang
Enable COMPR_CFGS and BIG_PCLUSTER since the implementations are all settled properly. Link: https://lore.kernel.org/r/20210407043927.10623-11-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-04-10erofs: support decompress big pcluster for lz4 backendGao Xiang
Prior to big pcluster, there was only one compressed page so it'd easy to map this. However, when big pcluster is enabled, more work needs to be done to handle multiple compressed pages. In detail, - (maptype 0) if there is only one compressed page + no need to copy inplace I/O, just map it directly what we did before; - (maptype 1) if there are more compressed pages + no need to copy inplace I/O, vmap such compressed pages instead; - (maptype 2) if inplace I/O needs to be copied, use per-CPU buffers for decompression then. Another thing is how to detect inplace decompression is feasable or not (it's still quite easy for non big pclusters), apart from the inplace margin calculation, inplace I/O page reusing order is also needed to be considered for each compressed page. Currently, if the compressed page is the xth page, it shouldn't be reused as [0 ... nrpages_out - nrpages_in + x], otherwise a full copy will be triggered. Although there are some extra optimization ideas for this, I'd like to make big pcluster work correctly first and obviously it can be further optimized later since it has nothing with the on-disk format at all. Link: https://lore.kernel.org/r/20210407043927.10623-10-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-04-10erofs: support parsing big pcluster compact indexesGao Xiang
Different from non-compact indexes, several lclusters are packed as the compact form at once and an unique base blkaddr is stored for each pack, so each lcluster index would take less space on avarage (e.g. 2 bytes for COMPACT_2B.) btw, that is also why BIG_PCLUSTER switch should be consistent for compact head0/1. Prior to big pcluster, the size of all pclusters was 1 lcluster. Therefore, when a new HEAD lcluster was scanned, blkaddr would be bumped by 1 lcluster. However, that way doesn't work anymore for big pcluster since we actually don't know the compressed size of pclusters in advance (before reading CBLKCNT lcluster). So, instead, let blkaddr of each pack be the first pcluster blkaddr with a valid CBLKCNT, in detail, 1) if CBLKCNT starts at the pack, this first valid pcluster is itself, e.g. _____________________________________________________________ |_CBLKCNT0_|_NONHEAD_| .. |_HEAD_|_CBLKCNT1_| ... |_HEAD_| ... ^ = blkaddr base ^ += CBLKCNT0 ^ += CBLKCNT1 2) if CBLKCNT doesn't start at the pack, the first valid pcluster is the next pcluster, e.g. _________________________________________________________ | NONHEAD_| .. |_HEAD_|_CBLKCNT0_| ... |_HEAD_|_HEAD_| ... ^ = blkaddr base ^ += CBLKCNT0 ^ += 1 When a CBLKCNT is found, blkaddr will be increased by CBLKCNT lclusters, or a new HEAD is found immediately, bump blkaddr by 1 instead (see the picture above.) Also noted if CBLKCNT is the end of the pack, instead of storing delta1 (distance of the next HEAD lcluster) as normal NONHEADs, it still uses the compressed block count (delta0) since delta1 can be calculated indirectly but the block count can't. Adjust decoding logic to fit big pcluster compact indexes as well. Link: https://lore.kernel.org/r/20210407043927.10623-9-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-04-10erofs: support parsing big pcluster compress indexesGao Xiang
When INCOMPAT_BIG_PCLUSTER sb feature is enabled, legacy compress indexes will also have the same on-disk header compact indexes to keep per-file configurations instead of leaving it zeroed. If ADVISE_BIG_PCLUSTER is set for a file, CBLKCNT will be loaded for each pcluster in this file by parsing 1st non-head lcluster. Link: https://lore.kernel.org/r/20210407043927.10623-8-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-04-10erofs: adjust per-CPU buffers according to max_pclusterblksGao Xiang
Adjust per-CPU buffers on demand since big pcluster definition is available. Also, bail out unsupported pcluster size according to Z_EROFS_PCLUSTER_MAX_SIZE. Link: https://lore.kernel.org/r/20210407043927.10623-7-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-04-10erofs: add big physical cluster definitionGao Xiang
Big pcluster indicates the size of compressed data for each physical pcluster is no longer fixed as block size, but could be more than 1 block (more accurately, 1 logical pcluster) When big pcluster feature is enabled for head0/1, delta0 of the 1st non-head lcluster index will keep block count of this pcluster in lcluster size instead of 1. Or, the compressed size of pcluster should be 1 lcluster if pcluster has no non-head lcluster index. Also note that BIG_PCLUSTER feature reuses COMPR_CFGS feature since it depends on COMPR_CFGS and will be released together. Link: https://lore.kernel.org/r/20210407043927.10623-6-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-04-10erofs: fix up inplace I/O pointer for big pclusterGao Xiang
When picking up inplace I/O pages, it should be traversed in reverse order in aligned with the traversal order of file-backed online pages. Also, index should be updated together when preloading compressed pages. Previously, only page-sized pclustersize was supported so no problem at all. Also rename `compressedpages' to `icpage_ptr' to reflect its functionality. Link: https://lore.kernel.org/r/20210407043927.10623-5-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-04-10erofs: introduce physical cluster slab poolsGao Xiang
Since multiple pcluster sizes could be used at once, the number of compressed pages will become a variable factor. It's necessary to introduce slab pools rather than a single slab cache now. This limits the pclustersize to 1M (Z_EROFS_PCLUSTER_MAX_SIZE), and get rid of the obsolete EROFS_FS_CLUSTER_PAGE_LIMIT, which has no use now. Link: https://lore.kernel.org/r/20210407043927.10623-4-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-04-10erofs: introduce multipage per-CPU buffersGao Xiang
To deal the with the cases which inplace decompression is infeasible for some inplace I/O. Per-CPU buffers was introduced to get rid of page allocation latency and thrash for low-latency decompression algorithms such as lz4. For the big pcluster feature, introduce multipage per-CPU buffers to keep such inplace I/O pclusters temporarily as well but note that per-CPU pages are just consecutive virtually. When a new big pcluster fs is mounted, its max pclustersize will be read and per-CPU buffers can be growed if needed. Shrinking adjustable per-CPU buffers is more complex (because we don't know if such size is still be used), so currently just release them all when unloading. Link: https://lore.kernel.org/r/20210409190630.19569-1-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-04-07erofs: reserve physical_clusterbits[]Gao Xiang
Formal big pcluster design is actually more powerful / flexable than the previous thought whose pclustersize was fixed as power-of-2 blocks, which was obviously inefficient and space-wasting. Instead, pclustersize can now be set independently for each pcluster, so various pcluster sizes can also be used together in one file if mkfs wants (for example, according to data type and/or compression ratio). Let's get rid of previous physical_clusterbits[] setting (also notice that corresponding on-disk fields are still 0 for now). Therefore, head1/2 can be used for at most 2 different algorithms in one file and again pclustersize is now independent of these. Link: https://lore.kernel.org/r/20210407043927.10623-2-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-04-03erofs: Clean up spelling mistakes found in fs/erofsRuiqi Gong
zmap.c: s/correspoinding/corresponding zdata.c: s/endding/ending Link: https://lore.kernel.org/r/20210331093920.31923-1-gongruiqi1@huawei.com Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Ruiqi Gong <gongruiqi1@huawei.com> Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-03-29erofs: add on-disk compression configurationsGao Xiang
Add a bitmap for available compression algorithms and a variable-sized on-disk table for compression options in preparation for upcoming big pcluster and LZMA algorithm, which follows the end of super block. To parse the compression options, the bitmap is scanned one by one. For each available algorithm, there is data followed by 2-byte `length' correspondingly (it's enough for most cases, or entire fs blocks should be used.) With such available algorithm bitmap, kernel itself can also refuse to mount such filesystem if any unsupported compression algorithm exists. Note that COMPR_CFGS feature will be enabled with BIG_PCLUSTER. Link: https://lore.kernel.org/r/20210329100012.12980-1-hsiangkao@aol.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-03-29erofs: introduce on-disk lz4 fs configurationsGao Xiang
Introduce z_erofs_lz4_cfgs to store all lz4 configurations. Currently it's only max_distance, but will be used for new features later. Link: https://lore.kernel.org/r/20210329012308.28743-4-hsiangkao@aol.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-03-29erofs: support adjust lz4 history window sizeHuang Jianan
lz4 uses LZ4_DISTANCE_MAX to record history preservation. When using rolling decompression, a block with a higher compression ratio will cause a larger memory allocation (up to 64k). It may cause a large resource burden in extreme cases on devices with small memory and a large number of concurrent IOs. So appropriately reducing this value can improve performance. Decreasing this value will reduce the compression ratio (except when input_size <LZ4_DISTANCE_MAX). But considering that erofs currently only supports 4k output, reducing this value will not significantly reduce the compression benefits. The maximum value of LZ4_DISTANCE_MAX defined by lz4 is 64k, and we can only reduce this value. For the old kernel, it just can't reduce the memory allocation during rolling decompression without affecting the decompression result. Link: https://lore.kernel.org/r/20210329012308.28743-3-hsiangkao@aol.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Huang Jianan <huangjianan@oppo.com> Signed-off-by: Guo Weichao <guoweichao@oppo.com> [ Gao Xiang: introduce struct erofs_sb_lz4_info for configurations. ] Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2021-03-29erofs: introduce erofs_sb_has_xxx() helpersGao Xiang
Introduce erofs_sb_has_xxx() to make long checks short, especially for later big pcluster & LZMA features. Link: https://lore.kernel.org/r/20210329012308.28743-2-hsiangkao@aol.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com>