summaryrefslogtreecommitdiff
path: root/fs/erofs
AgeCommit message (Collapse)Author
2025-01-24include/linux/lz4.h: add some missing macrosGao Xiang
Currently, LZ4_DISTANCE_MAX and LZ4_DECOMPRESS_INPLACE_MARGIN are defined in the erofs subsystem for LZ4 in-place decompression, which is somewhat unsuitable since they should belong to the LZ4 itself and may change with future LZ4 codebase updates. Move them to include/linux/lz4.h to match the upstream LZ4 library [1]. No logic changes. [1] https://github.com/lz4/lz4/blob/v1.10.0/lib/lz4.h#L670 Link: https://lkml.kernel.org/r/20250114130454.1191150-1-hsiangkao@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Yann Collet <yann.collet.73@gmail.com> Cc: Nick Terrell <terrelln@fb.com> Cc: Chao Yu <chao@kernel.org> Cc: Yue Hu <zbestahu@gmail.com> Cc; Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Sandeep Dhavale <dhavale@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-23erofs: refine z_erofs_get_extent_compressedlen()Gao Xiang
- Set `compressedblks = 1` directly for non-bigpcluster cases. This simplifies the logic a bit since lcluster sizes larger than one block are unsupported and the details remain unclear. - For Z_EROFS_LCLUSTER_TYPE_PLAIN pclusters, avoid assuming `compressedblks = 1` by default. Instead, check if Z_EROFS_ADVISE_BIG_PCLUSTER_2 is set. It basically has no impact to existing valid images, but it's useful to find the gap to prepare for large PLAIN pclusters. Link: https://lore.kernel.org/r/20250123090109.973463-1-hsiangkao@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-01-21Merge tag 'kthread-for-6.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks Pull kthread updates from Frederic Weisbecker: "Kthreads affinity follow either of 4 existing different patterns: 1) Per-CPU kthreads must stay affine to a single CPU and never execute relevant code on any other CPU. This is currently handled by smpboot code which takes care of CPU-hotplug operations. Affinity here is a correctness constraint. 2) Some kthreads _have_ to be affine to a specific set of CPUs and can't run anywhere else. The affinity is set through kthread_bind_mask() and the subsystem takes care by itself to handle CPU-hotplug operations. Affinity here is assumed to be a correctness constraint. 3) Per-node kthreads _prefer_ to be affine to a specific NUMA node. This is not a correctness constraint but merely a preference in terms of memory locality. kswapd and kcompactd both fall into this category. The affinity is set manually like for any other task and CPU-hotplug is supposed to be handled by the relevant subsystem so that the task is properly reaffined whenever a given CPU from the node comes up. Also care should be taken so that the node affinity doesn't cross isolated (nohz_full) cpumask boundaries. 4) Similar to the previous point except kthreads have a _preferred_ affinity different than a node. Both RCU boost kthreads and RCU exp kworkers fall into this category as they refer to "RCU nodes" from a distinctly distributed tree. Currently the preferred affinity patterns (3 and 4) have at least 4 identified users, with more or less success when it comes to handle CPU-hotplug operations and CPU isolation. Each of which do it in its own ad-hoc way. This is an infrastructure proposal to handle this with the following API changes: - kthread_create_on_node() automatically affines the created kthread to its target node unless it has been set as per-cpu or bound with kthread_bind[_mask]() before the first wake-up. - kthread_affine_preferred() is a new function that can be called right after kthread_create_on_node() to specify a preferred affinity different than the specified node. When the preferred affinity can't be applied because the possible targets are offline or isolated (nohz_full), the kthread is affine to the housekeeping CPUs (which means to all online CPUs most of the time or only the non-nohz_full CPUs when nohz_full= is set). kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been converted, along with a few old drivers. Summary of the changes: - Consolidate a bunch of ad-hoc implementations of kthread_run_on_cpu() - Introduce task_cpu_fallback_mask() that defines the default last resort affinity of a task to become nohz_full aware - Add some correctness check to ensure kthread_bind() is always called before the first kthread wake up. - Default affine kthread to its preferred node. - Convert kswapd / kcompactd and remove their halfway working ad-hoc affinity implementation - Implement kthreads preferred affinity - Unify kthread worker and kthread API's style - Convert RCU kthreads to the new API and remove the ad-hoc affinity implementation" * tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: kthread: modify kernel-doc function name to match code rcu: Use kthread preferred affinity for RCU exp kworkers treewide: Introduce kthread_run_worker[_on_cpu]() kthread: Unify kthread_create_on_cpu() and kthread_create_worker_on_cpu() automatic format rcu: Use kthread preferred affinity for RCU boost kthread: Implement preferred affinity mm: Create/affine kswapd to its preferred node mm: Create/affine kcompactd to its preferred node kthread: Default affine kthread to its preferred NUMA node kthread: Make sure kthread hasn't started while binding it sched,arm64: Handle CPU isolation on last resort fallback rq selection arm64: Exclude nohz_full CPUs from 32bits el0 support lib: test_objpool: Use kthread_run_on_cpu() kallsyms: Use kthread_run_on_cpu() soc/qman: test: Use kthread_run_on_cpu() arm/bL_switcher: Use kthread_run_on_cpu()
2025-01-20Merge tag 'vfs-6.14-rc1.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Support caching symlink lengths in inodes The size is stored in a new union utilizing the same space as i_devices, thus avoiding growing the struct or taking up any more space When utilized it dodges strlen() in vfs_readlink(), giving about 1.5% speed up when issuing readlink on /initrd.img on ext4 - Add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag If a file system supports uncached buffered IO, it may set FOP_DONTCACHE and enable support for RWF_DONTCACHE. If RWF_DONTCACHE is attempted without the file system supporting it, it'll get errored with -EOPNOTSUPP - Enable VBOXGUEST and VBOXSF_FS on ARM64 Now that VirtualBox is able to run as a host on arm64 (e.g. the Apple M3 processors) we can enable VBOXSF_FS (and in turn VBOXGUEST) for this architecture. Tested with various runs of bonnie++ and dbench on an Apple MacBook Pro with the latest Virtualbox 7.1.4 r165100 installed Cleanups: - Delay sysctl_nr_open check in expand_files() - Use kernel-doc includes in fiemap docbook - Use page->private instead of page->index in watch_queue - Use a consume fence in mnt_idmap() as it's heavily used in link_path_walk() - Replace magic number 7 with ARRAY_SIZE() in fc_log - Sort out a stale comment about races between fd alloc and dup2() - Fix return type of do_mount() from long to int - Various cosmetic cleanups for the lockref code Fixes: - Annotate spinning as unlikely() in __read_seqcount_begin The annotation already used to be there, but got lost in commit 52ac39e5db51 ("seqlock: seqcount_t: Implement all read APIs as statement expressions") - Fix proc_handler for sysctl_nr_open - Flush delayed work in delayed fput() - Fix grammar and spelling in propagate_umount() - Fix ESP not readable during coredump In /proc/PID/stat, there is the kstkesp field which is the stack pointer of a thread. While the thread is active, this field reads zero. But during a coredump, it should have a valid value However, at the moment, kstkesp is zero even during coredump - Don't wake up the writer if the pipe is still full - Fix unbalanced user_access_end() in select code" * tag 'vfs-6.14-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits) gfs2: use lockref_init for qd_lockref erofs: use lockref_init for pcl->lockref dcache: use lockref_init for d_lockref lockref: add a lockref_init helper lockref: drop superfluous externs lockref: use bool for false/true returns lockref: improve the lockref_get_not_zero description lockref: remove lockref_put_not_zero fs: Fix return type of do_mount() from long to int select: Fix unbalanced user_access_end() vbox: Enable VBOXGUEST and VBOXSF_FS on ARM64 pipe_read: don't wake up the writer if the pipe is still full selftests: coredump: Add stackdump test fs/proc: do_task_stat: Fix ESP not readable during coredump fs: add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag fs: sort out a stale comment about races between fd alloc and dup2 fs: Fix grammar and spelling in propagate_umount() fs: fc_log replace magic number 7 with ARRAY_SIZE() fs: use a consume fence in mnt_idmap() file: flush delayed work in delayed fput() ...
2025-01-19erofs: remove dead code in erofs_fc_parse_paramChen Linxuan
If an option is unknown to erofs, which means that option is not in `erofs_fs_parameters`, `fs_parse` will return -ENOPARAM, which makes `erofs_fc_parse_param` returns earlier. Signed-off-by: Chen Linxuan <chenlinxuan@uniontech.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/DB86A4E2BB2BB44E+20250117100635.335963-2-chenlinxuan@uniontech.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-01-17erofs: return SHRINK_EMPTY if no objects to freeChen Linxuan
Comments in file include/linux/shrinker.h says that `count_objects` of `struct shrinker` should return SHRINK_EMPTY when there are no objects to free. > If there are no objects to free, it should return SHRINK_EMPTY, > while 0 is returned in cases of the number of freeable items cannot > be determined or shrinker should skip this cache for this time > (e.g., their number is below shrinkable limit). Signed-off-by: Chen Linxuan <chenlinxuan@uniontech.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/149E6E64B5B6B5E8+20250116083303.199817-1-chenlinxuan@uniontech.com [ Gao Xiang: should have no impact since it's not memcg-aware. ] Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-01-17erofs: convert z_erofs_bind_cache() to foliosGao Xiang
The managed cache uses a pseudo inode to keep (necessary) compressed data. Currently, it still uses zero-order folios, so this is just a trivial conversion, except that the use of the pagepool is temporarily dropped. Drop some obsoleted comments too. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250114034429.431408-4-hsiangkao@linux.alibaba.com
2025-01-17erofs: tidy up zdata.cGao Xiang
All small code style adjustments, no logic changes: - z_erofs_decompress_frontend => z_erofs_frontend; - z_erofs_decompress_backend => z_erofs_backend; - Use Z_EROFS_DEFINE_FRONTEND() to replace DECOMPRESS_FRONTEND_INIT(); - `nr_folios` should be `nrpages` in z_erofs_readahead(); - Refine in-line comments. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250114034429.431408-3-hsiangkao@linux.alibaba.com
2025-01-17erofs: get rid of `z_erofs_next_pcluster_t`Gao Xiang
It was originally intended for tagged pointer reservation. Now all encoded data can be represented uniformally with `struct z_erofs_pcluster` as described in commit bf1aa03980f4 ("erofs: sunset `struct erofs_workgroup`"), let's drop it too. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250114034429.431408-2-hsiangkao@linux.alibaba.com
2025-01-17erofs: simplify z_erofs_load_compact_lcluster()Gao Xiang
- Get rid of unpack_compacted_index() and fold it into z_erofs_load_compact_lcluster(); - Avoid a goto. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250114034429.431408-1-hsiangkao@linux.alibaba.com
2025-01-17erofs: fix potential return value overflow of z_erofs_shrink_scan()Gao Xiang
z_erofs_shrink_scan() could return small numbers due to the mistyped `freed`. Although I don't think it has any visible impact. Fixes: 3883a79abd02 ("staging: erofs: introduce VLE decompression support") Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250114040058.459981-1-hsiangkao@linux.alibaba.com
2025-01-17erofs: shorten bvecs[] for file-backed mountsGao Xiang
BIO_MAX_VECS is too large for __GFP_NOFAIL allocation. We could use a mempool (since BIOs can always proceed), but it seems overly complicated for now. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250107082825.74242-1-hsiangkao@linux.alibaba.com
2025-01-17erofs: micro-optimize superblock checksumGao Xiang
Just verify the remaining unknown on-disk data instead of allocating a temporary buffer for the whole superblock and zeroing out the checksum field since .magic(EROFS_SUPER_MAGIC_V1) is verified and .checksum(0) is fixed. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241212023948.1143038-1-hsiangkao@linux.alibaba.com
2025-01-17fs: erofs: xattr.c change kzalloc to kcallocEthan Carter Edwards
Refactor xattr.c to use kcalloc instead of kzalloc when multiplying allocation size by count. This refactor prevents unintentional memory overflows. Discovered by checkpatch.pl. Signed-off-by: Ethan Carter Edwards <ethan@ethancedwards.com> Link: https://lore.kernel.org/r/i3CLJhMELKzBJr3DaRyv-hP_4m-3Twx0sgBWXW6naZlMtHrIeWr93xOFshX8qZHDrJeSjHMTiUOh8JmBZ9v0AB-S1lIYM_d-vasSRlsF_s4=@ethancedwards.com Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-01-16erofs: use lockref_init for pcl->lockrefChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250115094702.504610-8-hch@lst.de Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-08treewide: Introduce kthread_run_worker[_on_cpu]()Frederic Weisbecker
kthread_create() creates a kthread without running it yet. kthread_run() creates a kthread and runs it. On the other hand, kthread_create_worker() creates a kthread worker and runs it. This difference in behaviours is confusing. Also there is no way to create a kthread worker and affine it using kthread_bind_mask() or kthread_affine_preferred() before starting it. Consolidate the behaviours and introduce kthread_run_worker[_on_cpu]() that behaves just like kthread_run(). kthread_create_worker[_on_cpu]() will now only create a kthread worker without starting it. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
2025-01-08kthread: Unify kthread_create_on_cpu() and kthread_create_worker_on_cpu() ↵Frederic Weisbecker
automatic format kthread_create_on_cpu() uses the CPU argument as an implicit and unique printf argument to add to the format whereas kthread_create_worker_on_cpu() still relies on explicitly passing the printf arguments. This difference in behaviour is error prone and doesn't help standardizing per-CPU kthread names. Unify the behaviours and convert kthread_create_worker_on_cpu() to use the printf behaviour of kthread_create_on_cpu(). Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-12-16erofs: use buffered I/O for file-backed mounts by defaultGao Xiang
For many use cases (e.g. container images are just fetched from remote), performance will be impacted if underlay page cache is up-to-date but direct i/o flushes dirty pages first. Instead, let's use buffered I/O by default to keep in sync with loop devices and add a (re)mount option to explicitly give a try to use direct I/O if supported by the underlying files. The container startup time is improved as below: [workload] docker.io/library/workpress:latest unpack 1st run non-1st runs EROFS snapshotter buffered I/O file 4.586404265s 0.308s 0.198s EROFS snapshotter direct I/O file 4.581742849s 2.238s 0.222s EROFS snapshotter loop 4.596023152s 0.346s 0.201s Overlayfs snapshotter 5.382851037s 0.206s 0.214s Fixes: fb176750266a ("erofs: add file-backed mount support") Cc: Derek McGowan <derek@mcg.dev> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241212134336.2059899-1-hsiangkao@linux.alibaba.com
2024-12-16erofs: reference `struct erofs_device_info` for erofs_map_devGao Xiang
Record `m_sb` and `m_dif` to replace `m_fscache`, `m_daxdev`, `m_fp` and `m_dax_part_off` in order to simplify the codebase. Note that `m_bdev` is still left since it can be assigned from `sb->s_bdev` directly. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241212235401.2857246-1-hsiangkao@linux.alibaba.com
2024-12-16erofs: use `struct erofs_device_info` for the primary deviceGao Xiang
Instead of just listing each one directly in `struct erofs_sb_info` except that we still use `sb->s_bdev` for the primary block device. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241216125310.930933-2-hsiangkao@linux.alibaba.com
2024-12-13erofs: add erofs_sb_free() helperGao Xiang
Unify the common parts of erofs_fc_free() and erofs_kill_sb() as erofs_sb_free(). Thus, fput() in erofs_fc_get_tree() is no longer needed, too. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241212133504.2047178-1-hsiangkao@linux.alibaba.com
2024-12-13erofs: fix PSI memstall accountingGao Xiang
Max Kellermann recently reported psi_group_cpu.tasks[NR_MEMSTALL] is incorrect in the 6.11.9 kernel. The root cause appears to be that, since the problematic commit, bio can be NULL, causing psi_memstall_leave() to be skipped in z_erofs_submit_queue(). Reported-by: Max Kellermann <max.kellermann@ionos.com> Closes: https://lore.kernel.org/r/CAKPOu+8tvSowiJADW2RuKyofL_CSkm_SuyZA7ME5vMLWmL6pqw@mail.gmail.com Fixes: 9e2f9d34dd12 ("erofs: handle overlapped pclusters out of crafted images properly") Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241127085236.3538334-1-hsiangkao@linux.alibaba.com
2024-12-13erofs: fix rare pcluster memory leak after unmountingGao Xiang
There may still exist some pcluster with valid reference counts during unmounting. Instead of introducing another synchronization primitive, just try again as unmounting is relatively rare. This approach is similar to z_erofs_cache_invalidate_folio(). It was also reported by syzbot as a UAF due to commit f5ad9f9a603f ("erofs: free pclusters if no cached folio is attached"): BUG: KASAN: slab-use-after-free in do_raw_spin_trylock+0x72/0x1f0 kernel/locking/spinlock_debug.c:123 .. queued_spin_trylock include/asm-generic/qspinlock.h:92 [inline] do_raw_spin_trylock+0x72/0x1f0 kernel/locking/spinlock_debug.c:123 __raw_spin_trylock include/linux/spinlock_api_smp.h:89 [inline] _raw_spin_trylock+0x20/0x80 kernel/locking/spinlock.c:138 spin_trylock include/linux/spinlock.h:361 [inline] z_erofs_put_pcluster fs/erofs/zdata.c:959 [inline] z_erofs_decompress_pcluster fs/erofs/zdata.c:1403 [inline] z_erofs_decompress_queue+0x3798/0x3ef0 fs/erofs/zdata.c:1425 z_erofs_decompressqueue_work+0x99/0xe0 fs/erofs/zdata.c:1437 process_one_work kernel/workqueue.c:3229 [inline] process_scheduled_works+0xa68/0x1840 kernel/workqueue.c:3310 worker_thread+0x870/0xd30 kernel/workqueue.c:3391 kthread+0x2f2/0x390 kernel/kthread.c:389 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> However, it seems a long outstanding memory leak. Fix it now. Fixes: f5ad9f9a603f ("erofs: free pclusters if no cached folio is attached") Reported-by: syzbot+7ff87b095e7ca0c5ac39@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/674c1235.050a0220.ad585.0032.GAE@google.com Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241203072821.1885740-1-hsiangkao@linux.alibaba.com
2024-11-18erofs: handle NONHEAD !delta[1] lclusters gracefullyGao Xiang
syzbot reported a WARNING in iomap_iter_done: iomap_fiemap+0x73b/0x9b0 fs/iomap/fiemap.c:80 ioctl_fiemap fs/ioctl.c:220 [inline] Generally, NONHEAD lclusters won't have delta[1]==0, except for crafted images and filesystems created by pre-1.0 mkfs versions. Previously, it would immediately bail out if delta[1]==0, which led to inadequate decompressed lengths (thus FIEMAP is impacted). Treat it as delta[1]=1 to work around these legacy mkfs versions. `lclusterbits > 14` is illegal for compact indexes, error out too. Reported-by: syzbot+6c0b301317aa0156f9eb@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/67373c0c.050a0220.2a2fcc.0079.GAE@google.com Tested-by: syzbot+6c0b301317aa0156f9eb@syzkaller.appspotmail.com Fixes: d95ae5e25326 ("erofs: add support for the full decompressed length") Fixes: 001b8ccd0650 ("erofs: fix compact 4B support for 16k block size") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241115173651.3339514-1-hsiangkao@linux.alibaba.com
2024-11-18erofs: clarify direct I/O supportGao Xiang
Currently, only filesystems backed by block devices support direct I/O. Also remove the unnecessary strict checks that can be supported with iomap. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241115074625.2520728-1-hsiangkao@linux.alibaba.com
2024-11-18erofs: fix blksize < PAGE_SIZE for file-backed mountsHongzhen Luo
Adjust sb->s_blocksize{,_bits} directly for file-backed mounts when the fs block size is smaller than PAGE_SIZE. Previously, EROFS used sb_set_blocksize(), which caused a panic if bdev-backed mounts is not used. Fixes: fb176750266a ("erofs: add file-backed mount support") Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com> Link: https://lore.kernel.org/r/20241015103836.3757438-1-hongzhen@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-11-18erofs: get rid of `buf->kmap_type`Gao Xiang
After commit 927e5010ff5b ("erofs: use kmap_local_page() only for erofs_bread()"), `buf->kmap_type` actually has no use at all. Let's get rid of `buf->kmap_type` now. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241114095813.839866-1-hsiangkao@linux.alibaba.com
2024-11-18erofs: fix file-backed mounts over FUSEGao Xiang
syzbot reported a null-ptr-deref in fuse_read_args_fill: fuse_read_folio+0xb0/0x100 fs/fuse/file.c:905 filemap_read_folio+0xc6/0x2a0 mm/filemap.c:2367 do_read_cache_folio+0x263/0x5c0 mm/filemap.c:3825 read_mapping_folio include/linux/pagemap.h:1011 [inline] erofs_bread+0x34d/0x7e0 fs/erofs/data.c:41 erofs_read_superblock fs/erofs/super.c:281 [inline] erofs_fc_fill_super+0x2b9/0x2500 fs/erofs/super.c:625 Unlike most filesystems, some network filesystems and FUSE need unavoidable valid `file` pointers for their read I/Os [1]. Anyway, those use cases need to be supported too. [1] https://docs.kernel.org/filesystems/vfs.html Reported-by: syzbot+0b1279812c46e48bb0c1@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/6727bbdf.050a0220.3c8d68.0a7e.GAE@google.com Fixes: fb176750266a ("erofs: add file-backed mount support") Tested-by: syzbot+0b1279812c46e48bb0c1@syzkaller.appspotmail.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241114234905.1873723-1-hsiangkao@linux.alibaba.com
2024-11-18erofs: simplify definition of the log functionsGou Hao
Use printk instead of pr_info/err to reduce redundant code. Signed-off-by: Gou Hao <gouhao@uniontech.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241114013247.30821-1-gouhao@uniontech.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-11-18erofs: add sysfs node to drop internal cachesChunhai Guo
Add a sysfs node to drop compression-related caches, currently used to drop in-memory pclusters and cached compressed folios. Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241113041148.749129-1-guochunhai@vivo.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-11-18erofs: free pclusters if no cached folio is attachedChunhai Guo
Once a pcluster is fully decompressed and there are no attached cached folios, its corresponding `struct z_erofs_pcluster` will be freed. This will significantly reduce the frequency of calls to erofs_shrink_scan() and the memory allocated for `struct z_erofs_pcluster`. The tables below show approximately a 96% reduction in the calls to erofs_shrink_scan() and in the memory allocated for `struct z_erofs_pcluster` after applying this patch. The results were obtained by performing a test to copy a 4.1GB partition on ARM64 Android devices running the 6.6 kernel with an 8-core CPU and 12GB of memory. 1. The reduction in calls to erofs_shrink_scan(): +-----------------+-----------+----------+---------+ | | w/o patch | w/ patch | diff | +-----------------+-----------+----------+---------+ | Average (times) | 11390 | 390 | -96.57% | +-----------------+-----------+----------+---------+ 2. The reduction in memory released by erofs_shrink_scan(): +-----------------+-----------+----------+---------+ | | w/o patch | w/ patch | diff | +-----------------+-----------+----------+---------+ | Average (Byte) | 133612656 | 4434552 | -96.68% | +-----------------+-----------+----------+---------+ Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241112043235.546164-1-guochunhai@vivo.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-11-18erofs: sunset `struct erofs_workgroup`Gao Xiang
`struct erofs_workgroup` was introduced to provide a unique header for all physically indexed objects. However, after big pclusters and shared pclusters are implemented upstream, it seems that all EROFS encoded data (which requires transformation) can be represented with `struct z_erofs_pcluster` directly. Move all members into `struct z_erofs_pcluster` for simplicity. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241021035323.3280682-3-hsiangkao@linux.alibaba.com
2024-11-18erofs: move erofs_workgroup operations into zdata.cGao Xiang
Move related helpers into zdata.c as an intermediate step of getting rid of `struct erofs_workgroup`, and rename: erofs_workgroup_put => z_erofs_put_pcluster erofs_workgroup_get => z_erofs_get_pcluster erofs_try_to_release_workgroup => erofs_try_to_release_pcluster erofs_shrink_workstation => z_erofs_shrink_scan Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241021035323.3280682-2-hsiangkao@linux.alibaba.com
2024-11-18erofs: get rid of erofs_{find,insert}_workgroupGao Xiang
Just fold them into the only two callers since they are simple enough. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241021035323.3280682-1-hsiangkao@linux.alibaba.com
2024-11-12erofs: add SEEK_{DATA,HOLE} supportGao Xiang
Many userspace programs (including erofs-utils itself) use SEEK_DATA / SEEK_HOLE to parse hole extents in addition to FIEMAP. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241011065128.2097377-1-hsiangkao@linux.alibaba.com
2024-10-21erofs: use get_tree_bdev_flags() to avoid misleading messagesGao Xiang
Users can pass in an arbitrary source path for the proper type of a mount then without "Can't lookup blockdev" error message. Reported-by: Allison Karlitskaya <allison.karlitskaya@redhat.com> Closes: https://lore.kernel.org/r/CAOYeF9VQ8jKVmpy5Zy9DNhO6xmWSKMB-DO8yvBB0XvBE7=3Ugg@mail.gmail.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241009033151.2334888-2-hsiangkao@linux.alibaba.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-14Merge tag 'erofs-for-6.12-rc4-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fixes from Gao Xiang: "The main one fixes a syzbot issue due to the invalid inode type out of file-backed mounts. The others are minor cleanups without actual logic changes. Summary: - Make sure only regular inodes can be used for file-backed mounts - Two minor codebase cleanups" * tag 'erofs-for-6.12-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: get rid of kaddr in `struct z_erofs_maprecorder` erofs: get rid of z_erofs_try_to_claim_pcluster() erofs: ensure regular inodes for file-backed mounts
2024-10-11erofs: get rid of kaddr in `struct z_erofs_maprecorder`Gao Xiang
`kaddr` becomes useless after switching to metabuf. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241010235830.1535616-1-hsiangkao@linux.alibaba.com
2024-10-11erofs: get rid of z_erofs_try_to_claim_pcluster()Gao Xiang
Just fold it into the caller for simplicity. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241010090420.405871-1-hsiangkao@linux.alibaba.com
2024-10-11erofs: ensure regular inodes for file-backed mountsGao Xiang
Only regular inodes are allowed for file-backed mounts, not directories (as seen in the original syzbot case) or special inodes. Also ensure that .read_folio() is implemented on the underlying fs for the primary device. Fixes: fb176750266a ("erofs: add file-backed mount support") Reported-by: syzbot+001306cd9c92ce0df23f@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/00000000000011bdde0622498ee3@google.com Tested-by: syzbot+001306cd9c92ce0df23f@syzkaller.appspotmail.com Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240917130803.32418-1-hsiangkao@linux.alibaba.com
2024-10-02move asm/unaligned.h to linux/unaligned.hAl Viro
asm/unaligned.h is always an include of asm-generic/unaligned.h; might as well move that thing to linux/unaligned.h and include that - there's nothing arch-specific in that header. auto-generated by the following: for i in `git grep -l -w asm/unaligned.h`; do sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i done for i in `git grep -l -w asm-generic/unaligned.h`; do sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i done git mv include/asm-generic/unaligned.h include/linux/unaligned.h git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
2024-09-12erofs: reject inodes with negative i_sizeGao Xiang
Negative i_size is never supported, although crafted images with inodes having negative i_size will NOT lead to security issues in our current codebase: The following image can verify this (gzip+base64 encoded): H4sICCmk4mYAA3Rlc3QuaW1nAGNgGAWjYBSMVPDo4dcH3jP2aTED2TwMKgxMUHHNJY/SQDQX LxcDIw3tZwXit44MDNpQ/n8gQJZ/vxjijosPuSyZ0DUDgQqcZoKzVYFsDShbHeh6PT29ktTi Eqz2g/y2pBFiLxDMh4lhs5+W4TAKRsEoGAWjYBSMglEwCkYBPQAAS2DbowAQAAA= Mark as bad inodes for such corrupted inodes explicitly. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240912083538.3011860-1-hsiangkao@linux.alibaba.com
2024-09-12erofs: restrict pcluster size limitationsGao Xiang
Error out if {en,de}encoded size of a pcluster is unsupported: Maximum supported encoded size (of a pcluster): 1 MiB Maximum supported decoded size (of a pcluster): 12 MiB Users can still choose to use supported large configurations (e.g., for archival purposes), but there may be performance penalties in low-memory scenarios compared to smaller pclusters. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240912074156.2925394-1-hsiangkao@linux.alibaba.com
2024-09-12erofs: allocate more short-lived pages from reserved pool firstChunhai Guo
This patch aims to allocate bvpages and short-lived compressed pages from the reserved pool first. After applying this patch, there are three benefits. 1. It reduces the page allocation time. The bvpages and short-lived compressed pages account for about 4% of the pages allocated from the system in the multi-app launch benchmarks [1]. It reduces the page allocation time accordingly and lowers the likelihood of blockage by page allocation in low memory scenarios. 2. The pages in the reserved pool will be allocated on demand. Currently, bvpages and short-lived compressed pages are short-lived pages allocated from the system, and the pages in the reserved pool all originate from short-lived pages. Consequently, the number of reserved pool pages will increase to z_erofs_rsv_nrpages over time. With this patch, all short-lived pages are allocated from the reserved pool first, so the number of reserved pool pages will only increase when there are not enough pages. Thus, even if z_erofs_rsv_nrpages is set to a large number for specific reasons, the actual number of reserved pool pages may remain low as per demand. In the multi-app launch benchmarks [1], z_erofs_rsv_nrpages is set at 256, while the number of reserved pool pages remains below 64. 3. When erofs cache decompression is disabled (EROFS_ZIP_CACHE_DISABLED), all pages will *only* be allocated from the reserved pool for erofs. This will significantly reduce the memory pressure from erofs. [1] For additional details on the multi-app launch benchmarks, please refer to commit 0f6273ab4637 ("erofs: add a reserved buffer pool for lz4 decompression"). Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20240906121110.3701889-1-guochunhai@vivo.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-09-12erofs: sunset unneeded NOFAILsGao Xiang
With iterative development, our codebase can now deal with compressed buffer misses properly if both in-place I/O and compressed buffer allocation fail. Note that if readahead fails (with non-uptodate folios), the original request will then fall back to synchronous read, and `.read_folio()` should return appropriate errnos; otherwise -EIO will be passed to user space, which is unexpected. To simplify rarely encountered failure paths, a mimic decompression will be just used. Before that, failure reasons are recorded in compressed_bvecs[] and they also act as placeholders to avoid in-place pages. They will be parsed just before decompression and then pass back to `.read_folio()`. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240905084732.2684515-1-hsiangkao@linux.alibaba.com
2024-09-10erofs: simplify erofs_map_blocks_flatmode()Hongzhen Luo
Get rid of redundant variables (nblocks, offset) and a dead branch (!tailendpacking). Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20240905030339.1474396-1-hongzhen@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-09-10erofs: refactor read_inode calling conventionYiyang Wu
Refactor out the iop binding behavior out of the erofs_fill_symlink and move erofs_buf into the erofs_read_inode, so that erofs_fill_inode can only deal with inode operation bindings and can be decoupled from metabuf operations. This results in better calling conventions. Note that after this patch, we do not need erofs_buf and ofs as parameters any more when calling erofs_read_inode as all the data operations are now included in itself. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/all/20240425222847.GN2118490@ZenIV/ Signed-off-by: Yiyang Wu <toolmanp@tlmp.cc> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20240902093412.509083-1-toolmanp@tlmp.cc Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-09-10erofs: use kmemdup_nul in erofs_fill_symlinkYiyang Wu
Remove open coding in erofs_fill_symlink. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/all/20240425222847.GN2118490@ZenIV Signed-off-by: Yiyang Wu <toolmanp@tlmp.cc> Link: https://lore.kernel.org/r/20240902083147.450558-2-toolmanp@tlmp.cc Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-09-10erofs: mark experimental fscache backend deprecatedGao Xiang
Although fscache is still described as "General Filesystem Caching" for network filesystems and other things such as ISO9660 filesystems, it has actually become a part of netfslib recently, which was unexpected at the time when "EROFS over fscache" proposed (2021) since EROFS is entirely a disk filesystem and the dependency is redundant. Mark it deprecated and it will be removed after "fanotify pre-content hooks" lands, which will provide the same functionality for EROFS. Reviewed-by: Sandeep Dhavale <dhavale@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240830032840.3783206-4-hsiangkao@linux.alibaba.com
2024-09-10erofs: support compressed inodes for fileioGao Xiang
Use pseudo bios just like the previous fscache approach since merged bio_vecs can be filled properly with unique interfaces. Reviewed-by: Sandeep Dhavale <dhavale@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240830032840.3783206-3-hsiangkao@linux.alibaba.com