summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2021-11-01Merge tag 'overflow-v5.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull overflow updates from Kees Cook: "The end goal of the current buffer overflow detection work[0] is to gain full compile-time and run-time coverage of all detectable buffer overflows seen via array indexing or memcpy(), memmove(), and memset(). The str*() family of functions already have full coverage. While much of the work for these changes have been on-going for many releases (i.e. 0-element and 1-element array replacements, as well as avoiding false positives and fixing discovered overflows[1]), this series contains the foundational elements of several related buffer overflow detection improvements by providing new common helpers and FORTIFY_SOURCE changes needed to gain the introspection required for compiler visibility into array sizes. Also included are a handful of already Acked instances using the helpers (or related clean-ups), with many more waiting at the ready to be taken via subsystem-specific trees[2]. The new helpers are: - struct_group() for gaining struct member range introspection - memset_after() and memset_startat() for clearing to the end of structures - DECLARE_FLEX_ARRAY() for using flex arrays in unions or alone in structs Also included is the beginning of the refactoring of FORTIFY_SOURCE to support memcpy() introspection, fix missing and regressed coverage under GCC, and to prepare to fix the currently broken Clang support. Finishing this work is part of the larger series[0], but depends on all the false positives and buffer overflow bug fixes to have landed already and those that depend on this series to land. As part of the FORTIFY_SOURCE refactoring, a set of both a compile-time and run-time tests are added for FORTIFY_SOURCE and the mem*()-family functions respectively. The compile time tests have found a legitimate (though corner-case) bug[6] already. Please note that the appearance of "panic" and "BUG" in the FORTIFY_SOURCE refactoring are the result of relocating existing code, and no new use of those code-paths are expected nor desired. Finally, there are two tree-wide conversions for 0-element arrays and flexible array unions to gain sane compiler introspection coverage that result in no known object code differences. After this series (and the changes that have now landed via netdev and usb), we are very close to finally being able to build with -Warray-bounds and -Wzero-length-bounds. However, due corner cases in GCC[3] and Clang[4], I have not included the last two patches that turn on these options, as I don't want to introduce any known warnings to the build. Hopefully these can be solved soon" Link: https://lore.kernel.org/lkml/20210818060533.3569517-1-keescook@chromium.org/ [0] Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?qt=grep&q=FORTIFY_SOURCE [1] Link: https://lore.kernel.org/lkml/202108220107.3E26FE6C9C@keescook/ [2] Link: https://lore.kernel.org/lkml/3ab153ec-2798-da4c-f7b1-81b0ac8b0c5b@roeck-us.net/ [3] Link: https://bugs.llvm.org/show_bug.cgi?id=51682 [4] Link: https://lore.kernel.org/lkml/202109051257.29B29745C0@keescook/ [5] Link: https://lore.kernel.org/lkml/20211020200039.170424-1-keescook@chromium.org/ [6] * tag 'overflow-v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (30 commits) fortify: strlen: Avoid shadowing previous locals compiler-gcc.h: Define __SANITIZE_ADDRESS__ under hwaddress sanitizer treewide: Replace 0-element memcpy() destinations with flexible arrays treewide: Replace open-coded flex arrays in unions stddef: Introduce DECLARE_FLEX_ARRAY() helper btrfs: Use memset_startat() to clear end of struct string.h: Introduce memset_startat() for wiping trailing members and padding xfrm: Use memset_after() to clear padding string.h: Introduce memset_after() for wiping trailing members/padding lib: Introduce CONFIG_MEMCPY_KUNIT_TEST fortify: Add compile-time FORTIFY_SOURCE tests fortify: Allow strlen() and strnlen() to pass compile-time known lengths fortify: Prepare to improve strnlen() and strlen() warnings fortify: Fix dropped strcpy() compile-time write overflow check fortify: Explicitly disable Clang support fortify: Move remaining fortify helpers into fortify-string.h lib/string: Move helper functions out of string.c compiler_types.h: Remove __compiletime_object_size() cm4000_cs: Use struct_group() to zero struct cm4000_dev region can: flexcan: Use struct_group() to zero struct flexcan_regs regions ...
2021-11-01Merge tag 'x86_cc_for_v5.16_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull generic confidential computing updates from Borislav Petkov: "Add an interface called cc_platform_has() which is supposed to be used by confidential computing solutions to query different aspects of the system. The intent behind it is to unify testing of such aspects instead of having each confidential computing solution add its own set of tests to code paths in the kernel, leading to an unwieldy mess" * tag 'x86_cc_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: treewide: Replace the use of mem_encrypt_active() with cc_platform_has() x86/sev: Replace occurrences of sev_es_active() with cc_platform_has() x86/sev: Replace occurrences of sev_active() with cc_platform_has() x86/sme: Replace occurrences of sme_active() with cc_platform_has() powerpc/pseries/svm: Add a powerpc version of cc_platform_has() x86/sev: Add an x86 version of cc_platform_has() arch/cc: Introduce a function to check for confidential computing features x86/ioremap: Selectively build arch override encryption functions
2021-11-01nfsd4: remove obselete commentJ. Bruce Fields
Mandatory locking has been removed. And the rest of this comment is redundant with the code. Reported-by: Jeff layton <jlayton@kernel.org> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-11-01Merge tag 'sched-core-2021-11-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Thomas Gleixner: - Revert the printk format based wchan() symbol resolution as it can leak the raw value in case that the symbol is not resolvable. - Make wchan() more robust and work with all kind of unwinders by enforcing that the task stays blocked while unwinding is in progress. - Prevent sched_fork() from accessing an invalid sched_task_group - Improve asymmetric packing logic - Extend scheduler statistics to RT and DL scheduling classes and add statistics for bandwith burst to the SCHED_FAIR class. - Properly account SCHED_IDLE entities - Prevent a potential deadlock when initial priority is assigned to a newly created kthread. A recent change to plug a race between cpuset and __sched_setscheduler() introduced a new lock dependency which is now triggered. Break the lock dependency chain by moving the priority assignment to the thread function. - Fix the idle time reporting in /proc/uptime for NOHZ enabled systems. - Improve idle balancing in general and especially for NOHZ enabled systems. - Provide proper interfaces for live patching so it does not have to fiddle with scheduler internals. - Add cluster aware scheduling support. - A small set of tweaks for RT (irqwork, wait_task_inactive(), various scheduler options and delaying mmdrop) - The usual small tweaks and improvements all over the place * tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits) sched/fair: Cleanup newidle_balance sched/fair: Remove sysctl_sched_migration_cost condition sched/fair: Wait before decaying max_newidle_lb_cost sched/fair: Skip update_blocked_averages if we are defering load balance sched/fair: Account update_blocked_averages in newidle_balance cost x86: Fix __get_wchan() for !STACKTRACE sched,x86: Fix L2 cache mask sched/core: Remove rq_relock() sched: Improve wake_up_all_idle_cpus() take #2 irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support. sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ sched: Add cluster scheduler level for x86 sched: Add cluster scheduler level in core and related Kconfig for ARM64 topology: Represent clusters of CPUs within a die sched: Disable -Wunused-but-set-variable sched: Add wrapper for get_wchan() to keep task blocked x86: Fix get_wchan() to support the ORC unwinder proc: Use task_is_running() for wchan in /proc/$pid/stat ...
2021-11-01Merge tag 'for-5.16-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "The updates this time are more under the hood and enhancing existing features (subpage with compression and zoned namespaces). Performance related: - misc small inode logging improvements (+3% throughput, -11% latency on sample dbench workload) - more efficient directory logging: bulk item insertion, less tree searches and locking - speed up bulk insertion of items into a b-tree, which is used when logging directories, when running delayed items for directories (fsync and transaction commits) and when running the slow path (full sync) of an fsync (bulk creation run time -4%, deletion -12%) Core: - continued subpage support - make defragmentation work - make compression write work - zoned mode - support ZNS (zoned namespaces), zone capacity is number of usable blocks in each zone - add dedicated block group (zoned) for relocation, to prevent out of order writes in some cases - greedy block group reclaim, pick the ones with least usable space first - preparatory work for send protocol updates - error handling improvements - cleanups and refactoring Fixes: - lockdep warnings - in show_devname callback, on seeding device - device delete on loop device due to conversions to workqueues - fix deadlock between chunk allocation and chunk btree modifications - fix tracking of missing device count and status" * tag 'for-5.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (140 commits) btrfs: remove root argument from check_item_in_log() btrfs: remove root argument from add_link() btrfs: remove root argument from btrfs_unlink_inode() btrfs: remove root argument from drop_one_dir_item() btrfs: clear MISSING device status bit in btrfs_close_one_device btrfs: call btrfs_check_rw_degradable only if there is a missing device btrfs: send: prepare for v2 protocol btrfs: fix comment about sector sizes supported in 64K systems btrfs: update device path inode time instead of bd_inode fs: export an inode_update_time helper btrfs: fix deadlock when defragging transparent huge pages btrfs: sysfs: convert scnprintf and snprintf to sysfs_emit btrfs: make btrfs_super_block size match BTRFS_SUPER_INFO_SIZE btrfs: update comments for chunk allocation -ENOSPC cases btrfs: fix deadlock between chunk allocation and chunk btree modifications btrfs: zoned: use greedy gc for auto reclaim btrfs: check-integrity: stop storing the block device name in btrfsic_dev_state btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls btrfs: add a btrfs_get_dev_args_from_path helper btrfs: handle device lookup with btrfs_dev_lookup_args ...
2021-11-01btrfs: fix lzo_decompress_bio() kmap leakageLinus Torvalds
Commit ccaa66c8dd27 reinstated the kmap/kunmap that had been dropped in commit 8c945d32e604 ("btrfs: compression: drop kmap/kunmap from lzo"). However, it seems to have done so incorrectly due to the change not reverting cleanly, and lzo_decompress_bio() ended up not having a matching "kunmap()" to the "kmap()" that was put back. Also, any assert that the page pointer is not NULL should be before the kmap() of said pointer, since otherwise you'd just oops in the kmap() before the assert would even trigger. I noticed this when trying to verify my btrfs merge, and things not adding up. I'm doing this fixup before re-doing my merge, because this commit needs to also be backported to 5.15 (after verification from the btrfs people). Fixes: ccaa66c8dd27 ("Revert 'btrfs: compression: drop kmap/kunmap from lzo'") Cc: David Sterba <dsterba@suse.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-01Merge tag 'exfat-for-5.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat Pull exfat fix from Namjae Jeon: "Fix ->i_blocks truncation issue caused by wrong 32bit mask" * tag 'exfat-for-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat: exfat: fix incorrect loading of i_blocks for large files
2021-11-01Merge tag 'erofs-for-5.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs updates from Gao Xiang: "There are some new features available for this cycle. Firstly, EROFS LZMA algorithm support, specifically called MicroLZMA, is available as an option for embedded devices, LiveCDs and/or as the secondary auxiliary compression algorithm besides the primary algorithm in one file. In order to better support the LZMA fixed-sized output compression, especially for 4KiB pcluster size (which has lowest memory pressure thus useful for memory-sensitive scenarios), Lasse introduced a new LZMA header/container format called MicroLZMA to minimize the original LZMA1 header (for example, we don't need to waste 4-byte dictionary size and another 8-byte uncompressed size, which can be calculated by fs directly, for each pcluster) and enable EROFS fixed-sized output compression. Note that MicroLZMA can also be later used by other things in addition to EROFS too where wasting minimal amount of space for headers is important and it can be only compiled by enabling XZ_DEC_MICROLZMA. MicroLZMA has been supported by the latest upstream XZ embedded [1] & XZ utils [2], apply the latest related XZ embedded upstream patches by the XZ author Lasse here. Secondly, multiple device is also supported in this cycle, which is designed for multi-layer container images. By working together with inter-layer data deduplication and compression, we can achieve the next high-performance container image solution. Our team will announce the new Nydus container image service [3] implementation with new RAFS v6 (EROFS-compatible) format in Open Source Summit 2021 China [4] soon. Besides, the secondary compression head support and readmore decompression strategy are also included in this cycle. There are also some minor bugfixes and cleanups, as always. Summary: - support multiple devices for multi-layer container images; - support the secondary compression head; - support readmore decompression strategy; - support new LZMA algorithm (specifically called MicroLZMA); - some bugfixes & cleanups" * tag 'erofs-for-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: don't trigger WARN() when decompression fails erofs: get rid of ->lru usage erofs: lzma compression support erofs: rename some generic methods in decompressor lib/xz, lib/decompress_unxz.c: Fix spelling in comments lib/xz: Add MicroLZMA decoder lib/xz: Move s->lzma.len = 0 initialization to lzma_reset() lib/xz: Validate the value before assigning it to an enum variable lib/xz: Avoid overlapping memcpy() with invalid input with in-place decompression erofs: introduce readmore decompression strategy erofs: introduce the secondary compression head erofs: get compression algorithms directly on mapping erofs: add multiple device support erofs: decouple basic mount options from fs_context erofs: remove the fast path of per-CPU buffer decompression
2021-11-01Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscryptLinus Torvalds
Pull fscrypt updates from Eric Biggers: "Some cleanups for fs/crypto/: - Allow 256-bit master keys with AES-256-XTS - Improve documentation and comments - Remove unneeded field fscrypt_operations::max_namelen" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: fscrypt: improve a few comments fscrypt: allow 256-bit master keys with AES-256-XTS fscrypt: improve documentation for inline encryption fscrypt: clean up comments in bio.c fscrypt: remove fscrypt_operations::max_namelen
2021-11-01Merge tag 'for-5.16/inode-sync-2021-10-29' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block inode sync updates from Jens Axboe: "This contains improvements to how bdev inode syncing is handled, unifying the API" * tag 'for-5.16/inode-sync-2021-10-29' of git://git.kernel.dk/linux-block: block: simplify the block device syncing code ntfs3: use sync_blockdev_nowait fat: use sync_blockdev_nowait btrfs: use sync_blockdev xen-blkback: use sync_blockdev block: remove __sync_blockdev fs: remove __sync_filesystem
2021-11-01Merge tag 'for-5.16/ki_complete-2021-10-29' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull kiocb->ki_complete() cleanup from Jens Axboe: "This removes the res2 argument from kiocb->ki_complete(). Only the USB gadget code used it, everybody else passes 0. The USB guys checked the user gadget code they could find, and everybody just uses res as expected for the async interface" * tag 'for-5.16/ki_complete-2021-10-29' of git://git.kernel.dk/linux-block: fs: get rid of the res2 iocb->ki_complete argument usb: remove res2 argument from gadget code completions
2021-11-01Merge tag 'for-5.16/passthrough-flag-2021-10-29' of ↵Linus Torvalds
git://git.kernel.dk/linux-block Pull QUEUE_FLAG_SCSI_PASSTHROUGH removal from Jens Axboe: "This contains a series leading to the removal of the QUEUE_FLAG_SCSI_PASSTHROUGH queue flag" * tag 'for-5.16/passthrough-flag-2021-10-29' of git://git.kernel.dk/linux-block: block: remove blk_{get,put}_request block: remove QUEUE_FLAG_SCSI_PASSTHROUGH block: remove the initialize_rq_fn blk_mq_ops method scsi: add a scsi_alloc_request helper bsg-lib: initialize the bsg_job in bsg_transport_sg_io_fn nfsd/blocklayout: use ->get_unique_id instead of sending SCSI commands sd: implement ->get_unique_id block: add a ->get_unique_id method
2021-11-01Merge tag 'for-5.16/bdev-size-2021-10-29' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull bdev size cleanups from Jens Axboe: "Clean up the bdev size handling with new bdev_nr_bytes() helper" * tag 'for-5.16/bdev-size-2021-10-29' of git://git.kernel.dk/linux-block: (34 commits) partitions/ibm: use bdev_nr_sectors instead of open coding it partitions/efi: use bdev_nr_bytes instead of open coding it block/ioctl: use bdev_nr_sectors and bdev_nr_bytes block: cache inode size in bdev udf: use sb_bdev_nr_blocks reiserfs: use sb_bdev_nr_blocks ntfs: use sb_bdev_nr_blocks jfs: use sb_bdev_nr_blocks ext4: use sb_bdev_nr_blocks block: add a sb_bdev_nr_blocks helper block: use bdev_nr_bytes instead of open coding it in blkdev_fallocate squashfs: use bdev_nr_bytes instead of open coding it reiserfs: use bdev_nr_bytes instead of open coding it pstore/blk: use bdev_nr_bytes instead of open coding it ntfs3: use bdev_nr_bytes instead of open coding it nilfs2: use bdev_nr_bytes instead of open coding it nfs/blocklayout: use bdev_nr_bytes instead of open coding it jfs: use bdev_nr_bytes instead of open coding it hfsplus: use bdev_nr_sectors instead of open coding it hfs: use bdev_nr_sectors instead of open coding it ...
2021-11-01Merge tag 'for-5.16/io_uring-2021-10-29' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring updates from Jens Axboe: "Light on new features - basically just the hybrid mode support. Outside of that it's just fixes, cleanups, and performance improvements. In detail: - Add ring related information to the fdinfo output (Hao) - Hybrid async mode (Hao) - Support for batched issue on block (me) - sqe error trace improvement (me) - IOPOLL efficiency improvements (Pavel) - submit state cleanups and improvements (Pavel) - Completion side improvements (Pavel) - Drain improvements (Pavel) - Buffer selection cleanups (Pavel) - Fixed file node improvements (Pavel) - io-wq setup cancelation fix (Pavel) - Various other performance improvements and cleanups (Pavel) - Misc fixes (Arnd, Bixuan, Changcheng, Hao, me, Noah)" * tag 'for-5.16/io_uring-2021-10-29' of git://git.kernel.dk/linux-block: (97 commits) io-wq: remove worker to owner tw dependency io_uring: harder fdinfo sq/cq ring iterating io_uring: don't assign write hint in the read path io_uring: clusterise ki_flags access in rw_prep io_uring: kill unused param from io_file_supports_nowait io_uring: clean up timeout async_data allocation io_uring: don't try io-wq polling if not supported io_uring: check if opcode needs poll first on arming io_uring: clean iowq submit work cancellation io_uring: clean io_wq_submit_work()'s main loop io-wq: use helper for worker refcounting io_uring: implement async hybrid mode for pollable requests io_uring: Use ERR_CAST() instead of ERR_PTR(PTR_ERR()) io_uring: split logic of force_nonblock io_uring: warning about unused-but-set parameter io_uring: inform block layer of how many requests we are submitting io_uring: simplify io_file_supports_nowait() io_uring: combine REQ_F_NOWAIT_{READ,WRITE} flags io_uring: arm poll for non-nowait files fs/io_uring: Prioritise checking faster conditions first in io_write ...
2021-11-01Merge tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: - mq-deadline accounting improvements (Bart) - blk-wbt timer fix (Andrea) - Untangle the block layer includes (Christoph) - Rework the poll support to be bio based, which will enable adding support for polling for bio based drivers (Christoph) - Block layer core support for multi-actuator drives (Damien) - blk-crypto improvements (Eric) - Batched tag allocation support (me) - Request completion batching support (me) - Plugging improvements (me) - Shared tag set improvements (John) - Concurrent queue quiesce support (Ming) - Cache bdev in ->private_data for block devices (Pavel) - bdev dio improvements (Pavel) - Block device invalidation and block size improvements (Xie) - Various cleanups, fixes, and improvements (Christoph, Jackie, Masahira, Tejun, Yu, Pavel, Zheng, me) * tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block: (174 commits) blk-mq-debugfs: Show active requests per queue for shared tags block: improve readability of blk_mq_end_request_batch() virtio-blk: Use blk_validate_block_size() to validate block size loop: Use blk_validate_block_size() to validate block size nbd: Use blk_validate_block_size() to validate block size block: Add a helper to validate the block size block: re-flow blk_mq_rq_ctx_init() block: prefetch request to be initialized block: pass in blk_mq_tags to blk_mq_rq_ctx_init() block: add rq_flags to struct blk_mq_alloc_data block: add async version of bio_set_polled block: kill DIO_MULTI_BIO block: kill unused polling bits in __blkdev_direct_IO() block: avoid extra iter advance with async iocb block: Add independent access ranges support blk-mq: don't issue request directly in case that current is to be blocked sbitmap: silence data race warning blk-cgroup: synchronize blkg creation against policy deactivation block: refactor bio_iov_bvec_set() block: add single bio async direct IO helper ...
2021-11-01Merge tag 'locks-v5.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull file locking updates from Jeff Layton: "Most of this is just follow-on cleanup work of documentation and comments from the mandatory locking removal in v5.15. The only real functional change is that LOCK_MAND flock() support is also being removed, as it has basically been non-functional since the v2.5 days" * tag 'locks-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: fs: remove leftover comments from mandatory locking removal locks: remove changelog comments docs: fs: locks.rst: update comment about mandatory file locking Documentation: remove reference to now removed mandatory-locking doc locks: remove LOCK_MAND flock lock support
2021-11-01Merge tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds
Pull memory folios from Matthew Wilcox: "Add memory folios, a new type to represent either order-0 pages or the head page of a compound page. This should be enough infrastructure to support filesystems converting from pages to folios. The point of all this churn is to allow filesystems and the page cache to manage memory in larger chunks than PAGE_SIZE. The original plan was to use compound pages like THP does, but I ran into problems with some functions expecting only a head page while others expect the precise page containing a particular byte. The folio type allows a function to declare that it's expecting only a head page. Almost incidentally, this allows us to remove various calls to VM_BUG_ON(PageTail(page)) and compound_head(). This converts just parts of the core MM and the page cache. For 5.17, we intend to convert various filesystems (XFS and AFS are ready; other filesystems may make it) and also convert more of the MM and page cache to folios. For 5.18, multi-page folios should be ready. The multi-page folios offer some improvement to some workloads. The 80% win is real, but appears to be an artificial benchmark (postgres startup, which isn't a serious workload). Real workloads (eg building the kernel, running postgres in a steady state, etc) seem to benefit between 0-10%. I haven't heard of any performance losses as a result of this series. Nobody has done any serious performance tuning; I imagine that tweaking the readahead algorithm could provide some more interesting wins. There are also other places where we could choose to create large folios and currently do not, such as writes that are larger than PAGE_SIZE. I'd like to thank all my reviewers who've offered review/ack tags: Christoph Hellwig, David Howells, Jan Kara, Jeff Layton, Johannes Weiner, Kirill A. Shutemov, Michal Hocko, Mike Rapoport, Vlastimil Babka, William Kucharski, Yu Zhao and Zi Yan. I'd also like to thank those who gave feedback I incorporated but haven't offered up review tags for this part of the series: Nick Piggin, Mel Gorman, Ming Lei, Darrick Wong, Ted Ts'o, John Hubbard, Hugh Dickins, and probably a few others who I forget" * tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache: (90 commits) mm/writeback: Add folio_write_one mm/filemap: Add FGP_STABLE mm/filemap: Add filemap_get_folio mm/filemap: Convert mapping_get_entry to return a folio mm/filemap: Add filemap_add_folio() mm/filemap: Add filemap_alloc_folio mm/page_alloc: Add folio allocation functions mm/lru: Add folio_add_lru() mm/lru: Convert __pagevec_lru_add_fn to take a folio mm: Add folio_evictable() mm/workingset: Convert workingset_refault() to take a folio mm/filemap: Add readahead_folio() mm/filemap: Add folio_mkwrite_check_truncate() mm/filemap: Add i_blocks_per_folio() mm/writeback: Add folio_redirty_for_writepage() mm/writeback: Add folio_account_redirty() mm/writeback: Add folio_clear_dirty_for_io() mm/writeback: Add folio_cancel_dirty() mm/writeback: Add folio_account_cleaned() mm/writeback: Add filemap_dirty_folio() ...
2021-11-01exfat: fix incorrect loading of i_blocks for large filesSungjong Seo
When calculating i_blocks, there was a mistake that was masked with a 32-bit variable. So i_blocks for files larger than 4 GiB had incorrect values. Mask with a 64-bit variable instead of 32-bit one. Fixes: 5f2aa075070c ("exfat: add inode operations") Cc: stable@vger.kernel.org # v5.7+ Reported-by: Ganapathi Kamath <hgkamath@hotmail.com> Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2021-10-31erofs: don't trigger WARN() when decompression failsGao Xiang
syzbot reported a WARNING [1] due to corrupted compressed data. As Dmitry said, "If this is not a kernel bug, then the code should not use WARN. WARN if for kernel bugs and is recognized as such by all testing systems and humans." [1] https://lore.kernel.org/r/000000000000b3586105cf0ff45e@google.com Link: https://lore.kernel.org/r/20211025074311.130395-1-hsiangkao@linux.alibaba.com Cc: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Reported-by: syzbot+d8aaffc3719597e8cfb4@syzkaller.appspotmail.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-10-30xfs: use swap() to make code cleanerChangcheng Deng
Use swap() in order to make code cleaner. Issue found by coccinelle. Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Changcheng Deng <deng.changcheng@zte.com.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-10-30xfs: Remove duplicated include in xfs_superWan Jiabing
Fix following checkincludes.pl warning: ./fs/xfs/xfs_super.c: xfs_btree.h is included more than once. The include is in line 15. Remove the duplicated here. Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-10-29signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV)Eric W. Biederman
Now that force_fatal_sig exists it is unnecessary and a bit confusing to use force_sigsegv in cases where the simpler force_fatal_sig is wanted. So change every instance we can to make the code clearer. Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Link: https://lkml.kernel.org/r/877de7jrev.fsf@disp2133 Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-10-29exit/kthread: Have kernel threads return instead of calling do_exitEric W. Biederman
In 2009 Oleg reworked[1] the kernel threads so that it is not necessary to call do_exit if you are not using kthread_stop(). Remove the explicit calls of do_exit and complete_and_exit (with a NULL completion) that were previously necessary. [1] 63706172f332 ("kthreads: rework kthread_stop()") Link: https://lkml.kernel.org/r/20211020174406.17889-12-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2021-10-29Merge tag 'for-5.15-rc7-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Last minute fixes for crash on 32bit architectures when compression is in use. It's a regression introduced in 5.15-rc and I'd really like not let this into the final release, fixes via stable trees would add unnecessary delay. The problem is on 32bit architectures with highmem enabled, the pages for compression may need to be kmapped, while the patches removed that as we don't use GFP_HIGHMEM allocations anymore. The pages that don't come from local allocation still may be from highmem. Despite being on 32bit there's enough such ARM machines in use so it's not a marginal issue. I did full reverts of the patches one by one instead of a huge one. There's one exception for the "lzo" revert as there was an intermediate patch touching the same code to make it compatible with subpage. I can't revert that one too, so the revert in lzo.c is manual. Qu Wenruo has worked on that with me and verified the changes" * tag 'for-5.15-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Revert "btrfs: compression: drop kmap/kunmap from lzo" Revert "btrfs: compression: drop kmap/kunmap from zlib" Revert "btrfs: compression: drop kmap/kunmap from zstd" Revert "btrfs: compression: drop kmap/kunmap from generic helpers"
2021-10-29f2fs: support fault injection for dquot_initialize()Chao Yu
This patch adds a new function f2fs_dquot_initialize() to wrap dquot_initialize(), and it supports to inject fault into f2fs_dquot_initialize() to simulate inner failure occurs in dquot_initialize(). Usage: a) echo 65536 > /sys/fs/f2fs/<dev>/inject_type or b) mount -o fault_type=65536 <dev> <mountpoint> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-10-29f2fs: fix incorrect return value in f2fs_sanity_check_ckpt()Chao Yu
As Pavel Machek reported in [1] This code looks quite confused: part of function returns 1 on corruption, part returns -errno. The problem is not stable-specific. [1] https://lkml.org/lkml/2021/9/19/207 Let's fix to make 'insane cp_payload case' to return 1 rater than EFSCORRUPTED, so that return value can be kept consistent for all error cases, it can avoid confusion of code logic. Fixes: 65ddf6564843 ("f2fs: fix to do sanity check for sb/cp fields correctly") Reported-by: Pavel Machek <pavel@denx.de> Reviewed-by: Pavel Machek <pavel@denx.de> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-10-29io-wq: remove worker to owner tw dependencyPavel Begunkov
INFO: task iou-wrk-6609:6612 blocked for more than 143 seconds. Not tainted 5.15.0-rc5-syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:iou-wrk-6609 state:D stack:27944 pid: 6612 ppid: 6526 flags:0x00004006 Call Trace: context_switch kernel/sched/core.c:4940 [inline] __schedule+0xb44/0x5960 kernel/sched/core.c:6287 schedule+0xd3/0x270 kernel/sched/core.c:6366 schedule_timeout+0x1db/0x2a0 kernel/time/timer.c:1857 do_wait_for_common kernel/sched/completion.c:85 [inline] __wait_for_common kernel/sched/completion.c:106 [inline] wait_for_common kernel/sched/completion.c:117 [inline] wait_for_completion+0x176/0x280 kernel/sched/completion.c:138 io_worker_exit fs/io-wq.c:183 [inline] io_wqe_worker+0x66d/0xc40 fs/io-wq.c:597 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 io-wq worker may submit a task_work to the master task and upon io_worker_exit() wait for the tw to get executed. The problem appears when the master task is waiting in coredump.c: 468 freezer_do_not_count(); 469 wait_for_completion(&core_state->startup); 470 freezer_count(); Apparently having some dependency on children threads getting everything stuck. Workaround it by cancelling the taks_work callback that causes it before going into io_worker_exit() waiting. p.s. probably a better option is to not submit tw elevating the refcount in the first place, but let's leave this excercise for the future. Cc: stable@vger.kernel.org Reported-and-tested-by: syzbot+27d62ee6f256b186883e@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/142a716f4ed936feae868959059154362bfa8c19.1635509451.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-29io_uring: harder fdinfo sq/cq ring iteratingJens Axboe
The ring iteration is racy, which isn't necessarily a problem except it can cause us to iterate the whole thing. That isn't desired or ideal, and it can lead to excessive runtimes of reading fdinfo. Cap the iteration at tail - head OR the ring size. While in there, clean up the ring masking and just dump the raw values along with the masks. That provides more useful debug info. Fixes: 83f84356bc8f ("io_uring: add more uring info to fdinfo for debug") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-29ovl: fix use after free in struct ovl_aio_reqyangerkun
Example for triggering use after free in a overlay on ext4 setup: aio_read ovl_read_iter vfs_iter_read ext4_file_read_iter ext4_dio_read_iter iomap_dio_rw -> -EIOCBQUEUED /* * Here IO is completed in a separate thread, * ovl_aio_cleanup_handler() frees aio_req which has iocb embedded */ file_accessed(iocb->ki_filp); /**BOOM**/ Fix by introducing a refcount in ovl_aio_req similarly to aio_kiocb. This guarantees that iocb is only freed after vfs_read/write_iter() returns on underlying fs. Fixes: 2406a307ac7d ("ovl: implement async IO routines") Signed-off-by: yangerkun <yangerkun@huawei.com> Link: https://lore.kernel.org/r/20210930032228.3199690-3-yangerkun@huawei.com/ Cc: <stable@vger.kernel.org> # v5.6 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-29Revert "btrfs: compression: drop kmap/kunmap from lzo"David Sterba
This reverts commit 8c945d32e60427cbc0859cf7045bbe6196bb03d8. The kmaps in compression code are still needed and cause crashes on 32bit machines (ARM, x86). Reproducible eg. by running fstest btrfs/004 with enabled LZO or ZSTD compression. The revert does not apply cleanly due to changes in a6e66e6f8c1b ("btrfs: rework lzo_decompress_bio() to make it subpage compatible") that reworked the page iteration so the revert is done to be equivalent to the original code. Link: https://lore.kernel.org/all/CAJCQCtT+OuemovPO7GZk8Y8=qtOObr0XTDp8jh4OHD6y84AFxw@mail.gmail.com/ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=214839 Tested-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29Revert "btrfs: compression: drop kmap/kunmap from zlib"David Sterba
This reverts commit 696ab562e6df9fbafd6052d8ce4aafcb2ed16069. The kmaps in compression code are still needed and cause crashes on 32bit machines (ARM, x86). Reproducible eg. by running fstest btrfs/004 with enabled LZO or ZSTD compression. Link: https://lore.kernel.org/all/CAJCQCtT+OuemovPO7GZk8Y8=qtOObr0XTDp8jh4OHD6y84AFxw@mail.gmail.com/ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=214839 Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29Revert "btrfs: compression: drop kmap/kunmap from zstd"David Sterba
This reverts commit bbaf9715f3f5b5ff0de71da91fcc34ee9c198ed8. The kmaps in compression code are still needed and cause crashes on 32bit machines (ARM, x86). Reproducible eg. by running fstest btrfs/004 with enabled LZO or ZSTD compression. Example stacktrace with ZSTD on a 32bit ARM machine: Unable to handle kernel NULL pointer dereference at virtual address 00000000 pgd = c4159ed3 [00000000] *pgd=00000000 Internal error: Oops: 5 [#1] PREEMPT SMP ARM Modules linked in: CPU: 0 PID: 210 Comm: kworker/u2:3 Not tainted 5.14.0-rc79+ #12 Hardware name: Allwinner sun4i/sun5i Families Workqueue: btrfs-delalloc btrfs_work_helper PC is at mmiocpy+0x48/0x330 LR is at ZSTD_compressStream_generic+0x15c/0x28c (mmiocpy) from [<c0629648>] (ZSTD_compressStream_generic+0x15c/0x28c) (ZSTD_compressStream_generic) from [<c06297dc>] (ZSTD_compressStream+0x64/0xa0) (ZSTD_compressStream) from [<c049444c>] (zstd_compress_pages+0x170/0x488) (zstd_compress_pages) from [<c0496798>] (btrfs_compress_pages+0x124/0x12c) (btrfs_compress_pages) from [<c043c068>] (compress_file_range+0x3c0/0x834) (compress_file_range) from [<c043c4ec>] (async_cow_start+0x10/0x28) (async_cow_start) from [<c0475c3c>] (btrfs_work_helper+0x100/0x230) (btrfs_work_helper) from [<c014ef68>] (process_one_work+0x1b4/0x418) (process_one_work) from [<c014f210>] (worker_thread+0x44/0x524) (worker_thread) from [<c0156aa4>] (kthread+0x180/0x1b0) (kthread) from [<c0100150>] Link: https://lore.kernel.org/all/CAJCQCtT+OuemovPO7GZk8Y8=qtOObr0XTDp8jh4OHD6y84AFxw@mail.gmail.com/ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=214839 Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29btrfs: remove root argument from check_item_in_log()Filipe Manana
The root argument passed to check_item_in_log() always matches the root of the given directory, so it can be eliminated. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29btrfs: remove root argument from add_link()Filipe Manana
The root argument for tree-log.c:add_link() always matches the root of the given directory and the given inode, so it can eliminated. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29btrfs: remove root argument from btrfs_unlink_inode()Filipe Manana
The root argument passed to btrfs_unlink_inode() and its callee, __btrfs_unlink_inode(), always matches the root of the given directory and the given inode. So remove the argument and make __btrfs_unlink_inode() use the root of the directory. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29btrfs: remove root argument from drop_one_dir_item()Filipe Manana
The root argument for drop_one_dir_item() always matches the root of the given directory inode, since each log tree is associated to one and only one subvolume/root, so remove the argument. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29btrfs: clear MISSING device status bit in btrfs_close_one_deviceLi Zhang
Reported bug: https://github.com/kdave/btrfs-progs/issues/389 There's a problem with scrub reporting aborted status but returning error code 0, on a filesystem with missing and readded device. Roughly these steps: - mkfs -d raid1 dev1 dev2 - fill with data - unmount - make dev1 disappear - mount -o degraded - copy more data - make dev1 appear again Running scrub afterwards reports that the command was aborted, but the system log message says the exit code was 0. It seems that the cause of the error is decrementing fs_devices->missing_devices but not clearing device->dev_state. Every time we umount filesystem, it would call close_ctree, And it would eventually involve btrfs_close_one_device to close the device, but it only decrements fs_devices->missing_devices but does not clear the device BTRFS_DEV_STATE_MISSING bit. Worse, this bug will cause Integer Overflow, because every time umount, fs_devices->missing_devices will decrease. If fs_devices->missing_devices value hit 0, it would overflow. With added debugging: loop1: detected capacity change from 0 to 20971520 BTRFS: device fsid 56ad51f1-5523-463b-8547-c19486c51ebb devid 1 transid 21 /dev/loop1 scanned by systemd-udevd (2311) loop2: detected capacity change from 0 to 20971520 BTRFS: device fsid 56ad51f1-5523-463b-8547-c19486c51ebb devid 2 transid 17 /dev/loop2 scanned by systemd-udevd (2313) BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): using free space tree BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000f706684d /dev/loop1 0 BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 1 BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): using free space tree BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000f706684d /dev/loop1 0 BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 0 BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): using free space tree BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000f706684d /dev/loop1 18446744073709551615 BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 18446744073709551615 If fs_devices->missing_devices is 0, next time it would be 18446744073709551615 After apply this patch, the fs_devices->missing_devices seems to be right: $ truncate -s 10g test1 $ truncate -s 10g test2 $ losetup /dev/loop1 test1 $ losetup /dev/loop2 test2 $ mkfs.btrfs -draid1 -mraid1 /dev/loop1 /dev/loop2 -f $ losetup -d /dev/loop2 $ mount -o degraded /dev/loop1 /mnt/1 $ umount /mnt/1 $ mount -o degraded /dev/loop1 /mnt/1 $ umount /mnt/1 $ mount -o degraded /dev/loop1 /mnt/1 $ umount /mnt/1 $ dmesg loop1: detected capacity change from 0 to 20971520 loop2: detected capacity change from 0 to 20971520 BTRFS: device fsid 15aa1203-98d3-4a66-bcae-ca82f629c2cd devid 1 transid 5 /dev/loop1 scanned by mkfs.btrfs (1863) BTRFS: device fsid 15aa1203-98d3-4a66-bcae-ca82f629c2cd devid 2 transid 5 /dev/loop2 scanned by mkfs.btrfs (1863) BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): disk space caching is enabled BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000975bd577 /dev/loop1 0 BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 1 BTRFS info (device loop1): checking UUID tree BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): disk space caching is enabled BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000975bd577 /dev/loop1 0 BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 1 BTRFS info (device loop1): flagging fs with big metadata feature BTRFS info (device loop1): allowing degraded mounts BTRFS info (device loop1): disk space caching is enabled BTRFS info (device loop1): has skinny extents BTRFS info (device loop1): before clear_missing.00000000975bd577 /dev/loop1 0 BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing BTRFS info (device loop1): before clear_missing.0000000000000000 /dev/loop2 1 CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Li Zhang <zhanglikernel@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29btrfs: call btrfs_check_rw_degradable only if there is a missing deviceAnand Jain
In open_ctree() in btrfs_check_rw_degradable() [1], we check each block group individually if at least the minimum number of devices is available for that profile. If all the devices are available, then we don't have to check degradable. [1] open_ctree() :: 3559 if (!sb_rdonly(sb) && !btrfs_check_rw_degradable(fs_info, NULL)) { Also before calling btrfs_check_rw_degradable() in open_ctee() at the line number shown below [2] we call btrfs_read_chunk_tree() and down to add_missing_dev() to record number of missing devices. [2] open_ctree() :: 3454 ret = btrfs_read_chunk_tree(fs_info); btrfs_read_chunk_tree() read_one_chunk() / read_one_dev() add_missing_dev() So, check if there is any missing device before btrfs_check_rw_degradable() in open_ctree(). Also, with this the mount command could save ~16ms.[3] in the most common case, that is no device is missing. [3] 1) * 16934.96 us | btrfs_check_rw_degradable [btrfs](); CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-29btrfs: send: prepare for v2 protocolDavid Sterba
This is preparatory work for send protocol update to version 2 and higher. We have many pending protocol update requests but still don't have the basic protocol rev in place, the first thing that must happen is to do the actual versioning support. The protocol version is u32 and is a new member in the send ioctl struct. Validity of the version field is backed by a new flag bit. Old kernels would fail when a higher version is requested. Version protocol 0 will pick the highest supported version, BTRFS_SEND_STREAM_VERSION, that's also exported in sysfs. The version is still unchanged and will be increased once we have new incompatible commands or stream updates. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-28ocfs2: fix race between searching chunks and release journal_head from ↵Gautham Ananthakrishna
buffer_head Encountered a race between ocfs2_test_bg_bit_allocatable() and jbd2_journal_put_journal_head() resulting in the below vmcore. PID: 106879 TASK: ffff880244ba9c00 CPU: 2 COMMAND: "loop3" Call trace: panic oops_end no_context __bad_area_nosemaphore bad_area_nosemaphore __do_page_fault do_page_fault page_fault [exception RIP: ocfs2_block_group_find_clear_bits+316] ocfs2_block_group_find_clear_bits [ocfs2] ocfs2_cluster_group_search [ocfs2] ocfs2_search_chain [ocfs2] ocfs2_claim_suballoc_bits [ocfs2] __ocfs2_claim_clusters [ocfs2] ocfs2_claim_clusters [ocfs2] ocfs2_local_alloc_slide_window [ocfs2] ocfs2_reserve_local_alloc_bits [ocfs2] ocfs2_reserve_clusters_with_limit [ocfs2] ocfs2_reserve_clusters [ocfs2] ocfs2_lock_refcount_allocators [ocfs2] ocfs2_make_clusters_writable [ocfs2] ocfs2_replace_cow [ocfs2] ocfs2_refcount_cow [ocfs2] ocfs2_file_write_iter [ocfs2] lo_rw_aio loop_queue_work kthread_worker_fn kthread ret_from_fork When ocfs2_test_bg_bit_allocatable() called bh2jh(bg_bh), the bg_bh->b_private NULL as jbd2_journal_put_journal_head() raced and released the jounal head from the buffer head. Needed to take bit lock for the bit 'BH_JournalHead' to fix this race. Link: https://lkml.kernel.org/r/1634820718-6043-1-git-send-email-gautham.ananthakrishna@oracle.com Signed-off-by: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: <rajesh.sivaramasubramaniom@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-10-28fuse: add FOPEN_NOFLUSHAmir Goldstein
Add flag returned by FUSE_OPEN and FUSE_CREATE requests to avoid flushing data cache on close. Different filesystems implement ->flush() is different ways: - Most disk filesystems do not implement ->flush() at all - Some network filesystem (e.g. nfs) flush local write cache of FMODE_WRITE file and send a "flush" command to server - Some network filesystem (e.g. cifs) flush local write cache of FMODE_WRITE file without sending an additional command to server FUSE flushes local write cache of ANY file, even non FMODE_WRITE and sends a "flush" command to server (if server implements it). The FUSE implementation of ->flush() seems over agressive and arbitrary and does not make a lot of sense when writeback caching is disabled. Instead of deciding on another arbitrary implementation that makes sense, leave the choice of per-file flush behavior in the hands of the server. Link: https://lore.kernel.org/linux-fsdevel/CAJfpegspE8e6aKd47uZtSYX8Y-1e1FWS0VL0DH2Skb9gQP5RJQ@mail.gmail.com/ Suggested-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28fuse: only update necessary attributesMiklos Szeredi
fuse_update_attributes() refreshes metadata for internal use. Each use needs a particular set of attributes to be refreshed, but currently that cannot be expressed and all but atime are refreshed. Add a mask argument, which lets fuse_update_get_attr() to decide based on the cache_mask and the inval_mask whether a GETATTR call is needed or not. Reported-by: Yongji Xie <xieyongji@bytedance.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28fuse: take cache_mask into account in getattrMiklos Szeredi
When deciding to send a GETATTR request take into account the cache mask (which attributes are always valid). The cache mask takes precedence over the invalid mask. This results in the GETATTR request not being sent unnecessarily. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28fuse: add cache_maskMiklos Szeredi
If writeback_cache is enabled, then the size, mtime and ctime attributes of regular files are always valid in the kernel's cache. They are retrieved from userspace only when the inode is freshly looked up. Add a more generic "cache_mask", that indicates which attributes are currently valid in cache. This patch doesn't change behavior. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28fuse: move reverting attributes to fuse_change_attributes()Miklos Szeredi
In case of writeback_cache fuse_fillattr() would revert the queried attributes to the cached version. Move this to fuse_change_attributes() in order to manage the writeback logic in a central helper. This will be necessary for patches that follow. Only fuse_do_getattr() -> fuse_fillattr() uses the attributes after calling fuse_change_attributes(), so this should not change behavior. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28fuse: simplify local variables holding writeback cache stateMiklos Szeredi
There are two instances of "bool is_wb = fc->writeback_cache" where the actual use mostly involves checking "is_wb && S_ISREG(inode->i_mode)". Clean up these cases by storing the second condition in the local variable. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28fuse: cleanup code conditional on fc->writeback_cacheMiklos Szeredi
It's safe to call file_update_time() if writeback cache is not enabled, since S_NOCMTIME is set in this case. This part is purely a cleanup. __fuse_copy_file_range() also calls fuse_write_update_attr() only in the writeback cache case. This is inconsistent with other callers, where it's called unconditionally. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28fuse: fix attr version comparison in fuse_read_update_size()Miklos Szeredi
A READ request returning a short count is taken as indication of EOF, and the cached file size is modified accordingly. Fix the attribute version checking to allow for changes to fc->attr_version on other inodes. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28fuse: always invalidate attributes after writesMiklos Szeredi
Extend the fuse_write_update_attr() helper to invalidate cached attributes after a write. This has already been done in all cases except in fuse_notify_store(), so this is mostly a cleanup. fuse_direct_write_iter() calls fuse_direct_IO() which already calls fuse_write_update_attr(), so don't repeat that again in the former. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-28fuse: rename fuse_write_update_size()Miklos Szeredi
This function already updates the attr_version in fuse_inode, regardless of whether the size was changed or not. Rename the helper to fuse_write_update_attr() to reflect the more generic nature. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>