summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2022-05-25Merge tag 'xfs-5.19-for-linus' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Dave Chinner: "This is a big update with lots of new code. The summary below them all, so I'll just touch on teh higlights. The two main new features are Large Extent Counts and Logged Attribute Replay - these are two new foundational features that we are building more complex future features on top of. For upcoming functionality, we need to be able to store hundreds of millions of xattrs per inode. The Large Extent Count feature removes the limits that prevent this scale of xattr storage, and while we were modifying the on disk extent count format we also increased the number of data extents we support per inode from 2^32 to 2^47. We also need to be able to modify xattrs as part of larger atomic transactions rather than as standalone transactions. The Logged Attribute Replay feature introduces the infrastructure that allows us to use intents to record the attribute modifications in the journal before we start them, hence allowing other atomic transactions to log attribute modification intents and then defer the actual modification to later. If we then crash, log recovery then guarantees that the attribute is replayed in the context of the atomic transaction that logged the intent. A significant chunk of the commits in this merge are for the base attribute replay functionality along with fixes, improvements and cleanups related to this new functioanlity. Allison deserves a big round of thanks for her ongoing work to get this functionality into XFS. There are also many other smaller changes and improvements, so overall this is one of the bigger XFS merge requests in some time. I will be following up next week with another smaller pull request - we already have another round of fixes and improvements to the logged attribute replay functionality just about ready to go. They'll soak and test over the next week, and I'll send a pull request for them near the end of the merge window. Summary: - support for printk message indexing. - large extent counts to provide support for up to 2^47 data extents and 2^32 attribute extents, allowing us to scale beyond 4 billion data extents to billions of xattrs per inode. - conversion of various flags fields to be consistently declared as unsigned bit fields. - improvements to realtime extent accounting and converts them to per-cpu counters to match all the other block and inode accounting. - reworks core log formatting code to reduce iterations, have a shorter, cleaner fast path and generally be easier to understand and maintain. - improvements to rmap btree searches that reduce overhead by up to 30% resulting in xfs_scrub runtime reductions of 15%. - improvements to reflink that remove the size limitations in remapping operations and greatly reduce the size of transaction reservations. - reworks the minimum log size calculations to allow us to change transaction reservations without changing the minimum supported log size. - removal of quota warning support as it has never been used on Linux. - intent whiteouts to allow us to cancel intents that are completed entirely in memory rather than having use CPU and disk bandwidth formatting and writing them into the journal when it is not necessary. This makes rmap, reflink and extent freeing slightly more efficient, but provides massive improvements for.... - Logged Attribute Replay feature support. This is a fundamental change to the way we modify attributes, laying the foundation for future integration of attribute modifications as part of other atomic transactional operations the filesystem performs. - Lots of cleanups and fixes for the logged attribute replay functionality" * tag 'xfs-5.19-for-linus' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (124 commits) xfs: can't use kmem_zalloc() for attribute buffers xfs: detect empty attr leaf blocks in xfs_attr3_leaf_verify xfs: ATTR_REPLACE algorithm with LARP enabled needs rework xfs: use XFS_DA_OP flags in deferred attr ops xfs: remove xfs_attri_remove_iter xfs: switch attr remove to xfs_attri_set_iter xfs: introduce attr remove initial states into xfs_attr_set_iter xfs: xfs_attr_set_iter() does not need to return EAGAIN xfs: clean up final attr removal in xfs_attr_set_iter xfs: remote xattr removal in xfs_attr_set_iter() is conditional xfs: XFS_DAS_LEAF_REPLACE state only needed if !LARP xfs: split remote attr setting out from replace path xfs: consolidate leaf/node states in xfs_attr_set_iter xfs: kill XFS_DAC_LEAF_ADDNAME_INIT xfs: separate out initial attr_set states xfs: don't set quota warning values xfs: remove warning counters from struct xfs_dquot_res xfs: remove quota warning limit from struct xfs_quota_limits xfs: rework deferred attribute operation setup xfs: make xattri_leaf_bp more useful ...
2022-05-25Merge tag 'fsnotify_for_v5.19-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: "The biggest part of this is support for fsnotify inode marks that don't pin inodes in memory but rather get evicted together with the inode (they are useful if userspace needs to exclude receipt of events from potentially large subtrees using fanotify ignore marks). There is also a fix for more consistent handling of events sent to parent and a fix of sparse(1) complaints" * tag 'fsnotify_for_v5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fanotify: fix incorrect fmode_t casts fsnotify: consistent behavior for parent not watching children fsnotify: introduce mark type iterator fanotify: enable "evictable" inode marks fanotify: use fsnotify group lock helpers fanotify: implement "evictable" inode marks fanotify: factor out helper fanotify_mark_update_flags() fanotify: create helper fanotify_mark_user_flags() fsnotify: allow adding an inode mark without pinning inode dnotify: use fsnotify group lock helpers nfsd: use fsnotify group lock helpers audit: use fsnotify group lock helpers inotify: use fsnotify group lock helpers fsnotify: create helpers for group mark_mutex lock fsnotify: make allow_dups a property of the group fsnotify: pass flags argument to fsnotify_alloc_group() fsnotify: fix wrong lockdep annotations inotify: move control flags from mask to mark flags inotify: show inotify mask flags in proc fdinfo
2022-05-25Merge tag 'fs_for_v5.19-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull writeback and ext2 cleanups from Jan Kara: "One small ext2 cleanup and one writeback spelling fix" * tag 'fs_for_v5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: writeback: fix typo in comment fs: ext2: Fix duplicate included linux/dax.h
2022-05-25Merge tag 'size_t-saturating-helpers-5.19-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux Pull misc hardening updates from Gustavo Silva: "Replace a few open-coded instances with size_t saturating arithmetic helpers" * tag 'size_t-saturating-helpers-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: virt: acrn: Prefer array_size and struct_size over open coded arithmetic afs: Prefer struct_size over open coded arithmetic
2022-05-25Merge tag 'net-next-5.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core ---- - Support TCPv6 segmentation offload with super-segments larger than 64k bytes using the IPv6 Jumbogram extension header (AKA BIG TCP). - Generalize skb freeing deferral to per-cpu lists, instead of per-socket lists. - Add a netdev statistic for packets dropped due to L2 address mismatch (rx_otherhost_dropped). - Continue work annotating skb drop reasons. - Accept alternative netdev names (ALT_IFNAME) in more netlink requests. - Add VLAN support for AF_PACKET SOCK_RAW GSO. - Allow receiving skb mark from the socket as a cmsg. - Enable memcg accounting for veth queues, sysctl tables and IPv6. BPF --- - Add libbpf support for User Statically-Defined Tracing (USDTs). - Speed up symbol resolution for kprobes multi-link attachments. - Support storing typed pointers to referenced and unreferenced objects in BPF maps. - Add support for BPF link iterator. - Introduce access to remote CPU map elements in BPF per-cpu map. - Allow middle-of-the-road settings for the kernel.unprivileged_bpf_disabled sysctl. - Implement basic types of dynamic pointers e.g. to allow for dynamically sized ringbuf reservations without extra memory copies. Protocols --------- - Retire port only listening_hash table, add a second bind table hashed by port and address. Avoid linear list walk when binding to very popular ports (e.g. 443). - Add bridge FDB bulk flush filtering support allowing user space to remove all FDB entries matching a condition. - Introduce accept_unsolicited_na sysctl for IPv6 to implement router-side changes for RFC9131. - Support for MPTCP path manager in user space. - Add MPTCP support for fallback to regular TCP for connections that have never connected additional subflows or transmitted out-of-sequence data (partial support for RFC8684 fallback). - Avoid races in MPTCP-level window tracking, stabilize and improve throughput. - Support lockless operation of GRE tunnels with seq numbers enabled. - WiFi support for host based BSS color collision detection. - Add support for SO_TXTIME/SCM_TXTIME on CAN sockets. - Support transmission w/o flow control in CAN ISOTP (ISO 15765-2). - Support zero-copy Tx with TLS 1.2 crypto offload (sendfile). - Allow matching on the number of VLAN tags via tc-flower. - Add tracepoint for tcp_set_ca_state(). Driver API ---------- - Improve error reporting from classifier and action offload. - Add support for listing line cards in switches (devlink). - Add helpers for reporting page pool statistics with ethtool -S. - Add support for reading clock cycles when using PTP virtual clocks, instead of having the driver convert to time before reporting. This makes it possible to report time from different vclocks. - Support configuring low-latency Tx descriptor push via ethtool. - Separate Clause 22 and Clause 45 MDIO accesses more explicitly. New hardware / drivers ---------------------- - Ethernet: - Marvell's Octeon NIC PCI Endpoint support (octeon_ep) - Sunplus SP7021 SoC (sp7021_emac) - Add support for Renesas RZ/V2M (in ravb) - Add support for MediaTek mt7986 switches (in mtk_eth_soc) - Ethernet PHYs: - ADIN1100 industrial PHYs (w/ 10BASE-T1L and SQI reporting) - TI DP83TD510 PHY - Microchip LAN8742/LAN88xx PHYs - WiFi: - Driver for pureLiFi X, XL, XC devices (plfxlc) - Driver for Silicon Labs devices (wfx) - Support for WCN6750 (in ath11k) - Support Realtek 8852ce devices (in rtw89) - Mobile: - MediaTek T700 modems (Intel 5G 5000 M.2 cards) - CAN: - ctucanfd: add support for CTU CAN FD open-source IP core from Czech Technical University in Prague Drivers ------- - Delete a number of old drivers still using virt_to_bus(). - Ethernet NICs: - intel: support TSO on tunnels MPLS - broadcom: support multi-buffer XDP - nfp: support VF rate limiting - sfc: use hardware tx timestamps for more than PTP - mlx5: multi-port eswitch support - hyper-v: add support for XDP_REDIRECT - atlantic: XDP support (including multi-buffer) - macb: improve real-time perf by deferring Tx processing to NAPI - High-speed Ethernet switches: - mlxsw: implement basic line card information querying - prestera: add support for traffic policing on ingress and egress - Embedded Ethernet switches: - lan966x: add support for packet DMA (FDMA) - lan966x: add support for PTP programmable pins - ti: cpsw_new: enable bc/mc storm prevention - Qualcomm 802.11ax WiFi (ath11k): - Wake-on-WLAN support for QCA6390 and WCN6855 - device recovery (firmware restart) support - support setting Specific Absorption Rate (SAR) for WCN6855 - read country code from SMBIOS for WCN6855/QCA6390 - enable keep-alive during WoWLAN suspend - implement remain-on-channel support - MediaTek WiFi (mt76): - support Wireless Ethernet Dispatch offloading packet movement between the Ethernet switch and WiFi interfaces - non-standard VHT MCS10-11 support - mt7921 AP mode support - mt7921 IPv6 NS offload support - Ethernet PHYs: - micrel: ksz9031/ksz9131: cabletest support - lan87xx: SQI support for T1 PHYs - lan937x: add interrupt support for link detection" * tag 'net-next-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1809 commits) ptp: ocp: Add firmware header checks ptp: ocp: fix PPS source selector debugfs reporting ptp: ocp: add .init function for sma_op vector ptp: ocp: vectorize the sma accessor functions ptp: ocp: constify selectors ptp: ocp: parameterize input/output sma selectors ptp: ocp: revise firmware display ptp: ocp: add Celestica timecard PCI ids ptp: ocp: Remove #ifdefs around PCI IDs ptp: ocp: 32-bit fixups for pci start address Revert "net/smc: fix listen processing for SMC-Rv2" ath6kl: Use cc-disable-warning to disable -Wdangling-pointer selftests/bpf: Dynptr tests bpf: Add dynptr data slices bpf: Add bpf_dynptr_read and bpf_dynptr_write bpf: Dynptr support for ring buffers bpf: Add bpf_dynptr_from_mem for local dynptrs bpf: Add verifier support for dynptrs bpf: Suppress 'passing zero to PTR_ERR' warning bpf: Introduce bpf_arch_text_invalidate for bpf_prog_pack ...
2022-05-24Merge tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds
Pull page cache updates from Matthew Wilcox: - Appoint myself page cache maintainer - Fix how scsicam uses the page cache - Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS - Remove the AOP flags entirely - Remove pagecache_write_begin() and pagecache_write_end() - Documentation updates - Convert several address_space operations to use folios: - is_dirty_writeback - readpage becomes read_folio - releasepage becomes release_folio - freepage becomes free_folio - Change filler_t to require a struct file pointer be the first argument like ->read_folio * tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache: (107 commits) nilfs2: Fix some kernel-doc comments Appoint myself page cache maintainer fs: Remove aops->freepage secretmem: Convert to free_folio nfs: Convert to free_folio orangefs: Convert to free_folio fs: Add free_folio address space operation fs: Convert drop_buffers() to use a folio fs: Change try_to_free_buffers() to take a folio jbd2: Convert release_buffer_page() to use a folio jbd2: Convert jbd2_journal_try_to_free_buffers to take a folio reiserfs: Convert release_buffer_page() to use a folio fs: Remove last vestiges of releasepage ubifs: Convert to release_folio reiserfs: Convert to release_folio orangefs: Convert to release_folio ocfs2: Convert to release_folio nilfs2: Remove comment about releasepage nfs: Convert to release_folio jfs: Convert to release_folio ...
2022-05-24Merge tag 'iomap-5.19-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull iomap updates from Darrick Wong: "There's a couple of corrections sent in by Andreas for some accounting errors. The biggest change this time around is that writeback errors longer clear pageuptodate nor does XFS invalidate the page cache anymore. This brings XFS (and gfs2/zonefs) behavior in line with every other Linux filesystem driver, and fixes some UAF bugs that only cropped up after willy turned on multipage folios for XFS in 5.18-rc1. Regrettably, it took all the way to the end of the 5.18 cycle to find the source of these bugs and reach a consensus that XFS' writeback failure behavior from 20 years ago is no longer necessary. Summary: - Fix a couple of accounting errors in the buffered io code. - Discontinue the practice of marking folios !uptodate and invalidating them when writeback fails. This fixes some UAF bugs when multipage folios are enabled, and brings the behavior of XFS/gfs/zonefs into alignment with the behavior of all the other Linux filesystems" * tag 'iomap-5.19-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: don't invalidate folios after writeback errors iomap: iomap_write_end cleanup iomap: iomap_write_failed fix
2022-05-24Merge tag 'dlm-5.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: "This includes several large patches to improve endian handling and remove sparse warnings. The code previously used in/out, in-place endianness conversion functions. Other code cleanup includes the list iterator changes. Finally, a long standing bug was found and fixed, caused by missed decrement on an lock struct ref count" * tag 'dlm-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: (28 commits) dlm: use kref_put_lock in __put_lkb dlm: use kref_put_lock in put_rsb dlm: remove unnecessary error assign dlm: fix missing lkb refcount handling fs: dlm: cast resource pointer to uintptr_t dlm: replace usage of found with dedicated list iterator variable dlm: remove usage of list iterator for list_add() after the loop body dlm: fix pending remove if msg allocation fails dlm: fix wake_up() calls for pending remove dlm: check required context while close dlm: cleanup lock handling in dlm_master_lookup dlm: remove found label in dlm_master_lookup dlm: remove __user conversion warnings dlm: move conversion to compile time dlm: use __le types for dlm messages dlm: use __le types for rcom messages dlm: use __le types for dlm header dlm: use __le types for options header dlm: add __CHECKER__ for false positives dlm: move global to static inits ...
2022-05-24Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Various bug fixes and cleanups for ext4. In particular, move the crypto related fucntions from fs/ext4/super.c into a new fs/ext4/crypto.c, and fix a number of bugs found by fuzzers and error injection tools" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (25 commits) ext4: only allow test_dummy_encryption when supported ext4: fix bug_on in __es_tree_search ext4: avoid cycles in directory h-tree ext4: verify dir block before splitting it ext4: filter out EXT4_FC_REPLAY from on-disk superblock field s_state ext4: fix bug_on in ext4_writepages ext4: refactor and move ext4_ioctl_get_encryption_pwsalt() ext4: cleanup function defs from ext4.h into crypto.c ext4: move ext4 crypto code to its own file crypto.c ext4: fix memory leak in parse_apply_sb_mount_options() ext4: reject the 'commit' option on ext2 filesystems ext4: remove duplicated #include of dax.h in inode.c ext4: fix race condition between ext4_write and ext4_convert_inline_data ext4: convert symlink external data block mapping to bdev ext4: add nowait mode for ext4_getblk() ext4: fix journal_ioprio mount option handling ext4: mark group as trimmed only if it was fully scanned ext4: fix use-after-free in ext4_rename_dir_prepare ext4: add unmount filesystem message ext4: remove unnecessary conditionals ...
2022-05-24Merge tag 'gfs2-v5.18-rc6-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 updates from Andreas Gruenbacher: - Clean up the allocation of glocks that have an address space attached - Quota locking fix and quota iomap conversion - Fix the FITRIM error reporting - Some list iterator cleanups * tag 'gfs2-v5.18-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Convert function bh_get to use iomap gfs2: use i_lock spin_lock for inode qadata gfs2: Return more useful errors from gfs2_rgrp_send_discards() gfs2: Use container_of() for gfs2_glock(aspace) gfs2: Explain some direct I/O oddities gfs2: replace 'found' with dedicated list iterator variable
2022-05-24Merge tag 'for-5.19-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "Features: - subpage: - support for PAGE_SIZE > 4K (previously only 64K) - make it work with raid56 - repair super block num_devices automatically if it does not match the number of device items - defrag can convert inline extents to regular extents, up to now inline files were skipped but the setting of mount option max_inline could affect the decision logic - zoned: - minimal accepted zone size is explicitly set to 4MiB - make zone reclaim less aggressive and don't reclaim if there are enough free zones - add per-profile sysfs tunable of the reclaim threshold - allow automatic block group reclaim for non-zoned filesystems, with sysfs tunables - tree-checker: new check, compare extent buffer owner against owner rootid Performance: - avoid blocking on space reservation when doing nowait direct io writes (+7% throughput for reads and writes) - NOCOW write throughput improvement due to refined locking (+3%) - send: reduce pressure to page cache by dropping extent pages right after they're processed Core: - convert all radix trees to xarray - add iterators for b-tree node items - support printk message index - user bulk page allocation for extent buffers - switch to bio_alloc API, use on-stack bios where convenient, other bio cleanups - use rw lock for block groups to favor concurrent reads - simplify workques, don't allocate high priority threads for all normal queues as we need only one - refactor scrub, process chunks based on their constraints and similarity - allocate direct io structures on stack and pass around only pointers, avoids allocation and reduces potential error handling Fixes: - fix count of reserved transaction items for various inode operations - fix deadlock between concurrent dio writes when low on free data space - fix a few cases when zones need to be finished VFS, iomap: - add helper to check if sb write has started (usable for assertions) - new helper iomap_dio_alloc_bio, export iomap_dio_bio_end_io" * tag 'for-5.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (173 commits) btrfs: zoned: introduce a minimal zone size 4M and reject mount btrfs: allow defrag to convert inline extents to regular extents btrfs: add "0x" prefix for unsupported optional features btrfs: do not account twice for inode ref when reserving metadata units btrfs: zoned: fix comparison of alloc_offset vs meta_write_pointer btrfs: send: avoid trashing the page cache btrfs: send: keep the current inode open while processing it btrfs: allocate the btrfs_dio_private as part of the iomap dio bio btrfs: move struct btrfs_dio_private to inode.c btrfs: remove the disk_bytenr in struct btrfs_dio_private btrfs: allocate dio_data on stack iomap: add per-iomap_iter private data iomap: allow the file system to provide a bio_set for direct I/O btrfs: add a btrfs_dio_rw wrapper btrfs: zoned: zone finish unused block group btrfs: zoned: properly finish block group on metadata write btrfs: zoned: finish block group when there are no more allocatable bytes left btrfs: zoned: consolidate zone finish functions btrfs: zoned: introduce btrfs_zoned_bg_is_full btrfs: improve error reporting in lookup_inline_extent_backref ...
2022-05-24Merge tag 'zonefs-5.19-rc1-fix' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs Pull zonefs fix from Damien Le Moal: "A single patch to fix zonefs_init_file_inode() return value" * tag 'zonefs-5.19-rc1-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs: zonefs: Fix zonefs_init_file_inode() return value
2022-05-24Merge tag 'erofs-for-5.19-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs (and fscache) updates from Gao Xiang: "After working on it on the mailing list for more than half a year, we finally form 'erofs over fscache' feature into shape. Hopefully it could bring more possibility to the communities. The story mainly started from a new project what we called "RAFS v6" [1] for Nydus image service almost a year ago, which enhances EROFS to be a new form of one bootstrap (which includes metadata representing the whole fs tree) + several data-deduplicated content addressable blobs (actually treated as multiple devices). Each blob can represent one container image layer but not quite exactly since all new data can be fully existed in the previous blobs so no need to introduce another new blob. It is actually not a new idea (at least on my side it's much like a simpilied casync [2] for now) and has many benefits over per-file blobs or some other exist ways since typically each RAFS v6 image only has dozens of device blobs instead of thousands of per-file blobs. It's easy to be signed with user keys as a golden image, transfered untouchedly with minimal overhead over the network, kept in some type of storage conveniently, and run with (optional) runtime verification but without involving too many irrelevant features crossing the system beyond EROFS itself. At least it's our final goal and we're keeping working on it. There was also a good summary of this approach from the casync author [3]. Regardless further optimizations, this work is almost done in the previous Linux release cycles. In this round, we'd like to introduce on-demand load for EROFS with the fscache/cachefiles infrastructure, considering the following advantages: - Introduce new file-based backend to EROFS. Although each image only contains dozens of blobs but in densely-deployed runC host for example, there could still be massive blobs on a machine, which is messy if each blob is treated as a device. In contrast, fscache and cachefiles are really great interfaces for us to make them work. - Introduce on-demand load to fscache and EROFS. Previously, fscache is mainly used to caching network-likewise filesystems, now it can support on-demand downloading for local fses too with the exact localfs on-disk format. It has many advantages which we're been described in the latest patchset cover letter [4]. In addition to that, most importantly, the cached data is still stored in the original local fs on-disk format so that it's still the one signed with private keys but only could be partially available. Users can fully trust it during running. Later, users can also back up cachefiles easily to another machine. - More reliable on-demand approach in principle. After data is all available locally, user daemon can be no longer online in some use cases, which helps daemon crash recovery (filesystems can still in service) and hot-upgrade (user daemon can be upgraded more frequently due to new features or protocols introduced.) - Other format can also be converted to EROFS filesystem format over the internet on the fly with the new on-demand load feature and mounted. That is entirely possible with on-demand load feature as long as such archive format metadata can be fetched in advance like stargz. In addition, although currently our target user is Nydus image service [5], but laterly, it can be used for other use cases like on-demand system booting, etc. As for the fscache on-demand load feature itself, strictly it can be used for other local fses too. Laterly we could promote most code to the iomap infrastructure and also enhance it in the read-write way if other local fses are interested. Thanks David Howells for taking so much time and patience on this these months, many thanks with great respect here again! Thanks Jeffle for working on this feature and Xin Yin from Bytedance for asynchronous I/O implementation as well as Zichen Tian, Jia Zhu, and Yan Song for testing, much appeciated. We're also exploring more possibly over fscache cache management over FSDAX for secure containers and working on more improvements and useful features for fscache, cachefiles, and on-demand load. In addition to "erofs over fscache", NFS export and idmapped mount are also completed in this cycle for container use cases as well. Summary: - Add erofs on-demand load support over fscache - Support NFS export for erofs - Support idmapped mounts for erofs - Don't prompt for risk any more when using big pcluster - Fix buffer copy overflow of ztailpacking feature - Several minor cleanups" [1] https://lore.kernel.org/r/20210730194625.93856-1-hsiangkao@linux.alibaba.com [2] https://github.com/systemd/casync [3] http://0pointer.net/blog/casync-a-tool-for-distributing-file-system-images.html [4] https://lore.kernel.org/r/20220509074028.74954-1-jefflexu@linux.alibaba.com [5] https://github.com/dragonflyoss/image-service * tag 'erofs-for-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: (29 commits) erofs: scan devices from device table erofs: change to use asynchronous io for fscache readpage/readahead erofs: add 'fsid' mount option erofs: implement fscache-based data readahead erofs: implement fscache-based data read for inline layout erofs: implement fscache-based data read for non-inline layout erofs: implement fscache-based metadata read erofs: register fscache context for extra data blobs erofs: register fscache context for primary data blob erofs: add erofs_fscache_read_folios() helper erofs: add anonymous inode caching metadata for data blobs erofs: add fscache context helper functions erofs: register fscache volume erofs: add fscache mode check helper erofs: make erofs_map_blocks() generally available cachefiles: document on-demand read mode cachefiles: add tracepoints for on-demand read mode cachefiles: enable on-demand read mode cachefiles: implement on-demand read cachefiles: notify the user daemon when withdrawing cookie ...
2022-05-24Merge tag 'exfat-for-5.19-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat Pull exfat updates from Namjae Jeon: - fix referencing wrong parent directory information during rename - introduce a sys_tz mount option to use system timezone - improve performance while zeroing a cluster with dirsync mount option - fix slab-out-bounds in exat_clear_bitmap() reported from syzbot * tag 'exfat-for-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat: exfat: check if cluster num is valid exfat: reduce block requests when zeroing a cluster block: add sync_blockdev_range() exfat: introduce mount option 'sys_tz' exfat: fix referencing wrong parent directory information after renaming
2022-05-24Merge tag 'fs.idmapped.v5.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull fs idmapping updates from Christian Brauner: "This contains two minor updates: - An update to the idmapping documentation by Rodrigo making it easier to understand that we first introduce several use-cases that fail without idmapped mounts simply to explain how they can be handled with idmapped mounts. - When changing a mount's idmapping we now hold writers to make it more robust. This is similar to turning a mount ro with the difference that in contrast to turning a mount ro changing the idmapping can only ever be done once while a mount can transition between ro and rw as much as it wants. The vfs layer itself takes care to retrieve the idmapping of a mount once ensuring that the idmapping used for vfs permission checking is identical to the idmapping passed down to the filesystem. All filesystems with FS_ALLOW_IDMAP raised take the same precautions as the vfs in code-paths that are outside of direct control of the vfs such as ioctl()s. However, holding writers makes this more robust and predictable for both the kernel and userspace. This is a minor user-visible change. But it is extremely unlikely to matter. The caller must've created a detached mount via OPEN_TREE_CLONE and then handed that O_PATH fd to another process or thread which then must've gotten a writable fd for that mount and started creating files in there while the caller is still changing mount properties. While not impossible it will be an extremely rare corner-case and should in general be considered a bug in the application. Consider making a mount MOUNT_ATTR_NOEXEC or MOUNT_ATTR_NODEV while allowing someone else to perform lookups or exec'ing in parallel by handing them a copy of the OPEN_TREE_CLONE fd or another fd beneath that mount. I've pinged all major users of idmapped mounts pointing out this change and none of them have active writers on a mount while still changing mount properties. It would've been strange if they did. The rest and majority of the work will be coming through the overlayfs tree this cycle. In addition to overlayfs this cycle should also see support for idmapped mounts on erofs as I've acked a patch to this effect a little while ago" * tag 'fs.idmapped.v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: fs: hold writers when changing mount's idmapping docs: Add small intro to idmap examples
2022-05-24Merge tag 'integrity-v5.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity Pull IMA updates from Mimi Zohar: "New is IMA support for including fs-verity file digests and signatures in the IMA measurement list as well as verifying the fs-verity file digest based signatures, both based on policy. In addition, are two bug fixes: - avoid reading UEFI variables, which cause a page fault, on Apple Macs with T2 chips. - remove the original "ima" template Kconfig option to address a boot command line ordering issue. The rest is a mixture of code/documentation cleanup" * tag 'integrity-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity: integrity: Fix sparse warnings in keyring_handler evm: Clean up some variables evm: Return INTEGRITY_PASS for enum integrity_status value '0' efi: Do not import certificates from UEFI Secure Boot for T2 Macs fsverity: update the documentation ima: support fs-verity file digest based version 3 signatures ima: permit fsverity's file digests in the IMA measurement list ima: define a new template field named 'd-ngv2' and templates fs-verity: define a function to return the integrity protected file digest ima: use IMA default hash algorithm for integrity violations ima: fix 'd-ng' comments and documentation ima: remove the IMA_TEMPLATE Kconfig option ima: remove redundant initialization of pointer 'file'.
2022-05-24Merge tag 'execve-v5.19-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve updates from Kees Cook: - Fix binfmt_flat GOT handling for riscv (Niklas Cassel) - Remove unused/broken binfmt_flat shared library and coredump code (Eric W. Biederman) * tag 'execve-v5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: binfmt_flat: Remove shared library support binfmt_flat: Drop vestiges of coredump support binfmt_flat: do not stop relocating GOT entries prematurely on riscv
2022-05-24ext4: only allow test_dummy_encryption when supportedEric Biggers
Make the test_dummy_encryption mount option require that the encrypt feature flag be already enabled on the filesystem, rather than automatically enabling it. Practically, this means that "-O encrypt" will need to be included in MKFS_OPTIONS when running xfstests with the test_dummy_encryption mount option. (ext4/053 also needs an update.) Moreover, as long as the preconditions for test_dummy_encryption are being tightened anyway, take the opportunity to start rejecting it when !CONFIG_FS_ENCRYPTION rather than ignoring it. The motivation for requiring the encrypt feature flag is that: - Having the filesystem auto-enable feature flags is problematic, as it bypasses the usual sanity checks. The specific issue which came up recently is that in kernel versions where ext4 supports casefold but not encrypt+casefold (v5.1 through v5.10), the kernel will happily add the encrypt flag to a filesystem that has the casefold flag, making it unmountable -- but only for subsequent mounts, not the initial one. This confused the casefold support detection in xfstests, causing generic/556 to fail rather than be skipped. - The xfstests-bld test runners (kvm-xfstests et al.) already use the required mkfs flag, so they will not be affected by this change. Only users of test_dummy_encryption alone will be affected. But, this option has always been for testing only, so it should be fine to require that the few users of this option update their test scripts. - f2fs already requires it (for its equivalent feature flag). Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com> Link: https://lore.kernel.org/r/20220519204437.61645-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-24ext4: fix bug_on in __es_tree_searchBaokun Li
Hulk Robot reported a BUG_ON: ================================================================== kernel BUG at fs/ext4/extents_status.c:199! [...] RIP: 0010:ext4_es_end fs/ext4/extents_status.c:199 [inline] RIP: 0010:__es_tree_search+0x1e0/0x260 fs/ext4/extents_status.c:217 [...] Call Trace: ext4_es_cache_extent+0x109/0x340 fs/ext4/extents_status.c:766 ext4_cache_extents+0x239/0x2e0 fs/ext4/extents.c:561 ext4_find_extent+0x6b7/0xa20 fs/ext4/extents.c:964 ext4_ext_map_blocks+0x16b/0x4b70 fs/ext4/extents.c:4384 ext4_map_blocks+0xe26/0x19f0 fs/ext4/inode.c:567 ext4_getblk+0x320/0x4c0 fs/ext4/inode.c:980 ext4_bread+0x2d/0x170 fs/ext4/inode.c:1031 ext4_quota_read+0x248/0x320 fs/ext4/super.c:6257 v2_read_header+0x78/0x110 fs/quota/quota_v2.c:63 v2_check_quota_file+0x76/0x230 fs/quota/quota_v2.c:82 vfs_load_quota_inode+0x5d1/0x1530 fs/quota/dquot.c:2368 dquot_enable+0x28a/0x330 fs/quota/dquot.c:2490 ext4_quota_enable fs/ext4/super.c:6137 [inline] ext4_enable_quotas+0x5d7/0x960 fs/ext4/super.c:6163 ext4_fill_super+0xa7c9/0xdc00 fs/ext4/super.c:4754 mount_bdev+0x2e9/0x3b0 fs/super.c:1158 mount_fs+0x4b/0x1e4 fs/super.c:1261 [...] ================================================================== Above issue may happen as follows: ------------------------------------- ext4_fill_super ext4_enable_quotas ext4_quota_enable ext4_iget __ext4_iget ext4_ext_check_inode ext4_ext_check __ext4_ext_check ext4_valid_extent_entries Check for overlapping extents does't take effect dquot_enable vfs_load_quota_inode v2_check_quota_file v2_read_header ext4_quota_read ext4_bread ext4_getblk ext4_map_blocks ext4_ext_map_blocks ext4_find_extent ext4_cache_extents ext4_es_cache_extent ext4_es_cache_extent __es_tree_search ext4_es_end BUG_ON(es->es_lblk + es->es_len < es->es_lblk) The error ext4 extents is as follows: 0af3 0300 0400 0000 00000000 extent_header 00000000 0100 0000 12000000 extent1 00000000 0100 0000 18000000 extent2 02000000 0400 0000 14000000 extent3 In the ext4_valid_extent_entries function, if prev is 0, no error is returned even if lblock<=prev. This was intended to skip the check on the first extent, but in the error image above, prev=0+1-1=0 when checking the second extent, so even though lblock<=prev, the function does not return an error. As a result, bug_ON occurs in __es_tree_search and the system panics. To solve this problem, we only need to check that: 1. The lblock of the first extent is not less than 0. 2. The lblock of the next extent is not less than the next block of the previous extent. The same applies to extent_idx. Cc: stable@kernel.org Fixes: 5946d089379a ("ext4: check for overlapping extents in ext4_valid_extent_entries()") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220518120816.1541863-1-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-24ext4: avoid cycles in directory h-treeJan Kara
A maliciously corrupted filesystem can contain cycles in the h-tree stored inside a directory. That can easily lead to the kernel corrupting tree nodes that were already verified under its hands while doing a node split and consequently accessing unallocated memory. Fix the problem by verifying traversed block numbers are unique. Cc: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220518093332.13986-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-24ext4: verify dir block before splitting itJan Kara
Before splitting a directory block verify its directory entries are sane so that the splitting code does not access memory it should not. Cc: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220518093332.13986-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-24ext4: filter out EXT4_FC_REPLAY from on-disk superblock field s_stateTheodore Ts'o
The EXT4_FC_REPLAY bit in sbi->s_mount_state is used to indicate that we are in the middle of replay the fast commit journal. This was actually a mistake, since the sbi->s_mount_info is initialized from es->s_state. Arguably s_mount_state is misleadingly named, but the name is historical --- s_mount_state and s_state dates back to ext2. What should have been used is the ext4_{set,clear,test}_mount_flag() inline functions, which sets EXT4_MF_* bits in sbi->s_mount_flags. The problem with using EXT4_FC_REPLAY is that a maliciously corrupted superblock could result in EXT4_FC_REPLAY getting set in s_mount_state. This bypasses some sanity checks, and this can trigger a BUG() in ext4_es_cache_extent(). As a easy-to-backport-fix, filter out the EXT4_FC_REPLAY bit for now. We should eventually transition away from EXT4_FC_REPLAY to something like EXT4_MF_REPLAY. Cc: stable@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20220420192312.1655305-1-phind.uet@gmail.com Link: https://lore.kernel.org/r/20220517174028.942119-1-tytso@mit.edu Reported-by: syzbot+c7358a3cd05ee786eb31@syzkaller.appspotmail.com
2022-05-24gfs2: Convert function bh_get to use iomapBob Peterson
Before this patch, function bh_get used block_map to figure out the block it needed to read in from the quota_change file. This patch changes it to use iomap directly to make it more efficient. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-05-24gfs2: use i_lock spin_lock for inode qadataBob Peterson
Before this patch, functions gfs2_qa_get and _put used the i_rw_mutex to prevent simultaneous access to its i_qadata. But i_rw_mutex is now used for many other things, including iomap_begin and end, which causes a conflict according to lockdep. We cannot just remove the lock since simultaneous opens (gfs2_open -> gfs2_open_common -> gfs2_qa_get) can then stomp on each others values for i_qadata. This patch solves the conflict by using the i_lock spin_lock in the inode to prevent simultaneous access. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-05-24gfs2: Return more useful errors from gfs2_rgrp_send_discards()Andrew Price
The bug that 27ca8273f ("gfs2: Make sure FITRIM minlen is rounded up to fs block size") fixes was a little confusing as the user saw "Input/output error" which masked the -EINVAL that sb_issue_discard() returned. sb_issue_discard() can fail for various reasons, so we should return its return value from gfs2_rgrp_send_discards() to avoid all errors being reported as IO errors. This improves error reporting for FITRIM and makes no difference to the -o discard code path because the return value from gfs2_rgrp_send_discards() gets thrown away in that case (and the option switches off). Presumably that's why it was ok to just return -EIO in the past, before FITRIM was implemented. Tested with xfstests. Signed-off-by: Andrew Price <anprice@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-05-24gfs2: Use container_of() for gfs2_glock(aspace)Kees Cook
Clang's structure layout randomization feature gets upset when it sees struct address_space (which is randomized) cast to struct gfs2_glock. This is due to seeing the mapping pointer as being treated as an array of gfs2_glock, rather than "something else, before struct address_space": In file included from fs/gfs2/acl.c:23: fs/gfs2/meta_io.h:44:12: error: casting from randomized structure pointer type 'struct address_space *' to 'struct gfs2_glock *' return (((struct gfs2_glock *)mapping) - 1)->gl_name.ln_sbd; ^ Replace the instances of open-coded pointer math with container_of() usage, and update the allocator to match. Some cleanups and conversion of gfs2_glock_get() and gfs2_glock_dealloc() by Andreas. Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/lkml/202205041550.naKxwCBj-lkp@intel.com Cc: Bob Peterson <rpeterso@redhat.com> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Bill Wendling <morbo@google.com> Cc: cluster-devel@redhat.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-05-24gfs2: Explain some direct I/O odditiesAndreas Gruenbacher
Add some comments explaining the oddities of partial direct I/O reads and writes. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-05-24Merge tag 'fsverity-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt Pull fsverity updates from Eric Biggers: "A couple small cleanups for fs/verity/" * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: fs-verity: Use struct_size() helper in enable_verity() fs-verity: remove unused parameter desc_size in fsverity_create_info()
2022-05-24Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscryptLinus Torvalds
Pull fscrypt updates from Eric Biggers: "Some cleanups for fs/crypto/: - Split up the misleadingly-named FS_CRYPTO_BLOCK_SIZE constant. - Consistently report the encryption implementation that is being used. - Add helper functions for the test_dummy_encryption mount option that work properly with the new mount API. ext4 and f2fs will use these" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: fscrypt: add new helper functions for test_dummy_encryption fscrypt: factor out fscrypt_policy_to_key_spec() fscrypt: log when starting to use inline encryption fscrypt: split up FS_CRYPTO_BLOCK_SIZE
2022-05-24zonefs: Fix zonefs_init_file_inode() return valueDamien Le Moal
Commit 87c9ce3ffec9 ("zonefs: Add active seq file accounting") wrongly changed zonefs_init_file_inode() to always return 0 even if the call to zonefs_zone_mgmt() fails. Fix this by propagating zonefs_zone_mgmt() return value as the return value for zonefs_init_file_inode(). Fixes: 87c9ce3ffec9 ("zonefs: Add active seq file accounting") Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
2022-05-23Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Catalin Marinas: - Initial support for the ARMv9 Scalable Matrix Extension (SME). SME takes the approach used for vectors in SVE and extends this to provide architectural support for matrix operations. No KVM support yet, SME is disabled in guests. - Support for crashkernel reservations above ZONE_DMA via the 'crashkernel=X,high' command line option. - btrfs search_ioctl() fix for live-lock with sub-page faults. - arm64 perf updates: support for the Hisilicon "CPA" PMU for monitoring coherent I/O traffic, support for Arm's CMN-650 and CMN-700 interconnect PMUs, minor driver fixes, kerneldoc cleanup. - Kselftest updates for SME, BTI, MTE. - Automatic generation of the system register macros from a 'sysreg' file describing the register bitfields. - Update the type of the function argument holding the ESR_ELx register value to unsigned long to match the architecture register size (originally 32-bit but extended since ARMv8.0). - stacktrace cleanups. - ftrace cleanups. - Miscellaneous updates, most notably: arm64-specific huge_ptep_get(), avoid executable mappings in kexec/hibernate code, drop TLB flushing from get_clear_flush() (and rename it to get_clear_contig()), ARCH_NR_GPIO bumped to 2048 for ARCH_APPLE. * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (145 commits) arm64/sysreg: Generate definitions for FAR_ELx arm64/sysreg: Generate definitions for DACR32_EL2 arm64/sysreg: Generate definitions for CSSELR_EL1 arm64/sysreg: Generate definitions for CPACR_ELx arm64/sysreg: Generate definitions for CONTEXTIDR_ELx arm64/sysreg: Generate definitions for CLIDR_EL1 arm64/sve: Move sve_free() into SVE code section arm64: Kconfig.platforms: Add comments arm64: Kconfig: Fix indentation and add comments arm64: mm: avoid writable executable mappings in kexec/hibernate code arm64: lds: move special code sections out of kernel exec segment arm64/hugetlb: Implement arm64 specific huge_ptep_get() arm64/hugetlb: Use ptep_get() to get the pte value of a huge page arm64: kdump: Do not allocate crash low memory if not needed arm64/sve: Generate ZCR definitions arm64/sme: Generate defintions for SVCR arm64/sme: Generate SMPRI_EL1 definitions arm64/sme: Automatically generate SMPRIMAP_EL2 definitions arm64/sme: Automatically generate SMIDR_EL1 defines arm64/sme: Automatically generate defines for SMCR ...
2022-05-23Merge tag 'x86_cleanups_for_v5.19_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Borislav Petkov: - Serious sanitization and cleanup of the whole APERF/MPERF and frequency invariance code along with removing the need for unnecessary IPIs - Finally remove a.out support - The usual trivial cleanups and fixes all over x86 * tag 'x86_cleanups_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits) x86: Remove empty files x86/speculation: Add missing srbds=off to the mitigations= help text x86/prctl: Remove pointless task argument x86/aperfperf: Make it correct on 32bit and UP kernels x86/aperfmperf: Integrate the fallback code from show_cpuinfo() x86/aperfmperf: Replace arch_freq_get_on_cpu() x86/aperfmperf: Replace aperfmperf_get_khz() x86/aperfmperf: Store aperf/mperf data for cpu frequency reads x86/aperfmperf: Make parts of the frequency invariance code unconditional x86/aperfmperf: Restructure arch_scale_freq_tick() x86/aperfmperf: Put frequency invariance aperf/mperf data into a struct x86/aperfmperf: Untangle Intel and AMD frequency invariance init x86/aperfmperf: Separate AP/BP frequency invariance init x86/smp: Move APERF/MPERF code where it belongs x86/aperfmperf: Dont wake idle CPUs in arch_freq_get_on_cpu() x86/process: Fix kernel-doc warning due to a changed function name x86: Remove a.out support x86/mm: Replace nodes_weight() with nodes_empty() where appropriate x86: Replace cpumask_weight() with cpumask_empty() where appropriate x86/pkeys: Remove __arch_set_user_pkey_access() declaration ...
2022-05-23Merge tag 'zonefs-5.19-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs Pull zonefs updates from Damien Le Moal: "This improves zonefs open sequential file accounting and adds accounting for active sequential files to allow the user to handle the maximum number of active zones of an NVMe ZNS drive. sysfs attributes for both open and active sequential files are also added to facilitate access to this information from applications without resorting to inspecting the block device limits" * tag 'zonefs-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs: documentation: zonefs: Document sysfs attributes documentation: zonefs: Cleanup the mount options section zonefs: Add active seq file accounting zonefs: Export open zone resource information through sysfs zonefs: Always do seq file write open accounting zonefs: Rename super block information fields zonefs: Fix management of open zones zonefs: Clear inode information flags on inode creation
2022-05-23Merge tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: "Here are the core block changes for 5.19. This contains: - blk-throttle accounting fix (Laibin) - Series removing redundant assignments (Michal) - Expose bio cache via the bio_set, so that DM can use it (Mike) - Finish off the bio allocation interface cleanups by dealing with the weirdest member of the family. bio_kmalloc combines a kmalloc for the bio and bio_vecs with a hidden bio_init call and magic cleanup semantics (Christoph) - Clean up the block layer API so that APIs consumed by file systems are (almost) only struct block_device based, so that file systems don't have to poke into block layer internals like the request_queue (Christoph) - Clean up the blk_execute_rq* API (Christoph) - Clean up various lose end in the blk-cgroup code to make it easier to follow in preparation of reworking the blkcg assignment for bios (Christoph) - Fix use-after-free issues in BFQ when processes with merged queues get moved to different cgroups (Jan) - BFQ fixes (Jan) - Various fixes and cleanups (Bart, Chengming, Fanjun, Julia, Ming, Wolfgang, me)" * tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-block: (83 commits) blk-mq: fix typo in comment bfq: Remove bfq_requeue_request_body() bfq: Remove superfluous conversion from RQ_BIC() bfq: Allow current waker to defend against a tentative one bfq: Relax waker detection for shared queues blk-cgroup: delete rcu_read_lock_held() WARN_ON_ONCE() blk-throttle: Set BIO_THROTTLED when bio has been throttled blk-cgroup: Remove unnecessary rcu_read_lock/unlock() blk-cgroup: always terminate io.stat lines block, bfq: make bfq_has_work() more accurate block, bfq: protect 'bfqd->queued' by 'bfqd->lock' block: cleanup the VM accounting in submit_bio block: Fix the bio.bi_opf comment block: reorder the REQ_ flags blk-iocost: combine local_stat and desc_stat to stat block: improve the error message from bio_check_eod block: allow passing a NULL bdev to bio_alloc_clone/bio_init_clone block: remove superfluous calls to blkcg_bio_issue_init kthread: unexport kthread_blkcg blk-cgroup: cleanup blkcg_maybe_throttle_current ...
2022-05-23Merge tag 'for-5.19/writeback-2022-05-22' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull writeback fix from Jens Axboe: "A single writeback fix that didn't belong in any other branch, correcting the number of skipped pages" * tag 'for-5.19/writeback-2022-05-22' of git://git.kernel.dk/linux-block: fs-writeback: writeback_sb_inodes:Recalculate 'wrote' according skipped pages
2022-05-23Merge tag 'for-5.19/io_uring-passthrough-2022-05-22' of ↵Linus Torvalds
git://git.kernel.dk/linux-block Pull io_uring NVMe command passthrough from Jens Axboe: "On top of everything else, this adds support for passthrough for io_uring. The initial feature for this is NVMe passthrough support, which allows non-filesystem based IO commands and admin commands. To support this, io_uring grows support for SQE and CQE members that are twice as big, allowing to pass in a full NVMe command without having to copy data around. And to complete with more than just a single 32-bit value as the output" * tag 'for-5.19/io_uring-passthrough-2022-05-22' of git://git.kernel.dk/linux-block: (22 commits) io_uring: cleanup handling of the two task_work lists nvme: enable uring-passthrough for admin commands nvme: helper for uring-passthrough checks blk-mq: fix passthrough plugging nvme: add vectored-io support for uring-cmd nvme: wire-up uring-cmd support for io-passthru on char-device. nvme: refactor nvme_submit_user_cmd() block: wire-up support for passthrough plugging fs,io_uring: add infrastructure for uring-cmd io_uring: support CQE32 for nop operation io_uring: enable CQE32 io_uring: support CQE32 in /proc info io_uring: add tracing for additional CQE32 fields io_uring: overflow processing for CQE32 io_uring: flush completions for CQE32 io_uring: modify io_get_cqe for CQE32 io_uring: add CQE32 completion processing io_uring: add CQE32 setup processing io_uring: change ring size calculation for CQE32 io_uring: store add. return values for CQE32 ...
2022-05-23Merge tag 'for-5.19/io_uring-net-2022-05-22' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring 'more data in socket' support from Jens Axboe: "To be able to fully utilize the 'poll first' support in the core io_uring branch, it's advantageous knowing if the socket was empty after a receive. This adds support for that" * tag 'for-5.19/io_uring-net-2022-05-22' of git://git.kernel.dk/linux-block: io_uring: return hint on whether more data is available after receive tcp: pass back data left in socket after receive
2022-05-23Merge tag 'for-5.19/io_uring-socket-2022-05-22' of ↵Linus Torvalds
git://git.kernel.dk/linux-block Pull io_uring socket() support from Jens Axboe: "This adds support for socket(2) for io_uring. This is handy when using direct / registered file descriptors with io_uring. Outside of those two patches, a small series from Dylan on top that improves the tracing by providing a text representation of the opcode rather than needing to decode this by reading the header file every time. That sits in this branch as it was the last opcode added (until it wasn't...)" * tag 'for-5.19/io_uring-socket-2022-05-22' of git://git.kernel.dk/linux-block: io_uring: use the text representation of ops in trace io_uring: rename op -> opcode io_uring: add io_uring_get_opcode io_uring: add type to op enum io_uring: add socket(2) support net: add __sys_socket_file()
2022-05-23Merge tag 'for-5.19/io_uring-xattr-2022-05-22' of ↵Linus Torvalds
git://git.kernel.dk/linux-block Pull io_uring xattr support from Jens Axboe: "Support for the xattr variants" * tag 'for-5.19/io_uring-xattr-2022-05-22' of git://git.kernel.dk/linux-block: io_uring: cleanup error-handling around io_req_complete io_uring: fix trace for reduced sqe padding io_uring: add fgetxattr and getxattr support io_uring: add fsetxattr and setxattr support fs: split off do_getxattr from getxattr fs: split off setxattr_copy and do_setxattr function from setxattr
2022-05-23Merge tag 'for-5.19/io_uring-2022-05-22' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring updates from Jens Axboe: "Here are the main io_uring changes for 5.19. This contains: - Fixes for sparse type warnings (Christoph, Vasily) - Support for multi-shot accept (Hao) - Support for io_uring managed fixed files, rather than always needing the applicationt o manage the indices (me) - Fix for a spurious poll wakeup (Dylan) - CQE overflow fixes (Dylan) - Support more types of cancelations (me) - Support for co-operative task_work signaling, rather than always forcing an IPI (me) - Support for doing poll first when appropriate, rather than always attempting a transfer first (me) - Provided buffer cleanups and support for mapped buffers (me) - Improve how io_uring handles inflight SCM files (Pavel) - Speedups for registered files (Pavel, me) - Organize the completion data in a struct in io_kiocb rather than keep it in separate spots (Pavel) - task_work improvements (Pavel) - Cleanup and optimize the submission path, in general and for handling links (Pavel) - Speedups for registered resource handling (Pavel) - Support sparse buffers and file maps (Pavel, me) - Various fixes and cleanups (Almog, Pavel, me)" * tag 'for-5.19/io_uring-2022-05-22' of git://git.kernel.dk/linux-block: (111 commits) io_uring: fix incorrect __kernel_rwf_t cast io_uring: disallow mixed provided buffer group registrations io_uring: initialize io_buffer_list head when shared ring is unregistered io_uring: add fully sparse buffer registration io_uring: use rcu_dereference in io_close io_uring: consistently use the EPOLL* defines io_uring: make apoll_events a __poll_t io_uring: drop a spurious inline on a forward declaration io_uring: don't use ERR_PTR for user pointers io_uring: use a rwf_t for io_rw.flags io_uring: add support for ring mapped supplied buffers io_uring: add io_pin_pages() helper io_uring: add buffer selection support to IORING_OP_NOP io_uring: fix locking state for empty buffer group io_uring: implement multishot mode for accept io_uring: let fast poll support multishot io_uring: add REQ_F_APOLL_MULTISHOT for requests io_uring: add IORING_ACCEPT_MULTISHOT for accept io_uring: only wake when the correct events are set io_uring: avoid io-wq -EAGAIN looping for !IOPOLL ...
2022-05-23writeback: fix typo in commentJulia Lawall
Spelling mistake (triple letters) in comment. Detected with the help of Coccinelle. Link: https://lore.kernel.org/r/20220521111145.81697-32-Julia.Lawall@inria.fr Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Jan Kara <jack@suse.cz>
2022-05-23fanotify: fix incorrect fmode_t castsVasily Averin
Fixes sparce warnings: fs/notify/fanotify/fanotify_user.c:267:63: sparse: warning: restricted fmode_t degrades to integer fs/notify/fanotify/fanotify_user.c:1351:28: sparse: warning: restricted fmode_t degrades to integer FMODE_NONTIFY have bitwise fmode_t type and requires __force attribute for any casts. Signed-off-by: Vasily Averin <vvs@openvz.org> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/9adfd6ac-1b89-791e-796b-49ada3293985@openvz.org
2022-05-23exfat: check if cluster num is validTadeusz Struk
Syzbot reported slab-out-of-bounds read in exfat_clear_bitmap. This was triggered by reproducer calling truncute with size 0, which causes the following trace: BUG: KASAN: slab-out-of-bounds in exfat_clear_bitmap+0x147/0x490 fs/exfat/balloc.c:174 Read of size 8 at addr ffff888115aa9508 by task syz-executor251/365 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack_lvl+0x1e2/0x24b lib/dump_stack.c:118 print_address_description+0x81/0x3c0 mm/kasan/report.c:233 __kasan_report mm/kasan/report.c:419 [inline] kasan_report+0x1a4/0x1f0 mm/kasan/report.c:436 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report_generic.c:309 exfat_clear_bitmap+0x147/0x490 fs/exfat/balloc.c:174 exfat_free_cluster+0x25a/0x4a0 fs/exfat/fatent.c:181 __exfat_truncate+0x99e/0xe00 fs/exfat/file.c:217 exfat_truncate+0x11b/0x4f0 fs/exfat/file.c:243 exfat_setattr+0xa03/0xd40 fs/exfat/file.c:339 notify_change+0xb76/0xe10 fs/attr.c:336 do_truncate+0x1ea/0x2d0 fs/open.c:65 Move the is_valid_cluster() helper from fatent.c to a common header to make it reusable in other *.c files. And add is_valid_cluster() to validate if cluster number is within valid range in exfat_clear_bitmap() and exfat_set_bitmap(). Link: https://syzkaller.appspot.com/bug?id=50381fc73821ecae743b8cf24b4c9a04776f767c Reported-by: syzbot+a4087e40b9c13aad7892@syzkaller.appspotmail.com Fixes: 1e49a94cf707 ("exfat: add bitmap operations") Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org> Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2022-05-23exfat: reduce block requests when zeroing a clusterYuezhang Mo
If 'dirsync' is enabled, when zeroing a cluster, submitting sector by sector will generate many block requests, will cause the block device to not fully perform its performance. This commit makes the sectors in a cluster to be submitted in once, it will reduce the number of block requests. This will make the block device to give full play to its performance. Test create 1000 directories on SD card with: $ time (for ((i=0;i<1000;i++)); do mkdir dir${i}; done) Performance has been improved by more than 73% on imx6q-sabrelite. Cluster size Before After Improvement 64 KBytes 3m34.036s 0m56.052s 73.8% 128 KBytes 6m2.644s 1m13.354s 79.8% 256 KBytes 11m22.202s 1m39.451s 85.4% imx6q-sabrelite: - CPU: 792 MHz x4 - Memory: 1GB DDR3 - SD Card: SanDisk 8GB Class 4 Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com> Reviewed-by: Andy Wu <Andy.Wu@sony.com> Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2022-05-23exfat: introduce mount option 'sys_tz'Chung-Chiang Cheng
EXFAT_TZ_VALID bit in {create,modify,access}_tz is corresponding to OffsetValid field in exfat specification [1]. When this bit isn't set, timestamps should be treated as having the same UTC offset as the current local time. Currently, there is an option 'time_offset' for users to specify the UTC offset for this issue. This patch introduces a new mount option 'sys_tz' to use system timezone as time offset. Link: [1] https://docs.microsoft.com/en-us/windows/win32/fileio/exfat-specification#74102-offsetvalid-field Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com> Acked-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2022-05-23exfat: fix referencing wrong parent directory information after renamingYuezhang Mo
During renaming, the parent directory information maybe updated. But the file/directory still references to the old parent directory information. This bug will cause 2 problems. (1) The renamed file can not be written. [10768.175172] exFAT-fs (sda1): error, failed to bmap (inode : 7afd50e4 iblock : 0, err : -5) [10768.184285] exFAT-fs (sda1): Filesystem has been set read-only ash: write error: Input/output error (2) Some dentries of the renamed file/directory are not set to deleted after removing the file/directory. exfat_update_parent_info() is a workaround for the wrong parent directory information being used after renaming. Now that bug is fixed, this is no longer needed, so remove it. Fixes: 5f2aa075070c ("exfat: add inode operations") Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com> Reviewed-by: Andy Wu <Andy.Wu@sony.com> Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com> Reviewed-by: Daniel Palmer <daniel.palmer@sony.com> Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2022-05-22afs: Adjust ACK interpretation to try and cope with NATDavid Howells
If a client's address changes, say if it is NAT'd, this can disrupt an in progress operation. For most operations, this is not much of a problem, but StoreData can be different as some servers modify the target file as the data comes in, so if a store request is disrupted, the file can get corrupted on the server. The problem is that the server doesn't recognise packets that come after the change of address as belonging to the original client and will bounce them, either by sending an OUT_OF_SEQUENCE ACK to the apparent new call if the packet number falls within the initial sequence number window of a call or by sending an EXCEEDS_WINDOW ACK if it falls outside and then aborting it. In both cases, firstPacket will be 1 and previousPacket will be 0 in the ACK information. Fix this by the following means: (1) If a client call receives an EXCEEDS_WINDOW ACK with firstPacket as 1 and previousPacket as 0, assume this indicates that the server saw the incoming packets from a different peer and thus as a different call. Fail the call with error -ENETRESET. (2) Also fail the call if a similar OUT_OF_SEQUENCE ACK occurs if the first packet has been hard-ACK'd. If it hasn't been hard-ACK'd, the ACK packet will cause it to get retransmitted, so the call will just be repeated. (3) Make afs_select_fileserver() treat -ENETRESET as a straight fail of the operation. (4) Prioritise the error code over things like -ECONNRESET as the server did actually respond. (5) Make writeback treat -ENETRESET as a retryable error and make it redirty all the pages involved in a write so that the VM will retry. Note that there is still a circumstance that I can't easily deal with: if the operation is fully received and processed by the server, but the reply is lost due to address change. There's no way to know if the op happened. We can examine the server, but a conflicting change could have been made by a third party - and we can't tell the difference. In such a case, a message like: kAFS: vnode modified {100058:146266} b7->b8 YFS.StoreData64 (op=2646a) will be logged to dmesg on the next op to touch the file and the client will reset the inode state, including invalidating clean parts of the pagecache. Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-afs@lists.infradead.org Link: http://lists.infradead.org/pipermail/linux-afs/2021-December/004811.html # v1 Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22rxrpc, afs: Fix selection of abort codesDavid Howells
The RX_USER_ABORT code should really only be used to indicate that the user of the rxrpc service (ie. userspace) implicitly caused a call to be aborted - for instance if the AF_RXRPC socket is closed whilst the call was in progress. (The user may also explicitly abort a call and specify the abort code to use). Change some of the points of generation to use other abort codes instead: (1) Abort the call with RXGEN_SS_UNMARSHAL or RXGEN_CC_UNMARSHAL if we see ENOMEM and EFAULT during received data delivery and abort with RX_CALL_DEAD in the default case. (2) Abort with RXGEN_SS_MARSHAL if we get ENOMEM whilst trying to send a reply. (3) Abort with RX_CALL_DEAD if we stop hearing from the peer if we had heard from the peer and abort with RX_CALL_TIMEOUT if we hadn't. (4) Abort with RX_CALL_DEAD if we try to disconnect a call that's not completed successfully or been aborted. Reported-by: Jeffrey Altman <jaltman@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22rxrpc: Fix locking issueDavid Howells
There's a locking issue with the per-netns list of calls in rxrpc. The pieces of code that add and remove a call from the list use write_lock() and the calls procfile uses read_lock() to access it. However, the timer callback function may trigger a removal by trying to queue a call for processing and finding that it's already queued - at which point it has a spare refcount that it has to do something with. Unfortunately, if it puts the call and this reduces the refcount to 0, the call will be removed from the list. Unfortunately, since the _bh variants of the locking functions aren't used, this can deadlock. ================================ WARNING: inconsistent lock state 5.18.0-rc3-build4+ #10 Not tainted -------------------------------- inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. ksoftirqd/2/25 [HC0[0]:SC1[1]:HE1:SE0] takes: ffff888107ac4038 (&rxnet->call_lock){+.?.}-{2:2}, at: rxrpc_put_call+0x103/0x14b {SOFTIRQ-ON-W} state was registered at: ... Possible unsafe locking scenario: CPU0 ---- lock(&rxnet->call_lock); <Interrupt> lock(&rxnet->call_lock); *** DEADLOCK *** 1 lock held by ksoftirqd/2/25: #0: ffff8881008ffdb0 ((&call->timer)){+.-.}-{0:0}, at: call_timer_fn+0x5/0x23d Changes ======= ver #2) - Changed to using list_next_rcu() rather than rcu_dereference() directly. Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22afs: Fix afs_getattr() to refetch file status if callback break occurredDavid Howells
If a callback break occurs (change notification), afs_getattr() needs to issue an FS.FetchStatus RPC operation to update the status of the file being examined by the stat-family of system calls. Fix afs_getattr() to do this if AFS_VNODE_CB_PROMISED has been cleared on a vnode by a callback break. Skip this if AT_STATX_DONT_SYNC is set. This can be tested by appending to a file on one AFS client and then using "stat -L" to examine its length on a machine running kafs. This can also be watched through tracing on the kafs machine. The callback break is seen: kworker/1:1-46 [001] ..... 978.910812: afs_cb_call: c=0000005f YFSCB.CallBack kworker/1:1-46 [001] ...1. 978.910829: afs_cb_break: 100058:23b4c:242d2c2 b=2 s=1 break-cb kworker/1:1-46 [001] ..... 978.911062: afs_call_done: c=0000005f ret=0 ab=0 [0000000082994ead] And then the stat command generated no traffic if unpatched, but with this change a call to fetch the status can be observed: stat-4471 [000] ..... 986.744122: afs_make_fs_call: c=000000ab 100058:023b4c:242d2c2 YFS.FetchStatus stat-4471 [000] ..... 986.745578: afs_call_done: c=000000ab ret=0 ab=0 [0000000087fc8c84] Fixes: 08e0e7c82eea ("[AF_RXRPC]: Make the in-kernel AFS filesystem use AF_RXRPC.") Reported-by: Markus Suvanto <markus.suvanto@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Tested-by: Markus Suvanto <markus.suvanto@gmail.com> Tested-by: kafs-testing+fedora34_64checkkafs-build-496@auristor.com Link: https://bugzilla.kernel.org/show_bug.cgi?id=216010 Link: https://lore.kernel.org/r/165308359800.162686.14122417881564420962.stgit@warthog.procyon.org.uk/ # v1 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>