summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-03-04usb: dwc3: Set SUSPENDENABLE soon after phy initThinh Nguyen
After phy initialization, some phy operations can only be executed while in lower P states. Ensure GUSB3PIPECTL.SUSPENDENABLE and GUSB2PHYCFG.SUSPHY are set soon after initialization to avoid blocking phy ops. Previously the SUSPENDENABLE bits are only set after the controller initialization, which may not happen right away if there's no gadget driver or xhci driver bound. Revise this to clear SUSPENDENABLE bits only when there's mode switching (change in GCTL.PRTCAPDIR). Fixes: 6d735722063a ("usb: dwc3: core: Prevent phy suspend during init") Cc: stable <stable@kernel.org> Signed-off-by: Thinh Nguyen <Thinh.Nguyen@synopsys.com> Link: https://lore.kernel.org/r/633aef0afee7d56d2316f7cc3e1b2a6d518a8cc9.1738280911.git.Thinh.Nguyen@synopsys.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-03sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()Andrea Righi
If a BPF scheduler provides an invalid CPU (outside the nr_cpu_ids range) as prev_cpu to scx_bpf_select_cpu_dfl() it can cause a kernel crash. To prevent this, validate prev_cpu in scx_bpf_select_cpu_dfl() and trigger an scx error if an invalid CPU is specified. Fixes: f0e1a0643a59b ("sched_ext: Implement BPF extensible scheduler class") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-03Merge tag 'affs-6.14-rc5-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull affs fixes from David Sterba: "Two fixes from Simon Tatham. They're real bugfixes for problems with OFS floppy disks created on linux and then read in the emulated Workbench environment" * tag 'affs-6.14-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: affs: don't write overlarge OFS data block size fields affs: generate OFS sequence numbers starting at 1
2025-03-03Merge tag 'xfs-fixes-6.14-rc6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs cleanups from Carlos Maiolino: "Just a few cleanups" * tag 'xfs-fixes-6.14-rc6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: remove the XBF_STALE check from xfs_buf_rele_cached xfs: remove most in-flight buffer accounting xfs: decouple buffer readahead from the normal buffer read path xfs: reduce context switches for synchronous buffered I/O
2025-03-03Merge tag 'probes-fixes-v6.14-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probe events fixes from Masami Hiramatsu: - probe-events: Remove unused MAX_ARG_BUF_LEN macro - it is not used - fprobe-events: Log error for exceeding the number of entry args. Since the max number of entry args is limited, it should be checked and rejected when the parser detects it. - tprobe-events: Reject invalid tracepoint name If a user specifies an invalid tracepoint name (e.g. including '/') then the new event is not defined correctly in the eventfs. - tprobe-events: Fix a memory leak when tprobe defined with $retval There is a memory leak if tprobe is defined with $retval. * tag 'probes-fixes-v6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: probe-events: Remove unused MAX_ARG_BUF_LEN macro tracing: fprobe-events: Log error for exceeding the number of entry args tracing: tprobe-events: Reject invalid tracepoint name tracing: tprobe-events: Fix a memory leak when tprobe with $retval
2025-03-03KVM: selftests: Fix printf() format goof in SEV smoke testSean Christopherson
Print out the index of mismatching XSAVE bytes using unsigned decimal format. Some versions of clang complain about trying to print an integer as an unsigned char. x86/sev_smoke_test.c:55:51: error: format specifies type 'unsigned char' but the argument has type 'int' [-Werror,-Wformat] Fixes: 8c53183dbaa2 ("selftests: kvm: add test for transferring FPU state into VMSA") Link: https://lore.kernel.org/r/20250228233852.3855676-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: selftests: Ensure all vCPUs hit -EFAULT during initial RO stageSean Christopherson
During the initial mprotect(RO) stage of mmu_stress_test, keep vCPUs spinning until all vCPUs have hit -EFAULT, i.e. until all vCPUs have tried to write to a read-only page. If a vCPU manages to complete an entire iteration of the loop without hitting a read-only page, *and* the vCPU observes mprotect_ro_done before starting a second iteration, then the vCPU will prematurely fall through to GUEST_SYNC(3) (on x86 and arm64) and get out of sequence. Replace the "do-while (!r)" loop around the associated _vcpu_run() with a single invocation, as barring a KVM bug, the vCPU is guaranteed to hit -EFAULT, and retrying on success is super confusion, hides KVM bugs, and complicates this fix. The do-while loop was semi-unintentionally added specifically to fudge around a KVM x86 bug, and said bug is unhittable without modifying the test to force x86 down the !(x86||arm64) path. On x86, if forced emulation is enabled, vcpu_arch_put_guest() may trigger emulation of the store to memory. Due a (very, very) longstanding bug in KVM x86's emulator, emulate writes to guest memory that fail during __kvm_write_guest_page() unconditionally return KVM_EXIT_MMIO. While that is desirable in the !memslot case, it's wrong in this case as the failure happens due to __copy_to_user() hitting a read-only page, not an emulated MMIO region. But as above, x86 only uses vcpu_arch_put_guest() if the __x86_64__ guards are clobbered to force x86 down the common path, and of course the unexpected MMIO is a KVM bug, i.e. *should* cause a test failure. Fixes: b6c304aec648 ("KVM: selftests: Verify KVM correctly handles mprotect(PROT_READ)") Reported-by: Yan Zhao <yan.y.zhao@intel.com> Closes: https://lore.kernel.org/all/20250208105318.16861-1-yan.y.zhao@intel.com Debugged-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20250228230804.3845860-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: SVM: Don't rely on DebugSwap to restore host DR0..DR3Sean Christopherson
Never rely on the CPU to restore/load host DR0..DR3 values, even if the CPU supports DebugSwap, as there are no guarantees that SNP guests will actually enable DebugSwap on APs. E.g. if KVM were to rely on the CPU to load DR0..DR3 and skipped them during hw_breakpoint_restore(), KVM would run with clobbered-to-zero DRs if an SNP guest created APs without DebugSwap enabled. Update the comment to explain the dangers, and hopefully prevent breaking KVM in the future. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03KVM: SVM: Save host DR masks on CPUs with DebugSwapSean Christopherson
When running SEV-SNP guests on a CPU that supports DebugSwap, always save the host's DR0..DR3 mask MSR values irrespective of whether or not DebugSwap is enabled, to ensure the host values aren't clobbered by the CPU. And for now, also save DR0..DR3, even though doing so isn't necessary (see below). SVM_VMGEXIT_AP_CREATE is deeply flawed in that it allows the *guest* to create a VMSA with guest-controlled SEV_FEATURES. A well behaved guest can inform the hypervisor, i.e. KVM, of its "requested" features, but on CPUs without ALLOWED_SEV_FEATURES support, nothing prevents the guest from lying about which SEV features are being enabled (or not!). If a misbehaving guest enables DebugSwap in a secondary vCPU's VMSA, the CPU will load the DR0..DR3 mask MSRs on #VMEXIT, i.e. will clobber the MSRs with '0' if KVM doesn't save its desired value. Note, DR0..DR3 themselves are "ok", as DR7 is reset on #VMEXIT, and KVM restores all DRs in common x86 code as needed via hw_breakpoint_restore(). I.e. there is no risk of host DR0..DR3 being clobbered (when it matters). However, there is a flaw in the opposite direction; because the guest can lie about enabling DebugSwap, i.e. can *disable* DebugSwap without KVM's knowledge, KVM must not rely on the CPU to restore DRs. Defer fixing that wart, as it's more of a documentation issue than a bug in the code. Note, KVM added support for DebugSwap on commit d1f85fbe836e ("KVM: SEV: Enable data breakpoints in SEV-ES"), but that is not an appropriate Fixes, as the underlying flaw exists in hardware, not in KVM. I.e. all kernels that support SEV-SNP need to be patched, not just kernels with KVM's full support for DebugSwap (ignoring that DebugSwap support landed first). Opportunistically fix an incorrect statement in the comment; on CPUs without DebugSwap, the CPU does NOT save or load debug registers, i.e. Fixes: e366f92ea99e ("KVM: SEV: Support SEV-SNP AP Creation NAE event") Cc: stable@vger.kernel.org Cc: Naveen N Rao <naveen@kernel.org> Cc: Kim Phillips <kim.phillips@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Alexey Kardashevskiy <aik@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03xfs: export max_open_zones in sysfsChristoph Hellwig
Add a zoned group with an attribute for the maximum number of open zones. This allows querying the open zones for data placement tests, or also for placement aware applications that are in control of the entire file system. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: contain more sysfs code in xfs_sysfs.cChristoph Hellwig
Extend the error sysfs initialization helper to include the neighbouring attributes as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: export zone stats in /proc/*/mountstatsHans Holmberg
Add the per-zone life time hint and the used block distribution for fully written zones, grouping reclaimable zones in fixed-percentage buckets spanning 0..9%, 10..19% and full zones as 100% used as well as a few statistics about the zone allocator and open and reclaimable zones in /proc/*/mountstats. This gives good insight into data fragmentation and data placement success rate. Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Co-developed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: wire up the show_stats super operationChristoph Hellwig
The show_stats option allows a file system to dump plain text statistic on a per-mount basis into /proc/*/mountstats. Wire up a no-op version which will grow useful information for zoned file systems later. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: support write life time based data placementHans Holmberg
Add a file write life time data placement allocation scheme that aims to minimize fragmentation and thereby to do two things: a) separate file data to different zones when possible. b) colocate file data of similar life times when feasible. To get best results, average file sizes should align with the zone capacity that is reported through the XFS_IOC_FSGEOMETRY ioctl. This improvement in data placement efficiency reduces the number of blocks requiring relocation by GC, and thus decreases overall write amplification. The impact on performance varies depending on how full the file system is. For RocksDB using leveled compaction, the lifetime hints can improve throughput for overwrite workloads at 80% file system utilization by ~10%, but for lower file system utilization there won't be as much benefit in application performance as there is less need for garbage collection to start with. Lifetime hints can be disabled using the nolifetime mount option. Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: add a max_open_zones mount optionChristoph Hellwig
Allow limiting the number of open zones used below that exported by the device. This is required to tune the number of write streams when zoned RT devices are used on conventional devices, and can be useful on zoned devices that support a very large number of open zones. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: support zone gapsChristoph Hellwig
Zoned devices can have gaps beyond the usable capacity of a zone and the end in the LBA/daddr address space. In other words, the hardware equivalent to the RT groups already takes care of the power of 2 alignment for us. In this case the sparse FSB/RTB address space maps 1:1 to the device address space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: enable the zoned RT device featureChristoph Hellwig
Enable the zoned RT device directory feature. With this feature, RT groups are written sequentially and always emptied before rewriting the blocks. This perfectly maps to zoned devices, but can also be used on conventional block devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: disable rt quotas for zoned file systemsChristoph Hellwig
They'll need a little more work. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: disable reflink for zoned file systemsChristoph Hellwig
While the zoned on-disk format supports reflinks, the GC code currently always unshares reflinks when moving blocks to new zones, thus making the feature unusuable. Disable reflinks until the GC code is refcount aware. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: enable fsmap reporting for internal RT devicesChristoph Hellwig
File system with internal RT devices are a bit odd in that we need to report AGs and RGs. To make this happen use separate synthetic fmr_device values for the different sections instead of the dev_t mapping used by other XFS configurations. The data device is reported as file system metadata before the start of the RGs for the synthetic RT fmr_device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: support xrep_require_rtext_inuse on zoned file systemsChristoph Hellwig
Space usage is tracked by the rmap, which already is separately cross-referenced. But on top of that we have the write pointer and can do a basic sanity check here that the block is not beyond the write pointer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: support xchk_xref_is_used_rt_space on zoned file systemsChristoph Hellwig
Space usage is tracked by the rmap, which already is separately cross-referenced. But on top of that we have the write pointer and can do a basic sanity check here that the block is not beyond the write pointer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: allow COW forks on zoned file systems in xchk_bmapChristoph Hellwig
Zoned file systems can have COW forks even without reflinks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: support growfs on zoned file systemsChristoph Hellwig
Replace the inner loop growing one RT bitmap block at a time with one just modifying the superblock counters for growing an entire zone (aka RTG). The big restriction is just like at mkfs time only a RT extent size of a single FSB is allowed, and the file system capacity needs to be aligned to the zone size. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: hide reserved RT blocks from statfsChristoph Hellwig
File systems with a zoned RT device have a large number of reserved blocks that are required for garbage collection, and which can't be filled with user data. Exclude them from the available blocks reported through stat(v)fs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: wire up zoned block freeing in xfs_rtextent_free_finish_itemChristoph Hellwig
Make xfs_rtextent_free_finish_item call into the zoned allocator to free blocks on zoned RT devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: implement direct writes to zoned RT devicesChristoph Hellwig
Direct writes to zoned RT devices are extremely simple. After taking the block reservation before acquiring the iolock, the iomap direct I/O calls into ->iomap_begin which will return a "fake" iomap for the entire requested range. The actual block allocation is then done from the submit_io handler using code shared with the buffered I/O path. The iomap_dio_ops set the bio_set to the (iomap) ioend one and initialize the embedded ioend, which allows reusing the existing ioend based buffered I/O completion path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: implement buffered writes to zoned RT devicesChristoph Hellwig
Implement buffered writes including page faults and block zeroing for zoned RT devices. Buffered writes to zoned RT devices are split into three phases: 1) a reservation for the worst case data block usage is taken before acquiring the iolock. When there are enough free blocks but not enough available one, garbage collection is kicked off to free the space before continuing with the write. If there isn't enough freeable space, the block reservation is reduced and a short write will happen as expected by normal Linux write semantics. 2) with the iolock held, the generic iomap buffered write code is called, which through the iomap_begin operation usually just inserts delalloc extents for the range in a single iteration. Only for overwrites of existing data that are not block aligned, or zeroing operations the existing extent mapping is read to fill out the srcmap and to figure out if zeroing is required. 3) the ->map_blocks callback to the generic iomap writeback code calls into the zoned space allocator to actually allocate on-disk space for the range before kicking of the writeback. Note that because all writes are out of place, truncate or hole punches that are not aligned to block size boundaries need to allocate space. For block zeroing from truncate, ->setattr is called with the iolock (aka i_rwsem) already held, so a hacky deviation from the above scheme is needed. In this case the space reservations is called with the iolock held, but is required not to block and can dip into the reserved block pool. This can lead to -ENOSPC when truncating a file, which is unfortunate. But fixing the calling conventions in the VFS is probably much easier with code requiring it already in mainline. Similarly because all writes are out place, the zoned allocator can't support unwritten extents and thus the FALLOC_FL_ALLOCATE_RANGE range mode of fallocate. Other fallocate modes that would reserved space but don't need to to provide proper semantics do work but do not reserve space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: implement zoned garbage collectionChristoph Hellwig
RT groups on a zoned file system need to be completely empty before their space can be reused. This means that partially empty groups need to be emptied entirely to free up space if no entirely free groups are available. Add a garbage collection thread that moves all data out of the least used zone when not enough free zones are available, and which resets all zones that have been emptied. To find empty zone a simple set of 10 buckets based on the amount of space used in the zone is used. To empty zones, the rmap is walked to find the owners and the data is read and then written to the new place. To automatically defragment files the rmap records are sorted by inode and logical offset. This means defragmentation of parallel writes into a single zone happens automatically when performing garbage collection. Because holding the iolock over the entire GC cycle would inject very noticeable latency for other accesses to the inodes, the iolock is not taken while performing I/O. Instead the I/O completion handler checks that the mapping hasn't changed over the one recorded at the start of the GC cycle and doesn't update the mapping if it change. Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: add support for zoned space reservationsChristoph Hellwig
For zoned file systems garbage collection (GC) has to take the iolock and mmaplock after moving data to a new place to synchronize with readers. This means waiting for garbage collection with the iolock can deadlock. To avoid this, the worst case required blocks have to be reserved before taking the iolock, which is done using a new RTAVAILABLE counter that tracks blocks that are free to write into and don't require garbage collection. The new helpers try to take these available blocks, and if there aren't enough available it wakes and waits for GC. This is done using a list of on-stack reservations to ensure fairness. Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: add the zoned space allocatorChristoph Hellwig
For zoned RT devices space is always allocated at the write pointer, that is right after the last written block and only recorded on I/O completion. Because the actual allocation algorithm is very simple and just involves picking a good zone - preferably the one used for the last write to the inode. As the number of zones that can written at the same time is usually limited by the hardware, selecting a zone is done as late as possible from the iomap dio and buffered writeback bio submissions helpers just before submitting the bio. Given that the writers already took a reservation before acquiring the iolock, space will always be readily available if an open zone slot is available. A new structure is used to track these open zones, and pointed to by the xfs_rtgroup. Because zoned file systems don't have a rsum cache the space for that pointer can be reused. Allocations are only recorded at I/O completion time. The scheme used for that is very similar to the reflink COW end I/O path. Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: parse and validate hardware zone informationChristoph Hellwig
Add support to validate and parse reported hardware zone state. Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: skip zoned RT inodes in xfs_inodegc_want_queue_rt_fileChristoph Hellwig
The zoned allocator never performs speculative preallocations, so don't bother queueing up zoned inodes here. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: don't call xfs_can_free_eofblocks from ->release for zoned inodesChristoph Hellwig
Zoned file systems require out of place writes and thus can't support post-EOF speculative preallocations. Avoid the pointless ilock critical section to find out that none can be freed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: disable FITRIM for zoned RT devicesChristoph Hellwig
The zoned allocator unconditionally issues zone resets or discards after emptying an entire zone, so supporting FITRIM for a zoned RT device is not useful. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: disable sb_frextents for zoned file systemsChristoph Hellwig
Zoned file systems not only don't use the global frextents counter, but for them the in-memory percpu counter also includes reservations taken before even allocating delalloc extent records, so it will never match the per-zone used information. Disable all updates and verification of the sb counter for zoned file systems as it isn't useful for them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: export zoned geometry via XFS_FSOP_GEOMChristoph Hellwig
Export the zoned geometry information so that userspace can query it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: allow internal RT devices for zoned modeChristoph Hellwig
Allow creating an RT subvolume on the same device as the main data device. This is mostly used for SMR HDDs where the conventional zones are used for the data device and the sequential write required zones for the zoned RT section. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: define the zoned on-disk formatChristoph Hellwig
Zone file systems reuse the basic RT group enabled XFS file system structure to support a mode where each RT group is always written from start to end and then reset for reuse (after moving out any remaining data). There are few minor but important changes, which are indicated by a new incompat flag: 1) there are no bitmap and summary inodes, thus the /rtgroups/{rgno}.{bitmap,summary} metadir files do not exist and the sb_rbmblocks superblock field must be cleared to zero. 2) there is a new superblock field that specifies the start of an internal RT section. This allows supporting SMR HDDs that have random writable space at the beginning which is used for the XFS data device (which really is the metadata device for this configuration), directly followed by a RT device on the same block device. While something similar could be achieved using dm-linear just having a single device directly consumed by XFS makes handling the file systems a lot easier. 3) Another superblock field that tracks the amount of reserved space (or overprovisioning) that is never used for user capacity, but allows GC to run more smoothly. 4) an overlay of the cowextsize field for the rtrmap inode so that we can persistently track the total amount of rtblocks currently used in a RT group. There is no data structure other than the rmap that tracks used space in an RT group, and this counter is used to decide when a RT group has been entirely emptied, and to select one that is relatively empty if garbage collection needs to be performed. While this counter could be tracked entirely in memory and rebuilt from the rmap at mount time, that would lead to very long mount times with the large number of RT groups implied by the number of hardware zones especially on SMR hard drives with 256MB zone sizes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: add a xfs_rtrmap_highest_rgbno helperChristoph Hellwig
Add a helper to find the last offset mapped in the rtrmap. This will be used by the zoned code to find out where to start writing again on conventional devices without hardware zone support. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: support XFS_BMAPI_REMAP in xfs_bmap_del_extent_delayChristoph Hellwig
The zone allocator wants to be able to remove a delalloc mapping in the COW fork while keeping the block reservation. To support that pass the flags argument down to xfs_bmap_del_extent_delay and support the XFS_BMAPI_REMAP flag to keep the reservation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: refine the unaligned check for always COW inodes in xfs_file_dio_writeChristoph Hellwig
For always COW inodes we also must check the alignment of each individual iovec segment, as they could end up with different I/Os due to the way bio_iov_iter_get_pages works, and we'd then overwrite an already written block. The existing always_cow sysctl based code doesn't catch this because nothing enforces that blocks aren't rewritten, but for zoned XFS on sequential write required zones this is a hard error. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: skip always_cow inodes in xfs_reflink_trim_around_sharedChristoph Hellwig
xfs_reflink_trim_around_shared tries to find shared blocks in the refcount btree. Always_cow inodes don't have that tree, so don't bother. For the existing always_cow code this is a minor optimization. For the upcoming zoned code that can do COW without the rtreflink code it avoids triggering a NULL pointer dereference. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: move xfs_bmapi_reserve_delalloc to xfs_iomap.cChristoph Hellwig
Delalloc reservations are not supported in userspace, and thus it doesn't make sense to share this helper with xfsprogs.c. Move it to xfs_iomap.c toward the two callers. Note that there rest of the delalloc handling should probably eventually also move out of xfs_bmap.c, but that will require a bit more surgery. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: add a rtg_blocks helperChristoph Hellwig
Shortcut dereferencing the xg_block_count field in the generic group structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: factor out a xfs_rt_check_size helperChristoph Hellwig
Add a helper to check that the last block of a RT device is readable to share the code between mount and growfs. This also adds the mount time overflow check to growfs and improves the error messages. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: reduce metafile reservationsChristoph Hellwig
There is no point in reserving more space than actually available on the data device for the worst case scenario that is unlikely to happen. Reserve at most 1/4th of the data device blocks, which is still a heuristic. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: make metabtree reservations globalChristoph Hellwig
Currently each metabtree inode has it's own space reservation to ensure it can be expanded to the maximum size, mirroring what is done for the AG-based btrees. But unlike the AG-based btrees the metabtree inodes aren't restricted to allocate from a single AG but can use free space form the entire file system. And unlike AG-based btrees where the required reservation shrinks with the available free space due to this, the metabtree reservations for the rtrmap and rtfreflink trees are not bound in any way by the data device free space as they track RT extent allocations. This is not very efficient as it requires a large number of blocks to be set aside that can't be used at all by other btrees. Switch to a model that uses a global pool instead in preparation for reducing the amount of reserved space, which now also removes the overloading of the i_nblocks field for metabtree inodes, which would create problems if metabtree inodes ever had a big enough xattr fork to require xattr blocks outside the inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: fixup the metabtree reservation in xrep_reap_metadir_fsblocksChristoph Hellwig
All callers of xrep_reap_metadir_fsblocks need to fix up the metabtree reservation, otherwise they'd leave the reservations in an incoherent state. Move the call to xrep_reset_metafile_resv into xrep_reap_metadir_fsblocks so it always is taken care of, and remove now superfluous helper functions in the callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: trace in-memory freecounter reservationsChristoph Hellwig
Add two tracepoints when the freecounter dips into the reserved pool and when it is entirely out of space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>