summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-05-31bcachefs: bch2_dev_journal_bucket_delete()Kent Overstreet
Recover from "journal and btree in same bucket". Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-31bcachefs: Runtime self healing for keys for deleted snapshotsKent Overstreet
If snapshot deletion incorrectly missing some keys and leaves keys for deleted snapshots, that causes a bit of a problem for data move - we can't move an extent for a nonexistent snapshot, because the extent might have to be fragmented, and maintaining correct visibility in child snapshots doesn't work if it doesn't have a snapshot. Previously we'd just skip these keys, but it turns out that causes copygc to spin. So we need runtime self healing, i.e. calling check_key_has_snapshot() from the data move path. Snapshot deletion v2 included sentinal values for deleted snapshot nodes, so this is quite safe. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-31bcachefs: Don't unlock trans before data_update_init()Kent Overstreet
data_update_init() does need to do btree operations, delay doing the unlock-before-io. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-31bcachefs: Use bch2_err_matches() for BCH_ERR_fsck_(fix|ignore)Kent Overstreet
We'll be adding subtypes of these errors, and new error code tracing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-31Merge tag 'mm-stable-2025-05-31-14-50' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of creating a pte which addresses the first page in a folio and reduces the amount of plumbing which architecture must implement to provide this. - "Misc folio patches for 6.16" from Matthew Wilcox is a shower of largely unrelated folio infrastructure changes which clean things up and better prepare us for future work. - "memory,x86,acpi: hotplug memory alignment advisement" from Gregory Price adds early-init code to prevent x86 from leaving physical memory unused when physical address regions are not aligned to memory block size. - "mm/compaction: allow more aggressive proactive compaction" from Michal Clapinski provides some tuning of the (sadly, hard-coded (more sadly, not auto-tuned)) thresholds for our invokation of proactive compaction. In a simple test case, the reduction of a guest VM's memory consumption was dramatic. - "Minor cleanups and improvements to swap freeing code" from Kemeng Shi provides some code cleaups and a small efficiency improvement to this part of our swap handling code. - "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin adds the ability for a ptracer to modify syscalls arguments. At this time we can alter only "system call information that are used by strace system call tampering, namely, syscall number, syscall arguments, and syscall return value. This series should have been incorporated into mm.git's "non-MM" branch, but I goofed. - "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl against /proc/pid/pagemap. This permits CRIU to more efficiently get at the info about guard regions. - "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan implements that fix. No runtime effect is expected because validate_page_before_insert() happens to fix up this error. - "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David Hildenbrand basically brings uprobe text poking into the current decade. Remove a bunch of hand-rolled implementation in favor of using more current facilities. - "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman Khandual provides enhancements and generalizations to the pte dumping code. This might be needed when 128-bit Page Table Descriptors are enabled for ARM. - "Always call constructor for kernel page tables" from Kevin Brodsky ensures that the ctor/dtor is always called for kernel pgtables, as it already is for user pgtables. This permits the addition of more functionality such as "insert hooks to protect page tables". This change does result in various architectures performing unnecesary work, but this is fixed up where it is anticipated to occur. - "Rust support for mm_struct, vm_area_struct, and mmap" from Alice Ryhl adds plumbing to permit Rust access to core MM structures. - "fix incorrectly disallowed anonymous VMA merges" from Lorenzo Stoakes takes advantage of some VMA merging opportunities which we've been missing for 15 years. - "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from SeongJae Park optimizes process_madvise()'s TLB flushing. Instead of flushing each address range in the provided iovec, we batch the flushing across all the iovec entries. The syscall's cost was approximately halved with a microbenchmark which was designed to load this particular operation. - "Track node vacancy to reduce worst case allocation counts" from Sidhartha Kumar makes the maple tree smarter about its node preallocation. stress-ng mmap performance increased by single-digit percentages and the amount of unnecessarily preallocated memory was dramaticelly reduced. - "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes a few unnecessary things which Baoquan noted when reading the code. - ""Enhance sysfs handling for memory hotplug in weighted interleave" from Rakie Kim "enhances the weighted interleave policy in the memory management subsystem by improving sysfs handling, fixing memory leaks, and introducing dynamic sysfs updates for memory hotplug support". Fixes things on error paths which we are unlikely to hit. - "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory" from SeongJae Park introduces new DAMOS quota goal metrics which eliminate the manual tuning which is required when utilizing DAMON for memory tiering. - "mm/vmalloc.c: code cleanup and improvements" from Baoquan He provides cleanups and small efficiency improvements which Baoquan found via code inspection. - "vmscan: enforce mems_effective during demotion" from Gregory Price changes reclaim to respect cpuset.mems_effective during demotion when possible. because presently, reclaim explicitly ignores cpuset.mems_effective when demoting, which may cause the cpuset settings to violated. This is useful for isolating workloads on a multi-tenant system from certain classes of memory more consistently. - "Clean up split_huge_pmd_locked() and remove unnecessary folio pointers" from Gavin Guo provides minor cleanups and efficiency gains in in the huge page splitting and migrating code. - "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache for `struct mem_cgroup', yielding improved memory utilization. - "add max arg to swappiness in memory.reclaim and lru_gen" from Zhongkun He adds a new "max" argument to the "swappiness=" argument for memory.reclaim MGLRU's lru_gen. This directs proactive reclaim to reclaim from only anon folios rather than file-backed folios. - "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the first step on the path to permitting the kernel to maintain existing VMs while replacing the host kernel via file-based kexec. At this time only memblock's reserve_mem is preserved. - "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides and uses a smarter way of looping over a pfn range. By skipping ranges of invalid pfns. - "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning when a task is pinned a single NUMA mode. Dramatic performance benefits were seen in some real world cases. - "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank Garg addresses a warning which occurs during memory compaction when using JFS. - "move all VMA allocation, freeing and duplication logic to mm" from Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more appropriate mm/vma.c. - "mm, swap: clean up swap cache mapping helper" from Kairui Song provides code consolidation and cleanups related to the folio_index() function. - "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that. - "memcg: Fix test_memcg_min/low test failures" from Waiman Long addresses some bogus failures which are being reported by the test_memcontrol selftest. - "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo Stoakes commences the deprecation of file_operations.mmap() in favor of the new file_operations.mmap_prepare(). The latter is more restrictive and prevents drivers from messing with things in ways which, amongst other problems, may defeat VMA merging. - "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples the per-cpu memcg charge cache from the objcg's one. This is a step along the way to making memcg and objcg charging NMI-safe, which is a BPF requirement. - "mm/damon: minor fixups and improvements for code, tests, and documents" from SeongJae Park is yet another batch of miscellaneous DAMON changes. Fix and improve minor problems in code, tests and documents. - "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg stats to be irq safe. Another step along the way to making memcg charging and stats updates NMI-safe, a BPF requirement. - "Let unmap_hugepage_range() and several related functions take folio instead of page" from Fan Ni provides folio conversions in the hugetlb code. * tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits) mm: pcp: increase pcp->free_count threshold to trigger free_high mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range() mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page mm/hugetlb: pass folio instead of page to unmap_ref_private() memcg: objcg stock trylock without irq disabling memcg: no stock lock for cpu hot-unplug memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs memcg: make count_memcg_events re-entrant safe against irqs memcg: make mod_memcg_state re-entrant safe against irqs memcg: move preempt disable to callers of memcg_rstat_updated memcg: memcg_rstat_updated re-entrant safe against irqs mm: khugepaged: decouple SHMEM and file folios' collapse selftests/eventfd: correct test name and improve messages alloc_tag: check mem_profiling_support in alloc_tag_init Docs/damon: update titles and brief introductions to explain DAMOS selftests/damon/_damon_sysfs: read tried regions directories in order mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject() mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat() mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs ...
2025-05-30Merge tag 'pull-automount' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull automount updates from Al Viro: "Automount wart removal A bunch of odd boilerplate gone from instances - the reason for those was the need to protect the yet-to-be-attched mount from mark_mounts_for_expiry() deciding to take it out. But that's easy to detect and take care of in mark_mounts_for_expiry() itself; no need to have every instance simulate mount being busy by grabbing an extra reference to it, with finish_automount() undoing that once it attaches that mount. Should've done it that way from the very beginning... This is a flagday change, thankfully there are very few instances. vfs_submount() is gone - its sole remaining user (trace_automount) had been switched to saner primitives" * tag 'pull-automount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: kill vfs_submount() saner calling conventions for ->d_automount()
2025-05-30Merge tag 'pull-ufs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull UFS updates from Al Viro: "The bulk of this is Eric's conversion of UFS to new mount API, with a bit of cleanups from me. I hoped to get stricter sanity checks on superblock flags into that pile, but... next cycle, hopefully" * tag 'pull-ufs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ufs: convert ufs to the new mount API ufs: reject multiple conflicting -o ufstype=... on mount ufs: split ->s_mount_opt - don't mix flavour and on-error
2025-05-30Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull mount propagation fix from Al Viro: "6.15 allowed mount propagation to destinations in detached trees; unfortunately, that breaks existing userland, so the old behaviour needs to be restored. It's not exactly a revert - the original behaviour had a bug, where existence of detached tree might disrupt propagation between locations not in detached trees. Thankfully, userland did not depend upon that bug, so we want to keep the fix" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: Don't propagate mounts into detached trees
2025-05-30gfs2: Don't clear sb->s_fs_info in gfs2_sys_fs_addAndrew Price
When gfs2_sys_fs_add() fails, it sets sb->s_fs_info to NULL on its error path (see commit 0d515210b696 ("GFS2: Add kobject release method")). The intention seems to be to prevent dereferencing sb->s_fs_info once the object pointed to has been deallocated, but that would be better achieved by setting the pointer to NULL in free_sbd(). As a consequence, when the call to gfs2_sys_fs_add() fails in gfs2_fill_super(), sdp = GFS2_SB(inode) will evaluate to NULL in iput() -> gfs2_drop_inode(), and accessing sdp->sd_flags will be a NULL pointer dereference. Fix that by only setting sb->s_fs_info to NULL when actually freeing the object pointed to in free_sbd(). Fixes: ae9f3bd8259a ("gfs2: replace sd_aspace with sd_inode") Reported-by: syzbot+b12826218502df019f9d@syzkaller.appspotmail.com Signed-off-by: Andrew Price <anprice@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-05-30Merge tag 'f2fs-for-6.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "In this round, Matthew converted most of page operations to using folio. Beyond the work, we've applied some performance tunings such as GC and linear lookup, in addition to enhancing fault injection and sanity checks. Enhancements: - large number of folio conversions - add a control to turn on/off the linear lookup for performance - tune GC logics for zoned block device - improve fault injection and sanity checks Bug fixes: - handle error cases of memory donation - fix to correct check conditions in f2fs_cross_rename - fix to skip f2fs_balance_fs() if checkpoint is disabled - don't over-report free space or inodes in statvfs - prevent the current section from being selected as a victim during GC - fix to calculate first_zoned_segno correctly - fix to avoid inconsistence between SIT and SSA for zoned block device As usual, there are several debugging patches and clean-ups as well" * tag 'f2fs-for-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (195 commits) f2fs: fix to correct check conditions in f2fs_cross_rename f2fs: use d_inode(dentry) cleanup dentry->d_inode f2fs: fix to skip f2fs_balance_fs() if checkpoint is disabled f2fs: clean up to check bi_status w/ BLK_STS_OK f2fs: introduce is_{meta,node}_folio f2fs: add ckpt_valid_blocks to the section entry f2fs: add a method for calculating the remaining blocks in the current segment in LFS mode. f2fs: introduce FAULT_VMALLOC f2fs: use vmalloc instead of kvmalloc in .init_{,de}compress_ctx f2fs: add f2fs_bug_on() in f2fs_quota_read() f2fs: add f2fs_bug_on() to detect potential bug f2fs: remove unused sbi argument from checksum functions f2fs: fix 32-bits hexademical number in fault injection doc f2fs: don't over-report free space or inodes in statvfs f2fs: return bool from __write_node_folio f2fs: simplify return value handling in f2fs_fsync_node_pages f2fs: always unlock the page in f2fs_write_single_data_page f2fs: remove wbc->for_reclaim handling f2fs: return bool from __f2fs_write_meta_folio f2fs: fix to return correct error number in f2fs_sync_node_pages() ...
2025-05-30bcachefs: Mark bch_errcode helpers __attribute__((const))Kent Overstreet
These don't access global memory or defer pointer arguments - this enables CSE optimizations. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: Add missing printbuf_reset() in bch2_check_dirent_inode_dirent()Kent Overstreet
We were accidentally including the contents from the previous fsck_err(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: sysfs/errorsKent Overstreet
Make the superblock error counters available in sysfs; the only other way they can be seen is 'show-super', but we don't write the superblock every time the error count gets incremented. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: bch2_check_fix_ptrs() can now repair btree rootsKent Overstreet
This is straightforward enough: check_fix_ptrs() currently only runs before we go RW, so updating the btree root pointer in c->btree_roots suffices - it'll be written out in the first journal write we do. For that, do_bch2_trans_commit_to_journal_replay() now handles JSET_ENTRY_btree_root entries. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: Include b->ob.nr in cached_btree_node_to_text()Kent Overstreet
We have a bug report that looks like we might be leaking open buckets - let's check if they got left attached to the cached btree node. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: Move devs_sorted to alloc_requestKent Overstreet
More stack usage work. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: reduce stack usage in alloc_sectors_start()Kent Overstreet
with typical config options, variables in different inline functions aren't sharing stack space - and these are slowpaths. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: bch2_alloc_v4_to_text()Kent Overstreet
Specialize the .to_text() for alloc_v4, to avoid the temporary on the stack for conversion from old versions. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: Tweak bch2_data_update_init() for stack usageKent Overstreet
- Separate out a slowpath for bkey_nocow_lock() - Don't call bch2_bkey_ptrs_c() or loop over pointers more than necessary Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: kill replicas_sectors arg to __trigger_extent()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: Don't stack allocate bch_writepage_stateKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: factor out break_cycle_fail()Kent Overstreet
More stack usage work. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: btree_node_missing_err()Kent Overstreet
Factor out an error path for a small stack usage improvement. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: Kill bkey_buf in btree_path_down()Kent Overstreet
Allocate some (smaller) temporary storage in btree_trans for this - btree_path_down() is in our max-stack call stack. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: Add missing error logging in delete_dead_inodes()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: Fix misaligned bucket check in journal space calculationsKent Overstreet
Fix an assertion pop in the tiering_misaligned test: rounding down to bucket size at the end of the journal space calculations leaves cur_entry_sectors == 0, which is incorrect with !cur_entry_err. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: Fix incorrect multiple dev check in journal write pathKent Overstreet
It's uncomon to have multiple devices with journalling only on a subset, but can be specified with the 'data_allowed' option. We need to know if we're doing data/metadata writes to multiple devices, as that requires issuing flushes before the journal writes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: Catch data_update_done events in trace_io_move_start_failKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: io_move_evacuate_bucket tracepoint, counterKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: trace_io_move_predKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: Fix infinite loop in journal_entry_btree_keys_to_text()Kent Overstreet
Fix an infinite loop when bkey_i->k.u64s is 0. This only happens in userspace, where 'bcachefs list_journal' can print the entire contents of the journal, and non-dirty entries aren't validated. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: Journal read error message improvementsKent Overstreet
- Don't print a checksum error when we first read a journal entry: we print a checksum error later if we'll be using the journal entry. - Continuing with the theme of of improving error messages and grouping errors into a single log message per error, print a single 'checksum error' message per journal entry, and use bch2_journal_ptr_to_text() to print out where on the device it was. - Factor out checksum error messages and checking for missing journal entries into helpers, bch2_journal_read() has gotten obnoxiously big. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30fs/dax: Fix "don't skip locked entries when scanning entries"Alistair Popple
Commit 6be3e21d25ca ("fs/dax: don't skip locked entries when scanning entries") introduced a new function, wait_entry_unlocked_exclusive(), which waits for the current entry to become unlocked without advancing the XArray iterator state. Waiting for the entry to become unlocked requires dropping the XArray lock. This requires calling xas_pause() prior to dropping the lock which leaves the xas in a suitable state for the next iteration. However this has the side-effect of advancing the xas state to the next index. Normally this isn't an issue because xas_for_each() contains code to detect this state and thus avoid advancing the index a second time on the next loop iteration. However both callers of and wait_entry_unlocked_exclusive() itself subsequently use the xas state to reload the entry. As xas_pause() updated the state to the next index this will cause the current entry which is being waited on to be skipped. This caused the following warning to fire intermittently when running xftest generic/068 on an XFS filesystem with FS DAX enabled: [ 35.067397] ------------[ cut here ]------------ [ 35.068229] WARNING: CPU: 21 PID: 1640 at mm/truncate.c:89 truncate_folio_batch_exceptionals+0xd8/0x1e0 [ 35.069717] Modules linked in: nd_pmem dax_pmem nd_btt nd_e820 libnvdimm [ 35.071006] CPU: 21 UID: 0 PID: 1640 Comm: fstest Not tainted 6.15.0-rc7+ #77 PREEMPT(voluntary) [ 35.072613] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/204 [ 35.074845] RIP: 0010:truncate_folio_batch_exceptionals+0xd8/0x1e0 [ 35.075962] Code: a1 00 00 00 f6 47 0d 20 0f 84 97 00 00 00 4c 63 e8 41 39 c4 7f 0b eb 61 49 83 c5 01 45 39 ec 7e 58 42 f68 [ 35.079522] RSP: 0018:ffffb04e426c7850 EFLAGS: 00010202 [ 35.080359] RAX: 0000000000000000 RBX: ffff9d21e3481908 RCX: ffffb04e426c77f4 [ 35.081477] RDX: ffffb04e426c79e8 RSI: ffffb04e426c79e0 RDI: ffff9d21e34816e8 [ 35.082590] RBP: ffffb04e426c79e0 R08: 0000000000000001 R09: 0000000000000003 [ 35.083733] R10: 0000000000000000 R11: 822b53c0f7a49868 R12: 000000000000001f [ 35.084850] R13: 0000000000000000 R14: ffffb04e426c78e8 R15: fffffffffffffffe [ 35.085953] FS: 00007f9134c87740(0000) GS:ffff9d22abba0000(0000) knlGS:0000000000000000 [ 35.087346] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 35.088244] CR2: 00007f9134c86000 CR3: 000000040afff000 CR4: 00000000000006f0 [ 35.089354] Call Trace: [ 35.089749] <TASK> [ 35.090168] truncate_inode_pages_range+0xfc/0x4d0 [ 35.091078] truncate_pagecache+0x47/0x60 [ 35.091735] xfs_setattr_size+0xc7/0x3e0 [ 35.092648] xfs_vn_setattr+0x1ea/0x270 [ 35.093437] notify_change+0x1f4/0x510 [ 35.094219] ? do_truncate+0x97/0xe0 [ 35.094879] do_truncate+0x97/0xe0 [ 35.095640] path_openat+0xabd/0xca0 [ 35.096278] do_filp_open+0xd7/0x190 [ 35.096860] do_sys_openat2+0x8a/0xe0 [ 35.097459] __x64_sys_openat+0x6d/0xa0 [ 35.098076] do_syscall_64+0xbb/0x1d0 [ 35.098647] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 35.099444] RIP: 0033:0x7f9134d81fc1 [ 35.100033] Code: 75 57 89 f0 25 00 00 41 00 3d 00 00 41 00 74 49 80 3d 2a 26 0e 00 00 74 6d 89 da 48 89 ee bf 9c ff ff ff5 [ 35.102993] RSP: 002b:00007ffcd41e0d10 EFLAGS: 00000202 ORIG_RAX: 0000000000000101 [ 35.104263] RAX: ffffffffffffffda RBX: 0000000000000242 RCX: 00007f9134d81fc1 [ 35.105452] RDX: 0000000000000242 RSI: 00007ffcd41e1200 RDI: 00000000ffffff9c [ 35.106663] RBP: 00007ffcd41e1200 R08: 0000000000000000 R09: 0000000000000064 [ 35.107923] R10: 00000000000001a4 R11: 0000000000000202 R12: 0000000000000066 [ 35.109112] R13: 0000000000100000 R14: 0000000000100000 R15: 0000000000000400 [ 35.110357] </TASK> [ 35.110769] irq event stamp: 8415587 [ 35.111486] hardirqs last enabled at (8415599): [<ffffffff8d74b562>] __up_console_sem+0x52/0x60 [ 35.113067] hardirqs last disabled at (8415610): [<ffffffff8d74b547>] __up_console_sem+0x37/0x60 [ 35.114575] softirqs last enabled at (8415300): [<ffffffff8d6ac625>] handle_softirqs+0x315/0x3f0 [ 35.115933] softirqs last disabled at (8415291): [<ffffffff8d6ac811>] __irq_exit_rcu+0xa1/0xc0 [ 35.117316] ---[ end trace 0000000000000000 ]--- Fix this by using xas_reset() instead, which is equivalent in implementation to xas_pause() but does not advance the XArray state. Fixes: 6be3e21d25ca ("fs/dax: don't skip locked entries when scanning entries") Signed-off-by: Alistair Popple <apopple@nvidia.com> Link: https://lore.kernel.org/20250523043749.1460780-1-apopple@nvidia.com Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: "Matthew Wilcow (Oracle)" <willy@infradead.org> Cc: Balbir Singh <balbirs@nvidia.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-29Merge tag 'fs_for_v6.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext2 and isofs updates from Jan Kara: - isofs fix of handling of particularly formatted Rock Ridge timestamps - Add deprecation notice about support of DAX in ext2 filesystem driver * tag 'fs_for_v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: ext2: Deprecate DAX isofs: fix Y2038 and Y2156 issues in Rock Ridge TF entry
2025-05-29Merge tag 'fsnotify_for_v6.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: "Two fanotify cleanups and support for watching namespace-owned filesystems by namespace admins (most useful for being able to watch for new mounts / unmounts happening within a user namespace)" * tag 'fsnotify_for_v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fanotify: support watching filesystems and mounts inside userns fanotify: remove redundant permission checks fanotify: Drop use of flex array in fanotify_fh
2025-05-29Merge tag 'driver-core-6.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core Pull driver core updates from Greg KH: "Here are the driver core / kernfs changes for 6.16-rc1. Not a huge number of changes this development cycle, here's the summary of what is included in here: - kernfs locking tweaks, pushing some global locks down into a per-fs image lock - rust driver core and pci device bindings added for new features. - sysfs const work for bin_attributes. The final churn of switching away from and removing the transitional struct members, "read_new", "write_new" and "bin_attrs_new" will come after the merge window to avoid unnecesary merge conflicts. - auxbus device creation helpers added - fauxbus fix for creating sysfs files after the probe completed properly - other tiny updates for driver core things. All of these have been in linux-next for over a week with no reported issues" * tag 'driver-core-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: kernfs: Relax constraint in draining guard Documentation: embargoed-hardware-issues.rst: Remove myself drivers: hv: fix up const issue with vmbus_chan_bin_attrs firmware_loader: use SHA-256 library API instead of crypto_shash API docs: debugfs: do not recommend debugfs_remove_recursive PM: wakeup: Do not expose 4 device wakeup source APIs kernfs: switch global kernfs_rename_lock to per-fs lock kernfs: switch global kernfs_idr_lock to per-fs lock driver core: auxiliary bus: Fix IS_ERR() vs NULL mixup in __devm_auxiliary_device_create() sysfs: constify attribute_group::bin_attrs sysfs: constify bin_attribute argument of bin_attribute::read/write() software node: Correct a OOB check in software_node_get_reference_args() devres: simplify devm_kstrdup() using devm_kmemdup() platform: replace magic number with macro PLATFORM_DEVID_NONE component: do not try to unbind unbound components driver core: auxiliary bus: add device creation helpers driver core: faux: Add sysfs groups after probing
2025-05-29fuse: increase readdir buffer sizeMiklos Szeredi
Increase the buffer size to the count requested by userspace. This improves performance. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
2025-05-29readdir: supply dir_context.count as readdir buffer size hintMiklos Szeredi
This is a preparation for large readdir buffers in fuse. Simply setting the fuse buffer size to the userspace buffer size should work, the record sizes are similar (fuse's is slightly larger than libc's, so no overflow should ever happen). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Jaco Kroon <jaco@uls.co.za>
2025-05-29fuse: don't allow signals to interrupt getdents copyingMiklos Szeredi
When getting the directory contents, the entries are first fetched to a kernel buffer, then they are copied to userspace with dir_emit(). This second phase is non-blocking as long as the userspace buffer is not paged out, making it interruptible makes zero sense. Overload d_type as flags, since it only uses 4 bits from 32. Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: support large folios for writebackJoanne Koong
Add support for folios larger than one page size for writeback. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: support large folios for readaheadJoanne Koong
Add support for folios larger than one page size for readahead. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: support large folios for queued writesJoanne Koong
Add support for folios larger than one page size for queued writes. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: support large folios for storesJoanne Koong
Add support for folios larger than one page size for stores. Also change variable naming from "this_num" to "nr_bytes". Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: support large folios for symlinksJoanne Koong
Support large folios for symlinks and change the name from fuse_getlink_page() to fuse_getlink_folio(). Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: support large folios for folio readsJoanne Koong
Add support for folios larger than one page size for folio reads into the page cache. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: support large folios for writethrough writesJoanne Koong
Add support for folios larger than one page size for writethrough writes. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: refactor fuse_fill_write_pages()Joanne Koong
Refactor the logic in fuse_fill_write_pages() for copying out write data. This will make the future change for supporting large folios for writes easier. No functional changes. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: support large folios for retrievesJoanne Koong
Add support for folios larger than one page size for retrieves. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: support copying large foliosJoanne Koong
Currently, all folios associated with fuse are one page size. As part of the work to enable large folios, this commit adds support for copying to/from folios larger than one page size. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-28Merge tag 'net-next-6.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core: - Implement the Device Memory TCP transmit path, allowing zero-copy data transmission on top of TCP from e.g. GPU memory to the wire. - Move all the IPv6 routing tables management outside the RTNL scope, under its own lock and RCU. The route control path is now 3x times faster. - Convert queue related netlink ops to instance lock, reducing again the scope of the RTNL lock. This improves the control plane scalability. - Refactor the software crc32c implementation, removing unneeded abstraction layers and improving significantly the related micro-benchmarks. - Optimize the GRO engine for UDP-tunneled traffic, for a 10% performance improvement in related stream tests. - Cover more per-CPU storage with local nested BH locking; this is a prep work to remove the current per-CPU lock in local_bh_disable() on PREMPT_RT. - Introduce and use nlmsg_payload helper, combining buffer bounds verification with accessing payload carried by netlink messages. Netfilter: - Rewrite the procfs conntrack table implementation, improving considerably the dump performance. A lot of user-space tools still use this interface. - Implement support for wildcard netdevice in netdev basechain and flowtables. - Integrate conntrack information into nft trace infrastructure. - Export set count and backend name to userspace, for better introspection. BPF: - BPF qdisc support: BPF-qdisc can be implemented with BPF struct_ops programs and can be controlled in similar way to traditional qdiscs using the "tc qdisc" command. - Refactor the UDP socket iterator, addressing long standing issues WRT duplicate hits or missed sockets. Protocols: - Improve TCP receive buffer auto-tuning and increase the default upper bound for the receive buffer; overall this improves the single flow maximum thoughput on 200Gbs link by over 60%. - Add AFS GSSAPI security class to AF_RXRPC; it provides transport security for connections to the AFS fileserver and VL server. - Improve TCP multipath routing, so that the sources address always matches the nexthop device. - Introduce SO_PASSRIGHTS for AF_UNIX, to allow disabling SCM_RIGHTS, and thus preventing DoS caused by passing around problematic FDs. - Retire DCCP socket. DCCP only receives updates for bugs, and major distros disable it by default. Its removal allows for better organisation of TCP fields to reduce the number of cache lines hit in the fast path. - Extend TCP drop-reason support to cover PAWS checks. Driver API: - Reorganize PTP ioctl flag support to require an explicit opt-in for the drivers, avoiding the problem of drivers not rejecting new unsupported flags. - Converted several device drivers to timestamping APIs. - Introduce per-PHY ethtool dump helpers, improving the support for dump operations targeting PHYs. Tests and tooling: - Add support for classic netlink in user space C codegen, so that ynl-c can now read, create and modify links, routes addresses and qdisc layer configuration. - Add ynl sub-types for binary attributes, allowing ynl-c to output known struct instead of raw binary data, clarifying the classic netlink output. - Extend MPTCP selftests to improve the code-coverage. - Add tests for XDP tail adjustment in AF_XDP. New hardware / drivers: - OpenVPN virtual driver: offload OpenVPN data channels processing to the kernel-space, increasing the data transfer throughput WRT the user-space implementation. - Renesas glue driver for the gigabit ethernet RZ/V2H(P) SoC. - Broadcom asp-v3.0 ethernet driver. - AMD Renoir ethernet device. - ReakTek MT9888 2.5G ethernet PHY driver. - Aeonsemi 10G C45 PHYs driver. Drivers: - Ethernet high-speed NICs: - nVidia/Mellanox (mlx5): - refactor the steering table handling to significantly reduce the amount of memory used - add support for complex matches in H/W flow steering - improve flow streeing error handling - convert to netdev instance locking - Intel (100G, ice, igb, ixgbe, idpf): - ice: add switchdev support for LLDP traffic over VF - ixgbe: add firmware manipulation and regions devlink support - igb: introduce support for frame transmission premption - igb: adds persistent NAPI configuration - idpf: introduce RDMA support - idpf: add initial PTP support - Meta (fbnic): - extend hardware stats coverage - add devlink dev flash support - Broadcom (bnxt): - add support for RX-side device memory TCP - Wangxun (txgbe): - implement support for udp tunnel offload - complete PTP and SRIOV support for AML 25G/10G devices - Ethernet NICs embedded and virtual: - Google (gve): - add device memory TCP TX support - Amazon (ena): - support persistent per-NAPI config - Airoha: - add H/W support for L2 traffic offload - add per flow stats for flow offloading - RealTek (rtl8211): add support for WoL magic packet - Synopsys (stmmac): - dwmac-socfpga 1000BaseX support - add Loongson-2K3000 support - introduce support for hardware-accelerated VLAN stripping - Broadcom (bcmgenet): - expose more H/W stats - Freescale (enetc, dpaa2-eth): - enetc: add MAC filter, VLAN filter RSS and loopback support - dpaa2-eth: convert to H/W timestamping APIs - vxlan: convert FDB table to rhashtable, for better scalabilty - veth: apply qdisc backpressure on full ring to reduce TX drops - Ethernet switches: - Microchip (kzZ88x3): add ETS scheduler support - Ethernet PHYs: - RealTek (rtl8211): - add support for WoL magic packet - add support for PHY LEDs - CAN: - Adds RZ/G3E CANFD support to the rcar_canfd driver. - Preparatory work for CAN-XL support. - Add self-tests framework with support for CAN physical interfaces. - WiFi: - mac80211: - scan improvements with multi-link operation (MLO) - Qualcomm (ath12k): - enable AHB support for IPQ5332 - add monitor interface support to QCN9274 - add multi-link operation support to WCN7850 - add 802.11d scan offload support to WCN7850 - monitor mode for WCN7850, better 6 GHz regulatory - Qualcomm (ath11k): - restore hibernation support - MediaTek (mt76): - WiFi-7 improvements - implement support for mt7990 - Intel (iwlwifi): - enhanced multi-link single-radio (EMLSR) support on 5 GHz links - rework device configuration - RealTek (rtw88): - improve throughput for RTL8814AU - RealTek (rtw89): - add multi-link operation support - STA/P2P concurrency improvements - support different SAR configs by antenna - Bluetooth: - introduce HCI Driver protocol - btintel_pcie: do not generate coredump for diagnostic events - btusb: add HCI Drv commands for configuring altsetting - btusb: add RTL8851BE device 0x0bda:0xb850 - btusb: add new VID/PID 13d3/3584 for MT7922 - btusb: add new VID/PID 13d3/3630 and 13d3/3613 for MT7925 - btnxpuart: implement host-wakeup feature" * tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1611 commits) selftests/bpf: Fix bpf selftest build warning selftests: netfilter: Fix skip of wildcard interface test net: phy: mscc: Stop clearing the the UDPv4 checksum for L2 frames net: openvswitch: Fix the dead loop of MPLS parse calipso: Don't call calipso functions for AF_INET sk. selftests/tc-testing: Add a test for HFSC eltree double add with reentrant enqueue behaviour on netem net_sched: hfsc: Address reentrant enqueue adding class to eltree twice octeontx2-pf: QOS: Refactor TC_HTB_LEAF_DEL_LAST callback octeontx2-pf: QOS: Perform cache sync on send queue teardown net: mana: Add support for Multi Vports on Bare metal net: devmem: ncdevmem: remove unused variable net: devmem: ksft: upgrade rx test to send 1K data net: devmem: ksft: add 5 tuple FS support net: devmem: ksft: add exit_wait to make rx test pass net: devmem: ksft: add ipv4 support net: devmem: preserve sockc_err page_pool: fix ugly page_pool formatting net: devmem: move list_add to net_devmem_bind_dmabuf. selftests: netfilter: nft_queue.sh: include file transfer duration in log message net: phy: mscc: Fix memory leak when using one step timestamping ...