summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2024-10-19getname_maybe_null() - the third variant of pathname copy-inAl Viro
Semantics used by statx(2) (and later *xattrat(2)): without AT_EMPTY_PATH it's standard getname() (i.e. ERR_PTR(-ENOENT) on empty string, ERR_PTR(-EFAULT) on NULL), with AT_EMPTY_PATH both empty string and NULL are accepted. Calling conventions: getname_maybe_null(user_pointer, flags) returns * pointer to struct filename when non-empty string had been successfully read * ERR_PTR(...) on error * NULL if an empty string or NULL pointer had been given with AT_EMPTY_PATH in the flags argument. It tries to avoid allocation in the last case; it's not always able to do so, in which case the temporary struct filename instance is freed and NULL returned anyway. Fast path is inlined. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-19teach filename_lookup() to treat NULL filename as ""Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-19fs/block: Check for IOCB_DIRECT in generic_atomic_write_valid()John Garry
Currently FMODE_CAN_ATOMIC_WRITE is set if the bdev can atomic write and the file is open for direct IO. This does not work if the file is not opened for direct IO, yet fcntl(O_DIRECT) is used on the fd later. Change to check for direct IO on a per-IO basis in generic_atomic_write_valid(). Since we want to report -EOPNOTSUPP for non-direct IO for an atomic write, change to return an error code. Relocate the block fops atomic write checks to the common write path, as to catch non-direct IO. Fixes: c34fc6f26ab8 ("fs: Initial atomic write support") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241019125113.369994-3-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-19block/fs: Pass an iocb to generic_atomic_write_valid()John Garry
Darrick and Hannes both thought it better that generic_atomic_write_valid() should be passed a struct iocb, and not just the member of that struct which is referenced; see [0] and [1]. I think that makes a more generic and clean API, so make that change. [0] https://lore.kernel.org/linux-block/680ce641-729b-4150-b875-531a98657682@suse.de/ [1] https://lore.kernel.org/linux-xfs/20240620212401.GA3058325@frogsfrogsfrogs/ Fixes: c34fc6f26ab8 ("fs: Initial atomic write support") Suggested-by: Darrick J. Wong <djwong@kernel.org> Suggested-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241019125113.369994-2-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-19Merge tag '9p-for-6.12-rc4' of https://github.com/martinetd/linuxLinus Torvalds
Pull 9p fixes from Dominique Martinet: "Mashed-up update that I sat on too long: - fix for multiple slabs created with the same name - enable multipage folios - theorical fix to also look for opened fids by inode if none was found by dentry" [ Enabling multi-page folios should have been done during the merge window, but it's a one-liner, and the actual meat of the enablement is in netfs and already in use for other filesystems... - Linus ] * tag '9p-for-6.12-rc4' of https://github.com/martinetd/linux: 9p: Avoid creating multiple slab caches with the same name 9p: Enable multipage folios 9p: v9fs_fid_find: also lookup by inode if not found dentry
2024-10-19fs: add file_refChristian Brauner
As atomic_inc_not_zero() is implemented with a try_cmpxchg() loop it has O(N^2) behaviour under contention with N concurrent operations and it is in a hot path in __fget_files_rcu(). The rcuref infrastructures remedies this problem by using an unconditional increment relying on safe- and dead zones to make this work and requiring rcu protection for the data structure in question. This not just scales better it also introduces overflow protection. However, in contrast to generic rcuref, files require a memory barrier and thus cannot rely on *_relaxed() atomic operations and also require to be built on atomic_long_t as having massive amounts of reference isn't unheard of even if it is just an attack. As suggested by Linus, add a file specific variant instead of making this a generic library. Files are SLAB_TYPESAFE_BY_RCU and thus don't have "regular" rcu protection. In short, freeing of files isn't delayed until a grace period has elapsed. Instead, they are freed immediately and thus can be reused (multiple times) within the same grace period. So when picking a file from the file descriptor table via its file descriptor number it is thus possible to see an elevated reference count on file->f_count even though the file has already been recycled possibly multiple times by another task. To guard against this the vfs will pick the file from the file descriptor table twice. Once before the refcount increment and once after to compare the pointers (grossly simplified). If they match then the file is still valid. If not the caller needs to fput() it. The unconditional increment makes the following race possible as illustrated by rcuref: > Deconstruction race > =================== > > The release operation must be protected by prohibiting a grace period in > order to prevent a possible use after free: > > T1 T2 > put() get() > // ref->refcnt = ONEREF > if (!atomic_add_negative(-1, &ref->refcnt)) > return false; <- Not taken > > // ref->refcnt == NOREF > --> preemption > // Elevates ref->refcnt to ONEREF > if (!atomic_add_negative(1, &ref->refcnt)) > return true; <- taken > > if (put(&p->ref)) { <-- Succeeds > remove_pointer(p); > kfree_rcu(p, rcu); > } > > RCU grace period ends, object is freed > > atomic_cmpxchg(&ref->refcnt, NOREF, DEAD); <- UAF > > [...] it prevents the grace period which keeps the object alive until > all put() operations complete. Having files by SLAB_TYPESAFE_BY_RCU shouldn't cause any problems for this deconstruction race. Afaict, the only interesting case would be someone freeing the file and someone immediately recycling it within the same grace period and reinitializing file->f_count to ONEREF while a concurrent fput() is doing atomic_cmpxchg(&ref->refcnt, NOREF, DEAD) as in the race above. But this is safe from SLAB_TYPESAFE_BY_RCU's perspective and it should be safe from rcuref's perspective. T1 T2 T3 fput() fget() // f_count->refcnt = ONEREF if (!atomic_add_negative(-1, &f_count->refcnt)) return false; <- Not taken // f_count->refcnt == NOREF --> preemption // Elevates f_count->refcnt to ONEREF if (!atomic_add_negative(1, &f_count->refcnt)) return true; <- taken if (put(&f_count)) { <-- Succeeds remove_pointer(p); /* * Cache is SLAB_TYPESAFE_BY_RCU * so this is freed without a grace period. */ kmem_cache_free(p); } kmem_cache_alloc() init_file() { // Sets f_count->refcnt to ONEREF rcuref_long_init(&f->f_count, 1); } Object has been reused within the same grace period via kmem_cache_alloc()'s SLAB_TYPESAFE_BY_RCU. /* * With SLAB_TYPESAFE_BY_RCU this would be a safe UAF access and * it would work correctly because the atomic_cmpxchg() * will fail because the refcount has been reset to ONEREF by T3. */ atomic_cmpxchg(&ref->refcnt, NOREF, DEAD); <- UAF However, there are other cases to consider: (1) Benign race due to multiple atomic_long_read() CPU1 CPU2 file_ref_put() // last reference // => count goes negative/FILE_REF_NOREF atomic_long_add_negative_release(-1, &ref->refcnt) -> __file_ref_put() file_ref_get() // goes back from negative/FILE_REF_NOREF to 0 // and file_ref_get() succeeds atomic_long_add_negative(1, &ref->refcnt) // This is immediately followed by file_ref_put() // managing to set FILE_REF_DEAD file_ref_put() // __file_ref_put() continues and sees // cnt > FILE_REF_RELEASED // and splats with // "imbalanced put on file reference count" cnt = atomic_long_read(&ref->refcnt); The race however is benign and the problem is the atomic_long_read(). Instead of performing a separate read this uses atomic_long_dec_return() and pass the value to __file_ref_put(). Thanks to Linus for pointing out that braino. (2) SLAB_TYPESAFE_BY_RCU may cause recycled files to be marked dead When a file is recycled the following race exists: CPU1 CPU2 // @file is already dead and thus // cnt >= FILE_REF_RELEASED. file_ref_get(file) atomic_long_add_negative(1, &ref->refcnt) // We thus call into __file_ref_get() -> __file_ref_get() // which sees cnt >= FILE_REF_RELEASED cnt = atomic_long_read(&ref->refcnt); // In the meantime @file gets freed kmem_cache_free() // and is immediately recycled file = kmem_cache_zalloc() // and the reference count is reinitialized // and the file alive again in someone // else's file descriptor table file_ref_init(&ref->refcnt, 1); // the __file_ref_get() slowpath now continues // and as it saw earlier that cnt >= FILE_REF_RELEASED // it wants to ensure that we're staying in the middle // of the deadzone and unconditionally sets // FILE_REF_DEAD. // This marks @file dead for CPU2... atomic_long_set(&ref->refcnt, FILE_REF_DEAD); // Caller issues a close() system call to close @file close(fd) file = file_close_fd_locked() filp_flush() // The caller sees that cnt >= FILE_REF_RELEASED // and warns the first time... CHECK_DATA_CORRUPTION(file_count(file) == 0) // and then splats a second time because // __file_ref_put() sees cnt >= FILE_REF_RELEASED file_ref_put(&ref->refcnt); -> __file_ref_put() My initial inclination was to replace the unconditional atomic_long_set() with an atomic_long_try_cmpxchg() but Linus pointed out that: > I think we should just make file_ref_get() do a simple > > return !atomic_long_add_negative(1, &ref->refcnt)); > > and nothing else. Yes, multiple CPU's can race, and you can increment > more than once, but the gap - even on 32-bit - between DEAD and > becoming close to REF_RELEASED is so big that we simply don't care. > That's the point of having a gap. I've been testing this with will-it-scale using fstat() on a machine that Jens gave me access (thank you very much!): processor : 511 vendor_id : AuthenticAMD cpu family : 25 model : 160 model name : AMD EPYC 9754 128-Core Processor and I consistently get a 3-5% improvement on 256+ threads. Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202410151043.5d224a27-oliver.sang@intel.com Closes: https://lore.kernel.org/all/202410151611.f4cd71f2-oliver.sang@intel.com Link: https://lore.kernel.org/r/20241007-brauner-file-rcuref-v2-2-387e24dc9163@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-18ufs: Convert ufs_change_blocknr() to take a folioMatthew Wilcox (Oracle)
Now that ufs_new_fragments() has a folio, pass it to ufs_change_blocknr() as a folio instead of converting it from folio to page to folio. This removes the last use of struct page in UFS. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18ufs: Pass a folio to ufs_new_fragments()Matthew Wilcox (Oracle)
All callers now have a folio, pass it to ufs_new_fragments() instead of converting back to a page. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18ufs: Convert ufs_inode_getfrag() to take a folioMatthew Wilcox (Oracle)
Pass bh->b_folio instead of bh->b_page. They're in a union, so no code change expected. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18ufs: Convert ufs_extend_tail() to take a folioMatthew Wilcox (Oracle)
Pass bh->b_folio instead of bh->b_page. They're in a union, so no code change expected. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18ufs: Convert ufs_inode_getblock() to take a folioMatthew Wilcox (Oracle)
Pass bh->b_folio instead of bh->b_page. They're in a union, so no code change expected. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18ufs: take the handling of free block counters into a helperAl Viro
There are 3 places where those counters (many and varied...) are adjusted - when we are freeing fragments and get an entire block freed, when we are freeing blocks and (in opposite direction) when we are grabbing a block. The logics is identical (modulo the sign of adjustment) in all three; better take it into a helper - less duplication and less clutter in the callers that way. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18clean ufs_trunc_direct() up a bit...Al Viro
For short files (== no indirect blocks needed) UFS allows the last block to be a partial one. That creates some complications for truncation down to "short file" lengths. ufs_trunc_direct() is called when we'd already made sure that new EOF is not in a hole; nothing needs to be done if we are extending the file and in case we are shrinking the file it needs to * shrink or free the old final block. * free all full direct blocks between the new and old EOF. * possibly shrink the new final block. The logics is needlessly complicated by trying to keep all cases handled by the same sequence of operations. if not shrinking nothing to do else if number of full blocks unchanged free the tail of possibly partial last block else free the tail of (currently full) new last block free all present (full) blocks in between free the (possibly partial) old last block is easier to follow than the result of trying to unify these cases. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18ufs: get rid of ubh_{ubhcpymem,memcpyubh}()Al Viro
used only in ufs_read_cylinder_structures()/ufs_put_super_internal() and there we can just as well avoid bothering with ufs_buffer_head and just deal with it fragment-by-fragment. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18ufs_inode_getfrag(): remove junk commentAl Viro
It used to be a stubbed out beginning of ufs2 support, which had been implemented differently quite a while ago. Remove the commented-out (pseudo-)code. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18ufs_free_fragments(): fix the braino in sanity checkAl Viro
The function expects that all fragments it's been asked to free will be within the same block. And it even has a sanity check verifying that - it takes the fragment number modulo the number of fragments per block, adds the count and checks if that's too high. Unfortunately, it misspells the upper limit - instead of ->s_fpb (fragments per block) it says ->s_fpg (fragments per cylinder group). So "too high" ends up being insanely lenient. Had been that way since 2.1.112, when UFS write support had been added. 27 years to spot a typo... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18ufs_clusteracct(): switch to passing fragment numberAl Viro
Currently all callers pass it a block number. All of them have it derived from a fragment number (both fragment and block numbers are within a cylinder group, and thus 32bit). Pass it the fragment number instead; none of the callers has other uses for the block number, so that ends up with cleaner code. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18ufs: untangle ubh_...block...(), part 3Al Viro
Pass fragment number instead of a block one. It's available in all callers and it makes the logics inside those helpers much simpler. The bitmap they operate upon is with bit per fragment, block being an aligned group of 1, 2, 4 or 8 adjacent fragments. We still need a switch by the number of fragments in block (== number of bits to check/set/clear), but finding the byte we need to work with becomes uniform and that makes the things easier to follow. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18ufs: untangle ubh_...block...(), part 2Al Viro
pass cylinder group descriptor instead of its buffer head (ubh, always UCPI_UBH(ucpi)) and its ->c_freeoff. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18ufs: untangle ubh_...block...() macros, part 1Al Viro
passing implicit argument to a macro by having it in a variable with special name is Not Nice(tm); just pass it explicitly. kill an unused macro, while we are at it... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18ufs: fix ufs_read_cylinder() failure handlingAl Viro
1) ufs_load_cylinder() should return NULL on ufs_read_cylinder() failures. ufs_error() is not enough. As it is, IO failure on attempt to read a part of cylinder group metadata is likely to end up with an oops. 2) we drop the wrong buffer heads when undoing sb_bread() on IO failure halfway through the read - we need to brelse() what we've got from sb_bread(), TYVM... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18ufs: missing ->splice_write()Al Viro
normal ->write_iter()-based ->splice_write() works here just fine... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18ufs: fix handling of delete_entry and set_link failuresAl Viro
similar to minixfs series - make ufs_set_link() report failures, lift folio_release_kmap() into the callers of ufs_set_link() and ufs_delete_entry(), make ufs_rename() handle failures in both. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18nfsd: fix race between laundromat and free_stateidOlga Kornievskaia
There is a race between laundromat handling of revoked delegations and a client sending free_stateid operation. Laundromat thread finds that delegation has expired and needs to be revoked so it marks the delegation stid revoked and it puts it on a reaper list but then it unlock the state lock and the actual delegation revocation happens without the lock. Once the stid is marked revoked a racing free_stateid processing thread does the following (1) it calls list_del_init() which removes it from the reaper list and (2) frees the delegation stid structure. The laundromat thread ends up not calling the revoke_delegation() function for this particular delegation but that means it will no release the lock lease that exists on the file. Now, a new open for this file comes in and ends up finding that lease list isn't empty and calls nfsd_breaker_owns_lease() which ends up trying to derefence a freed delegation stateid. Leading to the followint use-after-free KASAN warning: kernel: ================================================================== kernel: BUG: KASAN: slab-use-after-free in nfsd_breaker_owns_lease+0x140/0x160 [nfsd] kernel: Read of size 8 at addr ffff0000e73cd0c8 by task nfsd/6205 kernel: kernel: CPU: 2 UID: 0 PID: 6205 Comm: nfsd Kdump: loaded Not tainted 6.11.0-rc7+ #9 kernel: Hardware name: Apple Inc. Apple Virtualization Generic Platform, BIOS 2069.0.0.0.0 08/03/2024 kernel: Call trace: kernel: dump_backtrace+0x98/0x120 kernel: show_stack+0x1c/0x30 kernel: dump_stack_lvl+0x80/0xe8 kernel: print_address_description.constprop.0+0x84/0x390 kernel: print_report+0xa4/0x268 kernel: kasan_report+0xb4/0xf8 kernel: __asan_report_load8_noabort+0x1c/0x28 kernel: nfsd_breaker_owns_lease+0x140/0x160 [nfsd] kernel: nfsd_file_do_acquire+0xb3c/0x11d0 [nfsd] kernel: nfsd_file_acquire_opened+0x84/0x110 [nfsd] kernel: nfs4_get_vfs_file+0x634/0x958 [nfsd] kernel: nfsd4_process_open2+0xa40/0x1a40 [nfsd] kernel: nfsd4_open+0xa08/0xe80 [nfsd] kernel: nfsd4_proc_compound+0xb8c/0x2130 [nfsd] kernel: nfsd_dispatch+0x22c/0x718 [nfsd] kernel: svc_process_common+0x8e8/0x1960 [sunrpc] kernel: svc_process+0x3d4/0x7e0 [sunrpc] kernel: svc_handle_xprt+0x828/0xe10 [sunrpc] kernel: svc_recv+0x2cc/0x6a8 [sunrpc] kernel: nfsd+0x270/0x400 [nfsd] kernel: kthread+0x288/0x310 kernel: ret_from_fork+0x10/0x20 This patch proposes a fixed that's based on adding 2 new additional stid's sc_status values that help coordinate between the laundromat and other operations (nfsd4_free_stateid() and nfsd4_delegreturn()). First to make sure, that once the stid is marked revoked, it is not removed by the nfsd4_free_stateid(), the laundromat take a reference on the stateid. Then, coordinating whether the stid has been put on the cl_revoked list or we are processing FREE_STATEID and need to make sure to remove it from the list, each check that state and act accordingly. If laundromat has added to the cl_revoke list before the arrival of FREE_STATEID, then nfsd4_free_stateid() knows to remove it from the list. If nfsd4_free_stateid() finds that operations arrived before laundromat has placed it on cl_revoke list, it marks the state freed and then laundromat will no longer add it to the list. Also, for nfsd4_delegreturn() when looking for the specified stid, we need to access stid that are marked removed or freeable, it means the laundromat has started processing it but hasn't finished and this delegreturn needs to return nfserr_deleg_revoked and not nfserr_bad_stateid. The latter will not trigger a FREE_STATEID and the lack of it will leave this stid on the cl_revoked list indefinitely. Fixes: 2d4a532d385f ("nfsd: ensure that clp->cl_revoked list is protected by clp->cl_lock") CC: stable@vger.kernel.org Signed-off-by: Olga Kornievskaia <okorniev@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-10-18Merge tag 'v6.12-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull smb client fixes from Steve French: - Fix possible double free setting xattrs - Fix slab out of bounds with large ioctl payload - Remove three unused functions, and an unused variable that could be confusing * tag 'v6.12-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: Remove unused functions smb/client: Fix logically dead code smb: client: fix OOBs when building SMB2_IOCTL request smb: client: fix possible double free in smb2_set_ea()
2024-10-18Merge tag 'xfs-6.12-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fixes from Carlos Maiolino: - Fix integer overflow in xrep_bmap - Fix stale dealloc punching for COW IO * tag 'xfs-6.12-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: punch delalloc extents from the COW fork for COW writes xfs: set IOMAP_F_SHARED for all COW fork allocations xfs: share more code in xfs_buffered_write_iomap_begin xfs: support the COW fork in xfs_bmap_punch_delalloc_range xfs: IOMAP_ZERO and IOMAP_UNSHARE already hold invalidate_lock xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eof xfs: factor out a xfs_file_write_zero_eof helper iomap: move locking out of iomap_write_delalloc_release iomap: remove iomap_file_buffered_write_punch_delalloc iomap: factor out a iomap_last_written_block helper xfs: fix integer overflow in xrep_bmap
2024-10-18proc: Fix W=1 build kernel-doc warningThorsten Blum
Building the kernel with W=1 generates the following warning: fs/proc/fd.c:81: warning: This comment starts with '/**', but isn't a kernel-doc comment. Use a normal comment for the helper function proc_fdinfo_permission(). Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://lore.kernel.org/r/20241018102705.92237-2-thorsten.blum@linux.dev Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-18bcachefs: fsck: Improve hash_check_key()Kent Overstreet
hash_check_key() checks and repairs the hash table btrees: dirents and xattrs are open addressing hash tables. We recently had a corruption reported where the hash type on an inode somehow got flipped, which made the existing dirents invisible and allowed new ones to be created with the same name. Now, hash_check_key() can repair duplicates: it will delete one of them, if it has an xattr or dangling dirent, but if it has two valid dirents one of them gets renamed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18bcachefs: bch2_hash_set_or_get_in_snapshot()Kent Overstreet
Add a variant of bch2_hash_set_in_snapshot() that returns the existing key on -EEXIST. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18bcachefs: Repair mismatches in inode hash seed, typeKent Overstreet
Different versions of the same inode (same inode number, different snapshot ID) must have the same hash seed and type - lookups require this, since they see keys from different snapshots simultaneously. To repair we only need to make the inodes consistent, hash_check_key() will do the rest. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18bcachefs: Add hash seed, type to inode_to_text()Kent Overstreet
This helped with discovering some filesystem corruption fsck has having trouble with: the str_hash type had gotten flipped on one snapshot's version of an inode. All versions of a given inode number have the same hash seed and hash type, since lookups will be done with a single hash/seed and type and see dirents/xattrs from multiple snapshots. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18bcachefs: INODE_STR_HASH() for bch_inode_unpackedKent Overstreet
Trivial cleanup - add a normal BITMASK() helper for bch_inode_unpacked. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18bcachefs: Run in-kernel offline fsck without ratelimit errorsKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18bcachefs: skip mount option handle for empty string.Hongbo Li
The options parse in get_tree will split the options buffer, it will get the empty string for last one by strsep(). After commit ea0eeb89b1d5 ("bcachefs: reject unknown mount options") is merged, unknown mount options is not allowed (here is empty string), and this causes this errors. This can be reproduced just by the following steps: bcachefs format /dev/loop mount -t bcachefs -o metadata_target=loop1 /dev/loop1 /mnt/bcachefs/ Fixes: ea0eeb89b1d5 ("bcachefs: reject unknown mount options") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18bcachefs: fix incorrect show_options resultsHongbo Li
When call show_options in bcachefs, the options buffer is appeneded to the seq variable. In fact, it requires an additional comma to be appended first. This will affect the remount process when reading existing mount options. Fixes: 9305cf91d05e ("bcachefs: bch2_opts_to_text()") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18bcachefs: Fix data corruption on -ENOSPC in buffered write pathKent Overstreet
Found by generic/299: When we have to truncate a write due to -ENOSPC, we may have to read in the folio we're writing to if we're now no longer doing a complete write to a !uptodate folio. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18bcachefs: bch2_folio_reservation_get_partial() is now better behavedKent Overstreet
bch2_folio_reservation_get_partial(), on partial success, will now return a reservation that's aligned to the filesystem blocksize. This is a partial fix for fstests generic/299 - fio verify is badly behaved in the presence of short writes that aren't aligned to its blocksize. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18bcachefs: fix disk reservation accounting in bch2_folio_reservation_get()Kent Overstreet
bch2_disk_reservation_put() zeroes out the reservation - oops. This fixes a disk reservation leak when getting a quota reservation returned an error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18bcachefS: ec: fix data type on stripe deletionKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18bcachefs: Don't use commit_do() unnecessarilyKent Overstreet
Using commit_do() to call alloc_sectors_start_trans() breaks when we're randomly injecting transaction restarts - the restart in the commit causes us to leak the lock that alloc_sectorS_start_trans() takes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18bcachefs: handle restarts in bch2_bucket_io_time_reset()Kent Overstreet
bch2_bucket_io_time_reset() doesn't need to succeed, which is why it didn't previously retry on transaction restart - but we're now treating these as errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18bcachefs: fix restart handling in __bch2_resume_logged_op_finsert()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18bcachefs: fix restart handling in bch2_alloc_write_key()Kent Overstreet
This is ugly: We may discover in alloc_write_key that the data type we calculated is wrong, because BCH_DATA_need_discard is checked/set elsewhere, and the disk accounting counters we calculated need to be updated. But bch2_alloc_key_to_dev_counters(..., BTREE_TRIGGER_gc) is not safe w.r.t. transaction restarts, so we need to propagate the fixup back to our gc state in case we take a transaction restart. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18bcachefs: fix restart handling in bch2_do_invalidates_work()Kent Overstreet
this one is fairly harmless since the invalidate worker will just run again later if it needs to, but still worth fixing Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18bcachefs: fix missing restart handling in bch2_read_retry_nodecode()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18bcachefs: fix restart handling in bch2_fiemap()Kent Overstreet
We were leaking transaction restart errors to userspace. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18bcachefs: fix bch2_hash_delete() error pathKent Overstreet
we were exiting an iterator that hadn't been initialized Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18bcachefs: fix restart handling in bch2_rename2()Kent Overstreet
This should be impossible to hit in practice; the first lookup within a transaction won't return a restart due to lock ordering, but we're adding fault injection for transaction restarts and shaking out bugs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-17Merge tag 'mm-hotfixes-stable-2024-10-17-16-08' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "28 hotfixes. 13 are cc:stable. 23 are MM. It is the usual shower of unrelated singletons - please see the individual changelogs for details" * tag 'mm-hotfixes-stable-2024-10-17-16-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (28 commits) maple_tree: add regression test for spanning store bug maple_tree: correct tree corruption on spanning store mm/mglru: only clear kswapd_failures if reclaimable mm/swapfile: skip HugeTLB pages for unuse_vma selftests: mm: fix the incorrect usage() info of khugepaged MAINTAINERS: add Jann as memory mapping/VMA reviewer mm: swap: prevent possible data-race in __try_to_reclaim_swap mm: khugepaged: fix the incorrect statistics when collapsing large file folios MAINTAINERS: kasan, kcov: add bugzilla links mm: don't install PMD mappings when THPs are disabled by the hw/process/vma mm: huge_memory: add vma_thp_disabled() and thp_disabled_by_hw() Docs/damon/maintainer-profile: update deprecated awslabs GitHub URLs Docs/damon/maintainer-profile: add missing '_' suffixes for external web links maple_tree: check for MA_STATE_BULK on setting wr_rebalance mm: khugepaged: fix the arguments order in khugepaged_collapse_file trace point mm/damon/tests/sysfs-kunit.h: fix memory leak in damon_sysfs_test_add_targets() mm: remove unused stub for can_swapin_thp() mailmap: add an entry for Andy Chiu MAINTAINERS: add memory mapping/VMA co-maintainers fs/proc: fix build with GCC 15 due to -Werror=unterminated-string-initialization ...
2024-10-17binfmt_elf: Wire up AT_HWCAP3 at AT_HWCAP4Mark Brown
AT_HWCAP3 and AT_HWCAP4 were recently defined for use on PowerPC in commit 3281366a8e79 ("uapi/auxvec: Define AT_HWCAP3 and AT_HWCAP4 aux vector, entries"). Since we want to start using AT_HWCAP3 on arm64 add support for exposing both these new hwcaps via binfmt_elf. Signed-off-by: Mark Brown <broonie@kernel.org> Acked-by: Kees Cook <kees@kernel.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Link: https://lore.kernel.org/r/20241004-arm64-elf-hwcap3-v2-1-799d1daad8b0@kernel.org Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>