Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- subpage mode fixes:
- access correct object (folio) when looking up bit offset
- fix assertion condition for number of blocks per folio
- fix upper boundary of locking range in hole punch
- zoned fixes:
- fix potential deadlock caught by lockdep when zone reporting and
device freeze run in parallel
- fix zone write pointer mismatch and NULL pointer dereference when
metadata are converted from DUP to RAID1
- fix error handling when reloc inode creation fails
- in tree-checker, unify error code for header level check
- block layer: add helpers to read zone capacity
* tag 'for-6.15-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zoned: skip reporting zone for new block group
block: introduce zone capacity helper
btrfs: tree-checker: adjust error code for header level check
btrfs: fix invalid inode pointer after failure to create reloc inode
btrfs: zoned: return EIO on RAID1 block group write pointer mismatch
btrfs: fix the ASSERT() inside GET_SUBPAGE_BITMAP()
btrfs: avoid page_lockend underflow in btrfs_punch_hole_lock_range()
btrfs: subpage: access correct object when reading bitmap start in subpage_calc_start_bit()
|
|
This reduces the slowdown in face of multiple callers issuing close on
what turns out to not be the last reference.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/20250418125756.59677-1-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202504171513.6d6f8a16-lkp@intel.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The large folio + buffer head noref migration scenarios are
being naughty and blocking while holding a spinlock.
As a consequence of the pagecache lookup path taking the
folio lock this serializes against migration paths, so
they can wait for each other. For the private_lock
atomic case, a new BH_Migrate flag is introduced which
enables the lookup to bail.
This allows the critical region of the private_lock on
the migration path to be reduced to the way it was before
ebdf4de5642fb6 ("mm: migrate: fix reference check race
between __find_get_block() and migration"), that is covering
the count checks.
The scope is always noref migration.
Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: syzbot+f3c6fda1297c748a7076@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/oe-lkp/202503101536.27099c77-lkp@intel.com
Fixes: 3c20917120ce61 ("block/bdev: enable large folio support for large logical block sizes")
Reviewed-by: Jan Kara <jack@suse.cz>
Co-developed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Link: https://kdevops.org/ext4/v6.15-rc2.html # [0]
Link: https://lore.kernel.org/all/aAAEvcrmREWa1SKF@bombadil.infradead.org/ # [1]
Link: https://lore.kernel.org/20250418015921.132400-8-dave@stgolabs.net
Tested-by: kdevops@lists.linux.dev # [0] [1]
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Enable ext4_free_blocks() to use it, which has a cond_resched to begin
with. Convert to the new nonatomic flavor to benefit from potential
performance benefits and adapt in the future vs migration such that
semantics are kept.
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Link: https://kdevops.org/ext4/v6.15-rc2.html # [0]
Link: https://lore.kernel.org/all/aAAEvcrmREWa1SKF@bombadil.infradead.org/ # [1]
Link: https://lore.kernel.org/20250418015921.132400-7-dave@stgolabs.net
Tested-by: kdevops@lists.linux.dev
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Convert to the new nonatomic flavor to benefit from potential
performance benefits and adapt in the future vs migration such
that semantics are kept.
- jbd2_journal_revoke(): can sleep (has might_sleep() in the beginning)
- jbd2_journal_cancel_revoke(): only used from do_get_write_access() and
do_get_create_access() which do sleep. So can sleep.
- jbd2_clear_buffer_revoked_flags() - only called from journal commit code
which sleeps. So can sleep.
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Link: https://kdevops.org/ext4/v6.15-rc2.html # [0]
Link: https://lore.kernel.org/all/aAAEvcrmREWa1SKF@bombadil.infradead.org/ # [1]
Link: https://lore.kernel.org/20250418015921.132400-6-dave@stgolabs.net
Tested-by: kdevops@lists.linux.dev
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This is a path that allows for blocking as it does IO. Convert
to the new nonatomic flavor to benefit from potential performance
benefits and adapt in the future vs migration such that semantics
are kept.
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Link: https://kdevops.org/ext4/v6.15-rc2.html # [0]
Link: https://lore.kernel.org/all/aAAEvcrmREWa1SKF@bombadil.infradead.org/ # [1]
Link: https://lore.kernel.org/20250418015921.132400-5-dave@stgolabs.net
Tested-by: kdevops@lists.linux.dev
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Convert to the new nonatomic flavor to benefit from potential performance
benefits and adapt in the future vs migration such that semantics
are kept.
Convert write_boundary_block() which already takes the buffer
lock as well as bdev_getblk() depending on the respective gpf flags.
There are no changes in semantics.
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Link: https://kdevops.org/ext4/v6.15-rc2.html # [0]
Link: https://lore.kernel.org/all/aAAEvcrmREWa1SKF@bombadil.infradead.org/ # [1]
Link: https://lore.kernel.org/20250418015921.132400-4-dave@stgolabs.net
Tested-by: kdevops@lists.linux.dev # [0] [1]
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add __find_get_block_nonatomic() and sb_find_get_block_nonatomic()
calls for which users will be converted where safe. These versions
will take the folio lock instead of the mapping's private_lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Link: https://kdevops.org/ext4/v6.15-rc2.html # [0]
Link: https://lore.kernel.org/all/aAAEvcrmREWa1SKF@bombadil.infradead.org/ # [1]
Link: https://lore.kernel.org/20250418015921.132400-3-dave@stgolabs.net
Tested-by: kdevops@lists.linux.dev
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Callers of __find_get_block() may or may not allow for blocking
semantics, and is currently assumed that it will not. Layout
two paths based on this. The the private_lock scheme will
continued to be used for atomic contexts. Otherwise take the
folio lock instead, which protects the buffers, such as
vs migration and try_to_free_buffers().
Per the "hack idea", the latter can alleviate contention on
the private_lock for bdev mappings. For reasons of determinism
and avoid making bugs hard to reproduce, the trylocking is not
attempted.
No change in semantics. All lookup users still take the spinlock.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Link: https://kdevops.org/ext4/v6.15-rc2.html # [0]
Link: https://lore.kernel.org/all/aAAEvcrmREWa1SKF@bombadil.infradead.org/ # [1]
Link: https://lore.kernel.org/20250418015921.132400-2-dave@stgolabs.net
Tested-by: kdevops@lists.linux.dev
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
xfs_zoned_need_gc makes use of mult_frac() to calculate the threshold
for triggering the zoned garbage collector, but, turns out mult_frac()
doesn't properly work with 64-bit data types and this caused build
failures on some 32-bit architectures.
Fix this by essentially open coding mult_frac() in a 64-bit friendly
way.
Notice we don't need to bother with counters underflow here because
xfs_estimate_freecounter() will always return a positive value, as it
leverages percpu_counter_read_positive to read such counters.
Fixes: 845abeb1f06a ("xfs: add tunable threshold parameter for triggering zone GC")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202504181233.F7D9Atra-lkp@intel.com/
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
inode_operations.fileattr_(get|set) didn't exist when the various flag
ioctls where implemented - but they do now, which means we can delete a
bunch of ioctl code in favor of standard VFS level wrappers.
Closes: https://lore.kernel.org/linux-bcachefs/7ltgrgqgfummyrlvw7hnfhnu42rfiamoq3lpcvrjnlyytldmzp@yazbhusnztqn/
Cc: Petr Vorel <pvorel@suse.cz>
Cc: Andrea Cervesato <andrea.cervesato@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We had a buggy release of bcachefs-tools that wasn't properly aligning
bucket sizes.
We can't ask users to reformat - and it's easy to teach the allocator to
make sure writes are properly aligned.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Don't check for the GIF_ALLOC_FAILED flag in gfs2_ea_dealloc() and pass
that information explicitly instead. This allows for a cleaner
follow-up patch.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Move gfs2_dinode_dealloc() and its helper gfs2_final_release_pages()
from super.c to inode.c.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
In gfs2_create_inode(), we initialize the inode from scratch and then we
write the result to disk. Clear the GLF_INSTANTIATE_NEEDED glock flag
to indicate that the inode is up to date. Otherwise, the next time the
inode glock is acquired, gfs2_instantiate() would reread the inode from
disk, which isn't necessary.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
When gfs2_create_inode() finds a directory, make sure to return -EISDIR.
Fixes: 571a4b57975a ("GFS2: bugger off early if O_CREAT open finds a directory")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
free_percpu() checks for NULL pointers internally.
Remove unneeded NULL check here.
Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Check the return value of sb_min_blocksize(): it will be 0 when the
requested block size is invalid.
In addition, check the return value of sb_set_blocksize() as well.
Reported-by: syzbot+b0018b7468b2af33b4d5@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Currently, sdp->sd_aspace and the per-inode metadata address spaces use
sb->s_bdev->bd_mapping->host as their ->host; folios in those address
spaces will thus appear to be on bdev rather than on gfs2 filesystems.
This is a problem because gfs2 doesn't support cgroup writeback
(SB_I_CGROUPWB), but bdev does.
Fix that by using a "dummy" gfs2 inode as ->host in those address
spaces. When coming from a folio, folio->mapping->host->i_sb will then
be a gfs2 super block and the SB_I_CGROUPWB flag will not be set in
sb->s_iflags.
Based on a previous version from Bob Peterson from several years ago.
Thanks to Tetsuo Handa, Jan Kara, and Rafael Aquini for helping figure
this out.
Fixes: aaa2cacf8184 ("writeback: add lockdep annotation to inode_to_wb()")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Currently, gfs2 always sets the DLM_LKF_VALBLK flag to enable lvb
handling even when sb_lvbptr is NULL. This currently causes no problems
because DLM ignores the DLM_LKF_VALBLK flag when sb_lvbptr is NULL, but
it does violate the DLM API. Fix that by only setting DLM_LKF_VALBLK
when sb_lvbptr is not NULL.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
This patch moves the msleep_interruptible() out of the non-sleepable
context by moving the ls->ls_recover_spin spinlock around so
msleep_interruptible() will be called in a sleepable context.
Cc: stable@vger.kernel.org
Fixes: 4a7727725dc7 ("GFS2: Fix recovery issues for spectators")
Suggested-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Previously, copygc and rebalance weren't started until the very end of
mounting, after all recvoery passes have finished.
But copygc really should be started earlier, since it may be needed for
allocations to make forward progress. Additionally, we've been seeing
occasional bug reports where starting the kthread fails due to a pending
signal - i.e. we're getting timed out by systemd (during a version
upgrade), but we're not seeing the signal until mount is about to
complete.
Additionally, we now have copygc/rebalance explicitly wait for
check_snapshots to complete (if being run); they require that for
snapshot_is_ancestor() in the data move path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Don't use a continue; this simplifies the next patch where
run_recovery_passes() will be responsible for waking up copygc and
rebalance at the appropriate time.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This makes it easy to detect proper anonymous inodes and to ensure that
we can detect them in codepaths such as readahead().
Readahead on anonymous inodes didn't work because they didn't have a
proper mode. Now that they have we need to retain EINVAL being returned
otherwise LTP will fail.
We also need to ensure that ioctls aren't simply fired like they are for
regular files so things like inotify inodes continue to correctly call
their own ioctl handlers as in [1].
Reported-by: Xilin Wu <sophon@radxa.com>
Link: https://lore.kernel.org/3A9139D5CD543962+89831381-31b9-4392-87ec-a84a5b3507d8@radxa.com [1]
Link: https://lore.kernel.org/7a1a7076-ff6b-4cb0-94e7-7218a0a44028@sirena.org.uk
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The routine only encounters errors when people try to access things they
can't, which is a negligible amount of calls.
The only questionable bit might be the pre-existing predict around
MAY_WRITE. Currently the routine is predominantly used for MAY_EXEC, so
this makes some sense.
I verified this straightens out the asm.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/20250416221626.2710239-2-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This system call has been deprecated for quite a while now.
Let's try and remove it from the kernel completely.
Link: https://lore.kernel.org/20250415-kanufahren-besten-02ac00e6becd@brauner
Acked-by: Kees Cook <kees@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Remove validate_constant_table() since:
- It has no caller.
- It has below 3 bugs for good constant table array array[] which must
end with a empty entry, and take below invocation for explaination:
validate_constant_table(array, ARRAY_SIZE(array), ...)
- Always return wrong value due to the last empty entry.
- Imprecise error message for missorted case.
- Potential NULL pointer dereference since the last pr_err() may use
@tbl[i].name NULL pointer to print the last empty entry's name.
Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Link: https://lore.kernel.org/20250415-fix_fs-v4-1-5d575124a3ff@quicinc.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Looking at the asm produced by gcc 13.3 for x86-64:
1. may_lookup() usage was not optimized for succeeding, despite the
routine being inlined and rightfully starting with likely(!err)
2. the compiler assumed the path will have an indefinite amount of
slashes to skip, after which the result will be an empty name
As such:
1. predict may_lookup() succeeding
2. check for one slash, no explicit predicts. do roll forward with
skipping more slashes while predicting there is only one
3. predict the path to find was not a mere slash
This also has a side effect of shrinking the file:
add/remove: 1/1 grow/shrink: 0/3 up/down: 934/-1012 (-78)
Function old new delta
link_path_walk - 934 +934
path_parentat 138 112 -26
path_openat 4864 4823 -41
path_lookupat 418 374 -44
link_path_walk.part.constprop 901 - -901
Total: Before=46639, After=46561, chg -0.17%
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/20250412110935.2267703-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Make file-nr output the total allocated file handles, not per-cpu
cache number, it's more precise, and not in hot path
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Link: https://lore.kernel.org/20250410112117.2851-1-lirongqing@baidu.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Adding an unlikely() hint on the n < 0 comparison return path improves
run-time performance of the select() system call, the negative
value of n is very uncommon in normal select usage.
Benchmarking on an Debian based Intel(R) Core(TM) Ultra 9 285K with
a 6.15-rc1 kernel built with 14.2.0 using a select of 1000 file
descriptors with zero timeout shows a consistent call reduction from
258 ns down to 254 ns, which is a ~1.5% performance improvement.
Results based on running 25 tests with turbo disabled (to reduce clock
freq turbo changes), with 30 second run per test and comparing the number
of select() calls per second. The % standard deviation of the 25 tests
was 0.24%, so results are reliable.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/20250414092426.53529-1-colin.i.king@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
found with the new enumerated_ref code
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We've got some reports of this happening in the wild, and need a bit
more info to debug it:
https://github.com/koverstreet/bcachefs/issues/854
https://www.reddit.com/r/bcachefs/comments/1k28kjm/surprise_soft_lockup/
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We don't require that bucket size is block size aligned (although it
should be!) - so we need to handle this in the journal code.
This fixes an assertion pop in jorunal_entry_close(), where the journal
entry overruns available space - after rounding it up to block size.
Fixes: https://github.com/koverstreet/bcachefs/issues/854
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Syzbot managed to come up with a filesystem where check/repair got
rather confused at finding a reflink pointer in the inodes btree.
Currently, the "key allowed in this btree" checks only apply at commit
time, not read time - for forwards compatibility. It seems this is too
loose.
Now, strict key type allowed checks apply:
- at commit time (no forward compatibility issues)
- for btree node pointers
- if it's a known btree, known key type, and the key type has the
"BKEY_TYPE_strict_btree_checks" flag.
This means we still have the option of using generic key types - e.g.
KEY_TYPE_error, KEY_TYPE_set - on more existing btrees in the future,
while most key types that are intended for only a specific btree get
stricter checks.
Reported-by: syzbot+baee8591f336cab0958b@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We now more often do repair automatically, without the user invoking
fsck - and sometimes that can involve fixing lots of errors, so let's
avoid flooding the dmesg log.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Reported-by: syzbot+baee8591f336cab0958b@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Don't set JOURNAL_running until we're also calling
journal_space_available() for the first time.
If JOURNAL_running is set, shutdown will write an empty journal entry -
but this will hit an assert in journal_entry_open() if we've never
called journal_space_available().
Reported-by: syzbot+53bb24d476ef8368a7f0@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
All of these cases are perfectly valid and good traditional C, but hit
by the "you're not NUL-terminating your byte array" warning.
And none of the cases want any terminating NUL character.
Mark them __nonstring to shut up gcc-15 (and in the case of the ak8974
magnetometer driver, I just removed the explicit array size and let gcc
expand the 3-byte and 6-byte arrays by one extra byte, because it was
the simpler change).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc hotfixes from Andrew Morton:
"16 hotfixes. 2 are cc:stable and the remainder address post-6.14
issues or aren't considered necessary for -stable kernels.
All patches are basically for MM although five are alterations to
MAINTAINERS"
[ Basic counting skills are clearly not a strictly necessary requirement
for kernel maintainers. - Linus ]
* tag 'mm-hotfixes-stable-2025-04-19-21-24' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
MAINTAINERS: add section for locking of mm's and VMAs
mm: vmscan: fix kswapd exit condition in defrag_mode
mm: vmscan: restore high-cpu watermark safety in kswapd
MAINTAINERS: add Pedro as reviewer to the MEMORY MAPPING section
mm/memory: move sanity checks in do_wp_page() after mapcount vs. refcount stabilization
mm, hugetlb: increment the number of pages to be reset on HVO
writeback: fix false warning in inode_to_wb()
docs: ABI: replace mcroce@microsoft.com with new Meta address
mm/gup: fix wrongly calculated returned value in fault_in_safe_writeable()
MAINTAINERS: add memory advice section
MAINTAINERS: add mmap trace events to MEMORY MAPPING
mm: memcontrol: fix swap counter leak from offline cgroup
MAINTAINERS: add MM subsection for the page allocator
MAINTAINERS: update SLAB ALLOCATOR maintainers
fs/dax: fix folio splitting issue by resetting old folio order + _nr_pages
mm/page_alloc: fix deadlock on cpu_hotplug_lock in __accept_page()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Revert the hfs{plus} deprecation warning that's also included in this
pull request. The commit introducing the deprecation warning resides
rather early in this branch. So simply dropping it would've rebased
all other commits which I decided to avoid. Hence the revert in the
same branch
[ Background - the deprecation warning discussion resulted in people
stepping up, and so hfs{plus} will have a maintainer taking care of
it after all.. - Linus ]
- Switch CONFIG_SYSFS_SYCALL default to n and decouple from
CONFIG_EXPERT
- Fix an audit bug caused by changes to our kernel path lookup helpers
this cycle. Audit needs the parent path even if the dentry it tried
to look up is negative
- Ensure that the kernel path lookup helpers leave the passed in path
argument clean when they return an error. This is consistent with all
our other helpers
- Ensure that vfs_getattr_nosec() calls bdev_statx() so the relevant
information is available to kernel consumers as well
- Don't set a timer and call schedule() if the timer will expire
immediately in epoll
- Make netfs lookup tables with __nonstring
* tag 'vfs-6.15-rc3.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
Revert "hfs{plus}: add deprecation warning"
fs: move the bdex_statx call to vfs_getattr_nosec
netfs: Mark __nonstring lookup tables
eventpoll: Set epoll timeout if it's in the future
fs: ensure that *path_locked*() helpers leave passed path pristine
fs: add kern_path_locked_negative()
hfs{plus}: add deprecation warning
Kconfig: switch CONFIG_SYSFS_SYCALL default to n
|
|
This reverts commit ddee68c499f76ae47c011549df5be53db0057402.
There's ongoing discussion about better maintenance of at least hfsplus.
Rever the deprecation warning for now.
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- v6.15 libcrc clean-up makes invalid configurations possible
- Fix a potential deadlock introduced during the v6.15 merge window
* tag 'nfsd-6.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
nfsd: decrease sc_count directly if fail to queue dl_recall
nfs: add missing selections of CONFIG_CRC32
|
|
Pull smb client fixes from Steve French:
- Fix hard link lease key problem when close is deferred
- Revert the socket lockdep/refcount workarounds done in cifs.ko now
that it is fixed at the socket layer
* tag '6.15-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
Revert "smb: client: fix TCP timers deadlock after rmmod"
Revert "smb: client: Fix netns refcount imbalance causing leaks and use-after-free"
smb3 client: fix open hardlink on deferred close file error
|
|
Pull smb server fixes from Steve French:
- Fix integer overflow in server disconnect deadtime calculation
- Three fixes for potential use after frees: one for oplocks, and one
for leases and one for kerberos authentication
- Fix to prevent attempted write to directory
- Fix locking warning for durable scavenger thread
* tag 'v6.15-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: Prevent integer overflow in calculation of deadtime
ksmbd: fix the warning from __kernel_write_iter
ksmbd: fix use-after-free in smb_break_all_levII_oplock()
ksmbd: fix use-after-free in __smb2_lease_break_noti()
ksmbd: fix WARNING "do not call blocking ops when !TASK_RUNNING"
ksmbd: Fix dangling pointer in krb_authenticate
|
|
Alison reports an issue with fsdax when large extends end up using large
ZONE_DEVICE folios:
[ 417.796271] BUG: kernel NULL pointer dereference, address: 0000000000000b00
[ 417.796982] #PF: supervisor read access in kernel mode
[ 417.797540] #PF: error_code(0x0000) - not-present page
[ 417.798123] PGD 2a5c5067 P4D 2a5c5067 PUD 2a5c6067 PMD 0
[ 417.798690] Oops: Oops: 0000 [#1] SMP NOPTI
[ 417.799178] CPU: 5 UID: 0 PID: 1515 Comm: mmap Tainted: ...
[ 417.800150] Tainted: [O]=OOT_MODULE
[ 417.800583] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 417.801358] RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
[ 417.801948] Code: ...
[ 417.803662] RSP: 0000:ffffc90002be3a08 EFLAGS: 00010206
[ 417.804234] RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000000002
[ 417.804984] RDX: ffffffff815652d7 RSI: 0000000000000000 RDI: ffffffff82a2beae
[ 417.805689] RBP: ffffc90002be3a28 R08: 0000000000000000 R09: 0000000000000000
[ 417.806384] R10: ffffea0007000040 R11: ffff888376ffe000 R12: 0000000000000001
[ 417.807099] R13: 0000000000000012 R14: ffff88807fe4ab40 R15: ffff888029210580
[ 417.807801] FS: 00007f339fa7a740(0000) GS:ffff8881fa9b9000(0000) knlGS:0000000000000000
[ 417.808570] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 417.809193] CR2: 0000000000000b00 CR3: 000000002a4f0004 CR4: 0000000000370ef0
[ 417.809925] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 417.810622] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 417.811353] Call Trace:
[ 417.811709] <TASK>
[ 417.812038] folio_add_file_rmap_ptes+0x143/0x230
[ 417.812566] insert_page_into_pte_locked+0x1ee/0x3c0
[ 417.813132] insert_page+0x78/0xf0
[ 417.813558] vmf_insert_page_mkwrite+0x55/0xa0
[ 417.814088] dax_fault_iter+0x484/0x7b0
[ 417.814542] dax_iomap_pte_fault+0x1ca/0x620
[ 417.815055] dax_iomap_fault+0x39/0x40
[ 417.815499] __xfs_write_fault+0x139/0x380
[ 417.815995] ? __handle_mm_fault+0x5e5/0x1a60
[ 417.816483] xfs_write_fault+0x41/0x50
[ 417.816966] xfs_filemap_fault+0x3b/0xe0
[ 417.817424] __do_fault+0x31/0x180
[ 417.817859] __handle_mm_fault+0xee1/0x1a60
[ 417.818325] ? debug_smp_processor_id+0x17/0x20
[ 417.818844] handle_mm_fault+0xe1/0x2b0
[...]
The issue is that when we split a large ZONE_DEVICE folio to order-0 ones,
we don't reset the order/_nr_pages. As folio->_nr_pages overlays
page[1]->memcg_data, once page[1] is a folio, it suddenly looks like it
has folio->memcg_data set. And we never manually initialize
folio->memcg_data in fsdax code, because we never expect it to be set at
all.
When __lruvec_stat_mod_folio() then stumbles over such a folio, it tries
to use folio->memcg_data (because it's non-NULL) but it does not actually
point at a memcg, resulting in the problem.
Alison also observed that these folios sometimes have "locked" set, which
is rather concerning (folios locked from the beginning ...). The reason
is that the order for large folios is stored in page[1]->flags, which
become the folio->flags of a new small folio.
Let's fix it by adding a folio helper to clear order/_nr_pages for
splitting purposes.
Maybe we should reinitialize other large folio flags / folio members as
well when splitting, because they might similarly cause harm once page[1]
becomes a folio? At least other flags in PAGE_FLAGS_SECOND should not be
set for fsdax, so at least page[1]->flags might be as expected with this
fix.
From a quick glimpse, initializing ->mapping, ->pgmap and ->share should
re-initialize most things from a previous page[1] used by large folios
that fsdax cares about. For example folio->private might not get
reinitialized, but maybe that's not relevant -- no traces of it's use in
fsdax code. Needs a closer look.
Another thing that should be considered in the future is performing
similar checks as we perform in free_tail_page_prepare()
-- checking pincount etc.
-- when freeing a large fsdax folio.
Link: https://lkml.kernel.org/r/20250410091020.119116-1-david@redhat.com
Fixes: 4996fc547f5b ("mm: let _folio_nr_pages overlay memcg_data in first tail page")
Fixes: 38607c62b34b ("fs/dax: properly refcount fs dax pages")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Alison Schofield <alison.schofield@intel.com>
Closes: https://lkml.kernel.org/r/Z_W9Oeg-D9FhImf3@aschofie-mobl2.lan
Tested-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: "Darrick J. Wong" <djwong@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Pull bcachefs fixes from Kent Overstreet:
"Usual set of small fixes/logging improvements.
One bigger user reported fix, for inode <-> dirent inconsistencies
reported in fsck, after moving a subvolume that had been snapshotted"
* tag 'bcachefs-2025-04-17' of git://evilpiepirate.org/bcachefs:
bcachefs: Fix snapshotting a subvolume, then renaming it
bcachefs: Add missing READ_ONCE() for metadata replicas
bcachefs: snapshot_node_missing is now autofix
bcachefs: Log message when incompat version requested but not enabled
bcachefs: Print version_incompat_allowed on startup
bcachefs: Silence extent_poisoned error messages
bcachefs: btree_root_unreadable_and_scan_found_nothing now AUTOFIX
bcachefs: fix bch2_dev_usage_full_read_fast()
bcachefs: Don't print data read retry success on non-errors
bcachefs: Add missing error handling
bcachefs: Prevent granting write refs when filesystem is read-only
|
|
Cross-merge networking fixes after downstream PR (net-6.15-rc3).
No conflicts. Adjacent changes:
tools/net/ynl/pyynl/ynl_gen_c.py
4d07bbf2d456 ("tools: ynl-gen: don't declare loop iterator in place")
7e8ba0c7de2b ("tools: ynl: don't use genlmsghdr in classic netlink")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Subvolume roots and the dirents that point to them are special; they
don't obey the normal snapshot versioning rules because they cross
snapshot boundaries.
We don't keep around older versions of subvolume dirents on rename - we
don't need to, because subvolume dirents are only visible in the parent
subvolume, and we wouldn't be able to match up the different dirent and
inode versions due to crossing the snapshot ID boundary.
That means that when we rename a subvolume, that's been snapshotted, the
older version of the subvolume root will become dangling - it won't have
a dirent that points to it.
That's expected, we just need to tell fsck that this is ok.
Fixes: https://github.com/koverstreet/bcachefs/issues/856
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Pull XFS fixes from Carlos Maiolino:
"This mostly includes fixes and documentation for the zoned allocator
feature merged during previous merge window, but it also adds a sysfs
tunable for the zone garbage collector.
There is also a fix for a regression to the RT device that we'd like
to fix ASAP now that we're getting more users on the RT zoned
allocator"
* tag 'xfs-fixes-6.15-rc3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: document zoned rt specifics in admin-guide
xfs: fix fsmap for internal zoned devices
xfs: Fix spelling mistake "drity" -> "dirty"
xfs: compute buffer address correctly in xmbuf_map_backing_mem
xfs: add tunable threshold parameter for triggering zone GC
xfs: mark xfs_buf_free as might_sleep()
xfs: remove the leftover xfs_{set,clear}_li_failed infrastructure
|