Age | Commit message (Collapse) | Author |
|
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.
Also, anytime the mtime changes, the ctime must also change, and those
are now the only two options for xfs_trans_ichgtime. Have that function
unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is
always set.
Finally, stop setting STATX_CHANGE_COOKIE in getattr, since the ctime
should give us better semantics now.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20241002-mgtime-v10-9-d1c4717f5284@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
New percpu counters for counting various stats around multigrain
timestamp events, and a new debugfs file for displaying them when
CONFIG_DEBUG_FS is enabled:
- number of attempted ctime updates
- number of successful i_ctime_nsec swaps
- number of fine-grained timestamp fetches
- number of floor value swap events
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20241002-mgtime-v10-7-d1c4717f5284@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add some tracepoints around various multigrain timestamp events.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20241002-mgtime-v10-6-d1c4717f5284@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
An update to the inode ctime typically requires the latest clock
value possible. The exception to this rule is when there is a nfsd write
delegation and the server is proxying timestamps from the client.
When nfsd gets a CB_GETATTR response, update the timestamp value in the
inode to the values that the client is tracking. The client doesn't send
a ctime value (since that's always determined by the exported
filesystem), but it can send a mtime value. In the case where it does,
update the ctime to a value commensurate with that instead of the
current time.
If ATTR_DELEG is set, then use ia_ctime value instead of setting the
timestamp to the current time.
With the addition of delegated timestamps, the server may receive a
request to update only the atime, which doesn't involve a ctime update.
Trust the ATTR_CTIME flag in the update and only update the ctime when
it's set.
Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20241002-mgtime-v10-5-d1c4717f5284@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
There is racy issue between smb2 session log off and smb2 session setup.
It will cause user-after-free from session log off.
This add session_lock when setting SMB2_SESSION_EXPIRED and referece
count to session struct not to free session while it is being used.
Cc: stable@vger.kernel.org # v5.15+
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-25282
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"12 hotfixes, 5 of which are c:stable. All singletons, about half of
which are MM"
* tag 'mm-hotfixes-stable-2024-10-09-15-46' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: zswap: delete comments for "value" member of 'struct zswap_entry'.
CREDITS: sort alphabetically by name
secretmem: disable memfd_secret() if arch cannot set direct map
.mailmap: update Fangrui's email
mm/huge_memory: check pmd_special() only after pmd_present()
resource, kunit: fix user-after-free in resource_test_region_intersects()
fs/proc/kcore.c: allow translation of physical memory addresses
selftests/mm: fix incorrect buffer->mirror size in hmm2 double_map test
device-dax: correct pgoff align in dax_set_mapping()
kthread: unpark only parked kthread
Revert "mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"
bcachefs: do not use PF_MEMALLOC_NORECLAIM
|
|
inode_bit_waitqueue() is changing - this update clears the way for
sched changes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Like how we already do when the allocator seems to be stuck, check if
we're waiting too long for a journal reservation and print some debug
info.
This is specifically to track down
https://github.com/koverstreet/bcachefs/issues/656
which is showing up in userspace where we don't have sysfs/debugfs to
get the journal debug info.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We increased write ref, if the fs went to RO, that would lead to
a deadlock, it actually happens:
00171 ========= TEST generic/279
00171
00172 bcachefs (vdb): starting version 1.12: rebalance_work_acct_fix opts=nocow
00172 bcachefs (vdb): recovering from clean shutdown, journal seq 35
00172 bcachefs (vdb): accounting_read... done
00172 bcachefs (vdb): alloc_read... done
00172 bcachefs (vdb): stripes_read... done
00172 bcachefs (vdb): snapshots_read... done
00172 bcachefs (vdb): journal_replay... done
00172 bcachefs (vdb): resume_logged_ops... done
00172 bcachefs (vdb): going read-write
00172 bcachefs (vdb): done starting filesystem
00172 FSTYP -- bcachefs
00172 PLATFORM -- Linux/aarch64 farm3-kvm 6.11.0-rc1-ktest-g3e290a0b8e34 #7030 SMP Tue Oct 8 14:15:12 UTC 2024
00172 MKFS_OPTIONS -- --nocow /dev/vdc
00172 MOUNT_OPTIONS -- /dev/vdc /mnt/scratch
00172
00172 bcachefs (vdc): starting version 1.12: rebalance_work_acct_fix opts=nocow
00172 bcachefs (vdc): initializing new filesystem
00172 bcachefs (vdc): going read-write
00172 bcachefs (vdc): marking superblocks
00172 bcachefs (vdc): initializing freespace
00172 bcachefs (vdc): done initializing freespace
00172 bcachefs (vdc): reading snapshots table
00172 bcachefs (vdc): reading snapshots done
00172 bcachefs (vdc): done starting filesystem
00173 bcachefs (vdc): shutting down
00173 bcachefs (vdc): going read-only
00173 bcachefs (vdc): finished waiting for writes to stop
00173 bcachefs (vdc): flushing journal and stopping allocators, journal seq 4
00173 bcachefs (vdc): flushing journal and stopping allocators complete, journal seq 6
00173 bcachefs (vdc): shutdown complete, journal seq 7
00173 bcachefs (vdc): marking filesystem clean
00173 bcachefs (vdc): shutdown complete
00173 bcachefs (vdb): shutting down
00173 bcachefs (vdb): going read-only
00361 INFO: task umount:6180 blocked for more than 122 seconds.
00361 Not tainted 6.11.0-rc1-ktest-g3e290a0b8e34 #7030
00361 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
00361 task:umount state:D stack:0 pid:6180 tgid:6180 ppid:6176 flags:0x00000004
00361 Call trace:
00362 __switch_to (arch/arm64/kernel/process.c:556)
00362 __schedule (kernel/sched/core.c:5191 kernel/sched/core.c:6529)
00363 schedule (include/asm-generic/bitops/generic-non-atomic.h:128 include/linux/thread_info.h:192 include/linux/sched.h:2084 kernel/sched/core.c:6608 kernel/sched/core.c:6621)
00365 bch2_fs_read_only (fs/bcachefs/super.c:346 (discriminator 41))
00367 __bch2_fs_stop (fs/bcachefs/super.c:620)
00368 bch2_put_super (fs/bcachefs/fs.c:1942)
00369 generic_shutdown_super (include/linux/list.h:373 (discriminator 2) fs/super.c:650 (discriminator 2))
00371 bch2_kill_sb (fs/bcachefs/fs.c:2170)
00372 deactivate_locked_super (fs/super.c:434 fs/super.c:475)
00373 deactivate_super (fs/super.c:508)
00374 cleanup_mnt (fs/namespace.c:250 fs/namespace.c:1374)
00376 __cleanup_mnt (fs/namespace.c:1381)
00376 task_work_run (include/linux/sched.h:2024 kernel/task_work.c:224)
00377 do_notify_resume (include/linux/resume_user_mode.h:50 arch/arm64/kernel/entry-common.c:151)
00377 el0_svc (arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/entry-common.c:171 arch/arm64/kernel/entry-common.c:178 arch/arm64/kernel/entry-common.c:713)
00377 el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:731)
00378 el0t_64_sync (arch/arm64/kernel/entry.S:598)
00378 INFO: task tee:6182 blocked for more than 122 seconds.
00378 Not tainted 6.11.0-rc1-ktest-g3e290a0b8e34 #7030
00378 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
00378 task:tee state:D stack:0 pid:6182 tgid:6182 ppid:533 flags:0x00000004
00378 Call trace:
00378 __switch_to (arch/arm64/kernel/process.c:556)
00378 __schedule (kernel/sched/core.c:5191 kernel/sched/core.c:6529)
00378 schedule (include/asm-generic/bitops/generic-non-atomic.h:128 include/linux/thread_info.h:192 include/linux/sched.h:2084 kernel/sched/core.c:6608 kernel/sched/core.c:6621)
00378 schedule_preempt_disabled (kernel/sched/core.c:6680)
00379 rwsem_down_read_slowpath (kernel/locking/rwsem.c:1073 (discriminator 1))
00379 down_read (kernel/locking/rwsem.c:1529)
00381 bch2_gc_gens (fs/bcachefs/sb-members.h:77 fs/bcachefs/sb-members.h:88 fs/bcachefs/sb-members.h:128 fs/bcachefs/btree_gc.c:1240)
00383 bch2_fs_store_inner (fs/bcachefs/sysfs.c:473)
00385 bch2_fs_internal_store (fs/bcachefs/sysfs.c:417 fs/bcachefs/sysfs.c:580 fs/bcachefs/sysfs.c:576)
00386 sysfs_kf_write (fs/sysfs/file.c:137)
00387 kernfs_fop_write_iter (fs/kernfs/file.c:334)
00389 vfs_write (fs/read_write.c:497 fs/read_write.c:590)
00390 ksys_write (fs/read_write.c:643)
00391 __arm64_sys_write (fs/read_write.c:652)
00391 invoke_syscall.constprop.0 (arch/arm64/include/asm/syscall.h:61 arch/arm64/kernel/syscall.c:54)
00392 do_el0_svc (include/linux/thread_info.h:127 (discriminator 2) arch/arm64/kernel/syscall.c:140 (discriminator 2) arch/arm64/kernel/syscall.c:151 (discriminator 2))
00392 el0_svc (arch/arm64/include/asm/irqflags.h:55 arch/arm64/include/asm/irqflags.h:76 arch/arm64/kernel/entry-common.c:165 arch/arm64/kernel/entry-common.c:178 arch/arm64/kernel/entry-common.c:713)
00392 el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:731)
00392 el0t_64_sync (arch/arm64/kernel/entry.S:598)
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This patch adds a bounds check to the bch2_opt_to_text function to prevent
NULL pointer dereferences when accessing the opt->choices array. This
ensures that the index used is within valid bounds before dereferencing.
The new version enhances the readability.
Reported-and-tested-by: syzbot+37186860aa7812b331d5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=37186860aa7812b331d5
Signed-off-by: Mohammed Anees <pvmohammedanees2003@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We will get this if we wake up first:
Kernel panic - not syncing: btree_node_write_done leaked btree_trans
since there are still transactions waiting for cycle detectors after
BTREE_NODE_write_in_flight is cleared.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Add check for read node's btree_id against BTREE_ID_NR_MAX in
try_read_btree_node to prevent triggering EBUG_ON condition in
bch2_btree_id_root[1].
[1] https://syzkaller.appspot.com/bug?extid=cf7b2215b5d70600ec00
Reported-by: syzbot+cf7b2215b5d70600ec00@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=cf7b2215b5d70600ec00
Fixes: 4409b8081d16 ("bcachefs: Repair pass for scanning for btree nodes")
Signed-off-by: Piotr Zalewski <pZ010001011111@proton.me>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
- Fix failure to validate that accounting replicas entries point to
valid devices: this wasn't a real bug since they'd be cleaned up by
GC, but is still something we should know about
- Fix failure to validate that dev_data_type entries point to valid
devices: this does fix a real bug, since bch2_accounting_read() would
then try to copy the counters to that device and pop an inconsistent
error when the device didn't exist
- Remove accounting entries that are zeroed or invalid: if we're not
validating them we need to get rid of them: they might not exist in
the superblock, so we need the to trigger the superblock mark path
when they're readded.
This fixes the replication.ktest rereplicate test, which was failing
with "superblock not marked for replicas..."
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
fsck can now correctly check if inodes in interior snapshot nodes are
open/in use.
- Tweak the vfs inode rhashtable so that the subvolume ID isn't hashed,
meaning inums in different subvolumes will hash to the same slot. Note
that this is a hack, and will cause problems if anyone ever has the
same file in many different snapshots open all at the same time.
- Then check if any of those subvolumes is a descendent of the snapshot
ID being checked
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Dead code now.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
There's an inherent race in taking a snapshot while an unlinked file is
open, and then reattaching it in the child snapshot.
In the interior snapshot node the file will appear unlinked, as though
it should be deleted - it's not referenced by anything in that snapshot
- but we can't delete it, because the file data is referenced by the
child snapshot.
This was being handled incorrectly with
propagate_key_to_snapshot_leaves() - but that doesn't resolve the
fundamental inconsistency of "this file looks like it should be deleted
according to normal rules, but - ".
To fix this, we need to fix the rule for when an inode is deleted. The
previous rule, ignoring snapshots (there was no well-defined rule
for with snapshots) was:
Unlinked, non open files are deleted, either at recovery time or
during online fsck
The new rule is:
Unlinked, non open files, that do not exist in child snapshots, are
deleted.
To make this work transactionally, we add a new inode flag,
BCH_INODE_has_child_snapshot; it overrides BCH_INODE_unlinked when
considering whether to delete an inode, or put it on the deleted list.
For transactional consistency, clearing it handled by the inode trigger:
when deleting an inode we check if there are parent inodes which can now
have the BCH_INODE_has_child_snapshot flag cleared.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
When /proc/kcore is read an attempt to read the first two pages results in
HW-specific page swap on s390 and another (so called prefix) pages are
accessed instead. That leads to a wrong read.
Allow architecture-specific translation of memory addresses using
kc_xlate_dev_mem_ptr() and kc_unxlate_dev_mem_ptr() callbacks similarily
to /dev/mem xlate_dev_mem_ptr() and unxlate_dev_mem_ptr() callbacks. That
way an architecture can deal with specific physical memory ranges.
Re-use the existing /dev/mem callback implementation on s390, which
handles the described prefix pages swapping correctly.
For other architectures the default callback is basically NOP. It is
expected the condition (vaddr == __va(__pa(vaddr))) always holds true for
KCORE_RAM memory type.
Link: https://lkml.kernel.org/r/20240930122119.1651546-1-agordeev@linux.ibm.com
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "remove PF_MEMALLOC_NORECLAIM" v3.
This patch (of 2):
bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new
inode to achieve GFP_NOWAIT semantic while holding locks. If this
allocation fails it will drop locks and use GFP_NOFS allocation context.
We would like to drop PF_MEMALLOC_NORECLAIM because it is really
dangerous to use if the caller doesn't control the full call chain with
this flag set. E.g. if any of the function down the chain needed
GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and
cause unexpected failure.
While this is not the case in this particular case using the scoped gfp
semantic is not really needed bacause we can easily pus the allocation
context down the chain without too much clutter.
[akpm@linux-foundation.org: fix kerneldoc warnings]
Link: https://lkml.kernel.org/r/20240926172940.167084-1-mhocko@kernel.org
Link: https://lkml.kernel.org/r/20240926172940.167084-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz> # For vfs changes
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: James Morris <jmorris@namei.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
After the delegation is returned to the NFS server remove it
from the server's delegations list to reduce the time it takes
to scan this list.
Network trace captured while running the below script shows the
time taken to service the CB_RECALL increases gradually due to
the overhead of traversing the delegation list in
nfs_delegation_find_inode_server.
The NFS server in this test is a Solaris server which issues
CB_RECALL when receiving the all-zero stateid in the SETATTR.
mount=/mnt/data
for i in $(seq 1 20)
do
echo $i
mkdir $mount/testtarfile$i
time tar -C $mount/testtarfile$i -xf 5000_files.tar
done
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode
Pull unicode fix from Gabriel Krisman Bertazi:
- Handle code-points with the Ignorable property as regular character
instead of treating them as an empty string (me)
* tag 'unicode-fixes-6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode:
unicode: Don't special case ignorable code points
|
|
We don't need to handle them separately. Instead, just let them
decompose/casefold to themselves.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
|
|
All callers treat 0 and 1 returned by expand_files() in the same way
now since the call in alloc_fd() had been made conditional. Just make
it return 0 on success and be done with it...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
->close_on_exec[] state is maintained only for opened descriptors;
as the result, anything that marks a descriptor opened has to
set its cloexec state explicitly.
As the result, all calls of __set_open_fd() are followed by
__set_close_on_exec(); might as well fold it into __set_open_fd()
so that cloexec state is defined as soon as the descriptor is
marked opened.
[braino fix folded]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Now that the sysctl core can handle registration of
"const struct ctl_table" constify the sysctl internal tables.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
Putting structure, especially those containing function pointers,
into read-only memory makes the safer and easier to reason about.
Change the sysctl registration APIs to allow registration of
"const struct ctl_table".
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org> # security/*
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
As a preparation to make all the core sysctl code work with const struct
ctl_table switch over the internal function to use the const variant.
Some pointers to "struct ctl_table" need to stay non-const as they are
newly allocated and modified before registration.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
This commit is a replay of commit 6252690f7e1b ("btrfs: fix invalid
mapping of extent xarray state"). We need to call
btrfs_folio_clear_dirty() before btrfs_set_range_writeback(), so that
xarray DIRTY tag is cleared.
With a refactoring commit 8189197425e7 ("btrfs: refactor
__extent_writepage_io() to do sector-by-sector submission"), it screwed
up and the order is reversed and causing the same hang. Fix the ordering
now in submit_one_sector().
Fixes: 8189197425e7 ("btrfs: refactor __extent_writepage_io() to do sector-by-sector submission")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
At btrfs_load_zone_info() we have an error path that is dereferencing
the name of a device which is a RCU string but we are not holding a RCU
read lock, which is incorrect.
Fix this by using btrfs_err_in_rcu() instead of btrfs_err().
The problem is there since commit 08e11a3db098 ("btrfs: zoned: load zone's
allocation offset"), back then at btrfs_load_block_group_zone_info() but
then later on that code was factored out into the helper
btrfs_load_zone_info() by commit 09a46725cc84 ("btrfs: zoned: factor out
per-zone logic from btrfs_load_block_group_zone_info").
Fixes: 08e11a3db098 ("btrfs: zoned: load zone's allocation offset")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Fix a typo in comments.
Signed-off-by: Andrew Kreimer <algonell@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
fallocate unshare mode explicitly breaks extent sharing. When a
command completes, it checks the data fork for any remaining shared
extents to determine whether the reflink inode flag and COW fork
preallocation can be removed. This logic doesn't consider in-core
pagecache and I/O state, however, which means we can unsafely remove
COW fork blocks that are still needed under certain conditions.
For example, consider the following command sequence:
xfs_io -fc "pwrite 0 1k" -c "reflink <file> 0 256k 1k" \
-c "pwrite 0 32k" -c "funshare 0 1k" <file>
This allocates a data block at offset 0, shares it, and then
overwrites it with a larger buffered write. The overwrite triggers
COW fork preallocation, 32 blocks by default, which maps the entire
32k write to delalloc in the COW fork. All but the shared block at
offset 0 remains hole mapped in the data fork. The unshare command
redirties and flushes the folio at offset 0, removing the only
shared extent from the inode. Since the inode no longer maps shared
extents, unshare purges the COW fork before the remaining 28k may
have written back.
This leaves dirty pagecache backed by holes, which writeback quietly
skips, thus leaving clean, non-zeroed pagecache over holes in the
file. To verify, fiemap shows holes in the first 32k of the file and
reads return different data across a remount:
$ xfs_io -c "fiemap -v" <file>
<file>:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
...
1: [8..511]: hole 504
...
$ xfs_io -c "pread -v 4k 8" <file>
00001000: cd cd cd cd cd cd cd cd ........
$ umount <mnt>; mount <dev> <mnt>
$ xfs_io -c "pread -v 4k 8" <file>
00001000: 00 00 00 00 00 00 00 00 ........
To avoid this problem, make unshare follow the same rules used for
background cowblock scanning and never purge the COW fork for inodes
with dirty pagecache or in-flight I/O.
Fixes: 46afb0628b86347 ("xfs: only flush the unshared range in xfs_reflink_unshare")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
The following recent commit made DEFINE_WAIT_BIT() type requirements stricter:
2382d68d7d43 ("sched: change wake_up_bit() and related function to expect unsigned long *")
.. which results in a build failure:
> fs/bcachefs/fs.c: In function '__wait_on_freeing_inode':
> fs/bcachefs/fs.c:281:31: error: initialization of 'long unsigned int *' from incompatible pointer type 'u32 *' {aka 'unsigned int *'} [-Wincompatible-pointer-types]
> 281 | DEFINE_WAIT_BIT(wait, &inode->v.i_state, __I_NEW);
Since this code relies on the waitqueue initialization within
inode_bit_waitqueue() anyway, the DEFINE_WAIT_BIT() initialization
is unnecessary - we can just declare a waitqueue entry.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Suggested-by: NeilBrown <neilb@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
https://github.com/Paragon-Software-Group/linux-ntfs3
Pull ntfs3 updates from Konstantin Komarov:
"New:
- implement fallocate for compressed files
- add support for the compression attribute
- optimize large writes to sparse files
Fixes:
- fix several potential deadlock scenarios
- fix various internal bugs detected by syzbot
- add checks before accessing NTFS structures during parsing
- correct the format of output messages
Refactoring:
- replace fsparam_flag_no with fsparam_flag in options parser
- remove unused functions and macros"
* tag 'ntfs3_for_6.12' of https://github.com/Paragon-Software-Group/linux-ntfs3: (25 commits)
fs/ntfs3: Format output messages like others fs in kernel
fs/ntfs3: Additional check in ntfs_file_release
fs/ntfs3: Fix general protection fault in run_is_mapped_full
fs/ntfs3: Sequential field availability check in mi_enum_attr()
fs/ntfs3: Additional check in ni_clear()
fs/ntfs3: Fix possible deadlock in mi_read
ntfs3: Change to non-blocking allocation in ntfs_d_hash
fs/ntfs3: Remove unused al_delete_le
fs/ntfs3: Rename ntfs3_setattr into ntfs_setattr
fs/ntfs3: Replace fsparam_flag_no -> fsparam_flag
fs/ntfs3: Add support for the compression attribute
fs/ntfs3: Implement fallocate for compressed files
fs/ntfs3: Make checks in run_unpack more clear
fs/ntfs3: Add rough attr alloc_size check
fs/ntfs3: Stale inode instead of bad
fs/ntfs3: Refactor enum_rstbl to suppress static checker
fs/ntfs3: Fix sparse warning in ni_fiemap
fs/ntfs3: Fix warning possible deadlock in ntfs_set_state
fs/ntfs3: Fix sparse warning for bigendian
fs/ntfs3: Separete common code for file_read/write iter/splice
...
|
|
Convert the ubifs filesystem to the new internal mount API as the old
one will be obsoleted and removed. This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.
See Documentation/filesystems/mount_api.txt for more information.
[sandeen: forward-port old patch, use dedicated context struct]]
Signed-off-by: David Howells <dhowells@redhat.com>
Co-developed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Link: https://lore.kernel.org/r/20241002165834.868046-1-sandeen@redhat.com
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
cc: Richard Weinberger <richard@nod.at>
cc: Zhihao Cheng <chengzhihao1@huawei.com>
cc: linux-mtd@lists.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Convert the hpfs filesystem to use the new mount API.
Tested by comparing random mount & remount options before and after
the change.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Link: https://lore.kernel.org/r/0a066bbb-59ad-17b0-e413-190569f2fea9@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Convert the jfs filesystem to use the new mount API.
Tested by comparing random mount & remount options before and after
the change.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Link: https://lore.kernel.org/r/20240926171947.682881-1-sandeen@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Eric Sandeen <sandeen@redhat.com> says:
These were all tested against images I created or obtained, using a
script to test random combinations of valid and invalid mount and
remount options, and comparing the results before and after the changes.
AFAICT, all parsing works as expected and behavior is unchanged.
* patches from https://lore.kernel.org/r/20240916172735.866916-1-sandeen@redhat.com:
hfsplus: convert hfsplus to use the new mount api
hfs: convert hfs to use the new mount api
befs: convert befs to use the new mount api
affs: convert affs to use the new mount api
adfs: convert adfs to use the new mount api
Link: https://lore.kernel.org/r/20240916172735.866916-1-sandeen@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Convert the hfsplus filesystem to use the new mount API.
Tested by comparing random mount & remount options before and after
the change.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Link: https://lore.kernel.org/r/20240916172735.866916-6-sandeen@redhat.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Convert the hfs filesystem to use the new mount API.
Tested by comparing random mount & remount options before and after
the change.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Link: https://lore.kernel.org/r/20240916172735.866916-5-sandeen@redhat.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Alice Ryhl <aliceryhl@google.com> says:
This patchset contains the file abstractions needed by the Rust
implementation of the Binder driver.
Please see the Rust Binder RFC for usage examples:
https://lore.kernel.org/rust-for-linux/20231101-rust-binder-v1-0-08ba9197f637@google.com
Users of "rust: types: add `NotThreadSafe`":
[PATCH 5/9] rust: file: add `FileDescriptorReservation`
Users of "rust: task: add `Task::current_raw`":
[PATCH 7/9] rust: file: add `Kuid` wrapper
[PATCH 8/9] rust: file: add `DeferredFdCloser`
Users of "rust: file: add Rust abstraction for `struct file`":
[PATCH RFC 02/20] rust_binder: add binderfs support to Rust binder
[PATCH RFC 03/20] rust_binder: add threading support
Users of "rust: cred: add Rust abstraction for `struct cred`":
[PATCH RFC 05/20] rust_binder: add nodes and context managers
[PATCH RFC 06/20] rust_binder: add oneway transactions
[PATCH RFC 11/20] rust_binder: send nodes in transaction
[PATCH RFC 13/20] rust_binder: add BINDER_TYPE_FD support
Users of "rust: security: add abstraction for secctx":
[PATCH RFC 06/20] rust_binder: add oneway transactions
Users of "rust: file: add `FileDescriptorReservation`":
[PATCH RFC 13/20] rust_binder: add BINDER_TYPE_FD support
[PATCH RFC 14/20] rust_binder: add BINDER_TYPE_FDA support
Users of "rust: file: add `Kuid` wrapper":
[PATCH RFC 05/20] rust_binder: add nodes and context managers
[PATCH RFC 06/20] rust_binder: add oneway transactions
Users of "rust: file: add abstraction for `poll_table`":
[PATCH RFC 07/20] rust_binder: add epoll support
This patchset has some uses of read_volatile in place of READ_ONCE.
Please see the following rfc for context on this:
https://lore.kernel.org/all/20231025195339.1431894-1-boqun.feng@gmail.com/
* patches from https://lore.kernel.org/r/20240915-alice-file-v10-0-88484f7a3dcf@google.com:
rust: file: add abstraction for `poll_table`
rust: file: add `Kuid` wrapper
rust: file: add `FileDescriptorReservation`
rust: security: add abstraction for secctx
rust: cred: add Rust abstraction for `struct cred`
rust: file: add Rust abstraction for `struct file`
rust: task: add `Task::current_raw`
rust: types: add `NotThreadSafe`
Link: https://lore.kernel.org/r/20240915-alice-file-v10-0-88484f7a3dcf@google.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Currently backing files are not under any form of rcu protection.
Switching to file_ref requires rcu protection and so does the
speculative vma lookup. Switch backing files to the same rcu slab type
as regular files. There should be no additional magic required as the
lifetime of a backing file is always tied to a regular file.
Link: https://lore.kernel.org/r/20241007-brauner-file-rcuref-v2-1-387e24dc9163@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Darrick J. Wong <djwong@kernel.org> says:
This patchset fixes multiple data corruption bugs in the fallocate
unshare range implementation for fsdax.
* patches from https://lore.kernel.org/r/172796813251.1131942.12184885574609980777.stgit@frogsfrogsfrogs:
fsdax: dax_unshare_iter needs to copy entire blocks
fsdax: remove zeroing code from dax_unshare_iter
iomap: share iomap_unshare_iter predicate code with fsdax
xfs: don't allocate COW extents when unsharing a hole
Link: https://lore.kernel.org/r/172796813251.1131942.12184885574609980777.stgit@frogsfrogsfrogs
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
When adding a delayed ref head, at delayed-ref.c:add_delayed_ref_head(),
if we fail to insert the qgroup record we don't error out, we ignore it.
In fact we treat it as if there was no error and there was already an
existing record - we don't distinguish between the cases where
btrfs_qgroup_trace_extent_nolock() returns 1, meaning a record already
existed and we can free the given record, and the case where it returns
a negative error value, meaning the insertion into the xarray that is
used to track records failed.
Effectively we end up ignoring that we are lacking qgroup record in the
dirty extents xarray, resulting in incorrect qgroup accounting.
Fix this by checking for errors and return them to the callers.
Fixes: 3cce39a8ca4e ("btrfs: qgroup: use xarray to track dirty extents in transaction")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There are reports that system cannot suspend due to running trim because
the task responsible for trimming the device isn't able to finish in
time, especially since we have a free extent discarding phase, which can
trim a lot of unallocated space. There are no limits on the trim size
(unlike the block group part).
Since trime isn't a critical call it can be interrupted at any time,
in such cases we stop the trim, report the amount of discarded bytes and
return an error.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180
Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device,
mostly empty although we will do the split according to our super block
locations, the last super block ends at 256G, we can submit a huge
discard for the range [256G, 8T), causing a large delay.
Split the space left to discard based on BTRFS_MAX_DISCARD_CHUNK_SIZE in
preparation of introduction of cancellation points to trim. The value
of the chunk size is arbitrary, it can be higher or derived from actual
device capabilities but we can't easily read that using
bio_discard_limit().
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180
Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
they are always go in pairs; seeing that they are inlined, might
as well make that a single inline function taking a boolean
argument ("do we want close_on_exec set for that descriptor")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
First of all, tell it how many slots do we want, not which slot
is wanted. It makes one caller (dup_fd()) more straightforward
and doesn't harm another (expand_fdtable()).
Furthermore, make it return ERR_PTR() on failure rather than
returning NULL. Simplifies the callers.
Simplify the size calculation, while we are at it - note that we
always have slots_wanted greater than BITS_PER_LONG. What the
rules boil down to is
* use the smallest power of two large enough to give us
that many slots
* on 32bit skip 64 and 128 - the minimal capacity we want
there is 256 slots (i.e. 1Kb fd array).
* on 64bit don't skip anything, the minimal capacity is
128 - and we'll never be asked for 64 or less. 128 slots means
1Kb fd array, again.
* on 128bit, if that ever happens, don't skip anything -
we'll never be asked for 128 or less, so the fd array allocation
will be at least 2Kb.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Skip 2-levels searching via find_next_zero_bit() when there is free slot in the
word contains next_fd, as:
(1) next_fd indicates the lower bound for the first free fd.
(2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
searching.
(3) After fdt is expanded (the bitmap size doubled for each time of expansion),
it would never be shrunk. The search size increases but there are few open fds
available here.
This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by
Jan Kara <jack@suse.cz>, which is more generic and scalable than previous
versions. And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by
8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
Link: https://lore.kernel.org/r/20240717145018.3972922-4-yu.ma@intel.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
64 bits in open_fds are mapped to a common bit in full_fds_bits. It is very
likely that a bit in full_fds_bits has been cleared before in
__clear_open_fds()'s operation. Check the clear bit in full_fds_bits before
clearing to avoid unnecessary write and cache bouncing. See commit fc90888d07b8
("vfs: conditionally clear close-on-exec flag") for a similar optimization.
take stock kernel with patch 1 as baseline, it improves pts/blogbench-1.1.0
read for 13%, and write for 5% on Intel ICX 160 cores configuration with
v6.10-rc7.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
Link: https://lore.kernel.org/r/20240717145018.3972922-3-yu.ma@intel.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
alloc_fd() has a sanity check inside to make sure the struct file mapping to the
allocated fd is NULL. Remove this sanity check since it can be assured by
exisitng zero initilization and NULL set when recycling fd. Meanwhile, add
likely/unlikely and expand_file() call avoidance to reduce the work under
file_lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
Link: https://lore.kernel.org/r/20240717145018.3972922-2-yu.ma@intel.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
We never had callers for __close_range() except for close_range(2)
itself. Nothing of that sort has appeared in four years and if any users
do show up, we can always separate those suckers again.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|