Age | Commit message (Collapse) | Author |
|
ext4_mb_mark_bb() currently wrongly calculates cluster len (clen) and
flex_group->free_clusters. This patch fixes that.
Identified based on code review of ext4_mb_mark_bb() function.
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/a0b035d536bafa88110b74456853774b64c8ac40.1644992609.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
CONFIG_JBD2_DEBUG and jbd2_journal_enable_debug knobs were added in
update_t_max_wait(), since earlier it used to take a spinlock for
updating t_max_wait, which could cause a bottleneck while starting a
txn (start_this_handle()).
Since in previous patch, we have killed t_handle_lock completely, we
could get rid of this debug config and knob to let t_max_wait be
updated by default again.
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/ad7319a601fd501079310747ce87d908e0944763.1644992076.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This patch kills t_handle_lock transaction spinlock completely from
jbd2.
To explain the reasoning, currently there were three sites at which
this spinlock was used.
1. jbd2_journal_wait_updates()
a. Based on careful code review it can be seen that, we don't need this
lock here. This is since we wait for any currently ongoing updates
based on a atomic variable t_updates. And we anyway don't take any
t_handle_lock while in stop_this_handle().
i.e.
write_lock(&journal->j_state_lock()
jbd2_journal_wait_updates() stop_this_handle()
while (atomic_read(txn->t_updates) { |
DEFINE_WAIT(wait); |
prepare_to_wait(); |
if (atomic_read(txn->t_updates) if (atomic_dec_and_test(txn->t_updates))
write_unlock(&journal->j_state_lock);
schedule(); wake_up()
write_lock(&journal->j_state_lock);
finish_wait();
}
txn->t_state = T_COMMIT
write_unlock(&journal->j_state_lock);
b. Also note that between atomic_inc(&txn->t_updates) in
start_this_handle() and jbd2_journal_wait_updates(), the
synchronization happens via read_lock(journal->j_state_lock) in
start_this_handle();
2. jbd2_journal_extend()
a. jbd2_journal_extend() is called with the handle of each process from
task_struct. So no lock required in updating member fields of handle_t
b. For member fields of h_transaction, all updates happens only via
atomic APIs (which is also within read_lock()).
So, no need of this transaction spinlock.
3. update_t_max_wait()
Based on Jan suggestion, this can be carefully removed using atomic
cmpxchg API.
Note that there can be several processes which are waiting for a new
transaction to be allocated and started. For doing this only one
process will succeed in taking write_lock() and allocating a new txn.
After that all of the process will be updating the t_max_wait (max
transaction wait time). This can be done via below method w/o taking
any locks using atomic cmpxchg.
For more details refer [1]
new = get_new_val();
old = READ_ONCE(ptr->max_val);
while (old < new)
old = cmpxchg(&ptr->max_val, old, new);
[1]: https://lwn.net/Articles/849237/
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/d89e599658b4a1f3893a48c6feded200073037fc.1644992076.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
jbd2_journal_wait_updates() is called with j_state_lock held. But if
there is a commit in progress, then this transaction might get committed
and freed via jbd2_journal_commit_transaction() ->
jbd2_journal_free_transaction(), when we release j_state_lock.
So check for journal->j_running_transaction everytime we release and
acquire j_state_lock to avoid use-after-free issue.
Link: https://lore.kernel.org/r/948c2fed518ae739db6a8f7f83f1d58b504f87d0.1644497105.git.ritesh.list@gmail.com
Fixes: 4f98186848707f53 ("jbd2: refactor wait logic for transaction updates into a common function")
Cc: stable@kernel.org
Reported-and-tested-by: syzbot+afa2ca5171d93e44b348@syzkaller.appspotmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
After commit 6e47a3cc68fc ("ext4: get rid of super block and sbi from
handle_mount_ops()") the 'abort' options stopped working. This is
because we're using ctx_set_mount_flags() helper that's expecting an
argument with the appropriate bit set, but instead got
EXT4_MF_FS_ABORTED which is a bit position. ext4_set_mount_flag() is
using set_bit() while ctx_set_mount_flags() was using bitwise OR.
Create a separate helper ctx_set_mount_flag() to handle setting the
mount_flags correctly.
While we're at it clean up the EXT4_SET_CTX macros so that we're only
creating helpers that we actually use to avoid warnings.
Fixes: 6e47a3cc68fc ("ext4: get rid of super block and sbi from handle_mount_ops()")
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Ye Bin <yebin10@huawei.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Tested-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Link: https://lore.kernel.org/r/20220201131345.77591-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Al Viro brought it to my attention that the dentries may not be filled
when the parse_options() is called, causing the call to set_gid() to
possibly crash. It should only be called if parse_options() succeeds
totally anyway.
He suggested the logical place to do the update is in apply_options().
Link: https://lore.kernel.org/all/20220225165219.737025658@goodmis.org/
Link: https://lkml.kernel.org/r/20220225153426.1c4cab6b@gandalf.local.home
Cc: stable@vger.kernel.org
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: 48b27b6b5191 ("tracefs: Set all files to the same group ownership as the mount option")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The use of mapping_set_error() in conjunction with calls to
filemap_check_errors() is problematic because every error gets reported
as either an EIO or an ENOSPC by filemap_check_errors() in functions
such as filemap_write_and_wait() or filemap_write_and_wait_range().
In almost all cases, we prefer to use the more nuanced wb errors.
Fixes: b8946d7bfb94 ("NFS: Revalidate the file mapping on all fatal writeback errors")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Add a helper for the xattr mask so that we can get rid of the inlined
ifdefs.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
We should never expect the 'xattr_cache' to be non-null in that case,
hence nfs_set_cache_invalid() is just going to optimise it away.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Ensure that we always initialise the 'xattr_support' field in struct
nfs_fsinfo, so that nfs_server_set_fsinfo() doesn't declare our NFSv2/v3
client to be capable of supporting the NFSv4.2 xattr protocol by setting
the NFS_CAP_XATTR capability.
This configuration can cause nfs_do_access() to set access mode bits
that are unsupported by the NFSv3 ACCESS call, which may confuse
spec-compliant servers.
Reported-by: Olga Kornievskaia <kolga@netapp.com>
Fixes: b78ef845c35d ("NFSv4.2: query the server for extended attribute support")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Now that we have more fine grained attribute revalidation, let's just
get rid of NFS_INO_REVAL_PAGECACHE.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
In order to differentiate client state, assign a random uuid to the
uniquifing portion of the client identifier when a network namespace is
created. Containers may still override this value if they wish to maintain
stable client identifiers by writing to /sys/fs/nfs/net/client/identifier,
either by udev rules or other means.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
In 4.1+, the server is allowed to set a flag
NFS4_RESULT_PRESERVE_UNLINKED in reply to the OPEN, that tells
the client that it does not need to do a silly rename of an
opened file when it's being removed.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
There doesn't seem to be any reason why the copy offload code can't use
GFP_KERNEL. It can't get called by direct reclaim.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Assume that the higher layers will have set memalloc_nofs_save/restore
as appropriate.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Assume that sections that should not re-enter the filesystem are already
protected with memalloc_nofs_save/restore call, so relax those GFP_NOFS
instances which might be used by other contexts.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
We should use either GFP_KERNEL or GFP_NOFS, but not both. Also strip
GFP_KERNEL_ACCOUNT down to GFP_KERNEL. This memory is shrinkable, so
does not need to be limited by kmemcg.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Allow kmemcg to limit the number of NFSv4 delegation, lock and open
state trackers.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Allow kmemcg to limit the number of open/lock file contexts, in the same
way that it limits the parent file descriptors.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
If memory allocation triggers a direct reclaim from the state recovery
thread, then we can deadlock. Use memalloc_nofs_save/restore to ensure
that doesn't happen.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
[You don't often get email from xiongx18@fudan.edu.cn. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]
The reference counting issue happens in two error paths in the
function _nfs42_proc_copy_notify(). In both error paths, the function
simply returns the error code and forgets to balance the refcount of
object `ctx`, bumped by get_nfs_open_context() earlier, which may
cause refcount leaks.
Fix it by balancing refcount of the `ctx` object before the function
returns in both error paths.
Signed-off-by: Xin Xiong <xiongx18@fudan.edu.cn>
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
git://git.infradead.org/users/hch/configfs
Pull configfs fix from Christoph Hellwig:
- fix a race in configfs_{,un}register_subsystem (ChenXiaoSong)
* tag 'configfs-5.17-2022-02-25' of git://git.infradead.org/users/hch/configfs:
configfs: fix a race in configfs_{,un}register_subsystem()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"This is a hopefully last batch of fixes for defrag that got broken in
5.16, all stable material.
The remaining reported problem is excessive IO with autodefrag due to
various conditions in the defrag code not met or missing"
* tag 'for-5.17-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: reduce extent threshold for autodefrag
btrfs: autodefrag: only scan one inode once
btrfs: defrag: don't use merged extent map for their generation check
btrfs: defrag: bring back the old file extent search behavior
btrfs: defrag: remove an ambiguous condition for rejection
btrfs: defrag: don't defrag extents which are already at max capacity
btrfs: defrag: don't try to merge regular extents with preallocated extents
btrfs: defrag: allow defrag_one_cluster() to skip large extent which is not a target
btrfs: prevent copying too big compressed lzo segment
|
|
NFS is one of the last two users of the deprecated ->readpages aop.
This conversion looks straightforward, but I have only compile-tested
it.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
nfs42_files_from_same_server() is called to check if freeing
cn_resp is required, just do the free.
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
cnt should be passed to sb_has_quota_active() instead of type to check
active quota properly.
Moreover, when the type is -1, the compiler with enough inline knowledge
can discard sb_has_quota_active() check altogether, causing a NULL pointer
dereference at the following inode_lock(dqopt->files[cnt]):
[ 2.796010] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000a0
[ 2.796024] Mem abort info:
[ 2.796025] ESR = 0x96000005
[ 2.796028] EC = 0x25: DABT (current EL), IL = 32 bits
[ 2.796029] SET = 0, FnV = 0
[ 2.796031] EA = 0, S1PTW = 0
[ 2.796032] Data abort info:
[ 2.796034] ISV = 0, ISS = 0x00000005
[ 2.796035] CM = 0, WnR = 0
[ 2.796046] user pgtable: 4k pages, 39-bit VAs, pgdp=00000003370d1000
[ 2.796048] [00000000000000a0] pgd=0000000000000000, pud=0000000000000000
[ 2.796051] Internal error: Oops: 96000005 [#1] PREEMPT SMP
[ 2.796056] CPU: 7 PID: 640 Comm: f2fs_ckpt-259:7 Tainted: G S 5.4.179-arter97-r8-64666-g2f16e087f9d8 #1
[ 2.796057] Hardware name: Qualcomm Technologies, Inc. Lahaina MTP lemonadep (DT)
[ 2.796059] pstate: 80c00005 (Nzcv daif +PAN +UAO)
[ 2.796065] pc : down_write+0x28/0x70
[ 2.796070] lr : f2fs_quota_sync+0x100/0x294
[ 2.796071] sp : ffffffa3f48ffc30
[ 2.796073] x29: ffffffa3f48ffc30 x28: 0000000000000000
[ 2.796075] x27: ffffffa3f6d718b8 x26: ffffffa415fe9d80
[ 2.796077] x25: ffffffa3f7290048 x24: 0000000000000001
[ 2.796078] x23: 0000000000000000 x22: ffffffa3f7290000
[ 2.796080] x21: ffffffa3f72904a0 x20: ffffffa3f7290110
[ 2.796081] x19: ffffffa3f77a9800 x18: ffffffc020aae038
[ 2.796083] x17: ffffffa40e38e040 x16: ffffffa40e38e6d0
[ 2.796085] x15: ffffffa40e38e6cc x14: ffffffa40e38e6d0
[ 2.796086] x13: 00000000000004f6 x12: 00162c44ff493000
[ 2.796088] x11: 0000000000000400 x10: ffffffa40e38c948
[ 2.796090] x9 : 0000000000000000 x8 : 00000000000000a0
[ 2.796091] x7 : 0000000000000000 x6 : 0000d1060f00002a
[ 2.796093] x5 : ffffffa3f48ff718 x4 : 000000000000000d
[ 2.796094] x3 : 00000000060c0000 x2 : 0000000000000001
[ 2.796096] x1 : 0000000000000000 x0 : 00000000000000a0
[ 2.796098] Call trace:
[ 2.796100] down_write+0x28/0x70
[ 2.796102] f2fs_quota_sync+0x100/0x294
[ 2.796104] block_operations+0x120/0x204
[ 2.796106] f2fs_write_checkpoint+0x11c/0x520
[ 2.796107] __checkpoint_and_complete_reqs+0x7c/0xd34
[ 2.796109] issue_checkpoint_thread+0x6c/0xb8
[ 2.796112] kthread+0x138/0x414
[ 2.796114] ret_from_fork+0x10/0x18
[ 2.796117] Code: aa0803e0 aa1f03e1 52800022 aa0103e9 (c8e97d02)
[ 2.796120] ---[ end trace 96e942e8eb6a0b53 ]---
[ 2.800116] Kernel panic - not syncing: Fatal exception
[ 2.800120] SMP: stopping secondary CPUs
Fixes: 9de71ede81e6 ("f2fs: quota: fix potential deadlock")
Cc: <stable@vger.kernel.org> # v5.15+
Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Lockdep uses lock class keys in its analysis. init_rwsem() instantiates
one lock class key with each init_rwsem() user as follows:
#define init_rwsem(sem) \
do { \
static struct lock_class_key __key; \
\
__init_rwsem((sem), #sem, &__key); \
} while (0)
Commit e4544b63a7ee ("f2fs: move f2fs to use reader-unfair rwsems") reduced
the number of lock class keys from one per init_rwsem() user to one per
file in which init_f2fs_rwsem() is used. This causes the same lock class key
to be associated with multiple f2fs rwsems and also triggers a number of
false positive lockdep deadlock reports. Fix this by again instantiating one
lock class key with each init_f2fs_rwsem() caller.
Cc: Tim Murray <timmurray@google.com>
Reported-by: syzbot+0b9cadf5fc45a98a5083@syzkaller.appspotmail.com
Fixes: e4544b63a7ee ("f2fs: move f2fs to use reader-unfair rwsems")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch fixes xfstests/generic/475 failure.
[ 293.680694] F2FS-fs (dm-1): May loss orphan inode, run fsck to fix.
[ 293.685358] Buffer I/O error on dev dm-1, logical block 8388592, async page read
[ 293.691527] Buffer I/O error on dev dm-1, logical block 8388592, async page read
[ 293.691764] sh (7615): drop_caches: 3
[ 293.691819] sh (7616): drop_caches: 3
[ 293.694017] Buffer I/O error on dev dm-1, logical block 1, async page read
[ 293.695659] sh (7618): drop_caches: 3
[ 293.696979] sh (7617): drop_caches: 3
[ 293.700290] sh (7623): drop_caches: 3
[ 293.708621] sh (7626): drop_caches: 3
[ 293.711386] sh (7628): drop_caches: 3
[ 293.711825] sh (7627): drop_caches: 3
[ 293.716738] sh (7630): drop_caches: 3
[ 293.719613] sh (7632): drop_caches: 3
[ 293.720971] sh (7633): drop_caches: 3
[ 293.727741] sh (7634): drop_caches: 3
[ 293.730783] sh (7636): drop_caches: 3
[ 293.732681] sh (7635): drop_caches: 3
[ 293.732988] sh (7637): drop_caches: 3
[ 293.738836] sh (7639): drop_caches: 3
[ 293.740568] sh (7641): drop_caches: 3
[ 293.743053] sh (7640): drop_caches: 3
[ 293.821889] ------------[ cut here ]------------
[ 293.824654] kernel BUG at fs/f2fs/node.c:3334!
[ 293.826226] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[ 293.828713] CPU: 0 PID: 7653 Comm: umount Tainted: G OE 5.17.0-rc1-custom #1
[ 293.830946] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[ 293.832526] RIP: 0010:f2fs_destroy_node_manager+0x33f/0x350 [f2fs]
[ 293.833905] Code: e8 d6 3d f9 f9 48 8b 45 d0 65 48 2b 04 25 28 00 00 00 75 1a 48 81 c4 28 03 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b
[ 293.837783] RSP: 0018:ffffb04ec31e7a20 EFLAGS: 00010202
[ 293.839062] RAX: 0000000000000001 RBX: ffff9df947db2eb8 RCX: 0000000080aa0072
[ 293.840666] RDX: 0000000000000000 RSI: ffffe86c0432a140 RDI: ffffffffc0b72a21
[ 293.842261] RBP: ffffb04ec31e7d70 R08: ffff9df94ca85780 R09: 0000000080aa0072
[ 293.843909] R10: ffff9df94ca85700 R11: ffff9df94e1ccf58 R12: ffff9df947db2e00
[ 293.845594] R13: ffff9df947db2ed0 R14: ffff9df947db2eb8 R15: ffff9df947db2eb8
[ 293.847855] FS: 00007f5a97379800(0000) GS:ffff9dfa77c00000(0000) knlGS:0000000000000000
[ 293.850647] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 293.852940] CR2: 00007f5a97528730 CR3: 000000010bc76005 CR4: 0000000000370ef0
[ 293.854680] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 293.856423] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 293.858380] Call Trace:
[ 293.859302] <TASK>
[ 293.860311] ? ttwu_do_wakeup+0x1c/0x170
[ 293.861800] ? ttwu_do_activate+0x6d/0xb0
[ 293.863057] ? _raw_spin_unlock_irqrestore+0x29/0x40
[ 293.864411] ? try_to_wake_up+0x9d/0x5e0
[ 293.865618] ? debug_smp_processor_id+0x17/0x20
[ 293.866934] ? debug_smp_processor_id+0x17/0x20
[ 293.868223] ? free_unref_page+0xbf/0x120
[ 293.869470] ? __free_slab+0xcb/0x1c0
[ 293.870614] ? preempt_count_add+0x7a/0xc0
[ 293.871811] ? __slab_free+0xa0/0x2d0
[ 293.872918] ? __wake_up_common_lock+0x8a/0xc0
[ 293.874186] ? __slab_free+0xa0/0x2d0
[ 293.875305] ? free_inode_nonrcu+0x20/0x20
[ 293.876466] ? free_inode_nonrcu+0x20/0x20
[ 293.877650] ? debug_smp_processor_id+0x17/0x20
[ 293.878949] ? call_rcu+0x11a/0x240
[ 293.880060] ? f2fs_destroy_stats+0x59/0x60 [f2fs]
[ 293.881437] ? kfree+0x1fe/0x230
[ 293.882674] f2fs_put_super+0x160/0x390 [f2fs]
[ 293.883978] generic_shutdown_super+0x7a/0x120
[ 293.885274] kill_block_super+0x27/0x50
[ 293.886496] kill_f2fs_super+0x7f/0x100 [f2fs]
[ 293.887806] deactivate_locked_super+0x35/0xa0
[ 293.889271] deactivate_super+0x40/0x50
[ 293.890513] cleanup_mnt+0x139/0x190
[ 293.891689] __cleanup_mnt+0x12/0x20
[ 293.892850] task_work_run+0x64/0xa0
[ 293.894035] exit_to_user_mode_prepare+0x1b7/0x1c0
[ 293.895409] syscall_exit_to_user_mode+0x27/0x50
[ 293.896872] do_syscall_64+0x48/0xc0
[ 293.898090] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 293.899517] RIP: 0033:0x7f5a975cd25b
Fixes: 7735730d39d7 ("f2fs: fix to propagate error from __get_meta_page()")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
We need to calculate the max file size accurately if the total blocks
that can address by block tree exceed the upper_limit. But this check is
not correct now, it only compute the total data blocks but missing
metadata blocks are needed. So in the case of "data blocks < upper_limit
&& total blocks > upper_limit", we will get wrong result. Fortunately,
this case could not happen in reality, but it's confused and better to
correct the computing.
bits data blocks metadatablocks upper_limit
10 16843020 66051 2147483647
11 134480396 263171 1073741823
12 1074791436 1050627 536870911 (*)
13 8594130956 4198403 268435455 (*)
14 68736258060 16785411 134217727 (*)
15 549822930956 67125251 67108863 (*)
16 4398314962956 268468227 33554431 (*)
[*] Need to calculate in depth.
Fixes: 1c2d14212b15 ("ext2: Fix underflow in ext2_max_size()")
Link: https://lore.kernel.org/r/20220212050532.179055-1-yi.zhang@huawei.com
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
As talked about in commit b792e64021ec ("drm: no need to check return
value of debugfs_create functions"), in many cases we can get away
with totally skipping checking the errors of debugfs functions. Let's
document that so people don't add new code that needlessly checks
these errors.
Probably this note could be added to a boatload of functions, but
that's a lot of duplication. Let's just add it to the two most
frequent ones and hope people will get the idea.
Suggested-by: Javier Martinez Canillas <javierm@redhat.com>
Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20220222154555.1.I26d364db7a007f8995e8f0dac978673bc8e9f5e2@changeid
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
There are no remaining callers of set_fs(), so CONFIG_SET_FS
can be removed globally, along with the thread_info field and
any references to it.
This turns access_ok() into a cheaper check against TASK_SIZE_MAX.
As CONFIG_SET_FS is now gone, drop all remaining references to
set_fs()/get_fs(), mm_segment_t, user_addr_max() and uaccess_kernel().
Acked-by: Sam Ravnborg <sam@ravnborg.org> # for sparc32 changes
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Tested-by: Sergey Matyukevich <sergey.matyukevich@synopsys.com> # for arc changes
Acked-by: Stafford Horne <shorne@gmail.com> # [openrisc, asm-generic]
Acked-by: Dinh Nguyen <dinguyen@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
Pull io_uring fixes from Jens Axboe:
- Add a conditional schedule point in io_add_buffers() (Eric)
- Fix for a quiesce speedup merged in this release (Dylan)
- Don't convert to jiffies for event timeout waiting, it's way too
coarse when we accept a timespec as input (me)
* tag 'io_uring-5.17-2022-02-23' of git://git.kernel.dk/linux-block:
io_uring: disallow modification of rsrc_data during quiesce
io_uring: don't convert to jiffies for waiting on timeouts
io_uring: add a schedule point in io_add_buffers()
|
|
There is a big gap between inode_should_defrag() and autodefrag extent
size threshold. For inode_should_defrag() it has a flexible
@small_write value. For compressed extent is 16K, and for non-compressed
extent it's 64K.
However for autodefrag extent size threshold, it's always fixed to the
default value (256K).
This means, the following write sequence will trigger autodefrag to
defrag ranges which didn't trigger autodefrag:
pwrite 0 8k
sync
pwrite 8k 128K
sync
The latter 128K write will also be considered as a defrag target (if
other conditions are met). While only that 8K write is really
triggering autodefrag.
Such behavior can cause extra IO for autodefrag.
Close the gap, by copying the @small_write value into inode_defrag, so
that later autodefrag can use the same @small_write value which
triggered autodefrag.
With the existing transid value, this allows autodefrag really to scan
the ranges which triggered autodefrag.
Although this behavior change is mostly reducing the extent_thresh value
for autodefrag, I believe in the future we should allow users to specify
the autodefrag extent threshold through mount options, but that's an
other problem to consider in the future.
CC: stable@vger.kernel.org # 5.16+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
fsnotify() treats FS_MODIFY events specially - it does not skip them
even if the FS_MODIFY event does not apear in the object's fsnotify
mask. This is because send_to_group() checks if FS_MODIFY needs to
clear ignored mask of marks.
The common case is that an object does not have any mark with ignored
mask and in particular, that it does not have a mark with ignored mask
and without the FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY flag.
Set FS_MODIFY in object's fsnotify mask during fsnotify_recalc_mask()
if object has a mark with an ignored mask and without the
FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY flag and remove the special
treatment of FS_MODIFY in fsnotify(), so that FS_MODIFY events could
be optimized in the common case.
Call fsnotify_recalc_mask() from fanotify after adding or removing an
ignored mask from a mark without FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY
or when adding the FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY flag to a mark
with ignored mask (the flag cannot be removed by fanotify uapi).
Performance results for doing 10000000 write(2)s to tmpfs:
vanilla patched
without notification mark 25.486+-1.054 24.965+-0.244
with notification mark 30.111+-0.139 26.891+-1.355
So we can see the overhead of notification subsystem has been
drastically reduced.
Link: https://lore.kernel.org/r/20220223151438.790268-3-amir73il@gmail.com
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
fsnotify_parent() does not consider the parent's mark at all unless
the parent inode shows interest in events on children and in the
specific event.
So unless parent added an event to both its mark mask and ignored mask,
the event will not be ignored.
Fix this by declaring the interest of an object in an event when the
event is in either a mark mask or ignored mask.
Link: https://lore.kernel.org/r/20220223151438.790268-2-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Although we have btrfs_requeue_inode_defrag(), for autodefrag we are
still just exhausting all inode_defrag items in the tree.
This means, it doesn't make much difference to requeue an inode_defrag,
other than scan the inode from the beginning till its end.
Change the behaviour to always scan from offset 0 of an inode, and till
the end.
By this we get the following benefit:
- Straight-forward code
- No more re-queue related check
- Fewer members in inode_defrag
We still keep the same btrfs_get_fs_root() and btrfs_iget() check for
each loop, and added extra should_auto_defrag() check per-loop.
Note: the patch needs to be backported and is intentionally written
to minimize the diff size, code will be cleaned up later.
CC: stable@vger.kernel.org # 5.16
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
For extent maps, if they are not compressed extents and are adjacent by
logical addresses and file offsets, they can be merged into one larger
extent map.
Such merged extent map will have the higher generation of all the
original ones.
But this brings a problem for autodefrag, as it relies on accurate
extent_map::generation to determine if one extent should be defragged.
For merged extent maps, their higher generation can mark some older
extents to be defragged while the original extent map doesn't meet the
minimal generation threshold.
Thus this will cause extra IO.
So solve the problem, here we introduce a new flag, EXTENT_FLAG_MERGED,
to indicate if the extent map is merged from one or more ems.
And for autodefrag, if we find a merged extent map, and its generation
meets the generation requirement, we just don't use this one, and go
back to defrag_get_extent() to read extent maps from subvolume trees.
This could cause more read IO, but should result less defrag data write,
so in the long run it should be a win for autodefrag.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
For defrag, we don't really want to use btrfs_get_extent() to iterate
all extent maps of an inode.
The reasons are:
- btrfs_get_extent() can merge extent maps
And the result em has the higher generation of the two, causing defrag
to mark unnecessary part of such merged large extent map.
This in fact can result extra IO for autodefrag in v5.16+ kernels.
However this patch is not going to completely solve the problem, as
one can still using read() to trigger extent map reading, and got
them merged.
The completely solution for the extent map merging generation problem
will come as an standalone fix.
- btrfs_get_extent() caches the extent map result
Normally it's fine, but for defrag the target range may not get
another read/write for a long long time.
Such cache would only increase the memory usage.
- btrfs_get_extent() doesn't skip older extent map
Unlike the old find_new_extent() which uses btrfs_search_forward() to
skip the older subtree, thus it will pick up unnecessary extent maps.
This patch will fix the regression by introducing defrag_get_extent() to
replace the btrfs_get_extent() call.
This helper will:
- Not cache the file extent we found
It will search the file extent and manually convert it to em.
- Use btrfs_search_forward() to skip entire ranges which is modified in
the past
This should reduce the IO for autodefrag.
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 7b508037d4ca ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
From the very beginning of btrfs defrag, there is a check to reject
extents which meet both conditions:
- Physically adjacent
We may want to defrag physically adjacent extents to reduce the number
of extents or the size of subvolume tree.
- Larger than 128K
This may be there for compressed extents, but unfortunately 128K is
exactly the max capacity for compressed extents.
And the check is > 128K, thus it never rejects compressed extents.
Furthermore, the compressed extent capacity bug is fixed by previous
patch, there is no reason for that check anymore.
The original check has a very small ranges to reject (the target extent
size is > 128K, and default extent threshold is 256K), and for
compressed extent it doesn't work at all.
So it's better just to remove the rejection, and allow us to defrag
physically adjacent extents.
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
For compressed extents, defrag ioctl will always try to defrag any
compressed extents, wasting not only IO but also CPU time to
compress/decompress:
mkfs.btrfs -f $DEV
mount -o compress $DEV $MNT
xfs_io -f -c "pwrite -S 0xab 0 128K" $MNT/foobar
sync
xfs_io -f -c "pwrite -S 0xcd 128K 128K" $MNT/foobar
sync
echo "=== before ==="
xfs_io -c "fiemap -v" $MNT/foobar
btrfs filesystem defrag $MNT/foobar
sync
echo "=== after ==="
xfs_io -c "fiemap -v" $MNT/foobar
Then it shows the 2 128K extents just get COW for no extra benefit, with
extra IO/CPU spent:
=== before ===
/mnt/btrfs/file1:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..255]: 26624..26879 256 0x8
1: [256..511]: 26632..26887 256 0x9
=== after ===
/mnt/btrfs/file1:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..255]: 26640..26895 256 0x8
1: [256..511]: 26648..26903 256 0x9
This affects not only v5.16 (after the defrag rework), but also v5.15
(before the defrag rework).
[CAUSE]
From the very beginning, btrfs defrag never checks if one extent is
already at its max capacity (128K for compressed extents, 128M
otherwise).
And the default extent size threshold is 256K, which is already beyond
the compressed extent max size.
This means, by default btrfs defrag ioctl will mark all compressed
extent which is not adjacent to a hole/preallocated range for defrag.
[FIX]
Introduce a helper to grab the maximum extent size, and then in
defrag_collect_targets() and defrag_check_next_extent(), reject extents
which are already at their max capacity.
Reported-by: Filipe Manana <fdmanana@suse.com>
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
With older kernels (before v5.16), btrfs will defrag preallocated extents.
While with newer kernels (v5.16 and newer) btrfs will not defrag
preallocated extents, but it will defrag the extent just before the
preallocated extent, even it's just a single sector.
This can be exposed by the following small script:
mkfs.btrfs -f $dev > /dev/null
mount $dev $mnt
xfs_io -f -c "pwrite 0 4k" -c sync -c "falloc 4k 16K" $mnt/file
xfs_io -c "fiemap -v" $mnt/file
btrfs fi defrag $mnt/file
sync
xfs_io -c "fiemap -v" $mnt/file
The output looks like this on older kernels:
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 26624..26631 8 0x0
1: [8..39]: 26632..26663 32 0x801
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..39]: 26664..26703 40 0x1
Which defrags the single sector along with the preallocated extent, and
replace them with an regular extent into a new location (caused by data
COW).
This wastes most of the data IO just for the preallocated range.
On the other hand, v5.16 is slightly better:
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 26624..26631 8 0x0
1: [8..39]: 26632..26663 32 0x801
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 26664..26671 8 0x0
1: [8..39]: 26632..26663 32 0x801
The preallocated range is not defragged, but the sector before it still
gets defragged, which has no need for it.
[CAUSE]
One of the function reused by the old and new behavior is
defrag_check_next_extent(), it will determine if we should defrag
current extent by checking the next one.
It only checks if the next extent is a hole or inlined, but it doesn't
check if it's preallocated.
On the other hand, out of the function, both old and new kernel will
reject preallocated extents.
Such inconsistent behavior causes above behavior.
[FIX]
- Also check if next extent is preallocated
If so, don't defrag current extent.
- Add comments for each branch why we reject the extent
This will reduce the IO caused by defrag ioctl and autodefrag.
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There is no need to have struct kernfs_root be part of kernfs.h for
the whole kernel to see and poke around it. Move it internal to kernfs
code and provide a helper function, kernfs_root_to_node(), to handle the
one field that kernfs users were directly accessing from the structure.
Cc: Imran Khan <imran.f.khan@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220222070713.3517679-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
When configfs_register_subsystem() or configfs_unregister_subsystem()
is executing link_group() or unlink_group(),
it is possible that two processes add or delete list concurrently.
Some unfortunate interleavings of them can cause kernel panic.
One of cases is:
A --> B --> C --> D
A <-- B <-- C <-- D
delete list_head *B | delete list_head *C
--------------------------------|-----------------------------------
configfs_unregister_subsystem | configfs_unregister_subsystem
unlink_group | unlink_group
unlink_obj | unlink_obj
list_del_init | list_del_init
__list_del_entry | __list_del_entry
__list_del | __list_del
// next == C |
next->prev = prev |
| next->prev = prev
prev->next = next |
| // prev == B
| prev->next = next
Fix this by adding mutex when calling link_group() or unlink_group(),
but parent configfs_subsystem is NULL when config_item is root.
So I create a mutex configfs_subsystem_mutex.
Fixes: 7063fbf22611 ("[PATCH] configfs: User-driven configuration filesystem")
Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com>
Signed-off-by: Laibin Qiu <qiulaibin@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
io_rsrc_ref_quiesce will unlock the uring while it waits for references to
the io_rsrc_data to be killed.
There are other places to the data that might add references to data via
calls to io_rsrc_node_switch.
There is a race condition where this reference can be added after the
completion has been signalled. At this point the io_rsrc_ref_quiesce call
will wake up and relock the uring, assuming the data is unused and can be
freed - although it is actually being used.
To fix this check in io_rsrc_ref_quiesce if a resource has been revived.
Reported-by: syzbot+ca8bf833622a1662745b@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220222161751.995746-1-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Remove usage of AOP_FLAG_CONT_EXPAND flag. Reiserfs is the only user of
it and it is easy to avoid.
Link: https://lore.kernel.org/r/20220220232219.1235-1-edward.shishkin@gmail.com
Signed-off-by: Edward Shishkin <edward.shishkin@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Since 'commit 393c3714081a ("kernfs: switch global kernfs_rwsem lock to
per-fs lock")' per-fs kernfs_rwsem has replaced global kernfs_rwsem.
Remove redundant declaration of global kernfs_rwsem.
Fixes: 393c3714081a ("kernfs: switch global kernfs_rwsem lock to per-fs lock")
Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
Link: https://lore.kernel.org/r/20220218010205.717582-1-imran.f.khan@oracle.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
...to help userland apps that need to identify FUSE mounts.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
If an application calls io_uring_enter(2) with a timespec passed in,
convert that timespec to ktime_t rather than jiffies. The latter does
not provide the granularity the application may expect, and may in
fact provided different granularity on different systems, depending
on what the HZ value is configured at.
Turn the timespec into an absolute ktime_t, and use that with
schedule_hrtimeout() instead.
Link: https://github.com/axboe/liburing/issues/531
Cc: stable@vger.kernel.org
Reported-by: Bob Chen <chenbo.chen@alibaba-inc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull mount_setattr test/doc fixes from Christian Brauner:
"This contains a fix for one of the selftests for the mount_setattr
syscall to create idmapped mounts, an entry for idmapped mounts for
maintainers, and missing kernel documentation for the helper we split
out some time ago to get and yield write access to a mount when
changing mount properties"
* tag 'fs.mount_setattr.v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
fs: add kernel doc for mnt_{hold,unhold}_writers()
MAINTAINERS: add entry for idmapped mounts
tests: fix idmapped mount_setattr test
|