Age | Commit message (Collapse) | Author |
|
this fixes a 'transaction should be locked' error in backpointers fsck
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Closes: https://syzkaller.appspot.com/bug?extid=8996d8f176cf946ef641
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Instead of popping an assert in bch2_write(), WARN and print out some
debugging info.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
After commit 230e9fc28604 ("slab: add SLAB_ACCOUNT flag"), we need to mark
the inode cache as SLAB_ACCOUNT, similar to commit 5d097056c9a0 ("kmemcg:
account for certain kmem allocations to memcg")
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
silly race
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Consolidate them under erofs_map_blocks_* for simplicity since we
have many other ways to know if a given inode is compressed or not.
Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240710083459.208362-1-hongzhen@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
minixfs now uses kmap_local_page(), so we can't call kunmap() to
undo it. This one call was missed as part of the commit this fixes.
Fixes: 6628f69ee66a (minixfs: Use dir_put_page() in minix_unlink() and minix_rename())
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20240709195841.1986374-1-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
get_ckpt_valid_blocks() checks valid ckpt blocks in current section.
It counts all vblocks from the first to the last segment in the
large section. However, START_SEGNO() is used to get the first segno
in an SIT block. This patch fixes that to get the correct start segno.
Fixes: 61461fc921b7 ("f2fs: fix to avoid touching checkpointed data in get_victim()")
Signed-off-by: Sheng Yong <shengyong@oppo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Pull smb server fixes from Steve French:
- fix access flags to address fuse incompatibility
- fix device type returned by get filesystem info
* tag '6.10-rc6-smb3-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: discard write access to the directory open
ksmbd: return FILE_DEVICE_DISK instead of super magic
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2024-07-08
The following pull-request contains BPF updates for your *net-next* tree.
We've added 102 non-merge commits during the last 28 day(s) which contain
a total of 127 files changed, 4606 insertions(+), 980 deletions(-).
The main changes are:
1) Support resilient split BTF which cuts down on duplication and makes BTF
as compact as possible wrt BTF from modules, from Alan Maguire & Eduard Zingerman.
2) Add support for dumping kfunc prototypes from BTF which enables both detecting
as well as dumping compilable prototypes for kfuncs, from Daniel Xu.
3) Batch of s390x BPF JIT improvements to add support for BPF arena and to implement
support for BPF exceptions, from Ilya Leoshkevich.
4) Batch of riscv64 BPF JIT improvements in particular to add 12-argument support
for BPF trampolines and to utilize bpf_prog_pack for the latter, from Pu Lehui.
5) Extend BPF test infrastructure to add a CHECKSUM_COMPLETE validation option
for skbs and add coverage along with it, from Vadim Fedorenko.
6) Inline bpf_get_current_task/_btf() helpers in the arm64 BPF JIT which gives
a small 1% performance improvement in micro-benchmarks, from Puranjay Mohan.
7) Extend the BPF verifier to track the delta between linked registers in order
to better deal with recent LLVM code optimizations, from Alexei Starovoitov.
8) Fix bpf_wq_set_callback_impl() kfunc signature where the third argument should
have been a pointer to the map value, from Benjamin Tissoires.
9) Extend BPF selftests to add regular expression support for test output matching
and adjust some of the selftest when compiled under gcc, from Cupertino Miranda.
10) Simplify task_file_seq_get_next() and remove an unnecessary loop which always
iterates exactly once anyway, from Dan Carpenter.
11) Add the capability to offload the netfilter flowtable in XDP layer through
kfuncs, from Florian Westphal & Lorenzo Bianconi.
12) Various cleanups in networking helpers in BPF selftests to shave off a few
lines of open-coded functions on client/server handling, from Geliang Tang.
13) Properly propagate prog->aux->tail_call_reachable out of BPF verifier, so
that x86 JIT does not need to implement detection, from Leon Hwang.
14) Fix BPF verifier to add a missing check_func_arg_reg_off() to prevent an
out-of-bounds memory access for dynpointers, from Matt Bobrowski.
15) Fix bpf_session_cookie() kfunc to return __u64 instead of long pointer as
it might lead to problems on 32-bit archs, from Jiri Olsa.
16) Enhance traffic validation and dynamic batch size support in xsk selftests,
from Tushar Vyavahare.
bpf-next-for-netdev
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (102 commits)
selftests/bpf: DENYLIST.aarch64: Remove fexit_sleep
selftests/bpf: amend for wrong bpf_wq_set_callback_impl signature
bpf: helpers: fix bpf_wq_set_callback_impl signature
libbpf: Add NULL checks to bpf_object__{prev_map,next_map}
selftests/bpf: Remove exceptions tests from DENYLIST.s390x
s390/bpf: Implement exceptions
s390/bpf: Change seen_reg to a mask
bpf: Remove unnecessary loop in task_file_seq_get_next()
riscv, bpf: Optimize stack usage of trampoline
bpf, devmap: Add .map_alloc_check
selftests/bpf: Remove arena tests from DENYLIST.s390x
selftests/bpf: Add UAF tests for arena atomics
selftests/bpf: Introduce __arena_global
s390/bpf: Support arena atomics
s390/bpf: Enable arena
s390/bpf: Support address space cast instruction
s390/bpf: Support BPF_PROBE_MEM32
s390/bpf: Land on the next JITed instruction after exception
s390/bpf: Introduce pre- and post- probe functions
s390/bpf: Get rid of get_probe_mem_regno()
...
====================
Link: https://patch.msgid.link/20240708221438.10974-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Currently, when the Virtio queue is full, a work item is scheduled
to execute in 1ms that retries adding the request to the queue.
This is a large amount of time on the scale on which a
virtio-fs device can operate. When using a DPU this is around
30-40us baseline without going to a remote server (4k, QD=1).
This patch changes the retrying behavior to immediately filling the
Virtio queue up again when a completion has been received.
This reduces the 99.9th percentile latencies in our tests by
60x and slightly increases the overall throughput, when using a
workload IO depth 2x the size of the Virtio queue and a
DPU-powered virtio-fs device (NVIDIA BlueField DPU).
Signed-off-by: Peter-Jan Gootzen <pgootzen@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Yoray Zack <yorayz@nvidia.com>
Message-Id: <20240517190435.152096-3-pgootzen@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
|
|
Currently, when the enqueueing of a request or forget operation fails
with -ENOMEM, the enqueueing is retried after a timeout. This patch
removes this behavior and treats -ENOMEM in these scenarios like any
other error. By bubbling up the error to user space in the case of a
request, and by dropping the operation in case of a forget. This
behavior matches that of the FUSE layer above, and also simplifies the
error handling. The latter will come in handy for upcoming patches that
optimize the retrying of operations in case of -ENOSPC.
Signed-off-by: Peter-Jan Gootzen <pgootzen@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Yoray Zack <yorayz@nvidia.com>
Message-Id: <20240517190435.152096-2-pgootzen@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
|
|
Just use a generic helper to prepare buffers for all supported
stream decompressors, eliminating similar logic.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240709094106.3018109-3-hsiangkao@linux.alibaba.com
|
|
Introduce z_erofs_{init,exit}_decompressor() to unexport
z_erofs_{deflate,lzma,zstd}_{init,exit}().
Besides, call them in z_erofs_{init,exit}_subsystem()
for simplicity.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240709094106.3018109-2-hsiangkao@linux.alibaba.com
|
|
Thus *_config() function declarations can be avoided.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240709094106.3018109-1-hsiangkao@linux.alibaba.com
|
|
The logic for determining when to demote a glock in glock_work_func(),
introduced in commit 7cf8dcd3b68a ("GFS2: Automatically adjust glock min
hold time"), doesn't make sense: inode glocks have a minimum hold time
that delays demotion, while all other glocks are expected to be demoted
immediately. Instead of demoting non-inode glocks immediately,
glock_work_func() schedules glock work for them to be demoted, however.
Get rid of that unnecessary indirection.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Test case: 2 threads write short inline data to a file.
In ext4_page_mkwrite the resulting inline data is converted.
Handling ext4_grp_locked_error with description "block bitmap
and bg descriptor inconsistent: X vs Y free clusters" calls
ext4_force_shutdown. The conversion clears
EXT4_STATE_MAY_INLINE_DATA but fails for
ext4_destroy_inline_data_nolock and ext4_mark_iloc_dirty due
to ext4_forced_shutdown. The restoration of inline data fails
for the same reason not setting EXT4_STATE_MAY_INLINE_DATA.
Without the flag set a regular process path in ext4_da_write_end
follows trying to dereference page folio private pointer that has
not been set. The fix calls early return with -EIO error shall the
pointer to private be NULL.
Sample crash report:
Unable to handle kernel paging request at virtual address dfff800000000004
KASAN: null-ptr-deref in range [0x0000000000000020-0x0000000000000027]
Mem abort info:
ESR = 0x0000000096000005
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
FSC = 0x05: level 1 translation fault
Data abort info:
ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
CM = 0, WnR = 0, TnD = 0, TagAccess = 0
GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[dfff800000000004] address between user and kernel address ranges
Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP
Modules linked in:
CPU: 1 PID: 20274 Comm: syz-executor185 Not tainted 6.9.0-rc7-syzkaller-gfda5695d692c #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : __block_commit_write+0x64/0x2b0 fs/buffer.c:2167
lr : __block_commit_write+0x3c/0x2b0 fs/buffer.c:2160
sp : ffff8000a1957600
x29: ffff8000a1957610 x28: dfff800000000000 x27: ffff0000e30e34b0
x26: 0000000000000000 x25: dfff800000000000 x24: dfff800000000000
x23: fffffdffc397c9e0 x22: 0000000000000020 x21: 0000000000000020
x20: 0000000000000040 x19: fffffdffc397c9c0 x18: 1fffe000367bd196
x17: ffff80008eead000 x16: ffff80008ae89e3c x15: 00000000200000c0
x14: 1fffe0001cbe4e04 x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000001 x10: 0000000000ff0100 x9 : 0000000000000000
x8 : 0000000000000004 x7 : 0000000000000000 x6 : 0000000000000000
x5 : fffffdffc397c9c0 x4 : 0000000000000020 x3 : 0000000000000020
x2 : 0000000000000040 x1 : 0000000000000020 x0 : fffffdffc397c9c0
Call trace:
__block_commit_write+0x64/0x2b0 fs/buffer.c:2167
block_write_end+0xb4/0x104 fs/buffer.c:2253
ext4_da_do_write_end fs/ext4/inode.c:2955 [inline]
ext4_da_write_end+0x2c4/0xa40 fs/ext4/inode.c:3028
generic_perform_write+0x394/0x588 mm/filemap.c:3985
ext4_buffered_write_iter+0x2c0/0x4ec fs/ext4/file.c:299
ext4_file_write_iter+0x188/0x1780
call_write_iter include/linux/fs.h:2110 [inline]
new_sync_write fs/read_write.c:497 [inline]
vfs_write+0x968/0xc3c fs/read_write.c:590
ksys_write+0x15c/0x26c fs/read_write.c:643
__do_sys_write fs/read_write.c:655 [inline]
__se_sys_write fs/read_write.c:652 [inline]
__arm64_sys_write+0x7c/0x90 fs/read_write.c:652
__invoke_syscall arch/arm64/kernel/syscall.c:34 [inline]
invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:48
el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:133
do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:152
el0_svc+0x54/0x168 arch/arm64/kernel/entry-common.c:712
el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730
el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598
Code: 97f85911 f94002da 91008356 d343fec8 (38796908)
---[ end trace 0000000000000000 ]---
----------------
Code disassembly (best guess):
0: 97f85911 bl 0xffffffffffe16444
4: f94002da ldr x26, [x22]
8: 91008356 add x22, x26, #0x20
c: d343fec8 lsr x8, x22, #3
* 10: 38796908 ldrb w8, [x8, x25] <-- trapping instruction
Reported-by: syzbot+18df508cf00a0598d9a6@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=18df508cf00a0598d9a6
Link: https://lore.kernel.org/all/000000000000f19a1406109eb5c5@google.com/T/
Signed-off-by: Wojciech Gładysz <wojciech.gladysz@infogain.com>
Link: https://patch.msgid.link/20240703070112.10235-1-wojciech.gladysz@infogain.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Originally, we were quite conservative in limiting maximum transaction
size to a quarter of the journal because we were not accounting
transaction descriptor and revoke blocks. These days we do properly
account them and reserve space for them from the total transaction
credits. Thus there's no need to be so conservative and we can increase
the maximum transaction size to one third of the journal (even half
should work fine in principle but the performance will likely suffer in
that case). This also fixes failures to grow filesystems with tiny
journals.
Link: CA+hUFcuGs04JHZ_WzA1zGN57+ehL2qmHOt5a7RMpo+rv6Vyxtw@mail.gmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20240701132800.7158-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
In jbd2_journal_init_common() we set batch size of a shrinker shrinking
checkpointed buffers to journal->j_max_transaction_buffers. But that is
guaranteed to be 0 at that point so we effectively stay with the default
shrinker batch size of 128. It has been like this since introduction of
jbd2 shrinkers so just drop the pointless initialization.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20240624170127.3253-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Commit 9f356e5a4f12 ("jbd2: Account descriptor blocks into
t_outstanding_credits") started to account descriptor blocks into
transactions outstanding credits. However it didn't appropriately
decrease the maximum amount of credits available to userspace. Thus if
the filesystem requests a transaction smaller than
j_max_transaction_buffers but large enough that when descriptor blocks
are added the size exceeds j_max_transaction_buffers, we confuse
add_transaction_credits() into thinking previous handles have grown the
transaction too much and enter infinite journal commit loop in
start_this_handle() -> add_transaction_credits() trying to create
transaction with enough credits available.
Fix the problem by properly accounting for transaction space reserved
for descriptor blocks when verifying requested transaction handle size.
CC: stable@vger.kernel.org
Fixes: 9f356e5a4f12 ("jbd2: Account descriptor blocks into t_outstanding_credits")
Reported-by: Alexander Coffin <alex.coffin@maticrobots.com>
Link: https://lore.kernel.org/all/CA+hUFcuGs04JHZ_WzA1zGN57+ehL2qmHOt5a7RMpo+rv6Vyxtw@mail.gmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20240624170127.3253-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Instead of computing the number of descriptor blocks a transaction can
have each time we need it (which is currently when starting each
transaction but will become more frequent later) precompute the number
once during journal initialization together with maximum transaction
size. We perform the precomputation whenever journal feature set is
updated similarly as for computation of
journal->j_revoke_records_per_block.
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20240624170127.3253-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
There's no reason to have jbd2_journal_get_max_txn_bufs() public
function. Currently all users are internal and can use
journal->j_max_transaction_buffers instead. This saves some unnecessary
recomputations of the limit as a bonus which becomes important as this
function gets more complex in the following patch.
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20240624170127.3253-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
We encountered a problem that the file system could not be mounted in
the power-off scenario. The analysis of the file system mirror shows that
only part of the data is written to the last commit block.
The valid data of the commit block is concentrated in the first sector.
However, the data of the entire block is involved in the checksum calculation.
For different hardware, the minimum atomic unit may be different.
If the checksum of a committed block is incorrect, clear the data except the
'commit_header' and then calculate the checksum. If the checkusm is correct,
it is considered that the block is partially committed, Then continue to replay
journal.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240620072405.3533701-1-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
If the extended attribute size is not a multiple of block size, the last
block in the EA inode will have uninitialized tail which will get
written to disk. We will never expose the data to userspace but still
this is not a good practice so just zero out the tail of the block as it
isn't going to cause a noticeable performance overhead.
Fixes: e50e5129f384 ("ext4: xattr-in-inode support")
Reported-by: syzbot+9c1fe13fcb51574b249b@syzkaller.appspotmail.com
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240613150234.25176-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When fast-commit needs to track ranges, it has to handle inodes that have
inlined data in a different way because ext4_fc_write_inode_data(), in the
actual commit path, will attempt to map the required blocks for the range.
However, inodes that have inlined data will have it's data stored in
inode->i_block and, eventually, in the extended attribute space.
Unfortunately, because fast commit doesn't currently support extended
attributes, the solution is to mark this commit as ineligible.
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1039883
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
Tested-by: Ben Hutchings <benh@debian.org>
Fixes: 9725958bb75c ("ext4: fast commit may miss tracking unwritten range during ftruncate")
Link: https://patch.msgid.link/20240618144312.17786-1-luis.henriques@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
In the fast commit code there are a few places where tid_t variables are
being compared without taking into account the fact that these sequence
numbers may wrap. Fix this issue by using the helper functions tid_gt()
and tid_geq().
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://patch.msgid.link/20240529092030.9557-3-luis.henriques@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
If we're trying to allocate real space for a delalloc reservation at
offset 0, we should use the rotor to spread files across the rt volume.
Switch the rtalloc to use the XFS_ALLOC_INITIAL_USER_DATA flag that
is set for any write at startoff to make it match the behavior for
the main data device.
Based on a patch from Darrick J. Wong.
Fixes: 6a94b1acda7e ("xfs: reinstate delalloc for RT inodes (if sb_rextsize == 1)")
Reported-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
|
|
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Now that nfsd_svc can handle an array of thread counts, fix up the
netlink threads interface to construct one from the netlink call
and pass it through so we can start a pooled server the same way we
would start a normal one.
Note that any unspecified values in the array are considered zeroes,
so it's possible to shut down a pooled server by passing in a short
array that has only zeros, or even an empty array.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Now that the refcounting is fixed, rework nfsd_svc to use the same
thread setup as the pool_threads interface. Have it take an array of
thread counts instead of just a single value, and pass that from the
netlink threads set interface. Since the new netlink interface doesn't
have the same restriction as pool_threads, move the guard against
shutting down all threads to write_pool_threads.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
I noticed LAYOUTGET(LAYOUTIOMODE4_RW) returning NFS4ERR_ACCESS
unexpectedly. The NFS client had created a file with mode 0444, and
the server had returned a write delegation on the OPEN(CREATE). The
client was requesting a RW layout using the write delegation stateid
so that it could flush file modifications.
Creating a read-only file does not seem to be problematic for
NFSv4.1 without pNFS, so I began looking at NFSD's implementation of
LAYOUTGET.
The failure was because fh_verify() was doing a permission check as
part of verifying the FH presented during the LAYOUTGET. It uses the
loga_iomode value to specify the @accmode argument to fh_verify().
fh_verify(MAY_WRITE) on a file whose mode is 0444 fails with -EACCES.
To permit LAYOUT* operations in this case, add OWNER_OVERRIDE when
checking the access permission of the incoming file handle for
LAYOUTGET and LAYOUTCOMMIT.
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org # v6.6+
Message-Id: 4E9C0D74-A06D-4DC3-A48A-73034DC40395@oracle.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
*-objs suffix is reserved rather for (user-space) host programs while
usually *-y suffix is used for kernel drivers (although *-objs works
for that purpose for now).
Let's correct the old usages of *-objs in Makefiles.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Since CONFIG_NFSD_LEGACY_CLIENT_TRACKING is a new config option, its
initial default setting should have been Y (if we are to follow the
common practice of "default Y, wait, default N, wait, remove code").
Paul also suggested adding a clearer remedy action to the warning
message.
Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
Message-Id: <d2ab4ee7-ba0f-44ac-b921-90c8fa5a04d2@molgen.mpg.de>
Fixes: 74fd48739d04 ("nfsd: new Kconfig option for legacy client tracking")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
'nfsd3_voidargs' in nfs[23]acl.c is unused since
commit 788f7183fba8 ("NFSD: Add common helpers to decode void args and
encode void results").
Remove them.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
These lengths come from xdr_stream_decode_u32() and so we should be a
bit careful with them. Use size_add() and struct_size() to avoid
integer overflows. Saving size_add()/struct_size() results to a u32 is
unsafe because it truncates away the high bits.
Also generally storing sizes in longs is safer. Most systems these days
use 64 bit CPUs. It's harder for an addition to overflow 64 bits than
it is to overflow 32 bits. Also functions like vmalloc() can
successfully allocate UINT_MAX bytes, but nothing can allocate ULONG_MAX
bytes.
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
When NFS requests are split into sub-requests, nfs_inode_remove_request
calls nfs_page_group_sync_on_bit to set PG_REMOVE on this sub-request and
only completes the head requests once PG_REMOVE is set on all requests.
This means that when nfs_lock_and_join_requests sees a PG_REMOVE bit, I/O
on the request is in progress and has partially completed. If such a
request is returned to nfs_try_to_update_request, it could be extended
with the newly dirtied region and I/O for the combined range will be
re-scheduled, leading to extra I/O.
Change the logic to instead restart the search for a request when any
PG_REMOVE bit is set, as the completion handler will remove the request
as soon as it can take the page group lock. This not only avoid
extending the I/O but also does the right thing for the callers that
want to cancel or flush the request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
nfs_wait_on_request is now only used in write.c. Move it there
and mark it static.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Fold nfs_page_group_lock_subrequests into nfs_lock_and_join_requests to
prepare for future changes to this code, and move the helpers to write.c
as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Fold nfs_folio_find_and_lock_request into the only caller to prepare
for changes to this code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
nfs_folio_find_and_lock_request and the nfs_page_group_lock_head helper
called by it spend quite some effort to deal with head vs subrequests.
But given that only the head request can be stashed in the folio private
data, non of that is required.
Fold the locking logic from nfs_page_group_lock_head into
nfs_folio_find_and_lock_request and simplify the result based on the
invariant that we always find the head request in the folio private data.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
nfs_folio_private_request is a trivial wrapper around, which itself has
fallen out of favor and has been replaced with plain ->private
dereferences in recent folio conversions. Do the same for nfs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Remove the code testing folio_test_swapcache either explicitly or
implicitly in pagemap.h headers, as is now handled using the direct I/O
path and not the buffered I/O path that these helpers are located in.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Previously in order to mark the communication with the DS server,
we tried to use NFS_CS_DS in cl_flags. However, this flag would
only be saved for the DS server and in case where DS equals MDS,
the client would not find a matching nfs_client in nfs_match_client
that represents the MDS (but is also a DS).
Instead, don't rely on the NFS_CS_DS but instead use NFS_CS_PNFS.
Fixes: 379e4adfddd6 ("NFSv4.1: fixup use EXCHGID4_FLAG_USE_PNFS_DS for DS server")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Commit 6df25e58532b ("nfs: remove reliance on bdi congestion")
introduced NFS-private solution for limiting number of writes
outstanding against a particular server. Unlike previous bdi congestion
this algorithm actually works and limits number of outstanding writeback
pages to nfs_congestion_kb which scales with amount of client's memory
and is capped at 256 MB. As a result some workloads such as random
buffered writes over NFS got slower (from ~170 MB/s to ~126 MB/s). The
fio command to reproduce is:
fio --direct=0 --ioengine=sync --thread --invalidate=1 --group_reporting=1
--runtime=300 --fallocate=posix --ramp_time=10 --new_group --rw=randwrite
--size=64256m --numjobs=4 --bs=4k --fsync_on_close=1 --end_fsync=1
This happens because the client sends ~256 MB worth of dirty pages to
the server and any further background writeback request is ignored until
the number of writeback pages gets below the threshold of 192 MB. By the
time this happens and clients decides to trigger another round of
writeback, the server often has no pages to write and the disk is idle.
To fix this problem and make the client react faster to eased congestion
of the server by blocking waiting for congestion to resolve instead of
aborting writeback. This improves the random 4k buffered write
throughput to 184 MB/s.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Atomic types should better be initialized with atomic_long_set() instead
of relying on zeroing done by kzalloc(). Clean this up.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
nfss->writeback is updated only when we are ending page writeback and at
that moment we also clear nfss->write_congested. So there's no point in
rechecking congestion state in nfs_commit_release_pages(). Drop the
pointless check.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|