Age | Commit message (Collapse) | Author |
|
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Match COPY operations up with CB_OFFLOAD operations.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Make it easier to grep for s2s COPY stateids in trace logs: Use the
same display format in nfsd_copy_class as is used to display other
stateids.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Nothing appears to limit the number of concurrent async COPY
operations that clients can start. In addition, AFAICT each async
COPY can copy an unlimited number of 4MB chunks, so can run for a
long time. Thus IMO async COPY can become a DoS vector.
Add a restriction mechanism that bounds the number of concurrent
background COPY operations. Start simple and try to be fair -- this
patch implements a per-namespace limit.
An async COPY request that occurs while this limit is exceeded gets
NFS4ERR_DELAY. The requesting client can choose to send the request
again after a delay or fall back to a traditional read/write style
copy.
If there is need to make the mechanism more sophisticated, we can
visit that in future patches.
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Currently, when NFSD handles an asynchronous COPY, it returns a
zero write verifier, relying on the subsequent CB_OFFLOAD callback
to pass the write verifier and a stable_how4 value to the client.
However, if the CB_OFFLOAD never arrives at the client (for example,
if a network partition occurs just as the server sends the
CB_OFFLOAD operation), the client will never receive this verifier.
Thus, if the client sends a follow-up COMMIT, there is no way for
the client to assess the COMMIT result.
The usual recovery for a missing CB_OFFLOAD is for the client to
send an OFFLOAD_STATUS operation, but that operation does not carry
a write verifier in its result. Neither does it carry a stable_how4
value, so the client /must/ send a COMMIT in this case -- which will
always fail because currently there's still no write verifier in the
COPY result.
Thus the server needs to return a normal write verifier in its COPY
result even if the COPY operation is to be performed asynchronously.
If the server recognizes the callback stateid in subsequent
OFFLOAD_STATUS operations, then obviously it has not restarted, and
the write verifier the client received in the COPY result is still
valid and can be used to assess a COMMIT of the copied data, if one
is needed.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
wake_up_var() needs a barrier after the important change is made in the
var and before wake_up_var() is called, else it is possible that a wake
up won't be sent when it should.
In each case here the var is changed in an "atomic" manner, so
smb_mb__after_atomic() is sufficient.
In one case the important change (removing the lease) is performed
*after* the wake_up, which is backwards. The code survives in part
because the wait_var_event is given a timeout.
This patch adds the required barriers and calls destroy_delegation()
*before* waking any threads waiting for the delegation to be destroyed.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
nfsd has two places that open-code clear_and_wake_up_bit(). One has
the required memory barriers. The other does not.
Change both to use clear_and_wake_up_bit() so we have the barriers
without the noise.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Using ERR_CAST() is more reasonable and safer, When it is necessary
to convert the type of an error pointer and return it.
Signed-off-by: Yan Zhen <yanzhen@vivo.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Add the __counted_by compiler attribute to the flexible array member
volumes to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.
Use struct_size() instead of manually calculating the number of bytes to
allocate for a pnfs_block_deviceaddr with a single volume.
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
If not enough buffer space available, but idmap_lookup has triggered
lookup_fn which calls cache_get and returns successfully. Then we
missed to call cache_put here which pairs with cache_get.
Fixes: ddd1ea563672 ("nfsd4: use xdr_reserve_space in attribute encoding")
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Reviwed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Add some tracepoints in the callback client RPC operations. Also
add a tracepoint to nfsd4_cb_getattr_done.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Keep track of the "main" opcode for the callback, and display it in the
tracepoint. This makes it simpler to discern what's happening when there
is more than one callback in flight.
The one special case is the CB_NULL RPC. That's not a CB_COMPOUND
opcode, so designate the value 0 for that.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Currently, you get the warning and stack trace, but nothing is printed
about the relevant error codes. Add that in.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Fix spelling errors in comments of nfsd4_release_lockowner and
nfs4_set_delegation.
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Commit 427f5f83a319 ("NFSD: Ensure nf_inode is never dereferenced") passes
inode directly to nfsd_file_mark_find_or_create instead of getting it from
nf, so there is no need to pass nf.
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
list_head can be initialized automatically with LIST_HEAD()
instead of calling INIT_LIST_HEAD().
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Ext4 will throw -EBADMSG through ext4_readdir when a checksum error
occurs, resulting in the following WARNING.
Fix it by mapping EBADMSG to nfserr_io.
nfsd_buffered_readdir
iterate_dir // -EBADMSG -74
ext4_readdir // .iterate_shared
ext4_dx_readdir
ext4_htree_fill_tree
htree_dirblock_to_tree
ext4_read_dirblock
__ext4_read_dirblock
ext4_dirblock_csum_verify
warn_no_space_for_csum
__warn_no_space_for_csum
return ERR_PTR(-EFSBADCRC) // -EBADMSG -74
nfserrno // WARNING
[ 161.115610] ------------[ cut here ]------------
[ 161.116465] nfsd: non-standard errno: -74
[ 161.117315] WARNING: CPU: 1 PID: 780 at fs/nfsd/nfsproc.c:878 nfserrno+0x9d/0xd0
[ 161.118596] Modules linked in:
[ 161.119243] CPU: 1 PID: 780 Comm: nfsd Not tainted 5.10.0-00014-g79679361fd5d #138
[ 161.120684] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qe
mu.org 04/01/2014
[ 161.123601] RIP: 0010:nfserrno+0x9d/0xd0
[ 161.124676] Code: 0f 87 da 30 dd 00 83 e3 01 b8 00 00 00 05 75 d7 44 89 ee 48 c7 c7 c0 57 24 98 89 44 24 04 c6
05 ce 2b 61 03 01 e8 99 20 d8 00 <0f> 0b 8b 44 24 04 eb b5 4c 89 e6 48 c7 c7 a0 6d a4 99 e8 cc 15 33
[ 161.127797] RSP: 0018:ffffc90000e2f9c0 EFLAGS: 00010286
[ 161.128794] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 161.130089] RDX: 1ffff1103ee16f6d RSI: 0000000000000008 RDI: fffff520001c5f2a
[ 161.131379] RBP: 0000000000000022 R08: 0000000000000001 R09: ffff8881f70c1827
[ 161.132664] R10: ffffed103ee18304 R11: 0000000000000001 R12: 0000000000000021
[ 161.133949] R13: 00000000ffffffb6 R14: ffff8881317c0000 R15: ffffc90000e2fbd8
[ 161.135244] FS: 0000000000000000(0000) GS:ffff8881f7080000(0000) knlGS:0000000000000000
[ 161.136695] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 161.137761] CR2: 00007fcaad70b348 CR3: 0000000144256006 CR4: 0000000000770ee0
[ 161.139041] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 161.140291] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 161.141519] PKRU: 55555554
[ 161.142076] Call Trace:
[ 161.142575] ? __warn+0x9b/0x140
[ 161.143229] ? nfserrno+0x9d/0xd0
[ 161.143872] ? report_bug+0x125/0x150
[ 161.144595] ? handle_bug+0x41/0x90
[ 161.145284] ? exc_invalid_op+0x14/0x70
[ 161.146009] ? asm_exc_invalid_op+0x12/0x20
[ 161.146816] ? nfserrno+0x9d/0xd0
[ 161.147487] nfsd_buffered_readdir+0x28b/0x2b0
[ 161.148333] ? nfsd4_encode_dirent_fattr+0x380/0x380
[ 161.149258] ? nfsd_buffered_filldir+0xf0/0xf0
[ 161.150093] ? wait_for_concurrent_writes+0x170/0x170
[ 161.151004] ? generic_file_llseek_size+0x48/0x160
[ 161.151895] nfsd_readdir+0x132/0x190
[ 161.152606] ? nfsd4_encode_dirent_fattr+0x380/0x380
[ 161.153516] ? nfsd_unlink+0x380/0x380
[ 161.154256] ? override_creds+0x45/0x60
[ 161.155006] nfsd4_encode_readdir+0x21a/0x3d0
[ 161.155850] ? nfsd4_encode_readlink+0x210/0x210
[ 161.156731] ? write_bytes_to_xdr_buf+0x97/0xe0
[ 161.157598] ? __write_bytes_to_xdr_buf+0xd0/0xd0
[ 161.158494] ? lock_downgrade+0x90/0x90
[ 161.159232] ? nfs4svc_decode_voidarg+0x10/0x10
[ 161.160092] nfsd4_encode_operation+0x15a/0x440
[ 161.160959] nfsd4_proc_compound+0x718/0xe90
[ 161.161818] nfsd_dispatch+0x18e/0x2c0
[ 161.162586] svc_process_common+0x786/0xc50
[ 161.163403] ? nfsd_svc+0x380/0x380
[ 161.164137] ? svc_printk+0x160/0x160
[ 161.164846] ? svc_xprt_do_enqueue.part.0+0x365/0x380
[ 161.165808] ? nfsd_svc+0x380/0x380
[ 161.166523] ? rcu_is_watching+0x23/0x40
[ 161.167309] svc_process+0x1a5/0x200
[ 161.168019] nfsd+0x1f5/0x380
[ 161.168663] ? nfsd_shutdown_threads+0x260/0x260
[ 161.169554] kthread+0x1c4/0x210
[ 161.170224] ? kthread_insert_work_sanity_check+0x80/0x80
[ 161.171246] ret_from_fork+0x1f/0x30
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Commit 5826e09bf3dd ("NFSD: OP_CB_RECALL_ANY should recall both read and
write delegations") added a new assignment statement to add
RCA4_TYPE_MASK_WDATA_DLG to ra_bmval bitmask of OP_CB_RECALL_ANY. So the
old one should be removed.
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Collect a few very old previous employers as well.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
According to RFC 8881, all minor versions of NFSv4 support PUTPUBFH.
Replace the XDR decoder for PUTPUBFH with a "noop" since we no
longer want the minorversion check, and PUTPUBFH has no arguments to
decode. (Ideally nfsd4_decode_noop should really be called
nfsd4_decode_void).
PUTPUBFH should now behave just like PUTROOTFH.
Reported-by: Cedric Blancher <cedric.blancher@gmail.com>
Fixes: e1a90ebd8b23 ("NFSD: Combine decode operations for v4 and v4.1")
Cc: Dan Shelton <dan.f.shelton@gmail.com>
Cc: Roland Mainz <roland.mainz@nrubsig.org>
Cc: stable@vger.kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The 'callback address' in client_info_show is output without quotes
causing yaml parsers to fail on processing IPv6 addresses.
Adding quotes to 'callback address' also matches that used by
the 'address' field.
Signed-off-by: Mark Grimes <mark.grimes@ixsystems.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Synchronously wait for all disconnects to complete to ensure the
transports have divested all hardware resources before the
underlying RDMA device can safely be removed.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
If an NFS operation expects a particular sort of object (file, dir, link,
etc) but gets a file handle for a different sort of object, it must
return an error. The actual error varies among NFS versions in non-trivial
ways.
For v2 and v3 there are ISDIR and NOTDIR errors and, for NFSv4 only,
INVAL is suitable.
For v4.0 there is also NFS4ERR_SYMLINK which should be used if a SYMLINK
was found when not expected. This take precedence over NOTDIR.
For v4.1+ there is also NFS4ERR_WRONG_TYPE which should be used in
preference to EINVAL when none of the specific error codes apply.
When nfsd_mode_check() finds a symlink where it expected a directory it
needs to return an error code that can be converted to NOTDIR for v2 or
v3 but will be SYMLINK for v4. It must be different from the error
code returns when it finds a symlink but expects a regular file - that
must be converted to EINVAL or SYMLINK.
So we introduce an internal error code nfserr_symlink_not_dir which each
version converts as appropriate.
nfsd_check_obj_isreg() is similar to nfsd_mode_check() except that it is
only used by NFSv4 and only for OPEN. NFSERR_INVAL is never a suitable
error if the object is the wrong time. For v4.0 we use nfserr_symlink
for non-dirs even if not a symlink. For v4.1 we have nfserr_wrong_type.
We handle this difference in-place in nfsd_check_obj_isreg() as there is
nothing to be gained by delaying the choice to nfsd4_map_status().
As a result of these changes, nfsd_mode_check() doesn't need an rqstp
arg any more.
Note that NFSv4 operations are actually performed in the xdr code(!!!)
so to the only place that we can map the status code successfully is in
nfsd4_encode_operation().
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Rather than using ad hoc values for internal errors (30000, 11000, ...)
use 'enum' to sequentially allocate numbers starting from the first
known available number - now visible as NFS4ERR_FIRST_FREE.
The goal is values that are distinct from all be32 error codes. To get
those we must first select integers that are not already used, then
convert them with cpu_to_be32().
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
There is code scattered around nfsd which chooses an error status based
on the particular version of nfs being used. It is cleaner to have the
version specific choices in version specific code.
With this patch common code returns the most specific error code
possible and the version specific code maps that if necessary.
Both v2 (nfsproc.c) and v3 (nfs3proc.c) now have a "map_status()"
function which is called to map the resp->status before each non-trivial
nfsd_proc_* or nfsd3_proc_* function returns.
NFS4ERR_SYMLINK and NFS4ERR_WRONG_TYPE introduce extra complications and
are left for a later patch.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
This further centralizes version number checks.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
With this patch the only places that test ->rq_vers against a specific
version are nfsd_v4client() and nfsd_set_fh_dentry().
The latter sets some flags in the svc_fh, which now includes:
fh_64bit_cookies
fh_use_wgather
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
nfsd_breaker_owns_lease() currently open-codes the same test that
nfsd_v4client() performs.
With this patch we use nfsd_v4client() instead.
Also as i_am_nfsd() is only used in combination with kthread_data(),
replace it with nfsd_current_rqst() which combines the two and returns a
valid svc_rqst, or NULL.
The test for NULL is moved into nfsd_v4client() for code clarity.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
nfsd_permission(), exp_rdonly(), nfsd_setuser(), and nfsexp_flags()
only ever need the cred out of rqstp, so pass it explicitly instead of
the whole rqstp.
This makes the interfaces cleaner.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Rather than passing the whole rqst, pass the pieces that are actually
needed. This makes the inputs to rqst_exp_find() more obvious.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Move the stateid handling to nfsd4_copy_notify.
If nfs4_preprocess_stateid_op did not produce an output stateid, error out.
Copy notify specifically does not permit the use of special stateids,
so enforce that outside generic stateid pre-processing.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Olga Kornievskaia <aglo@umich.edu>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
If an svc thread needs to perform some initialisation that might fail,
it has no good way to handle the failure.
Before the thread can exit it must call svc_exit_thread(), but that
requires the service mutex to be held. The thread cannot simply take
the mutex as that could deadlock if there is a concurrent attempt to
shut down all threads (which is unlikely, but not impossible).
nfsd currently call svc_exit_thread() unprotected in the unlikely event
that unshare_fs_struct() fails.
We can clean this up by introducing svc_thread_init_status() by which an
svc thread can report whether initialisation has succeeded. If it has,
it continues normally into the action loop. If it has not,
svc_thread_init_status() immediately aborts the thread.
svc_start_kthread() waits for either of these to happen, and calls
svc_exit_thread() (under the mutex) if the thread aborted.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The only caller of svc_rqst_alloc() is svc_prepare_thread(). So merge
the one into the other and simplify.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
As documented in svc_xprt.c, sv_nrthreads is protected by the service
mutex, and it does not need ->sv_lock.
(->sv_lock is needed only for sv_permsocks, sv_tempsocks, and
sv_tmpcnt).
So remove the unnecessary locking.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
sp_nrthreads is only ever accessed under the service mutex
nlmsvc_mutex nfs_callback_mutex nfsd_mutex
so these is no need for it to be an atomic_t.
The fact that all code using it is single-threaded means that we can
simplify svc_pool_victim and remove the temporary elevation of
sp_nrthreads.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The locking required for svc_exit_thread() is not obvious, so document
it in a kdoc comment.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Instead of using kmalloc to allocate an array for storing active version
info, just declare an array to the max size - it is only 5 or so.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
When invoke memfd_pin_folios, we need offer an array to save each folio
which we pinned.
The current way is dynamic alloc an array(use kvmalloc), get folios,
save into udmabuf and then free.
Depend on the size, kvmalloc can do something different:
Below PAGE_SIZE, slab allocator will be used, which have good alloc
performance, due to it cached page.
PAGE_SIZE - PCP Order, PCP(per-cpu-pageset) also given buddy page a
cache in each CPU, so different CPU no need to hold some lock(zone or
some) to get the locally page. If PCP cached page, the access also fast.
PAGE_SIZE - BUDDY_MAX, try to get page from buddy, due to kvmalloc adjusted
the gfp flags, if zone freelist can't alloc page(fast path), we will not
enter slowpath to reclaim memory. Due to need hold lock and check, may
slow, but still fast than vmalloc.
Anything wrong will fallback into vmalloc to alloc memory, it obtains
contiguous virtual addresses by loop alloc order 0 page(PAGE_SIZE), and
then map it into vmalloc area. If necessary, page alloc may enter
slowpath to reclaim memory. Hence, if fallback into vmalloc, it's slow.
When create, we need to iter each udmabuf item, then pin it's range
folios, if each item's range folio's count is large, we may fallback each
into vmalloc.
This patch find the largest range folio in items, then alloc this size's
folio array. When pin range folios, reuse this array.
Signed-off-by: Huan Yang <link@vivo.com>
Acked-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240918025238.2957823-8-link@vivo.com
|
|
Currently, udmabuf handles folio by create an unpin list to record
each folio obtained from the list and unpinning them when released. To
maintain this, many struct have been established.
However, maintain this requires a significant amount of memory and
iter the list is a substantial overhead, which is not friendly to the
CPU cache.
When create, we arranged the folio array in the order of pin and set
the offset according to pgcnt. So, if record each pinned folio when
create, then can easy unpin it. Compare to use list to record it,
an array also can do this.
Hence, this patch setup a pinned_folios array(size is the pgcnt) to
instead of udmabuf_folio struct, it record each folio which pinned when
invoke memfd_pin_folios, then unpin folio by iter pinned_folios.
Note that, since a folio may be pinned multiple times, each folio can be
added to pinned_folios multiple times, depend on how many times the
folio has been pinned when create.
Compare to udmabuf_folio(24 byte size), a folio pointer is 8 byte, if no
large folio - each folio is PAGE_SIZE - and need to unpin when release.
So need to record each folio, by this patch, each folio can save 16 byte.
But if large folio used, depend on the large folio's number, the
pinned_folios array may take more memory, but it still can makes unpin
access more cache-friendly.
Signed-off-by: Huan Yang <link@vivo.com>
Acked-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240918025238.2957823-7-link@vivo.com
|
|
After udmabuf is allocated, its resources need to be initialized,
including various array structures. The current array structure has
already been greatly expanded.
Also, before udmabuf needs to be kfree, the occupied resources need to
be released.
This part is repetitive and maybe overlooked.
This patch give a helper function when init and deinit, by this,
reduce duplicate code.
Signed-off-by: Huan Yang <link@vivo.com>
Acked-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240918025238.2957823-6-link@vivo.com
|
|
This patch aim to simplify the memfd folio pin during the udmabuf
create. No functional changes.
This patch create a udmabuf_pin_folios function, in this, do the memfd
pin folio and then record each pinned folio, offset.
This patch simplify the pinned folio record, iter by each pinned folio,
and then record each offset in it.
Compare to iter by pgcnt, more readable.
Suggested-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Signed-off-by: Huan Yang <link@vivo.com>
Acked-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240918025238.2957823-5-link@vivo.com
|
|
Currently vmap_udmabuf set page's array by each folio.
But, ubuf->folios is only contain's the folio's head page.
That mean we repeatedly mapped the folio head page to the vmalloc area.
Due to udmabuf can use hugetlb, if HVO enabled, tail page may not exist,
so, we can't use page array to map, instead, use pfn array.
By this, we removed page usage in udmabuf totally.
Fixes: 5e72b2b41a21 ("udmabuf: convert udmabuf driver to use folios")
Suggested-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Signed-off-by: Huan Yang <link@vivo.com>
Acked-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240918025238.2957823-4-link@vivo.com
|
|
When PAGE_SIZE 4096, MAX_PAGE_ORDER 10, 64bit machine,
page_alloc only support 4MB.
If above this, trigger this warn and return NULL.
udmabuf can change size limit, if change it to 3072(3GB), and then alloc
3GB udmabuf, will fail create.
[ 4080.876581] ------------[ cut here ]------------
[ 4080.876843] WARNING: CPU: 3 PID: 2015 at mm/page_alloc.c:4556 __alloc_pages+0x2c8/0x350
[ 4080.878839] RIP: 0010:__alloc_pages+0x2c8/0x350
[ 4080.879470] Call Trace:
[ 4080.879473] <TASK>
[ 4080.879473] ? __alloc_pages+0x2c8/0x350
[ 4080.879475] ? __warn.cold+0x8e/0xe8
[ 4080.880647] ? __alloc_pages+0x2c8/0x350
[ 4080.880909] ? report_bug+0xff/0x140
[ 4080.881175] ? handle_bug+0x3c/0x80
[ 4080.881556] ? exc_invalid_op+0x17/0x70
[ 4080.881559] ? asm_exc_invalid_op+0x1a/0x20
[ 4080.882077] ? udmabuf_create+0x131/0x400
Because MAX_PAGE_ORDER, kmalloc can max alloc 4096 * (1 << 10), 4MB
memory, each array entry is pointer(8byte), so can save 524288 pages(2GB).
Further more, costly order(order 3) may not be guaranteed that it can be
applied for, due to fragmentation.
This patch change udmabuf array use kvmalloc_array, this can fallback
alloc into vmalloc, which can guarantee allocation for any size and does
not affect the performance of kmalloc allocations.
Signed-off-by: Huan Yang <link@vivo.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240918025238.2957823-3-link@vivo.com
|
|
The current udmabuf mmap only fills the physical memory to the
corresponding virtual address when the user actually accesses the
virtual address.
However, the current udmabuf has already obtained and pinned the folio
upon completion of the creation.This means that the physical memory has
already been acquired, rather than being accessed dynamically.
As a result, the page fault has lost its purpose as a demanding
page. Due to the fact that page fault requires trapping into kernel mode
and filling in when accessing the corresponding virtual address in mmap,
when creating a large size udmabuf, this represents a considerable
overhead.
This patch fill the pfn into page table, and then pre-fault each pfn
into vma, when first access.
Notice, if anything wrong , we do not return an error during this
pre-fault step. However, an error will be returned if the failure occurs
when the addr is truly accessed
Suggested-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Signed-off-by: Huan Yang <link@vivo.com>
Acked-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240918025238.2957823-2-link@vivo.com
|
|
I would like to help maintain the udmabuf driver, in light of the
recent changes that converted the driver to use folios instead
of pages. Furthermore, I also contribute to Qemu's virtio-gpu
module (and UI modules), that are primary users of udmabuf driver.
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240822045806.3563883-1-vivek.kasireddy@intel.com
|
|
Add new PCI id for ARL platform.
v2: Fix typo in PCI id (SaiTeja)
Signed-off-by: Dnyaneshwar Bhadane <dnyaneshwar.bhadane@intel.com>
Reviewed-by: Sai Teja Pottumuttu <sai.teja.pottumuttu@intel.com>
Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240912115906.2730577-1-dnyaneshwar.bhadane@intel.com
|
|
ENGINE API has been deprecated since OpenSSL version 3.0 [1].
Distros have started dropping support from headers and in future
it will likely disappear also from library.
It has been superseded by the PROVIDER API, so use it instead
for OPENSSL MAJOR >= 3.
[1] https://github.com/openssl/openssl/blob/master/README-ENGINES.md
[jarkko: fixed up alignment issues reported by checkpatch.pl --strict]
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: R Nageswara Sastry <rnsastry@linux.ibm.com>
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
|
|
ERR_get_error_line() is deprecated since OpenSSL 3.0.
Use ERR_peek_error_line() instead, and combine display_openssl_errors()
and drain_openssl_errors() to a single function where parameter decides
if it should consume errors silently.
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: R Nageswara Sastry <rnsastry@linux.ibm.com>
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
|
|
Couple error handling helpers are repeated in both tools, so
move them to a common header.
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: R Nageswara Sastry <rnsastry@linux.ibm.com>
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
|
|
In find_asymmetric_key(), if all NULLs are passed in the id_{0,1,2}
arguments, the kernel will first emit WARN but then have an oops
because id_2 gets dereferenced anyway.
Add the missing id_2 check and move WARN_ON() to the final else branch
to avoid duplicate NULL checks.
Found by Linux Verification Center (linuxtesting.org) with Svace static
analysis tool.
Cc: stable@vger.kernel.org # v5.17+
Fixes: 7d30198ee24f ("keys: X.509 public key issuer lookup without AKID")
Suggested-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Roman Smirnov <r.smirnov@omp.ru>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
|