git.armlinux.org.uk/linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2024-03-01	SUNRPC: Remove EXPORT_SYMBOL_GPL for svc_process_bc()	Chuck Lever
	svc_process_bc(), previously known as bc_svc_process(), was added in commit 4d6bbb6233c9 ("nfs41: Backchannel bc_svc_process()") but there has never been a call site outside of the sunrpc.ko module. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01	NFSD: Add callback operation lifetime trace points	Chuck Lever
	Help observe the flow of callback operations. bc_shutdown() records exactly when the backchannel RPC client is destroyed and cl_cb_client is replaced with NULL. Examples include: nfsd-955 [004] 650.013997: nfsd_cb_queue: addr=192.168.122.6:0 client 65b3c5b8:f541f749 cb=0xffff8881134b02f8 (first try) kworker/u21:4-497 [004] 650.014050: nfsd_cb_seq_status: task:00000001@00000001 sessionid=65b3c5b8:f541f749:00000001:00000000 tk_status=-107 seq_status=1 kworker/u21:4-497 [004] 650.014051: nfsd_cb_restart: addr=192.168.122.6:0 client 65b3c5b8:f541f749 cb=0xffff88810e39f400 (first try) kworker/u21:4-497 [004] 650.014066: nfsd_cb_queue: addr=192.168.122.6:0 client 65b3c5b8:f541f749 cb=0xffff88810e39f400 (need restart) kworker/u16:0-10 [006] 650.065750: nfsd_cb_start: addr=192.168.122.6:0 client 65b3c5b8:f541f749 state=UNKNOWN kworker/u16:0-10 [006] 650.065752: nfsd_cb_bc_update: addr=192.168.122.6:0 client 65b3c5b8:f541f749 cb=0xffff8881134b02f8 (first try) kworker/u16:0-10 [006] 650.065754: nfsd_cb_bc_shutdown: addr=192.168.122.6:0 client 65b3c5b8:f541f749 cb=0xffff8881134b02f8 (first try) kworker/u16:0-10 [006] 650.065810: nfsd_cb_new_state: addr=192.168.122.6:0 client 65b3c5b8:f541f749 state=DOWN Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01	NFSD: Rename nfsd_cb_state trace point	Chuck Lever
	Make it clear where backchannel state is updated. Example trace point output: kworker/u16:0-10 [006] 2800.080404: nfsd_cb_new_state: addr=192.168.122.6:0 client 65b3c5b8:f541f749 state=UP nfsd-940 [003] 2800.478368: nfsd_cb_new_state: addr=192.168.122.6:0 client 65b3c5b8:f541f749 state=UNKNOWN kworker/u16:0-10 [003] 2800.478828: nfsd_cb_new_state: addr=192.168.122.6:0 client 65b3c5b8:f541f749 state=DOWN kworker/u16:0-10 [005] 2802.039724: nfsd_cb_start: addr=192.168.122.6:0 client 65b3c5b8:f541f749 state=UP kworker/u16:0-10 [005] 2810.611452: nfsd_cb_start: addr=192.168.122.6:0 client 65b3c5b8:f541f749 state=FAULT kworker/u16:0-10 [005] 2810.616832: nfsd_cb_start: addr=192.168.122.6:0 client 65b3c5b8:f541f749 state=UNKNOWN kworker/u16:0-10 [005] 2810.616931: nfsd_cb_start: addr=192.168.122.6:0 client 65b3c5b8:f541f749 state=DOWN Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01	NFSD: Replace dprintks in nfsd4_cb_sequence_done()	Chuck Lever
	Improve observability of backchannel session operation. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01	NFSD: Add nfsd_seq4_status trace event	Chuck Lever
	Add a trace point that records SEQ4_STATUS flags returned in an NFSv4.1 SEQUENCE response. SEQ4_STATUS flags report backchannel issues and changes to lease state to clients. Knowing what the server is reporting to clients is useful for debugging both configuration and operational issues in real time. For example, upcoming patches will enable server administrators to revoke parts of a client's lease; that revocation is indicated to the client when a subsequent SEQUENCE operation has one or more SEQ4_STATUS flags that are set. Sample trace records: nfsd-927 [006] 615.581821: nfsd_seq4_status: xid=0x095ded07 sessionid=65a032c3:b7845faf:00000001:00000000 status_flags=BACKCHANNEL_FAULT nfsd-927 [006] 615.588043: nfsd_seq4_status: xid=0x0a5ded07 sessionid=65a032c3:b7845faf:00000001:00000000 status_flags=BACKCHANNEL_FAULT nfsd-928 [003] 615.588448: nfsd_seq4_status: xid=0x0b5ded07 sessionid=65a032c3:b7845faf:00000001:00000000 status_flags=BACKCHANNEL_FAULT Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01	NFSD: Retransmit callbacks after client reconnects	Chuck Lever
	NFSv4.1 clients assume that if they disconnect, that will force the server to resend pending callback operations once a fresh connection has been established. Turns out NFSD has not been resending after reconnect. Fixes: 7ba6cad6c88f ("nfsd: New helper nfsd4_cb_sequence_done() for processing more cb errors") Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01	NFSD: Reschedule CB operations when backchannel rpc_clnt is shut down	Chuck Lever
	As part of managing a client disconnect, NFSD closes down and replaces the backchannel rpc_clnt. If a callback operation is pending when the backchannel rpc_clnt is shut down, currently nfsd4_run_cb_work() just discards that callback. But there are multiple cases to deal with here: o The client's lease is getting destroyed. Throw the CB away. o The client disconnected. It might be forcing a retransmit of CB operations, or it could have disconnected for other reasons. Reschedule the CB so it is retransmitted when the client reconnects. Since callback operations can now be rescheduled, ensure that cb_ops->prepare can be called only once by moving the cb_ops->prepare paragraph down to just before the rpc_call_async() call. Fixes: 2bbfed98a4d8 ("nfsd: Fix races between nfsd4_cb_release() and nfsd4_shutdown_callback()") Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01	NFSD: Convert the callback workqueue to use delayed_work	Chuck Lever
	Normally, NFSv4 callback operations are supposed to be sent to the client as soon as they are queued up. In a moment, I will introduce a recovery path where the server has to wait for the client to reconnect. We don't want a hard busy wait here -- the callback should be requeued to try again in several milliseconds. For now, convert nfsd4_callback from struct work_struct to struct delayed_work, and queue with a zero delay argument. This should avoid behavior changes for current operation. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01	NFSD: Reset cb_seq_status after NFS4ERR_DELAY	Chuck Lever
	I noticed that once an NFSv4.1 callback operation gets a NFS4ERR_DELAY status on CB_SEQUENCE and then the connection is lost, the callback client loops, resending it indefinitely. The switch arm in nfsd4_cb_sequence_done() that handles NFS4ERR_DELAY uses rpc_restart_call() to rearm the RPC state machine for the retransmit, but that path does not call the rpc_prepare_call callback again. Thus cb_seq_status is set to -10008 by the first NFS4ERR_DELAY result, but is never set back to 1 for the retransmits. nfsd4_cb_sequence_done() thinks it's getting nothing but a long series of CB_SEQUENCE NFS4ERR_DELAY replies. Fixes: 7ba6cad6c88f ("nfsd: New helper nfsd4_cb_sequence_done() for processing more cb errors") Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01	nfsd: make svc_stat per-network namespace instead of global	Josef Bacik
	The final bit of stats that is global is the rpc svc_stat. Move this into the nfsd_net struct and use that everywhere instead of the global struct. Remove the unused global struct. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01	nfsd: remove nfsd_stats, make th_cnt a global counter	Josef Bacik
	This is the last global stat, take it out of the nfsd_stats struct and make it a global part of nfsd, report it the same as always. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01	nfsd: make all of the nfsd stats per-network namespace	Josef Bacik
	We have a global set of counters that we modify for all of the nfsd operations, but now that we're exposing these stats across all network namespaces we need to make the stats also be per-network namespace. We already have some caching stats that are per-network namespace, so move these definitions into the same counter and then adjust all the helpers and users of these stats to provide the appropriate nfsd_net struct so that the stats are maintained for the per-network namespace objects. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01	nfsd: expose /proc/net/sunrpc/nfsd in net namespaces	Josef Bacik
	We are running nfsd servers inside of containers with their own network namespace, and we want to monitor these services using the stats found in /proc. However these are not exposed in the proc inside of the container, so we have to bind mount the host /proc into our containers to get at this information. Separate out the stat counters init and the proc registration, and move the proc registration into the pernet operations entry and exit points so that these stats can be exposed inside of network namespaces. This is an intermediate step, this just exposes the global counters in the network namespace. Subsequent patches will move these counters into the per-network namespace container. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01	nfsd: rename NFSD_NET_* to NFSD_STATS_*	Josef Bacik
	We're going to merge the stats all into per network namespace in subsequent patches, rename these nn counters to be consistent with the rest of the stats. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01	sunrpc: use the struct net as the svc proc private	Josef Bacik
	nfsd is the only thing using this helper, and it doesn't use the private currently. When we switch to per-network namespace stats we will need the struct net * in order to get to the nfsd_net. Use the net as the proc private so we can utilize this when we make the switch over. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01	sunrpc: remove ->pg_stats from svc_program	Josef Bacik
	Now that this isn't used anywhere, remove it. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01	sunrpc: pass in the sv_stats struct through svc_create_pooled	Josef Bacik
	Since only one service actually reports the rpc stats there's not much of a reason to have a pointer to it in the svc_program struct. Adjust the svc_create_pooled function to take the sv_stats as an argument and pass the struct through there as desired instead of getting it from the svc_program->pg_stats. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01	nfsd: stop setting ->pg_stats for unused stats	Josef Bacik
	A lot of places are setting a blank svc_stats in ->pg_stats and never utilizing these stats. Remove all of these extra structs as we're not reporting these stats anywhere. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01	sunrpc: don't change ->sv_stats if it doesn't exist	Josef Bacik
	We check for the existence of ->sv_stats elsewhere except in the core processing code. It appears that only nfsd actual exports these values anywhere, everybody else just has a write only copy of sv_stats in their svc_program. Add a check for ->sv_stats before every adjustment to allow us to eliminate the stats struct from all the users who don't report the stats. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01	NFSD: fix LISTXATTRS returning more bytes than maxcount	Jorge Mora
	The maxcount is the maximum number of bytes for the LISTXATTRS4resok result. This includes the cookie and the count for the name array, thus subtract 12 bytes from the maxcount: 8 (cookie) + 4 (array count) when filling up the name array. Fixes: 23e50fe3a5e6 ("nfsd: implement the xattr functions and en/decode logic") Signed-off-by: Jorge Mora <mora@netapp.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01	NFSD: fix LISTXATTRS returning a short list with eof=TRUE	Jorge Mora
	If the XDR buffer is not large enough to fit all attributes and the remaining bytes left in the XDR buffer (xdrleft) is equal to the number of bytes for the current attribute, then the loop will prematurely exit without setting eof to FALSE. Also in this case, adding the eof flag to the buffer will make the reply 4 bytes larger than lsxa_maxcount. Need to check if there are enough bytes to fit not only the next attribute name but also the eof as well. Fixes: 23e50fe3a5e6 ("nfsd: implement the xattr functions and en/decode logic") Signed-off-by: Jorge Mora <mora@netapp.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01	NFSD: change LISTXATTRS cookie encoding to big-endian	Jorge Mora
	Function nfsd4_listxattr_validate_cookie() expects the cookie as an offset to the list thus it needs to be encoded in big-endian. Fixes: 23e50fe3a5e6 ("nfsd: implement the xattr functions and en/decode logic") Signed-off-by: Jorge Mora <mora@netapp.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01	NFSD: fix nfsd4_listxattr_validate_cookie	Jorge Mora
	If LISTXATTRS is sent with a correct cookie but a small maxcount, this could lead function nfsd4_listxattr_validate_cookie to return NFS4ERR_BAD_COOKIE. If maxcount = 20, then second check on function gives RHS = 3 thus any cookie larger than 3 returns NFS4ERR_BAD_COOKIE. There is no need to validate the cookie on the return XDR buffer since attribute referenced by cookie will be the first in the return buffer. Fixes: 23e50fe3a5e6 ("nfsd: implement the xattr functions and en/decode logic") Signed-off-by: Jorge Mora <mora@netapp.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01	nfsd: use __fput_sync() to avoid delayed closing of files.	NeilBrown
	Calling fput() directly or though filp_close() from a kernel thread like nfsd causes the final __fput() (if necessary) to be called from a workqueue. This means that nfsd is not forced to wait for any work to complete. If the ->release or ->destroy_inode function is slow for any reason, this can result in nfsd closing files more quickly than the workqueue can complete the close and the queue of pending closes can grow without bounces (30 million has been seen at one customer site, though this was in part due to a slowness in xfs which has since been fixed). nfsd does not need this. It is quite appropriate and safe for nfsd to do its own close work. There is no reason that close should ever wait for nfsd, so no deadlock can occur. It should be safe and sensible to change all fput() calls to __fput_sync(). However in the interests of caution this patch only changes two - the two that can be most directly affected by client behaviour and could occur at high frequency. - the fput() implicitly in flip_close() is changed to __fput_sync() by calling get_file() first to ensure filp_close() doesn't do the final fput() itself. If is where files opened for IO are closed. - the fput() in nfsd_read() is also changed. This is where directories opened for readdir are closed. This ensure that minimal fput work is queued to the workqueue. This removes the need for the flush_delayed_fput() call in nfsd_file_close_inode_sync() Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01	nfsd: Don't leave work of closing files to a work queue	NeilBrown
	The work of closing a file can have non-trivial cost. Doing it in a separate work queue thread means that cost isn't imposed on the nfsd threads and an imbalance can be created. This can result in files being queued for the work queue more quickly that the work queue can process them, resulting in unbounded growth of the queue and memory exhaustion. To avoid this work imbalance that exhausts memory, this patch moves all closing of files into the nfsd threads. This means that when the work imposes a cost, that cost appears where it would be expected - in the work of the nfsd thread. A subsequent patch will ensure the final __fput() is called in the same (nfsd) thread which calls filp_close(). Files opened for NFSv3 are never explicitly closed by the client and are kept open by the server in the "filecache", which responds to memory pressure, is garbage collected even when there is no pressure, and sometimes closes files when there is particular need such as for rename. These files currently have filp_close() called in a dedicated work queue, so their __fput() can have no effect on nfsd threads. This patch discards the work queue and instead has each nfsd thread call flip_close() on as many as 8 files from the filecache each time it acts on a client request (or finds there are no pending client requests). If there are more to be closed, more threads are woken. This spreads the work of __fput() over multiple threads and imposes any cost on those threads. The number 8 is somewhat arbitrary. It needs to be greater than 1 to ensure that files are closed more quickly than they can be added to the cache. It needs to be small enough to limit the per-request delays that will be imposed on clients when all threads are busy closing files. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01	SUNRPC: Use a static buffer for the checksum initialization vector	Chuck Lever
	Allocating and zeroing a buffer during every call to krb5_etm_checksum() is inefficient. Instead, set aside a static buffer that is the maximum crypto block size, and use a portion (or all) of that. Reported-by: Markus Elfring <Markus.Elfring@web.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01	SUNRPC: fix some memleaks in gssx_dec_option_array	Zhipeng Lu
	The creds and oa->data need to be freed in the error-handling paths after their allocation. So this patch add these deallocations in the corresponding paths. Fixes: 1d658336b05f ("SUNRPC: Add RPC based upcall mechanism for RPCGSS auth") Signed-off-by: Zhipeng Lu <alexious@zju.edu.cn> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01	SUNRPC: fix a memleak in gss_import_v2_context	Zhipeng Lu
	The ctx->mech_used.data allocated by kmemdup is not freed in neither gss_import_v2_context nor it only caller gss_krb5_import_sec_context, which frees ctx on error. Thus, this patch reform the last call of gss_import_v2_context to the gss_krb5_import_ctx_v2, preventing the memleak while keepping the return formation. Fixes: 47d848077629 ("gss_krb5: handle new context format from gssd") Signed-off-by: Zhipeng Lu <alexious@zju.edu.cn> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01	io_uring/sqpoll: statistics of the true utilization of sq threads	Xiaobing Li
	Count the running time and actual IO processing time of the sqpoll thread, and output the statistical data to fdinfo. Variable description: "work_time" in the code represents the sum of the jiffies of the sq thread actually processing IO, that is, how many milliseconds it actually takes to process IO. "total_time" represents the total time that the sq thread has elapsed from the beginning of the loop to the current time point, that is, how many milliseconds it has spent in total. The test tool is fio, and its parameters are as follows: [global] ioengine=io_uring direct=1 group_reporting bs=128k norandommap=1 randrepeat=0 refill_buffers ramp_time=30s time_based runtime=1m clocksource=clock_gettime overwrite=1 log_avg_msec=1000 numjobs=1 [disk0] filename=/dev/nvme0n1 rw=read iodepth=16 hipri sqthread_poll=1 The test results are as follows: Every 2.0s: cat /proc/9230/fdinfo/6 \| grep -E Sq SqMask: 0x3 SqHead: 3197153 SqTail: 3197153 CachedSqHead: 3197153 SqThread: 9231 SqThreadCpu: 11 SqTotalTime: 18099614 SqWorkTime: 16748316 The test results corresponding to different iodepths are as follows: \|-----------\|-------\|-------\|-------\|------\|-------\| \| iodepth \| 1 \| 4 \| 8 \| 16 \| 64 \| \|-----------\|-------\|-------\|-------\|------\|-------\| \|utilization\| 2.9% \| 8.8% \| 10.9% \| 92.9%\| 84.4% \| \|-----------\|-------\|-------\|-------\|------\|-------\| \| idle \| 97.1% \| 91.2% \| 89.1% \| 7.1% \| 15.6% \| \|-----------\|-------\|-------\|-------\|------\|-------\| Signed-off-by: Xiaobing Li <xiaobing.li@samsung.com> Link: https://lore.kernel.org/r/20240228091251.543383-1-xiaobing.li@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01	io_uring/net: move recv/recvmsg flags out of retry loop	Jens Axboe
	The flags don't change, just intialize them once rather than every loop for multishot. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01	qnx4: convert qnx4 to use the new mount api	Bill O'Donnell
	Convert the qnx4 filesystem to use the new mount API. Tested mount, umount, and remount using a qnx4 boot image. Signed-off-by: Bill O'Donnell <bodonnel@redhat.com> Link: https://lore.kernel.org/r/20240229161649.800957-1-bodonnel@redhat.com Acked-by: Anders Larsen <al@alarsen.net> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-01	fs: use inode_set_ctime_to_ts to set inode ctime to current time	Nguyen Dinh Phi
	The function inode_set_ctime_current simply retrieves the current time and assigns it to the field __i_ctime without any alterations. Therefore, it is possible to set ctime to now directly using inode_set_ctime_to_ts Signed-off-by: Nguyen Dinh Phi <phind.uet@gmail.com> Link: https://lore.kernel.org/r/20240228173031.3208743-1-phind.uet@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-01	iommu/sva: Fix SVA handle sharing in multi device case	Zhangfei Gao
	iommu_sva_bind_device will directly goto out in multi-device case when found existing domain, ignoring list_add handle, which causes the handle to fail to be shared. Fixes: 65d4418c5002 ("iommu/sva: Restore SVA handle sharing") Signed-off-by: Zhangfei Gao <zhangfei.gao@linaro.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20240227064821.128-1-zhangfei.gao@linaro.org Signed-off-by: Joerg Roedel <jroedel@suse.de>
2024-03-01	Merge tag 'at91-dt-6.9' of ↵	Arnd Bergmann
	https://git.kernel.org/pub/scm/linux/kernel/git/at91/linux into soc/dt Microchip AT91 device tree updates for v6.9 It contains: - use DMA for DBGU of at91sam9x5ek.dtsi and USART3 of at91sam9g25-gardena-smart-gateway.dts - the new SAMA7G54 Curiosity board - cleanups * tag 'at91-dt-6.9' of https://git.kernel.org/pub/scm/linux/kernel/git/at91/linux: ARM: dts: microchip: sama7g5: add sama7g5 compatible ARM: dts: microchip: sam9x60: align dmas to the opening '<' ARM: dts: microchip: sama7g5: align dmas to the opening '<' ARM: dts: microchip: sama7g54_curiosity: Add initial device tree of the board ARM: dts: microchip: sama7g5: Add flexcom 10 node dt-bindings: ARM: at91: Document Microchip SAMA7G54 Curiosity ARM: dts: microchip: gardena-smart-gateway: Use DMA for USART3 ARM: dts: microchip: at91sam9x5ek: Use DMA for DBGU serial port Link: https://lore.kernel.org/r/20240226183635.1964704-1-claudiu.beznea@tuxon.dev Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-03-01	Merge tag 'zynqmp-dt-for-6.9' of https://github.com/Xilinx/linux-xlnx into ↵	Arnd Bergmann
	soc/dt arm64: ZynqMP DT changes for 6.9 dt-bindings: - Describe firmware for Versal NET - Describe all firmware child nodes - Align versal-fpga node name with dt schema - Describe k26 rev2 and kv260 DTs: - Align firmware node with dt schema - Add an optee node - Describe reset for CANs - Update ECAM size to discover up to 256 buses - Describe assigned-clocks for uarts - Add u-boot node - Comment SMMU entries - Align dwc3 nodes with dt schema - Rename i2c groups to match dt schema - Small DT updates (comments) - Fix default clock frequency for si570 (zcu102, zcu106) - Add output-enable pins and cover MIO38 (SOM) * tag 'zynqmp-dt-for-6.9' of https://github.com/Xilinx/linux-xlnx: (21 commits) dt-bindings: firmware: xilinx: Describe soc-nvmem subnode dt-bindings: soc: xilinx: Add support for KV260 CC dt-bindings: soc: xilinx: Add support for K26 rev2 SOMs arm64: zynqmp: Align usb clock nodes with binding arm64: zynqmp: Comment all smmu entries arm64: zynqmp: Rename i2c?-gpio to i2c?-gpio-grp arm64: zynqmp: Disable Tri-state for MIO38 Pin arm64: zynqmp: Remove incorrect comment from kv260s arm64: zynqmp: Introduce u-boot options node with bootscr-address arm64: zynqmp: Fix comment to be aligned with board name. arm64: zynqmp: Update ECAM size to discover up to 256 buses arm64: zynqmp: Describe assigned-clocks for uarts arm64: zynqmp: Setup default si570 frequency to 156.25MHz arm64: zynqmp: Add resets property for CAN nodes arm64: zynqmp: Add an OP-TEE node to the device tree arm64: zynqmp: Add output-enable pins to SOMs arm64: zynqmp: Rename zynqmp-power node to power-management dt-bindings: firmware: xilinx: Sort node names (clock-controller) dt-bindings: firmware: xilinx: Describe missing child nodes dt-bindings: firmware: xilinx: Fix versal-fpga node name ... Link: https://lore.kernel.org/r/CAHTX3dLEoFMTGg1Q4+OuOwWYd8N73YBTXki8Vvj3cGHUpLJ0=A@mail.gmail.com Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-03-01	Merge tag 'sgx-for-v6.9-signed' of ↵	Arnd Bergmann
	git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into soc/dt Add PowerVR Series5 SGX GPUs for the TI SoCs With the Imagination Rogue GPU binding added, let's also add the devicetree binding for earlier SGX GPUs. Let's also patch the TI SoCs for the related SGX GPU nodes. Based on the mailing list discussions, the conclusion was that we need two separate device tree bindings, one for Rogue and upcoming GPUS, and one for the older SGX GPUs. For merging the changes, I applied the binding changes together with the TI SoC related changes into a branch leaving out the sun6i and mips changes as suggested by Rob. These changes are mostly 32-bit SoCs, but also contains one arm64 change. It does not cause any merge conflicts. * tag 'sgx-for-v6.9-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap: arm64: dts: ti: k3-am654-main: Add device tree entry for SGX GPU ARM: dts: DRA7xx: Add device tree entry for SGX GPU ARM: dts: AM437x: Add device tree entry for SGX GPU ARM: dts: AM33xx: Add device tree entry for SGX GPU ARM: dts: omap5: Add device tree entry for SGX GPU ARM: dts: omap4: Add device tree entry for SGX GPU ARM: dts: omap3: Add device tree entry for SGX GPU dt-bindings: gpu: Add PowerVR Series5 SGX GPUs dt-bindings: gpu: Rename img,powervr to img,powervr-rogue Link: https://lore.kernel.org/r/pull-1708943489-872615@atomide.com Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-03-01	Merge tag 'imx-dt64-6.9' of ↵	Arnd Bergmann
	git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux into soc/dt i.MX arm64 device tree for 6.9: - New board support: Apalis eval v1.2 carrier board, Variscite VAR-SOM-MX93, phyBOARD-Segin-i.MX93. - A series from Adam Ford to enable bluetooth, configure multiple queues on eqos, remove unnecessary clock configuration for i.MX8 Beacon boards. - Several changesets from Alexander Stein to add i.MX8DXP support, enable audio and GPU for i.MX8QXP, re-parent MEDIA_MIPI_PHY1_REF clock for i.MX8MP, and improve MBA8xx board description. - A few dt-schema fixes from Fabio Estevam for i.MX8MM and i.MX93 devices. - A bunch of changes from Frank Li to improve i.MX8QM and i.MX8DXL support, correcting edma3 power-domains and interrupt numbers, adding I2C, FlexCAN and SMMU devices, etc. - A series from Frieder Schrempf to improve imx8mm-kontron board descriptions, disabling pulls, fixing up RTC device, adding EEPROM, and refactoring OSM-S module, etc. - A set of Data Modul i.MX8M Plus eDM SBC improvements from Marek Vasut. - A series from Shengjiu Wang to add PDM micphone and SPDIF sound card support for imx8mm-evk board. - A series of imx8mm-venice boards improvement from Tim Harvey to add TPM device, fix USB OTG VBUS etc. - Other small and random improvements on various boards. * tag 'imx-dt64-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux: (77 commits) arm64: dts: imx8mm-kontron-bl-osm-s: Fix Ethernet PHY compatible arm64: dts: imx8-apalis-v1.1: Remove reset-names from ethernet-phy arm64: dts: imx8mp-evk: Fix hdmi@3d node arm64: dts: imx93-var-som: Remove phy-supply from eqos arm64: dts: imx8mp-phyboard-pollux: Disable pull-up for CD GPIO arm64: dts: imx8mp-phyboard-pollux: Reduce drive strength for eqos tx lines arm64: dts: imx8mp-phyboard-pollux: Set debug uart muxing to 0x140 arm64: dts: imx8mp-phyboard-pollux: Add and update rtc devicetree node arm64: dts: imx8mm-evk: Add spdif sound card support arm64: dts: mba8xx: Add missing #interrupt-cells arm64: dts: imx8mp: Set SPI NOR to max 40 MHz on Data Modul i.MX8M Plus eDM SBC arm64: dts: imx8mn: tqma8mqnl-mba8mx: Add USB DR overlay arm64: dts: imx8mq: tqma8mq-mba8mx: Add missing USB vbus supply arm64: dts: freescale: imx8mm/imx8mq: mba8mx: Use PCIe clock generator arm64: dts: imx8mn-beacon: Remove unnecessary clock configuration arm64: dts: imx8mn: Slow default video_pll clock rate arm64: dts: imx8mp-beacon: Configure multiple queues on eqos arm64: dts: imx8mp-beacon: Enable Bluetooth arm64: dts: freescale: minor whitespace cleanup arm64: dts: lx2160a: Fix DTS for full PL011 UART ... Link: https://lore.kernel.org/r/20240226034147.233993-4-shawnguo2@yeah.net Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-03-01	Merge tag 'md-6.9-20240301' of ↵	Jens Axboe
	https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.9/block Pull MD updates from Song: "The major changes are: 1. Refactor raid1 read_balance, by Yu Kuai and Paul Luse. 2. Clean up and fix for md_ioctl, by Li Nan. 3. Other small fixes, by Gui-Dong Han and Heming Zhao." * tag 'md-6.9-20240301' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: (22 commits) md/raid1: factor out helpers to choose the best rdev from read_balance() md/raid1: factor out the code to manage sequential IO md/raid1: factor out choose_bb_rdev() from read_balance() md/raid1: factor out choose_slow_rdev() from read_balance() md/raid1: factor out read_first_rdev() from read_balance() md/raid1-10: factor out a new helper raid1_should_read_first() md/raid1-10: add a helper raid1_check_read_range() md/raid1: fix choose next idle in read_balance() md/raid1: record nonrot rdevs while adding/removing rdevs to conf md/raid1: factor out helpers to add rdev to conf md: add a new helper rdev_has_badblock() md/raid5: fix atomicity violation in raid5_cache_count md/md-bitmap: fix incorrect usage for sb_index md: check mddev->pers before calling md_set_readonly() md: clean up openers check in do_md_stop() and md_set_readonly() md: sync blockdev before stopping raid or setting readonly md: factor out a helper to sync mddev md: Don't clear MD_CLOSING when the raid is about to stop md: return directly before setting did_set_md_closing md: clean up invalid BUG_ON in md_ioctl ...
2024-03-01	locking/rtmutex: Use try_cmpxchg_relaxed() in mark_rt_mutex_waiters()	Uros Bizjak
	Use try_cmpxchg() instead of cmpxchg(ptr, old, new) == old. The x86 CMPXCHG instruction returns success in the ZF flag, so this change saves a compare after CMPXCHG (and related move instruction in front of CMPXCHG). Also, try_cmpxchg() implicitly assigns old ptr value to "old" when CMPXCHG fails. There is no need to re-read the value in the loop. Note that the value from *ptr should be read using READ_ONCE() to prevent the compiler from merging, refetching or reordering the read. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20240124104953.612063-1-ubizjak@gmail.com
2024-03-01	Merge tag 'imx-dt-6.9' of ↵	Arnd Bergmann
	git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux into soc/dt i.MX ARM device tree for 6.9: - New board support: Sielaff i.MX6 Solo, Apalis Evaluation Board v1.2. - A bunch of i.MX7 TQMA7/MBA7 updates from Alexander Stein that add various devices, improve hardware descriptions and fix dt-schema warnings, etc. - Correct touchscreen rotation for imx6sl-tolino-shine2hd board. - An imx53-qsb update from Dmitry Baryshkov to add HDMI expander support. - A couple of i.MX1 and i.MX28 device node name fixes from Fabio Estevam. - Enable usb3-lpm-capable for LS1021A usb3 node. - A couple of imx6dl-yapp4 board improvements from Michal Vokáč. - A series from Sebastian Reichel to improve imx6ull descriptions. * tag 'imx-dt-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux: (43 commits) ARM: dts: nxp: imx: fix weim node name ARM: dts: nxp: imx6ul: fix touchscreen node name ARM: dts: nxp: imx6ul: xnur-gpio -> xnur-gpios ARM: dts: imx6ul: Remove fsl,anatop from usbotg1 ARM: dts: imx6ull: fix pinctrl node name ARM: dts: imx1-apf9328: Fix Ethernet node name ARM: dts: imx28-evk: Use 'eeprom' as the node name ARM: dts: ls1021a: Enable usb3-lpm-capable for usb3 node ARM: dts: imx6dl-yapp4: Move the internal switch PHYs under the switch node ARM: dts: imx6dl-yapp4: Fix typo in the QCA switch register address ARM: dts: imx6ul: Set macaddress location in ocotp ARM: dts: imx53-qsb: add support for the HDMI expander ARM: dts: imx6ull-dhcom: Remove /omit-if-no-ref/ from node usdhc1-pwrseq ARM: dts: imx: Add support for Apalis Evaluation Board v1.2 ARM: dts: imx6: skov: add aliases for all ethernet nodes ARM: dts: imx6qdl-hummingboard: Add rtc0 and rtc1 aliases to fix hctosys ARM: dts: imx6dl: Add support for Sielaff i.MX6 Solo board ARM: dts: imx6ul: Add missing #thermal-sensor-cells to tempmon ARM: dts: imx6sl-tolino-shine2hd: fix touchscreen rotation ARM: dts: imx6ull-dhcor: Remove 900MHz operating point ... Link: https://lore.kernel.org/r/20240226034147.233993-3-shawnguo2@yeah.net Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-03-01	Merge tag 'imx-bindings-6.9' of ↵	Arnd Bergmann
	git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux into soc/dt i.MX dt-bindings for 6.9: - New compatibles for boards: TQMa8Xx, Sielaff i.MX6 Solo, Toradex Apalis imx6q-eval-v1.2, VAR-SOM-MX93, phyBOARD-Segin-i.MX93, UNI-T UTi260B. - Add vendor prefix for UNI-T. * tag 'imx-bindings-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux: dt-bindings: arm: add UNI-T UTi260B dt-bindings: vendor-prefixes: add UNI-T dt-bindings: arm: fsl: remove redundant company name dt-bindings: arm: fsl: add imx8qm apalis eval v1.2 carrier board dt-bindings: arm: fsl: Add toradex,apalis_imx6q-eval-v1.2 board dt-bindings: arm: fsl: Add phyBOARD-Segin-i.MX93 dt-bindings: arm: fsl: Add Sielaff i.MX6 Solo board dt-bindings: arm: fsl: Add VAR-SOM-MX93 with Symphony dt-bindings: arm: add TQMa8Xx boards Link: https://lore.kernel.org/r/20240226034147.233993-2-shawnguo2@yeah.net Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-03-01	Merge tag 'socfpga_dts_updates_for_v6.9' of ↵	Arnd Bergmann
	git://git.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux into soc/dt SoCFPGA DTS updates for v6.9 - Drop the "master" suffix in I3C controller node name * tag 'socfpga_dts_updates_for_v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux: arm64: dts: intel: agilex5: drop "master" I3C node name suffix Link: https://lore.kernel.org/r/20240226012528.20380-1-dinguyen@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-03-01	locking/x86: Implement local_xchg() using CMPXCHG without the LOCK prefix	Uros Bizjak
	Implement local_xchg() using the CMPXCHG instruction without the LOCK prefix. XCHG is expensive due to the implied LOCK prefix. The processor cannot prefetch cachelines if XCHG is used. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20240124105816.612670-1-ubizjak@gmail.com
2024-03-01	x86/boot: Use 32-bit XOR to clear registers	Uros Bizjak
	x86_64 zero extends 32-bit operations, so for 64-bit operands, XORL r32,r32 is functionally equal to XORQ r64,r64, but avoids a REX prefix byte when legacy registers are used. Slightly smaller code generated, no change in functionality. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20240124103859.611372-1-ubizjak@gmail.com
2024-03-01	libfs: add stashed_dentry_prune()	Christian Brauner
	Both pidfs and nsfs use a memory location to stash a dentry for reuse by concurrent openers. Right now two custom dentry->d_prune::{ns,pidfs}_prune_dentry() methods are needed that do the same thing. The only thing that differs is that they need to get to the memory location to store or retrieve the dentry from differently. Fix that by remember the stashing location for the dentry in dentry->d_fsdata which allows us to retrieve it in dentry->d_prune. That in turn makes it possible to add a common helper that pidfs and nsfs can both use. Link: https://lore.kernel.org/r/CAHk-=wg8cHY=i3m6RnXQ2Y2W8psicKWQEZq1=94ivUiviM-0OA@mail.gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-01	libfs: improve path_from_stashed() helper	Christian Brauner
	In earlier patches we moved both nsfs and pidfs to path_from_stashed(). The helper currently tries to add and stash a new dentry if a reusable dentry couldn't be found and returns EAGAIN if it lost the race to stash the dentry. The caller can use EAGAIN to retry. The helper and the two filesystems be written in a way that makes returning EAGAIN unnecessary. To do this we need to change the dentry->d_prune() implementation of nsfs and pidfs to not simply replace the stashed dentry with NULL but to use a cmpxchg() and only replace their own dentry. Then path_from_stashed() can then be changed to not just stash a new dentry when no dentry is currently stashed but also when an already dead dentry is stashed. If another task managed to install a dentry in the meantime it can simply be reused. Pack that into a loop and call it a day. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/CAHk-=wgtLF5Z5=15-LKAczWm=-tUjHO+Bpf7WjBG+UU3s=fEQw@mail.gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-01	pidfs: convert to path_from_stashed() helper	Christian Brauner
	Moving pidfds from the anonymous inode infrastructure to a separate tiny in-kernel filesystem similar to sockfs, pipefs, and anon_inodefs causes selinux denials and thus various userspace components that make heavy use of pidfds to fail as pidfds used anon_inode_getfile() which aren't subject to any LSM hooks. But dentry_open() is and that would cause regressions. The failures that are seen are selinux denials. But the core failure is dbus-broker. That cascades into other services failing that depend on dbus-broker. For example, when dbus-broker fails to start polkit and all the others won't be able to work because they depend on dbus-broker. The reason for dbus-broker failing is because it doesn't handle failures for SO_PEERPIDFD correctly. Last kernel release we introduced SO_PEERPIDFD (and SCM_PIDFD). SO_PEERPIDFD allows dbus-broker and polkit and others to receive a pidfd for the peer of an AF_UNIX socket. This is the first time in the history of Linux that we can safely authenticate clients in a race-free manner. dbus-broker immediately made use of this but messed up the error checking. It only allowed EINVAL as a valid failure for SO_PEERPIDFD. That's obviously problematic not just because of LSM denials but because of seccomp denials that would prevent SO_PEERPIDFD from working; or any other new error code from there. So this is catching a flawed implementation in dbus-broker as well. It has to fallback to the old pid-based authentication when SO_PEERPIDFD doesn't work no matter the reasons otherwise it'll always risk such failures. So overall that LSM denial should not have caused dbus-broker to fail. It can never assume that a feature released one kernel ago like SO_PEERPIDFD can be assumed to be available. So, the next fix separate from the selinux policy update is to try and fix dbus-broker at [3]. That should make it into Fedora as well. In addition the selinux reference policy should also be updated. See [4] for that. If Selinux is in enforcing mode in userspace and it encounters anything that it doesn't know about it will deny it by default. And the policy is entirely in userspace including declaring new types for stuff like nsfs or pidfs to allow it. For now we continue to raise S_PRIVATE on the inode if it's a pidfs inode which means things behave exactly like before. Link: https://bugzilla.redhat.com/show_bug.cgi?id=2265630 Link: https://github.com/fedora-selinux/selinux-policy/pull/2050 Link: https://github.com/bus1/dbus-broker/pull/343 [3] Link: https://github.com/SELinuxProject/refpolicy/pull/762 [4] Reported-by: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/r/20240222190334.GA412503@dev-arch.thelio-3990X Link: https://lore.kernel.org/r/20240218-neufahrzeuge-brauhaus-fb0eb6459771@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-01	nsfs: convert to path_from_stashed() helper	Christian Brauner
	Use the newly added path_from_stashed() helper for nsfs. Link: https://lore.kernel.org/r/20240218-neufahrzeuge-brauhaus-fb0eb6459771@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-01	libfs: add path_from_stashed()	Christian Brauner
	Add a helper for both nsfs and pidfs to reuse an already stashed dentry or to add and stash a new dentry. Link: https://lore.kernel.org/r/20240218-neufahrzeuge-brauhaus-fb0eb6459771@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-01	pidfd: add pidfs	Christian Brauner
	This moves pidfds from the anonymous inode infrastructure to a tiny pseudo filesystem. This has been on my todo for quite a while as it will unblock further work that we weren't able to do simply because of the very justified limitations of anonymous inodes. Moving pidfds to a tiny pseudo filesystem allows: * statx() on pidfds becomes useful for the first time. * pidfds can be compared simply via statx() and then comparing inode numbers. * pidfds have unique inode numbers for the system lifetime. * struct pid is now stashed in inode->i_private instead of file->private_data. This means it is now possible to introduce concepts that operate on a process once all file descriptors have been closed. A concrete example is kill-on-last-close. * file->private_data is freed up for per-file options for pidfds. * Each struct pid will refer to a different inode but the same struct pid will refer to the same inode if it's opened multiple times. In contrast to now where each struct pid refers to the same inode. Even if we were to move to anon_inode_create_getfile() which creates new inodes we'd still be associating the same struct pid with multiple different inodes. The tiny pseudo filesystem is not visible anywhere in userspace exactly like e.g., pipefs and sockfs. There's no lookup, there's no complex inode operations, nothing. Dentries and inodes are always deleted when the last pidfd is closed. We allocate a new inode for each struct pid and we reuse that inode for all pidfds. We use iget_locked() to find that inode again based on the inode number which isn't recycled. We allocate a new dentry for each pidfd that uses the same inode. That is similar to anonymous inodes which reuse the same inode for thousands of dentries. For pidfds we're talking way less than that. There usually won't be a lot of concurrent openers of the same struct pid. They can probably often be counted on two hands. I know that systemd does use separate pidfd for the same struct pid for various complex process tracking issues. So I think with that things actually become way simpler. Especially because we don't have to care about lookup. Dentries and inodes continue to be always deleted. The code is entirely optional and fairly small. If it's not selected we fallback to anonymous inodes. Heavily inspired by nsfs which uses a similar stashing mechanism just for namespaces. Link: https://lore.kernel.org/r/20240213-vfs-pidfd_fs-v1-2-f863f58cfce1@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>