summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2024-03-01nfsd: allow layout state to be admin-revoked.NeilBrown
When there is layout state on a filesystem that is being "unlocked" that is now revoked, which involves closing the nfsd_file and releasing the vfs lease. To avoid races, ->ls_file can now be accessed either: - under ->fi_lock for the state's sc_file or - under rcu_read_lock() if nfsd_file_get() is used. To support this, ->fence_client and nfsd4_cb_layout_fail() now take a second argument being the nfsd_file. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: allow delegation state ids to be revoked and then freedNeilBrown
Revoking state through 'unlock_filesystem' now revokes any delegation states found. When the stateids are then freed by the client, the revoked stateids will be cleaned up correctly. As there is already support for revoking delegations, we build on that for admin-revoking. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: allow open state ids to be revoked and then freedNeilBrown
Revoking state through 'unlock_filesystem' now revokes any open states found. When the stateids are then freed by the client, the revoked stateids will be cleaned up correctly. Possibly the related lock states should be revoked too, but a subsequent patch will do that for all lock state on the superblock. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: allow lock state ids to be revoked and then freedNeilBrown
Revoking state through 'unlock_filesystem' now revokes any lock states found. When the stateids are then freed by the client, the revoked stateids will be cleaned up correctly. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: allow admin-revoked NFSv4.0 state to be freed.NeilBrown
For NFSv4.1 and later the client easily discovers if there is any admin-revoked state and will then find and explicitly free it. For NFSv4.0 there is no such mechanism. The client can only find that state is admin-revoked if it tries to use that state, and there is no way for it to explicitly free the state. So the server must hold on to the stateid (at least) for an indefinite amount of time. A RELEASE_LOCKOWNER request might justify forgetting some of these stateids, as would the whole clients lease lapsing, but these are not reliable. This patch takes two approaches. Whenever a client uses an revoked stateid, that stateid is then discarded and will not be recognised again. This might confuse a client which expect to get NFS4ERR_ADMIN_REVOKED consistently once it get it at all, but should mostly work. Hopefully one error will lead to other resources being closed (e.g. process exits), which will result in more stateid being freed when a CLOSE attempt gets NFS4ERR_ADMIN_REVOKED. Also, any admin-revoked stateids that have been that way for more than one lease time are periodically revoke. No actual freeing of state happens in this patch. That will come in future patches which handle the different sorts of revoked state. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: report in /proc/fs/nfsd/clients/*/states when state is admin-revokeNeilBrown
Add "admin-revoked" to the status information for any states that have been admin-revoked. This can be useful for confirming correct behaviour. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: allow state with no file to appear in /proc/fs/nfsd/clients/*/statesNeilBrown
Change the "show" functions to show some content even if a file cannot be found. This is the case for admin-revoked state. This is primarily useful for debugging - to ensure states are being removed eventually. So change several seq_printf() to seq_puts(). Some of these are needed to keep checkpatch happy. Others were done for consistency. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: prepare for supporting admin-revocation of stateNeilBrown
The NFSv4 protocol allows state to be revoked by the admin and has error codes which allow this to be communicated to the client. This patch - introduces a new state-id status SC_STATUS_ADMIN_REVOKED which can be set on open, lock, or delegation state. - reports NFS4ERR_ADMIN_REVOKED when these are accessed - introduces a per-client counter of these states and returns SEQ4_STATUS_ADMIN_STATE_REVOKED when the counter is not zero. Decrements this when freeing any admin-revoked state. - introduces stub code to find all interesting states for a given superblock so they can be revoked via the 'unlock_filesystem' file in /proc/fs/nfsd/ No actual states are handled yet. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: split sc_status out of sc_typeNeilBrown
sc_type identifies the type of a state - open, lock, deleg, layout - and also the status of a state - closed or revoked. This is a bit untidy and could get worse when "admin-revoked" states are added. So clean it up. With this patch, the type is now all that is stored in sc_type. This is zero when the state is first added to ->cl_stateids (causing it to be ignored), and is then set appropriately once it is fully initialised. It is set under ->cl_lock to ensure atomicity w.r.t lookup. It is now never cleared. sc_type is still a bit-set even though at most one bit is set. This allows lookup functions to be given a bitmap of acceptable types. sc_type is now an unsigned short rather than char. There is no value in restricting to just 8 bits. All the constants now start SC_TYPE_ matching the field in which they are stored. Keeping the existing names and ensuring clear separation from non-type flags would have required something like NFS4_STID_TYPE_CLOSED which is cumbersome. The "NFS4" prefix is redundant was they only appear in NFS4 code, so remove that and change STID to SC to match the field. The status is stored in a separate unsigned short named "sc_status". It has two flags: SC_STATUS_CLOSED and SC_STATUS_REVOKED. CLOSED combines NFS4_CLOSED_STID, NFS4_CLOSED_DELEG_STID, and is used for SC_TYPE_LOCK and SC_TYPE_LAYOUT instead of setting the sc_type to zero. These flags are only ever set, never cleared. For deleg stateids they are set under the global state_lock. For open and lock stateids they are set under ->cl_lock. For layout stateids they are set under ->ls_lock nfs4_unhash_stid() has been removed, and we never set sc_type = 0. This was only used for LOCK and LAYOUT stids and they now use SC_STATUS_CLOSED. Also TRACE_DEFINE_NUM() calls for the various STID #define have been removed because these things are not enums, and so that call is incorrect. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: avoid race after unhash_delegation_locked()NeilBrown
NFS4_CLOSED_DELEG_STID and NFS4_REVOKED_DELEG_STID are similar in purpose. REVOKED is used for NFSv4.1 states which have been revoked because the lease has expired. CLOSED is used in other cases. The difference has two practical effects. 1/ REVOKED states are on the ->cl_revoked list 2/ REVOKED states result in nfserr_deleg_revoked from nfsd4_verify_open_stid() and nfsd4_validate_stateid while CLOSED states result in nfserr_bad_stid. Currently a state that is being revoked is first set to "CLOSED" in unhash_delegation_locked(), then possibly to "REVOKED" in revoke_delegation(), at which point it is added to the cl_revoked list. It is possible that a stateid test could see the CLOSED state which really should be REVOKED, and so return the wrong error code. So it is safest to remove this window of inconsistency. With this patch, unhash_delegation_locked() always sets the state correctly, and revoke_delegation() no longer changes the state. Also remove a redundant test on minorversion when NFS4_REVOKED_DELEG_STID is seen - it can only be seen when minorversion is non-zero. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: don't call functions with side-effecting inside WARN_ON()NeilBrown
Code like: WARN_ON(foo()) looks like an assertion and might not be expected to have any side effects. When testing if a function with side-effects fails a construct like if (foo()) WARN_ON(1); makes the intent more obvious. nfsd has several WARN_ON calls where the test has side effects, so it would be good to change them. These cases don't really need the WARN_ON. They have never failed in 8 years of usage so let's just remove the WARN_ON wrapper. Suggested-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: hold ->cl_lock for hash_delegation_locked()NeilBrown
The protocol for creating a new state in nfsd is to allocate the state leaving it largely uninitialised, add that state to the ->cl_stateids idr so as to reserve a state-id, then complete initialisation of the state and only set ->sc_type to non-zero once the state is fully initialised. If a state is found in the idr with ->sc_type == 0, it is ignored. The ->cl_lock lock is used to avoid races - it is held while checking sc_type during lookup, and held when a non-zero value is stored in ->sc_type. ... except... hash_delegation_locked() finalises the initialisation of a delegation state, but does NOT hold ->cl_lock. So this patch takes ->cl_lock at the appropriate time w.r.t other locks, and so ensures there are no races (which are extremely unlikely in any case). As ->fi_lock is often taken when ->cl_lock is held, we need to take ->cl_lock first of those two. Currently ->cl_lock and state_lock are never both taken at the same time. We need both for this patch so an arbitrary choice is needed concerning which to take first. As state_lock is more global, it might be more contended, so take it first. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: remove stale comment in nfs4_show_deleg()NeilBrown
As we do now support write delegations, this comment is unhelpful and misleading. Reported-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01NFSD: Remove redundant cb_seq_status initializationChuck Lever
As far as I can see, setting cb_seq_status in nfsd4_init_cb() is superfluous because it is set again in nfsd4_cb_prepare(). Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01NFSD: Remove BUG_ON in nfsd4_process_cb_update()Chuck Lever
Don't kill the kworker thread, and don't panic while cl_lock is held. There's no need for scorching the earth here. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01NFSD: Replace comment with lockdep assertionChuck Lever
Convert a code comment into a real assertion. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01NFSD: Remove unused @reason argumentChuck Lever
Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01NFSD: Add callback operation lifetime trace pointsChuck Lever
Help observe the flow of callback operations. bc_shutdown() records exactly when the backchannel RPC client is destroyed and cl_cb_client is replaced with NULL. Examples include: nfsd-955 [004] 650.013997: nfsd_cb_queue: addr=192.168.122.6:0 client 65b3c5b8:f541f749 cb=0xffff8881134b02f8 (first try) kworker/u21:4-497 [004] 650.014050: nfsd_cb_seq_status: task:00000001@00000001 sessionid=65b3c5b8:f541f749:00000001:00000000 tk_status=-107 seq_status=1 kworker/u21:4-497 [004] 650.014051: nfsd_cb_restart: addr=192.168.122.6:0 client 65b3c5b8:f541f749 cb=0xffff88810e39f400 (first try) kworker/u21:4-497 [004] 650.014066: nfsd_cb_queue: addr=192.168.122.6:0 client 65b3c5b8:f541f749 cb=0xffff88810e39f400 (need restart) kworker/u16:0-10 [006] 650.065750: nfsd_cb_start: addr=192.168.122.6:0 client 65b3c5b8:f541f749 state=UNKNOWN kworker/u16:0-10 [006] 650.065752: nfsd_cb_bc_update: addr=192.168.122.6:0 client 65b3c5b8:f541f749 cb=0xffff8881134b02f8 (first try) kworker/u16:0-10 [006] 650.065754: nfsd_cb_bc_shutdown: addr=192.168.122.6:0 client 65b3c5b8:f541f749 cb=0xffff8881134b02f8 (first try) kworker/u16:0-10 [006] 650.065810: nfsd_cb_new_state: addr=192.168.122.6:0 client 65b3c5b8:f541f749 state=DOWN Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01NFSD: Rename nfsd_cb_state trace pointChuck Lever
Make it clear where backchannel state is updated. Example trace point output: kworker/u16:0-10 [006] 2800.080404: nfsd_cb_new_state: addr=192.168.122.6:0 client 65b3c5b8:f541f749 state=UP nfsd-940 [003] 2800.478368: nfsd_cb_new_state: addr=192.168.122.6:0 client 65b3c5b8:f541f749 state=UNKNOWN kworker/u16:0-10 [003] 2800.478828: nfsd_cb_new_state: addr=192.168.122.6:0 client 65b3c5b8:f541f749 state=DOWN kworker/u16:0-10 [005] 2802.039724: nfsd_cb_start: addr=192.168.122.6:0 client 65b3c5b8:f541f749 state=UP kworker/u16:0-10 [005] 2810.611452: nfsd_cb_start: addr=192.168.122.6:0 client 65b3c5b8:f541f749 state=FAULT kworker/u16:0-10 [005] 2810.616832: nfsd_cb_start: addr=192.168.122.6:0 client 65b3c5b8:f541f749 state=UNKNOWN kworker/u16:0-10 [005] 2810.616931: nfsd_cb_start: addr=192.168.122.6:0 client 65b3c5b8:f541f749 state=DOWN Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01NFSD: Replace dprintks in nfsd4_cb_sequence_done()Chuck Lever
Improve observability of backchannel session operation. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01NFSD: Add nfsd_seq4_status trace eventChuck Lever
Add a trace point that records SEQ4_STATUS flags returned in an NFSv4.1 SEQUENCE response. SEQ4_STATUS flags report backchannel issues and changes to lease state to clients. Knowing what the server is reporting to clients is useful for debugging both configuration and operational issues in real time. For example, upcoming patches will enable server administrators to revoke parts of a client's lease; that revocation is indicated to the client when a subsequent SEQUENCE operation has one or more SEQ4_STATUS flags that are set. Sample trace records: nfsd-927 [006] 615.581821: nfsd_seq4_status: xid=0x095ded07 sessionid=65a032c3:b7845faf:00000001:00000000 status_flags=BACKCHANNEL_FAULT nfsd-927 [006] 615.588043: nfsd_seq4_status: xid=0x0a5ded07 sessionid=65a032c3:b7845faf:00000001:00000000 status_flags=BACKCHANNEL_FAULT nfsd-928 [003] 615.588448: nfsd_seq4_status: xid=0x0b5ded07 sessionid=65a032c3:b7845faf:00000001:00000000 status_flags=BACKCHANNEL_FAULT Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01NFSD: Retransmit callbacks after client reconnectsChuck Lever
NFSv4.1 clients assume that if they disconnect, that will force the server to resend pending callback operations once a fresh connection has been established. Turns out NFSD has not been resending after reconnect. Fixes: 7ba6cad6c88f ("nfsd: New helper nfsd4_cb_sequence_done() for processing more cb errors") Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01NFSD: Reschedule CB operations when backchannel rpc_clnt is shut downChuck Lever
As part of managing a client disconnect, NFSD closes down and replaces the backchannel rpc_clnt. If a callback operation is pending when the backchannel rpc_clnt is shut down, currently nfsd4_run_cb_work() just discards that callback. But there are multiple cases to deal with here: o The client's lease is getting destroyed. Throw the CB away. o The client disconnected. It might be forcing a retransmit of CB operations, or it could have disconnected for other reasons. Reschedule the CB so it is retransmitted when the client reconnects. Since callback operations can now be rescheduled, ensure that cb_ops->prepare can be called only once by moving the cb_ops->prepare paragraph down to just before the rpc_call_async() call. Fixes: 2bbfed98a4d8 ("nfsd: Fix races between nfsd4_cb_release() and nfsd4_shutdown_callback()") Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01NFSD: Convert the callback workqueue to use delayed_workChuck Lever
Normally, NFSv4 callback operations are supposed to be sent to the client as soon as they are queued up. In a moment, I will introduce a recovery path where the server has to wait for the client to reconnect. We don't want a hard busy wait here -- the callback should be requeued to try again in several milliseconds. For now, convert nfsd4_callback from struct work_struct to struct delayed_work, and queue with a zero delay argument. This should avoid behavior changes for current operation. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01NFSD: Reset cb_seq_status after NFS4ERR_DELAYChuck Lever
I noticed that once an NFSv4.1 callback operation gets a NFS4ERR_DELAY status on CB_SEQUENCE and then the connection is lost, the callback client loops, resending it indefinitely. The switch arm in nfsd4_cb_sequence_done() that handles NFS4ERR_DELAY uses rpc_restart_call() to rearm the RPC state machine for the retransmit, but that path does not call the rpc_prepare_call callback again. Thus cb_seq_status is set to -10008 by the first NFS4ERR_DELAY result, but is never set back to 1 for the retransmits. nfsd4_cb_sequence_done() thinks it's getting nothing but a long series of CB_SEQUENCE NFS4ERR_DELAY replies. Fixes: 7ba6cad6c88f ("nfsd: New helper nfsd4_cb_sequence_done() for processing more cb errors") Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: make svc_stat per-network namespace instead of globalJosef Bacik
The final bit of stats that is global is the rpc svc_stat. Move this into the nfsd_net struct and use that everywhere instead of the global struct. Remove the unused global struct. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: remove nfsd_stats, make th_cnt a global counterJosef Bacik
This is the last global stat, take it out of the nfsd_stats struct and make it a global part of nfsd, report it the same as always. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: make all of the nfsd stats per-network namespaceJosef Bacik
We have a global set of counters that we modify for all of the nfsd operations, but now that we're exposing these stats across all network namespaces we need to make the stats also be per-network namespace. We already have some caching stats that are per-network namespace, so move these definitions into the same counter and then adjust all the helpers and users of these stats to provide the appropriate nfsd_net struct so that the stats are maintained for the per-network namespace objects. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: expose /proc/net/sunrpc/nfsd in net namespacesJosef Bacik
We are running nfsd servers inside of containers with their own network namespace, and we want to monitor these services using the stats found in /proc. However these are not exposed in the proc inside of the container, so we have to bind mount the host /proc into our containers to get at this information. Separate out the stat counters init and the proc registration, and move the proc registration into the pernet operations entry and exit points so that these stats can be exposed inside of network namespaces. This is an intermediate step, this just exposes the global counters in the network namespace. Subsequent patches will move these counters into the per-network namespace container. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: rename NFSD_NET_* to NFSD_STATS_*Josef Bacik
We're going to merge the stats all into per network namespace in subsequent patches, rename these nn counters to be consistent with the rest of the stats. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01sunrpc: remove ->pg_stats from svc_programJosef Bacik
Now that this isn't used anywhere, remove it. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01sunrpc: pass in the sv_stats struct through svc_create_pooledJosef Bacik
Since only one service actually reports the rpc stats there's not much of a reason to have a pointer to it in the svc_program struct. Adjust the svc_create_pooled function to take the sv_stats as an argument and pass the struct through there as desired instead of getting it from the svc_program->pg_stats. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: stop setting ->pg_stats for unused statsJosef Bacik
A lot of places are setting a blank svc_stats in ->pg_stats and never utilizing these stats. Remove all of these extra structs as we're not reporting these stats anywhere. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01NFSD: fix LISTXATTRS returning more bytes than maxcountJorge Mora
The maxcount is the maximum number of bytes for the LISTXATTRS4resok result. This includes the cookie and the count for the name array, thus subtract 12 bytes from the maxcount: 8 (cookie) + 4 (array count) when filling up the name array. Fixes: 23e50fe3a5e6 ("nfsd: implement the xattr functions and en/decode logic") Signed-off-by: Jorge Mora <mora@netapp.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01NFSD: fix LISTXATTRS returning a short list with eof=TRUEJorge Mora
If the XDR buffer is not large enough to fit all attributes and the remaining bytes left in the XDR buffer (xdrleft) is equal to the number of bytes for the current attribute, then the loop will prematurely exit without setting eof to FALSE. Also in this case, adding the eof flag to the buffer will make the reply 4 bytes larger than lsxa_maxcount. Need to check if there are enough bytes to fit not only the next attribute name but also the eof as well. Fixes: 23e50fe3a5e6 ("nfsd: implement the xattr functions and en/decode logic") Signed-off-by: Jorge Mora <mora@netapp.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01NFSD: change LISTXATTRS cookie encoding to big-endianJorge Mora
Function nfsd4_listxattr_validate_cookie() expects the cookie as an offset to the list thus it needs to be encoded in big-endian. Fixes: 23e50fe3a5e6 ("nfsd: implement the xattr functions and en/decode logic") Signed-off-by: Jorge Mora <mora@netapp.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01NFSD: fix nfsd4_listxattr_validate_cookieJorge Mora
If LISTXATTRS is sent with a correct cookie but a small maxcount, this could lead function nfsd4_listxattr_validate_cookie to return NFS4ERR_BAD_COOKIE. If maxcount = 20, then second check on function gives RHS = 3 thus any cookie larger than 3 returns NFS4ERR_BAD_COOKIE. There is no need to validate the cookie on the return XDR buffer since attribute referenced by cookie will be the first in the return buffer. Fixes: 23e50fe3a5e6 ("nfsd: implement the xattr functions and en/decode logic") Signed-off-by: Jorge Mora <mora@netapp.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: use __fput_sync() to avoid delayed closing of files.NeilBrown
Calling fput() directly or though filp_close() from a kernel thread like nfsd causes the final __fput() (if necessary) to be called from a workqueue. This means that nfsd is not forced to wait for any work to complete. If the ->release or ->destroy_inode function is slow for any reason, this can result in nfsd closing files more quickly than the workqueue can complete the close and the queue of pending closes can grow without bounces (30 million has been seen at one customer site, though this was in part due to a slowness in xfs which has since been fixed). nfsd does not need this. It is quite appropriate and safe for nfsd to do its own close work. There is no reason that close should ever wait for nfsd, so no deadlock can occur. It should be safe and sensible to change all fput() calls to __fput_sync(). However in the interests of caution this patch only changes two - the two that can be most directly affected by client behaviour and could occur at high frequency. - the fput() implicitly in flip_close() is changed to __fput_sync() by calling get_file() first to ensure filp_close() doesn't do the final fput() itself. If is where files opened for IO are closed. - the fput() in nfsd_read() is also changed. This is where directories opened for readdir are closed. This ensure that minimal fput work is queued to the workqueue. This removes the need for the flush_delayed_fput() call in nfsd_file_close_inode_sync() Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: Don't leave work of closing files to a work queueNeilBrown
The work of closing a file can have non-trivial cost. Doing it in a separate work queue thread means that cost isn't imposed on the nfsd threads and an imbalance can be created. This can result in files being queued for the work queue more quickly that the work queue can process them, resulting in unbounded growth of the queue and memory exhaustion. To avoid this work imbalance that exhausts memory, this patch moves all closing of files into the nfsd threads. This means that when the work imposes a cost, that cost appears where it would be expected - in the work of the nfsd thread. A subsequent patch will ensure the final __fput() is called in the same (nfsd) thread which calls filp_close(). Files opened for NFSv3 are never explicitly closed by the client and are kept open by the server in the "filecache", which responds to memory pressure, is garbage collected even when there is no pressure, and sometimes closes files when there is particular need such as for rename. These files currently have filp_close() called in a dedicated work queue, so their __fput() can have no effect on nfsd threads. This patch discards the work queue and instead has each nfsd thread call flip_close() on as many as 8 files from the filecache each time it acts on a client request (or finds there are no pending client requests). If there are more to be closed, more threads are woken. This spreads the work of __fput() over multiple threads and imposes any cost on those threads. The number 8 is somewhat arbitrary. It needs to be greater than 1 to ensure that files are closed more quickly than they can be added to the cache. It needs to be small enough to limit the per-request delays that will be imposed on clients when all threads are busy closing files. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01qnx4: convert qnx4 to use the new mount apiBill O'Donnell
Convert the qnx4 filesystem to use the new mount API. Tested mount, umount, and remount using a qnx4 boot image. Signed-off-by: Bill O'Donnell <bodonnel@redhat.com> Link: https://lore.kernel.org/r/20240229161649.800957-1-bodonnel@redhat.com Acked-by: Anders Larsen <al@alarsen.net> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-01fs: use inode_set_ctime_to_ts to set inode ctime to current timeNguyen Dinh Phi
The function inode_set_ctime_current simply retrieves the current time and assigns it to the field __i_ctime without any alterations. Therefore, it is possible to set ctime to now directly using inode_set_ctime_to_ts Signed-off-by: Nguyen Dinh Phi <phind.uet@gmail.com> Link: https://lore.kernel.org/r/20240228173031.3208743-1-phind.uet@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-01libfs: add stashed_dentry_prune()Christian Brauner
Both pidfs and nsfs use a memory location to stash a dentry for reuse by concurrent openers. Right now two custom dentry->d_prune::{ns,pidfs}_prune_dentry() methods are needed that do the same thing. The only thing that differs is that they need to get to the memory location to store or retrieve the dentry from differently. Fix that by remember the stashing location for the dentry in dentry->d_fsdata which allows us to retrieve it in dentry->d_prune. That in turn makes it possible to add a common helper that pidfs and nsfs can both use. Link: https://lore.kernel.org/r/CAHk-=wg8cHY=i3m6RnXQ2Y2W8psicKWQEZq1=94ivUiviM-0OA@mail.gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-01libfs: improve path_from_stashed() helperChristian Brauner
In earlier patches we moved both nsfs and pidfs to path_from_stashed(). The helper currently tries to add and stash a new dentry if a reusable dentry couldn't be found and returns EAGAIN if it lost the race to stash the dentry. The caller can use EAGAIN to retry. The helper and the two filesystems be written in a way that makes returning EAGAIN unnecessary. To do this we need to change the dentry->d_prune() implementation of nsfs and pidfs to not simply replace the stashed dentry with NULL but to use a cmpxchg() and only replace their own dentry. Then path_from_stashed() can then be changed to not just stash a new dentry when no dentry is currently stashed but also when an already dead dentry is stashed. If another task managed to install a dentry in the meantime it can simply be reused. Pack that into a loop and call it a day. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/CAHk-=wgtLF5Z5=15-LKAczWm=-tUjHO+Bpf7WjBG+UU3s=fEQw@mail.gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-01pidfs: convert to path_from_stashed() helperChristian Brauner
Moving pidfds from the anonymous inode infrastructure to a separate tiny in-kernel filesystem similar to sockfs, pipefs, and anon_inodefs causes selinux denials and thus various userspace components that make heavy use of pidfds to fail as pidfds used anon_inode_getfile() which aren't subject to any LSM hooks. But dentry_open() is and that would cause regressions. The failures that are seen are selinux denials. But the core failure is dbus-broker. That cascades into other services failing that depend on dbus-broker. For example, when dbus-broker fails to start polkit and all the others won't be able to work because they depend on dbus-broker. The reason for dbus-broker failing is because it doesn't handle failures for SO_PEERPIDFD correctly. Last kernel release we introduced SO_PEERPIDFD (and SCM_PIDFD). SO_PEERPIDFD allows dbus-broker and polkit and others to receive a pidfd for the peer of an AF_UNIX socket. This is the first time in the history of Linux that we can safely authenticate clients in a race-free manner. dbus-broker immediately made use of this but messed up the error checking. It only allowed EINVAL as a valid failure for SO_PEERPIDFD. That's obviously problematic not just because of LSM denials but because of seccomp denials that would prevent SO_PEERPIDFD from working; or any other new error code from there. So this is catching a flawed implementation in dbus-broker as well. It has to fallback to the old pid-based authentication when SO_PEERPIDFD doesn't work no matter the reasons otherwise it'll always risk such failures. So overall that LSM denial should not have caused dbus-broker to fail. It can never assume that a feature released one kernel ago like SO_PEERPIDFD can be assumed to be available. So, the next fix separate from the selinux policy update is to try and fix dbus-broker at [3]. That should make it into Fedora as well. In addition the selinux reference policy should also be updated. See [4] for that. If Selinux is in enforcing mode in userspace and it encounters anything that it doesn't know about it will deny it by default. And the policy is entirely in userspace including declaring new types for stuff like nsfs or pidfs to allow it. For now we continue to raise S_PRIVATE on the inode if it's a pidfs inode which means things behave exactly like before. Link: https://bugzilla.redhat.com/show_bug.cgi?id=2265630 Link: https://github.com/fedora-selinux/selinux-policy/pull/2050 Link: https://github.com/bus1/dbus-broker/pull/343 [3] Link: https://github.com/SELinuxProject/refpolicy/pull/762 [4] Reported-by: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/r/20240222190334.GA412503@dev-arch.thelio-3990X Link: https://lore.kernel.org/r/20240218-neufahrzeuge-brauhaus-fb0eb6459771@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-01nsfs: convert to path_from_stashed() helperChristian Brauner
Use the newly added path_from_stashed() helper for nsfs. Link: https://lore.kernel.org/r/20240218-neufahrzeuge-brauhaus-fb0eb6459771@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-01libfs: add path_from_stashed()Christian Brauner
Add a helper for both nsfs and pidfs to reuse an already stashed dentry or to add and stash a new dentry. Link: https://lore.kernel.org/r/20240218-neufahrzeuge-brauhaus-fb0eb6459771@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-01pidfd: add pidfsChristian Brauner
This moves pidfds from the anonymous inode infrastructure to a tiny pseudo filesystem. This has been on my todo for quite a while as it will unblock further work that we weren't able to do simply because of the very justified limitations of anonymous inodes. Moving pidfds to a tiny pseudo filesystem allows: * statx() on pidfds becomes useful for the first time. * pidfds can be compared simply via statx() and then comparing inode numbers. * pidfds have unique inode numbers for the system lifetime. * struct pid is now stashed in inode->i_private instead of file->private_data. This means it is now possible to introduce concepts that operate on a process once all file descriptors have been closed. A concrete example is kill-on-last-close. * file->private_data is freed up for per-file options for pidfds. * Each struct pid will refer to a different inode but the same struct pid will refer to the same inode if it's opened multiple times. In contrast to now where each struct pid refers to the same inode. Even if we were to move to anon_inode_create_getfile() which creates new inodes we'd still be associating the same struct pid with multiple different inodes. The tiny pseudo filesystem is not visible anywhere in userspace exactly like e.g., pipefs and sockfs. There's no lookup, there's no complex inode operations, nothing. Dentries and inodes are always deleted when the last pidfd is closed. We allocate a new inode for each struct pid and we reuse that inode for all pidfds. We use iget_locked() to find that inode again based on the inode number which isn't recycled. We allocate a new dentry for each pidfd that uses the same inode. That is similar to anonymous inodes which reuse the same inode for thousands of dentries. For pidfds we're talking way less than that. There usually won't be a lot of concurrent openers of the same struct pid. They can probably often be counted on two hands. I know that systemd does use separate pidfd for the same struct pid for various complex process tracking issues. So I think with that things actually become way simpler. Especially because we don't have to care about lookup. Dentries and inodes continue to be always deleted. The code is entirely optional and fairly small. If it's not selected we fallback to anonymous inodes. Heavily inspired by nsfs which uses a similar stashing mechanism just for namespaces. Link: https://lore.kernel.org/r/20240213-vfs-pidfd_fs-v1-2-f863f58cfce1@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: net/mptcp/protocol.c adf1bb78dab5 ("mptcp: fix snd_wnd initialization for passive socket") 9426ce476a70 ("mptcp: annotate lockless access for RX path fields") https://lore.kernel.org/all/20240228103048.19255709@canb.auug.org.au/ Adjacent changes: drivers/dpll/dpll_core.c 0d60d8df6f49 ("dpll: rely on rcu for netdev_dpll_pin()") e7f8df0e81bf ("dpll: move xa_erase() call in to match dpll_pin_alloc() error path order") drivers/net/veth.c 1ce7d306ea63 ("veth: try harder when allocating queue memory") 0bef512012b1 ("net: add netdev_lockdep_set_classes() to virtual drivers") drivers/net/wireless/intel/iwlwifi/mvm/d3.c 8c9bef26e98b ("wifi: iwlwifi: mvm: d3: implement suspend with MLO") 78f65fbf421a ("wifi: iwlwifi: mvm: ensure offloading TID queue exists") net/wireless/nl80211.c f78c1375339a ("wifi: nl80211: reject iftype change with mesh ID change") 414532d8aa89 ("wifi: cfg80211: use IEEE80211_MAX_MESH_ID_LEN appropriately") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-29btrfs: fix double free of anonymous device after snapshot creation failureFilipe Manana
When creating a snapshot we may do a double free of an anonymous device in case there's an error committing the transaction. The second free may result in freeing an anonymous device number that was allocated by some other subsystem in the kernel or another btrfs filesystem. The steps that lead to this: 1) At ioctl.c:create_snapshot() we allocate an anonymous device number and assign it to pending_snapshot->anon_dev; 2) Then we call btrfs_commit_transaction() and end up at transaction.c:create_pending_snapshot(); 3) There we call btrfs_get_new_fs_root() and pass it the anonymous device number stored in pending_snapshot->anon_dev; 4) btrfs_get_new_fs_root() frees that anonymous device number because btrfs_lookup_fs_root() returned a root - someone else did a lookup of the new root already, which could some task doing backref walking; 5) After that some error happens in the transaction commit path, and at ioctl.c:create_snapshot() we jump to the 'fail' label, and after that we free again the same anonymous device number, which in the meanwhile may have been reallocated somewhere else, because pending_snapshot->anon_dev still has the same value as in step 1. Recently syzbot ran into this and reported the following trace: ------------[ cut here ]------------ ida_free called for id=51 which is not allocated. WARNING: CPU: 1 PID: 31038 at lib/idr.c:525 ida_free+0x370/0x420 lib/idr.c:525 Modules linked in: CPU: 1 PID: 31038 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00410-gc02197fc9076 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024 RIP: 0010:ida_free+0x370/0x420 lib/idr.c:525 Code: 10 42 80 3c 28 (...) RSP: 0018:ffffc90015a67300 EFLAGS: 00010246 RAX: be5130472f5dd000 RBX: 0000000000000033 RCX: 0000000000040000 RDX: ffffc90009a7a000 RSI: 000000000003ffff RDI: 0000000000040000 RBP: ffffc90015a673f0 R08: ffffffff81577992 R09: 1ffff92002b4cdb4 R10: dffffc0000000000 R11: fffff52002b4cdb5 R12: 0000000000000246 R13: dffffc0000000000 R14: ffffffff8e256b80 R15: 0000000000000246 FS: 00007fca3f4b46c0(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f167a17b978 CR3: 000000001ed26000 CR4: 0000000000350ef0 Call Trace: <TASK> btrfs_get_root_ref+0xa48/0xaf0 fs/btrfs/disk-io.c:1346 create_pending_snapshot+0xff2/0x2bc0 fs/btrfs/transaction.c:1837 create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1931 btrfs_commit_transaction+0xf1c/0x3740 fs/btrfs/transaction.c:2404 create_snapshot+0x507/0x880 fs/btrfs/ioctl.c:848 btrfs_mksubvol+0x5d0/0x750 fs/btrfs/ioctl.c:998 btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1044 __btrfs_ioctl_snap_create+0x387/0x4b0 fs/btrfs/ioctl.c:1306 btrfs_ioctl_snap_create_v2+0x1ca/0x400 fs/btrfs/ioctl.c:1393 btrfs_ioctl+0xa74/0xd40 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:871 [inline] __se_sys_ioctl+0xfe/0x170 fs/ioctl.c:857 do_syscall_64+0xfb/0x240 entry_SYSCALL_64_after_hwframe+0x6f/0x77 RIP: 0033:0x7fca3e67dda9 Code: 28 00 00 00 (...) RSP: 002b:00007fca3f4b40c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007fca3e7abf80 RCX: 00007fca3e67dda9 RDX: 00000000200005c0 RSI: 0000000050009417 RDI: 0000000000000003 RBP: 00007fca3e6ca47a R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000000b R14: 00007fca3e7abf80 R15: 00007fff6bf95658 </TASK> Where we get an explicit message where we attempt to free an anonymous device number that is not currently allocated. It happens in a different code path from the example below, at btrfs_get_root_ref(), so this change may not fix the case triggered by syzbot. To fix at least the code path from the example above, change btrfs_get_root_ref() and its callers to receive a dev_t pointer argument for the anonymous device number, so that in case it frees the number, it also resets it to 0, so that up in the call chain we don't attempt to do the double free. CC: stable@vger.kernel.org # 5.10+ Link: https://lore.kernel.org/linux-btrfs/000000000000f673a1061202f630@google.com/ Fixes: e03ee2fe873e ("btrfs: do not ASSERT() if the newly created subvolume already got read") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-29btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is givenFilipe Manana
When FIEMAP_FLAG_SYNC is given to fiemap the expectation is that that are no concurrent writes and we get a stable view of the inode's extent layout. When the flag is given we flush all IO (and wait for ordered extents to complete) and then lock the inode in shared mode, however that leaves open the possibility that a write might happen right after the flushing and before locking the inode. So fix this by flushing again after locking the inode - we leave the initial flushing before locking the inode to avoid holding the lock and blocking other RO operations while waiting for IO and ordered extents to complete. The second flushing while holding the inode's lock will most of the time do nothing or very little since the time window for new writes to have happened is small. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>