Age | Commit message (Collapse) | Author |
|
Minor clean up. Instead of dprintk there are appropriate error codes.
Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Compilers are allowed to insert padding and reorder the
fields in a struct, so using a union of an array and a
struct in struct knfsd_fh is not reliable.
The position of elements in an array is more reliable.
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
I'm about to remove the union in struct knfsd_fh. First step is to
add an accessor function for the file handle's fsid portion.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
In the past several kernel releases, we've made NFSv4.2 async copy
reliable:
- The Linux NFS client and server now both implement and use the
NFSv4.2 OFFLOAD_STATUS operation
- The Linux NFS server keeps copy stateids around longer
- The Linux NFS client and server now both implement referring call
lists
And resilient against DoS:
- The Linux NFS server limits the number of concurrent async copy
operations
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Replace flexible-array member with a fixed-size array.
With this changes, fix many instances of the following type of
warnings:
fs/nfsd/nfsfh.h:79:33: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
fs/nfsd/state.h:763:33: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
fs/nfsd/state.h:669:33: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
fs/nfsd/state.h:549:33: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
fs/nfsd/xdr4.h:705:33: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
fs/nfsd/xdr4.h:678:33: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Refactor: Enable the use of IOCB flags to control NFSD's individual
write operations. This allows the eventual use of atomic, uncached,
direct, or asynchronous writes.
Suggested-by: NeilBrown <neil@brown.name>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Refactor: Enable the use of IOCB flags to control NFSD's individual
read operations (when not using splice). This allows the eventual
use of atomic, uncached, direct, or asynchronous reads.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Sparse reports that the synopsis of nfsd_open_local_fh() does not
match its kdoc comment. Introduced by commit e6f7e1487ab5
("nfs_localio: simplify interface to nfsd for getting nfsd_file").
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Sparse reports that the synopsis of nfsd_file_put_local() does not
match its kdoc comment. Introduced by commit c25a89770d1f ("nfs_localio:
change nfsd_file_put_local() to take a pointer to __rcu pointer") .
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
trace_nfsd_ctl_maxconn() was removed by commit a4b853f183a1
("sunrpc: remove all connection limit configuration") but did not
remove the event.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Closes: https://lore.kernel.org/linux-nfs/5ccae2f9-1560-4ac5-b506-b235ed4e4f4f@oracle.com/T/#t
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Event nfsd_file_gc_recent was added by commit 64912122a4f8 ("nfsd:
filecache: introduce NFSD_FILE_RECENT") but never used.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Closes: https://lore.kernel.org/linux-nfs/5ccae2f9-1560-4ac5-b506-b235ed4e4f4f@oracle.com/T/#t
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Events nfsd_file_lru_add_disposed and nfsd_file_lru_del_disposed
were added by commit 4a0e73e635e3 ("NFSD: Leave open files out of
the filecache LRU") but they were never used.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Closes: https://lore.kernel.org/linux-nfs/5ccae2f9-1560-4ac5-b506-b235ed4e4f4f@oracle.com/T/#t
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
trace_nfsd_file_unhash_and_queue() was removed by commit ac3a2585f018
("nfsd: rework refcounting in filecache").
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Closes: https://lore.kernel.org/linux-nfs/5ccae2f9-1560-4ac5-b506-b235ed4e4f4f@oracle.com/T/#t
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Update error codes in decoding functions of block and scsi layout
drivers to match the core nfsd code. NFS4ERR_EINVAL means that the
server was able to decode the request, but the decoded values are
invalid. Use NFS4ERR_BADXDR instead to indicate a decoding error.
And ENOMEM is changed to nfs code NFS4ERR_DELAY.
Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
This limit has always been a sanity check; in nearly all cases a
large COMPOUND is a sign of a malfunctioning client. The only real
limit on COMPOUND size and complexity is the size of NFSD's send
and receive buffers.
However, there are a few cases where a large COMPOUND is sane. For
example, when a client implementation wants to walk down a long file
pathname in a single round trip.
A small risk is that now a client can construct a COMPOUND request
that can keep a single nfsd thread busy for quite some time.
Suggested-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
To enable NFSD to handle NFSv4 COMPOUNDs of unrestricted size,
resize the array in struct nfsd_genl_rqstp so it saves only up to
16 operations per COMPOUND.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Clean up: A function parameter called "rqstp" typically refers to an
object of type "struct svc_rqst", so it's confusing when such an
parameter refers to a different struct type with field names that are
very similar to svc_rqst.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
When the client sends an OPEN with claim type CLAIM_DELEG_CUR_FH or
CLAIM_DELEGATION_CUR, the delegation stateid and the file handle
must belong to the same file, otherwise return NFS4ERR_INVAL.
Note that RFC8881, section 8.2.4, mandates the server to return
NFS4ERR_BAD_STATEID if the selected table entry does not match the
current filehandle. However returning NFS4ERR_BAD_STATEID in the
OPEN causes the client to retry the operation and therefor get the
client into a loop. To avoid this situation we return NFS4ERR_INVAL
instead.
Reported-by: Petro Pavlov <petro.pavlov@vastdata.com>
Fixes: c44c5eeb2c02 ("[PATCH] nfsd4: add open state code for CLAIM_DELEGATE_CUR")
Cc: stable@vger.kernel.org
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Lei Lu recently reported that nfsd4_setclientid_confirm() did not check
the return value from get_client_locked(). a SETCLIENTID_CONFIRM could
race with a confirmed client expiring and fail to get a reference. That
could later lead to a UAF.
Fix this by getting a reference early in the case where there is an
extant confirmed client. If that fails then treat it as if there were no
confirmed client found at all.
In the case where the unconfirmed client is expiring, just fail and
return the result from get_client_locked().
Reported-by: lei lu <llfamsec@gmail.com>
Closes: https://lore.kernel.org/linux-nfs/CAEBF3_b=UvqzNKdnfD_52L05Mqrqui9vZ2eFamgAbV0WG+FNWQ@mail.gmail.com/
Fixes: d20c11d86d8f ("nfsd: Protect session creation and client confirm using client_lock")
Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The valid values for ek_fsidtype are actually 0-7 so it's better to
change the type to u8. Also using kstrtou8() to relpace simple_strtoul(),
kstrtou8() is safer and more suitable for u8.
Suggested-by: NeilBrown <neil@brown.name>
Signed-off-by: Su Hui <suhui@nfschina.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The rqst argument to xdr_init_encode_pages is set to NULL by all callers,
and pages is always set to buf->pages. Remove the two arguments and
hardcode the assignments.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
When a write delegation is returned, check if read access was added
to nfs4_file when client opens file with WRONLY, and release it.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
RFC8881, section 9.1.2 says:
"In the case of READ, the server may perform the corresponding
check on the access mode, or it may choose to allow READ for
OPEN4_SHARE_ACCESS_WRITE, to accommodate clients whose WRITE
implementation may unavoidably do reads (e.g., due to buffer cache
constraints)."
and in section 10.4.1:
"Similarly, when closing a file opened for OPEN4_SHARE_ACCESS_WRITE/
OPEN4_SHARE_ACCESS_BOTH and if an OPEN_DELEGATE_WRITE delegation
is in effect"
This patch allows READ using write delegation stateid granted on OPENs
with OPEN4_SHARE_ACCESS_WRITE only, to accommodate clients whose WRITE
implementation may unavoidably do (e.g., due to buffer cache
constraints).
For write delegation granted for OPEN with OPEN4_SHARE_ACCESS_WRITE
a new nfsd_file and a struct file are allocated to use for reads.
The nfsd_file is freed when the file is closed by release_all_access.
Suggested-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
When netfslib is issuing subrequests, the subrequests start processing
immediately and may complete before we reach the end of the issuing
function. At the end of the issuing function we set NETFS_RREQ_ALL_QUEUED
to indicate to the collector that we aren't going to issue any more subreqs
and that it can do the final notifications and cleanup.
Now, this isn't a problem if the request is synchronous
(NETFS_RREQ_OFFLOAD_COLLECTION is unset) as the result collection will be
done in-thread and we're guaranteed an opportunity to run the collector.
However, if the request is asynchronous, collection is primarily triggered
by the termination of subrequests queuing it on a workqueue. Now, a race
can occur here if the app thread sets ALL_QUEUED after the last subrequest
terminates.
This can happen most easily with the copy2cache code (as used by Ceph)
where, in the collection routine of a read request, an asynchronous write
request is spawned to copy data to the cache. Folios are added to the
write request as they're unlocked, but there may be a delay before
ALL_QUEUED is set as the write subrequests may complete before we get
there.
If all the write subreqs have finished by the ALL_QUEUED point, no further
events happen and the collection never happens, leaving the request
hanging.
Fix this by queuing the collector after setting ALL_QUEUED. This is a bit
heavy-handed and it may be sufficient to do it only if there are no extant
subreqs.
Also add a tracepoint to cross-reference both requests in a copy-to-request
operation and add a trace to the netfs_rreq tracepoint to indicate the
setting of ALL_QUEUED.
Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item")
Reported-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/CAKPOu+8z_ijTLHdiCYGU_Uk7yYD=shxyGLwfe-L7AV3DhebS3w@mail.gmail.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250711151005.2956810-3-dhowells@redhat.com
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: netfs@lists.linux.dev
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: stable@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The netfs copy-to-cache that is used by Ceph with local caching sets up a
new request to write data just read to the cache. The request is started
and then left to look after itself whilst the app continues. The request
gets notified by the backing fs upon completion of the async DIO write, but
then tries to wake up the app because NETFS_RREQ_OFFLOAD_COLLECTION isn't
set - but the app isn't waiting there, and so the request just hangs.
Fix this by setting NETFS_RREQ_OFFLOAD_COLLECTION which causes the
notification from the backing filesystem to put the collection onto a work
queue instead.
Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item")
Reported-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/CAKPOu+8z_ijTLHdiCYGU_Uk7yYD=shxyGLwfe-L7AV3DhebS3w@mail.gmail.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250711151005.2956810-2-dhowells@redhat.com
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: netfs@lists.linux.dev
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: stable@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Allow fuse to use the iomap writeback code even when CONFIG_BLOCK is
not enabled. Do this with an ifdef instead of a separate file to keep
the iomap_folio_state local to buffered-io.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-15-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add a read_folio_range() handler for buffered writes that filesystems
may pass in if they wish to provide a custom handler for synchronously
reading in the contents of a folio.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
[hch: renamed to read_folio_range, pass less arguments]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-14-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Pass the iomap_iter and derive the map inside iomap_read_folio_sync
instead of in the caller, and use the more descriptive srcmap name for
the source iomap. Stop passing the offset into folio argument as it
can be derived from the folio and the file offset. Rename the
variables for the offset into the file and the length to be more
descriptive and match the rest of the code.
Rename the function itself to iomap_read_folio_range to make the use
more clear.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-13-hch@lst.de
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The iomap_folio_ops are only used for buffered writes, including the zero
and unshare variants. Rename them to iomap_write_ops to better describe
the usage, and pass them through the call chain like the other operation
specific methods instead of through the iomap.
xfs_iomap_valid grows a IOMAP_HOLE check to keep the existing behavior
that never attached the folio_ops to a iomap representing a hole.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-12-hch@lst.de
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Allow fuse to use iomap_writeback_folio for folio laundering. Note
that the caller needs to manually submit the pending writeback context.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-11-hch@lst.de
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Move unlocking the folio out of iomap_writeback_folio into the caller.
This means the end writeback machinery is now run with the folio locked
when no writeback happened, or writeback completed extremely fast.
Note that having the folio locked over the call to folio_end_writeback in
iomap_writeback_folio means that the dropbehind handling there will never
run because the trylock fails. The only way this can happen is if the
writepage either never wrote back any dirty data at all, in which case
the dropbehind handling isn't needed, or if all writeback finished
instantly, which is rather unlikely. Even in the latter case the
dropbehind handling is an optional optimization so skipping it will not
cause correctness issues.
This prepares for exporting iomap_writeback_folio for use in folio
laundering.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-10-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
->writepage is gone, and our naming wasn't always that great to start
with.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-9-hch@lst.de
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Now that the writeback code has the proper abstractions, all the ioend
code can be self-contained in ioend.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-8-hch@lst.de
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add a new iomap_start_folio_write helper to abstract away the
write_bytes_pending handling, and export it and the existing
iomap_finish_folio_write for non-iomap writeback in fuse.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-7-hch@lst.de
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Replace the ioend pointer in iomap_writeback_ctx with a void *wb_ctx
one to facilitate non-block, non-ioend writeback for use. Rename
the submit_ioend method to writeback_submit and make it mandatory so
that the generic writeback code stops seeing ioends and bios.
Co-developed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-6-hch@lst.de
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Replace ->map_blocks with a new ->writeback_range, which differs in the
following ways:
- it must also queue up the I/O for writeback, that is called into the
slightly refactored and extended in scope iomap_add_to_ioend for
each region
- can handle only a part of the requested region, that is the retry
loop for partial mappings moves to the caller
- handles cleanup on failures as well, and thus also replaces the
discard_folio method only implemented by XFS.
This will allow to use the iomap writeback code also for file systems
that are not block based like fuse.
Co-developed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-5-hch@lst.de
Acked-by: Damien Le Moal <dlemoal@kernel.org> # zonefs
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
We don't care about the count of outstanding ioends, just if there is one.
Replace the count variable passed to iomap_writepage_map_blocks with a
boolean to make that more clear.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
[hch: rename the variable, update the commit message]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-4-hch@lst.de
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add inode and wpc fields to pass the inode and writeback context that
are needed in the entire writeback call chain, and let the callers
initialize all fields in the writeback context before calling
iomap_writepages to simplify the argument passing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-3-hch@lst.de
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Drop various unused #include statements.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-2-hch@lst.de
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
[into #fixes, unless somebody objects]
Lifetime of new_dn_mark is controlled by that of its ->fsn_mark,
pointed to by new_fsn_mark. Unfortunately, a failure exit had
been inserted between the allocation of new_dn_mark and the
call of fsnotify_init_mark(), ending up with a leak.
Fixes: 1934b212615d "file: reclaim 24 bytes from f_owner"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/20250712171843.GB1880847@ZenIV
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The calculation of journal credits in ext4_meta_trans_blocks() should
include pextents, as each extent separately may be allocated from a
different group and thus need to update different bitmap and group
descriptor block.
Fixes: 0e32d8617012 ("ext4: correct the journal credits calculations of allocating blocks")
Reported-by: Jan Kara <jack@suse.cz>
Closes: https://lore.kernel.org/linux-ext4/nhxfuu53wyacsrq7xqgxvgzcggyscu2tbabginahcygvmc45hy@t4fvmyeky33e/
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250707140814.542883-11-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
After ext4 supports large folios, the semantics of reserving credits in
pages is no longer applicable. In most scenarios, reserving credits in
extents is sufficient. Therefore, introduce ext4_chunk_trans_extent()
to replace ext4_writepage_trans_blocks(). move_extent_per_page() is the
only remaining location where we are still processing extents in pages.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-10-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
After ext4 supports large folios, reserving journal credits for one
maximum-ordered folio based on the worst case cenario during the
writeback process can easily exceed the maximum transaction credits.
Additionally, reserving journal credits for one page is also no
longer appropriate.
Currently, the folio writeback process can either extend the journal
credits or initiate a new transaction if the currently reserved journal
credits are insufficient. Therefore, it can be modified to reserve
credits for only one extent at the outset. In most cases involving
continuous mapping, these credits are generally adequate, and we may
only need to perform some basic credit expansion. However, in extreme
cases where the block size and folio size differ significantly, or when
the folios are sufficiently discontinuous, it may be necessary to
restart a new transaction and resubmit the folios.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-9-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Now, we reserve journal credits for converting extents in only one page
to written state when the I/O operation is complete. This is
insufficient when large folio is enabled.
Fix this by reserving credits for converting up to one extent per block in
the largest 2MB folio, this calculation should only involve extents index
and leaf blocks, so it should not estimate too many credits.
Fixes: 7ac67301e82f ("ext4: enable large folio for regular file")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250707140814.542883-8-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
After mpage_map_and_submit_extent() supports restarting handle if
credits are insufficient during allocating blocks, it is more likely to
exit the current mapping iteration and continue to process the current
processing partially mapped folio again. The existing tracepoints are
not sufficient to track this situation, so enhance the tracepoints to
track the writeback position and the return value before and after
submitting the folios.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-7-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
After large folios are supported on ext4, writing back a sufficiently
large and discontinuous folio may consume a significant number of
journal credits, placing considerable strain on the journal. For
example, in a 20GB filesystem with 1K block size and 1MB journal size,
writing back a 2MB folio could require thousands of credits in the
worst-case scenario (when each block is discontinuous and distributed
across different block groups), potentially exceeding the journal size.
This issue can also occur in ext4_write_begin() and ext4_page_mkwrite()
when delalloc is not enabled.
Fix this by ensuring that there are sufficient journal credits before
allocating an extent in mpage_map_one_extent() and
ext4_block_write_begin(). If there are not enough credits, return
-EAGAIN, exit the current mapping loop, restart a new handle and a new
transaction, and allocating blocks on this folio again in the next
iteration.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-6-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The block allocation process and error handling in ext4_page_mkwrite()
is complex now. Refactor it by introducing a new helper function,
ext4_block_page_mkwrite(). It will call ext4_block_write_begin() to
allocate blocks instead of directly calling block_page_mkwrite().
Preparing to implement retry logic in a subsequent patch to address
situations where the reserved journal credits are insufficient.
Additionally, this modification will help prevent potential deadlocks
that may occur when waiting for folio writeback while holding the
transaction handle.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-5-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
During the process of writing back folios, if
mpage_map_and_submit_extent() exits the extent mapping loop due to an
ENOSPC or ENOMEM error, it may result in stale data or filesystem
inconsistency in environments where the block size is smaller than the
folio size.
When mapping a discontinuous folio in mpage_map_and_submit_extent(),
some buffers may have already be mapped. If we exit the mapping loop
prematurely, the folio data within the mapped range will not be written
back, and the file's disk size will not be updated. Once the transaction
that includes this range of extents is committed, this can lead to stale
data or filesystem inconsistency.
Fix this by submitting the current processing partially mapped folio.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-4-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
mpage_folio_done() should be a more appropriate place than
mpage_submit_folio() for updating the wbc->nr_to_write after we have
submitted a fully mapped folio. Preparing to make mpage_submit_folio()
allows to submit partially mapped folio that is still under processing.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250707140814.542883-3-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Since ext4 supports large folios, processing writebacks in pages is no
longer appropriate, it can be modified to process writebacks in bytes.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-2-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|