Age | Commit message (Collapse) | Author |
|
We periodically check the available rt blocks when filling up zones
and start GC if needed, but we may run completely out in between
filling zones, so start GC(unless already running) if we can't reserve
writable space.
This should only happen as a corner case in setups with very few
backing zones.
Fixes: 080d01c41d44 ("xfs: implement zoned garbage collection")
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
This reverts commit 94c821fb286b545d37549ff30a0c341e066f0d6c.
It reports that there is potential corruption in node footer,
the most suspious feature is nat_bits, let's revert recovery
related code.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
syzbot reports a f2fs bug as below:
F2FS-fs (loop3): Stopped filesystem due to reason: 7
kworker/u8:7: attempt to access beyond end of device
BUG: unable to handle page fault for address: ffffed1604ea3dfa
RIP: 0010:get_ckpt_valid_blocks fs/f2fs/segment.h:361 [inline]
RIP: 0010:has_curseg_enough_space fs/f2fs/segment.h:570 [inline]
RIP: 0010:__get_secs_required fs/f2fs/segment.h:620 [inline]
RIP: 0010:has_not_enough_free_secs fs/f2fs/segment.h:633 [inline]
RIP: 0010:has_enough_free_secs+0x575/0x1660 fs/f2fs/segment.h:649
<TASK>
f2fs_is_checkpoint_ready fs/f2fs/segment.h:671 [inline]
f2fs_write_inode+0x425/0x540 fs/f2fs/inode.c:791
write_inode fs/fs-writeback.c:1525 [inline]
__writeback_single_inode+0x708/0x10d0 fs/fs-writeback.c:1745
writeback_sb_inodes+0x820/0x1360 fs/fs-writeback.c:1976
wb_writeback+0x413/0xb80 fs/fs-writeback.c:2156
wb_do_writeback fs/fs-writeback.c:2303 [inline]
wb_workfn+0x410/0x1080 fs/fs-writeback.c:2343
process_one_work kernel/workqueue.c:3236 [inline]
process_scheduled_works+0xa66/0x1840 kernel/workqueue.c:3317
worker_thread+0x870/0xd30 kernel/workqueue.c:3398
kthread+0x7a9/0x920 kernel/kthread.c:464
ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:148
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
Commit 8b10d3653735 ("f2fs: introduce FAULT_NO_SEGMENT") allows to trigger
no free segment fault in allocator, then it will update curseg->segno to
NULL_SEGNO, though, CP_ERROR_FLAG has been set, f2fs_write_inode() missed
to check the flag, and access invalid curseg->segno directly in below call
path, then resulting in panic:
- f2fs_write_inode
- f2fs_is_checkpoint_ready
- has_enough_free_secs
- has_not_enough_free_secs
- __get_secs_required
- has_curseg_enough_space
- get_ckpt_valid_blocks
: access invalid curseg->segno
To avoid this issue, let's:
- check CP_ERROR_FLAG flag in prior to f2fs_is_checkpoint_ready() in
f2fs_write_inode().
- in has_curseg_enough_space(), save curseg->segno into a temp variable,
and verify its validation before use.
Fixes: 8b10d3653735 ("f2fs: introduce FAULT_NO_SEGMENT")
Reported-by: syzbot+b6b347b7a4ea1b2e29b6@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/67973c2b.050a0220.11b1bb.0089.GAE@google.com
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
To simulate inconsistent node footer error.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch introduces a new wrapper f2fs_get_xnode_page(), then, caller
can use it to load xattr block to page cache, meanwhile it will do sanity
check on xattr node footer.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch introduces a new wrapper f2fs_get_inode_page(), then, caller
can use it to load inode block to page cache, meanwhile it will do sanity
check on inode footer.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
ksmbd_work could be freed when after connection release.
Increment r_count of ksmbd_conn to indicate that requests
are not finished yet and to not release the connection.
Cc: stable@vger.kernel.org
Reported-by: Norbert Szetei <norbert@doyensec.com>
Tested-by: Norbert Szetei <norbert@doyensec.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
->interim_entry of ksmbd_work could be deleted after oplock is freed.
We don't need to manage it with linked list. The interim request could be
immediately sent whenever a oplock break wait is needed.
Cc: stable@vger.kernel.org
Reported-by: Norbert Szetei <norbert@doyensec.com>
Tested-by: Norbert Szetei <norbert@doyensec.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Switch from bio_for_each_segment_all() to bio_for_each_folio_all()
which removes a call to page_buffers().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
gfs2_end_log_write() has to handle bios which consist of both pages
which belong to folios and pages which were allocated from a mempool and
do not belong to a folio. It would be cleaner to have separate endio
handlers which handle each type, but it's not clear to me whether that's
even possible.
This patch is slightly forward-looking in that page_folio() cannot
currently return NULL, but it will return NULL in the future for pages
which do not belong to a folio.
This was the last user of page_has_buffers(), so remove it.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Remove a call to grab_cache_page() by using a folio throughout
this function.
[agruenba@redhat.com: Adjust to return value difference between
bio_add_page() and bio_add_folio().]
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Pass in the folio instead of the page. Add an assert that this is
not a large folio as we'd need a more complex solution if we wanted to
kmap() each page out of a large folio. Removes a use of folio->page.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
[agruenba@redhat.com: Rename gfs2_jhead_folio_srch() to gfs2_jhead_folio_search().]
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
We are preparing to remove bh->b_page. Use kmap_local_folio() instead
of kmap_local_page().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Remove a reference to bh->b_page which is going to be removed soon.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
The lock bit is maintained on the folio, not on the page. Saves two
calls to compound_head() as well as removing two references to
bh->b_page.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
We are preparing to remove bh->b_page. gfs2_log_write() should continue
to operate on pages as some of the memory being logged does not come
from folios, so convert from folio to page in this function.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
In gfs2_evict_inode(), in the unlikely case that we cannot defer
deleting the inode, it is not safe to fall back to deleting the inode;
the only valid choice we have is to skip the delete.
In addition, in evict_should_delete(), if we cannot lock the inode glock
exclusively, we are in a bad enough state that skipping the delete is
likely a better choice than trying to recover from the failure later.
Fixes: c5b7a2400edc ("gfs2: Only defer deletes when we have an iopen glock")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
In glock_set_object() and glock_clear_object(), there is no need to
print the glock type and number when we dump the entire glock, anyway.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
In evict_should_delete(), when gfs2_upgrade_iopen_glock() fails, we
detach the iopen glock from the inode without calling
glock_clear_object(). This leads to a warning in glock_set_object()
when the same inode is recreated and the glock is reused.
Fix that by only detaching the iopen glock in gfs2_evict_inode().
In addition, remove the dequeue code from evict_should_delete(); we
already perform a conditional dequeue in gfs2_evict_inode().
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
In gfs2_try_evict(), we try grabbing the inode to evict, we try to evict
it, and then we try grabbing it again to see if it still exists. There
is no guarantee that we will end up with the same inode both times; the
inode validity check that commit ffd1cf0443a2 ("gfs2: Prevent inode
creation race") added to the first grab is actually needed both times.
(To avoid code duplication, add a grab_existing_inode() helper.)
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
In gfs2_glock_dq(), we must drop the glock spin lock before calling
->lm_cancel, but this means that in the meantime, the operation we are
trying to cancel could complete. If the operation completes
unsuccessfully, another holder can end up at the head of the queue and
another ->lm_lock operation can get started. In this case, we would end
up canceling that second operation by accident.
To prevent that, introduce a new GLF_CANCELING flag. Set that flag in
gfs2_glock_dq() when trying to cancel an operation. When seeing that
flag, finish_xmote() will then keep the GLF_LOCK flag set to prevent
other glock operations from taking place. gfs2_glock_dq() then
completes the cancelation attempt by clearing GLF_LOCK and
GLF_CANCELING.
In addition, add a missing GLF_DEMOTE_IN_PROGRESS check in
gfs2_glock_dq() to make sure that we won't accidentally cancel a demote
request.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
In finish_xmote(), when a locking request is canceled, the corresponding
holder is moved to the tail of the holders list instead of being
dequeued immediately. When there is only a single holder, the canceled
locking request is then immediately repeated. This makes no sense; it
looks like another remnant of LM_FLAG_PRIORITY support.
Instead, dequeue canceled holders and proceed with the next holder in
finish_xmote(). We can then easily detect in gfs2_glock_dq() when a
holder has been canceled.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
In run_queue(), check if the queue of pending requests is empty instead
of blindly assuming that it won't be.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Remove some more dead code in add_to_queue() that commit 0b93bac2271e
("gfs2: Remove LM_FLAG_PRIORITY flag") has rendered obsolete. This is a
continuation of commit 3302764610057 ("gfs2: remove dead code in
add_to_queue"); no functional change.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Having this flag attached to the iopen glock instead of the inode is
much simpler; it eliminates a protential weird race in gfs2_try_evict().
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Glocks are always actively acquired by processes, but as indicated by
the GL_NOPID holder flag, some of them are then associated with objects
like cached inodes rather than the process that acquired them. As such,
for those glock holders, it makes little sense to dump which processes
originally acquired them.
Therefore, gfs2 is trying to hide the identity of the processes that
acquired those glocks. The code for doing that is incorrect though, so
fix it.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Introduce a new GLF_PENDING_REPLY flag to indicate that a reply from DLM
is expected. Include that flag in glock dumps to show more clearly
what's going on. (When the GLF_PENDING_REPLY flag is set, the GLF_LOCK
flag will also be set but the GLF_LOCK flag alone isn't sufficient to
tell that we are waiting for a DLM reply.)
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Add a number of glock flags are currently not shown in the text form of
glock tracepoints.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
We need the driver core fix in here as well.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Add three trace points for the different backing memory allocators for
buffers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Directly assign b_addr based on the tmpfs folios without a detour
through pages, reuse the folio_put path used for non-tmpfs buffers
and replace all references to pages in comments with folios.
Partially based on a patch from Dave Chinner <dchinner@redhat.com>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
The fallback buffer allocation path currently open codes a suboptimal
version of vmalloc to allocate pages that are then mapped into
vmalloc space. Switch to using vmalloc instead, which uses all the
optimizations in the common vmalloc code, and removes the need to
track the backing pages in the xfs_buf structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Unmapped buffers don't exist anymore, so the page straddling
detection and slow path code can go away now.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Unmapped buffer access is a pain, so kill it. The switch to large
folios means we rarely pay a vmap penalty for large buffers,
so this functionality is largely unnecessary now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Now that we have the buffer cache using the folio API, we can extend
the use of folios to allocate high order folios for multi-page
buffers rather than an array of single pages that are then vmapped
into a contiguous range.
This creates a new type of single folio buffers that can have arbitrary
order in addition to the existing multi-folio buffers made up of many
single page folios that get vmapped. The single folio is for now
stashed into the existing b_pages array, but that will go away entirely
later in the series and remove the temporary page vs folio typing issues
that only work because the two structures currently can be used largely
interchangeable.
The code that allocates buffers will optimistically attempt a high
order folio allocation as a fast path if the buffer size is a power
of two and thus fits into a folio. If this high order allocation
fails, then we fall back to the existing multi-folio allocation
code. This now forms the slow allocation path, and hopefully will be
largely unused in normal conditions except for buffers with size
that are not a power of two like larger remote xattrs.
This should improve performance of large buffer operations (e.g.
large directory block sizes) as we should now mostly avoid the
expense of vmapping large buffers (and the vmap lock contention that
can occur) as well as avoid the runtime pressure that frequently
accessing kernel vmapped pages put on the TLBs.
Based on a patch from Dave Chinner <dchinner@redhat.com>, but mutilated
beyond recognition.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Since commit 59bb47985c1d ("mm, sl[aou]b: guarantee natural alignment
for kmalloc(power-of-two)", kmalloc and friends guarantee that power of
two sized allocations are naturally aligned. Limit our use of kmalloc
for buffers to these power of two sizes and remove the fallback to
the page allocator for this case, but keep a check in addition to
trusting the slab allocator to get the alignment right.
Also refactor the kmalloc path to reuse various calculations for the
size and gfp flags.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Lift handling of shmem and slab backed buffers into xfs_buf_alloc_pages
and rename the result to xfs_buf_alloc_backing_mem. This shares more
code and ensures uncached buffers can also use slab, which slightly
reduces the memory usage of growfs on 512 byte sector size file systems,
but more importantly means the allocation invariants are the same for
cached and uncached buffers. Document these new invariants with a big
fat comment mostly stolen from a patch by Dave Chinner.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
No need to look at the page count if we can simply call is_vmalloc_addr
on bp->b_addr. This prepares for eventualy removing the b_page_count
field.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
b_offset is only set for slab backed buffers and always set to
offset_in_page(bp->b_addr), which can be done just as easily in the only
user of b_offset.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
No need to walk the page list if bp->b_addr is valid. That also means
b_offset doesn't need to be taken into account in the unmapped loop as
b_offset is only set for kmem backed buffers which are always mapped.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
We never log large contiguous regions of unmapped buffers, so this
bug is never triggered by the current code. However, the slowpath
for formatting buffer straddling regions is broken.
That is, the size and shape of the log vector calculated across a
straddle does not match how the formatting code formats a straddle.
This results in a log vector with an uninitialised iovec and this
causes a crash when xlog_write_full() goes to copy the iovec into
the journal.
Whilst touching this code, don't bother checking mapped or single
folio buffers for discontiguous regions because they don't have
them. This significantly reduces the overhead of this check when
logging large buffers as calling xfs_buf_offset() is not free and
it occurs a *lot* in those cases.
Fixes: 929f8b0deb83 ("xfs: optimise xfs_buf_item_size/format for contiguous regions")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Bound nsm_local_state sysctl writings between SYSCTL_ZERO
and SYSCTL_INT_MAX.
The proc_handler has thus been updated to proc_dointvec_minmax.
Signed-off-by: Nicolas Bouchinet <nicolas.bouchinet@ssi.gouv.fr>
[ cel: updated to handle zero - UINT_MAX instead ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
If there are no courtesy clients then the return value from the
atomic_long_read() could overflow an int. Use a long to store the value
instead.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
idr_alloc_cyclic() is what guarantees that now, not this long-gone trick.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
This isn't needed.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Move dl_type field above dl_time, which shaves 8 bytes off this struct.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
It's possible for rpc_call_async() to fail (mainly due to memory
allocation failure). If it does, there isn't much recourse other than to
requeue the callback and try again later.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Since there is now a cb_flags word, use a new NFSD4_CALLBACK_REQUEUE
flag in that instead of a separate boolean.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
These flags serve essentially the same purpose and get set and cleared
at the same time. Drop CB_GETATTR_BUSY and just use
NFSD4_CALLBACK_RUNNING instead.
For this to work, we must use clear_and_wake_up_bit(), but doing that on
for other types of callbacks is wasteful. Declare a new NFSD4_CALLBACK_WAKE
flag in cb_flags to indicate that wake_up is needed, and only set that
for CB_GETATTRs.
Also, make the wait use a TASK_UNINTERRUPTIBLE sleep. This is done in
the context of an nfsd thread, and it should never need to deal with
signals.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
deleg_reaper() will walk the client_lru list and put any suitable
entries onto "cblist" using the cl_ra_cblist pointer. It then walks the
objects outside the spinlock and queues callbacks for them.
None of the operations that deleg_reaper() does outside the
nn->client_lock are blocking operations. Just queue their workqueue jobs
under the nn->client_lock instead.
Also, the NFSD4_CLIENT_CB_RECALL_ANY and NFSD4_CALLBACK_RUNNING flags
serve an identical purpose now. Drop the NFSD4_CLIENT_CB_RECALL_ANY flag
and just use the one in the callback.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|