Age | Commit message (Collapse) | Author |
|
This fixes a bug where rebalance would loop repeatedly on the same
extents.
Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
A reservation goes through a 3 step lifetime:
- generated during delalloc
- released/counted by ordered_extent allocation
- freed by running delayed ref
That third step depends on must_insert_reserved on the head ref, so the
head ref with that field set owns the reservation. Once you prepare to
run the head ref, must_insert_reserved is unset, which means that
running the ref must free the reservation, whether or not it succeeds,
or else the reservation is leaked. That results in either a risk of
spurious ENOSPC if the fs stays writeable or a warning on unmount if it
is readonly.
The existing squota code was aware of these invariants, but missed a few
cases. Improve it by adding a helper function to use in the cleanup
paths and call it from the existing early returns in running delayed
refs. This also simplifies btrfs_record_squota_delta and struct
btrfs_quota_delta.
This fixes (or at least improves the reliability of) generic/475 with
"mkfs -O squota". On my machine, that test failed ~4/10 times without
this patch and passed 100/100 times with it.
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The EXTENT_QGROUP_RESERVED bit is used to "lock" regions of the file for
duplicate reservations. That is two writes to that range in one
transaction shouldn't create two reservations, as the reservation will
only be freed once when the write finally goes down. Therefore, it is
never OK to clear that bit without freeing the associated qgroup
reserve. At this point, we don't want to be freeing the reserve, so mask
off the bit.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If we abort a transaction, we never run the code that frees the pertrans
qgroup reservation. This results in warnings on unmount as that
reservation has been leaked. The leak isn't a huge issue since the fs is
read-only, but it's better to clean it up when we know we can/should. Do
it during the cleanup_transaction step of aborting.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The reserved data counter and input parameter is a u64, but we
inadvertently accumulate it in an int. Overflowing that int results in
freeing the wrong amount of data and breaking reserve accounting.
Unfortunately, this overflow rot spreads from there, as the qgroup
release/free functions rely on returning an int to take advantage of
negative values for error codes.
Therefore, the full fix is to return the "released" or "freed" amount by
a u64 argument and to return 0 or negative error code via the return
value.
Most of the call sites simply ignore the return value, though some
of them handle the error and count the returned bytes. Change all of
them accordingly.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
An ordered extent completing is a critical moment in qgroup reserve
handling, as the ownership of the reservation is handed off from the
ordered extent to the delayed ref. In the happy path we release (unlock)
but do not free (decrement counter) the reservation, and the delayed ref
drives the free. However, on an error, we don't create a delayed ref,
since there is no ref to add. Therefore, free on the error path.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Reconnect worker currently assumes that the server struct
is alive and only takes reference on the server if it needs
to call smb2_reconnect.
With the new ability to disable channels based on whether the
server has multichannel disabled, this becomes a problem when
we need to disable established channels. While disabling the
channels and deallocating the server, there could be reconnect
work that could not be cancelled (because it started).
This change forces the reconnect worker to unconditionally
take a reference on the server when it runs.
Also, this change now allows smb2_reconnect to know if it was
called by the reconnect worker. Based on this, the cifs_put_tcp_session
can decide whether it can cancel the reconnect work synchronously or not.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
This reverts commit 19a4b9d6c372cab6a3b2c9a061a236136fe95274.
This earlier commit was making an assumption that each mod_delayed_work
called for the reconnect work would result in smb2_reconnect_server
being called twice. This assumption turns out to be untrue. So reverting
this change for now.
I will submit a follow-up patch to fix the actual problem in a different
way.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
The internal freeze mechanism in bcachefs mostly reuses the generic
rw<->ro transition code. If the fs happens to shutdown during or
after freeze, a transition back to rw can fail. This is expected,
but returning an error from the unfreeze callout prevents the
filesystem from being unfrozen.
Skip the read write transition if the fs is shutdown. This allows
the fs to unfreeze at the vfs level so writes will no longer block,
but will still fail due to the emergency read-only state of the fs.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
When creating a snapshot without specifying the source subvolume, we use
the subvolume containing the new snapshot.
Previously, this worked if the directory containing the new snapshot was
the subvolume root - but we were using the incorrect helper, and got a
subvolume ID of 0 when the parent directory wasn't the root of the
subvolume, causing an emergency read-only.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Deduplication isn't supported on cifs, but cifs doesn't reject it, instead
treating it as extent duplication/cloning. This can cause generic/304 to go
silly and run for hours on end.
Fix cifs to indicate EOPNOTSUPP if REMAP_FILE_DEDUP is set in
->remap_file_range().
Note that it's unclear whether or not commit b073a08016a1 is meant to cause
cifs to return an error if REMAP_FILE_DEDUP.
Fixes: b073a08016a1 ("cifs: fix that return -EINVAL when do dedupe operation")
Cc: stable@vger.kernel.org
Suggested-by: Dave Chinner <david@fromorbit.com>
cc: Xiaoli Feng <fengxiaoli0714@gmail.com>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Darrick Wong <darrick.wong@oracle.com>
cc: fstests@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/3876191.1701555260@warthog.procyon.org.uk/
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
If server returned no data for FSCTL_DFS_GET_REFERRALS, @dfs_rsp will
remain NULL and then parse_dfs_referrals() will dereference it.
Fix this by returning -EIO when no output data is returned.
Besides, we can't fix it in SMB2_ioctl() as some FSCTLs are allowed to
return no data as per MS-SMB2 2.2.32.
Fixes: 9d49640a21bf ("CIFS: implement get_dfs_refer for SMB2+")
Cc: stable@vger.kernel.org
Reported-by: Robert Morris <rtm@csail.mit.edu>
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
RWH(Read + Write + Handle) caching state is not supported for directory.
ksmbd downgrade it to RH for directory if client send RWH caching lease
state.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Set SMB2_GLOBAL_CAP_DIRECTORY_LEASING to ->capabilities to inform server
support directory lease to client.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
To support v2 lease(directory lease), ksmbd set epoch in create context
v2 lease response.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
In smb2_lock(), if setup_async_work() executes successfully,
work->cancel_argv will bind the argv that generated by kmalloc(). And
release_async_work() is called in ksmbd_conn_try_dequeue_request() or
smb2_lock() to release argv.
However, when setup_async_work function fails, work->cancel_argv has not
been bound to the argv, resulting in the previously allocated argv not
being released. Call kfree() to fix it.
Fixes: e2f34481b24d ("cifsd: add server-side procedures for SMB3")
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
In order to limit # of open zones, let's finish or reset zones given # of
valid blocks per section and its zone condition.
Reviewed-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
nfsd/ksmbd call vfs_copy_file_range() with flag COPY_FILE_SPLICE to
perform kernel copy between two files on any two filesystems.
Splicing input file, while holding file_start_write() on the output file
which is on a different sb, posses a risk for fanotify related deadlocks.
We only need to call splice_file_range() from within the context of
->copy_file_range() filesystem methods with file_start_write() held.
To avoid the possible deadlocks, always use do_splice_direct() instead of
splice_file_range() for the kernel copy fallback in vfs_copy_file_range()
without holding file_start_write().
Reported-and-tested-by: Bert Karwatzki <spasswolf@web.de>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231130141624.3338942-4-amir73il@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
With the removal of the 'iov' argument to import_single_range(), the two
functions are now fully identical. Convert the import_single_range()
callers to import_ubuf(), and remove the former fully.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20231204174827.1258875-3-axboe@kernel.dk
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
It is entirely unused, just get rid of it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20231204174827.1258875-2-axboe@kernel.dk
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Even if f2fs was rebooted as staying checkpoint=disable, let's match the write
pointers all the time.
Reviewed-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
1. do roll forward recovery
2. update current segments pointers
3. fix the entire zones' write pointers
4. do checkpoint
Reviewed-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
If fsck can allocate a new zone, it'd be better to use that instead of
allocating a new one.
And, it modifies kernel messages.
Reviewed-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This fixes a transaction path overflow reported in the snapshot deletion
path, when moving extents to the correct snapshot.
The root of the issue is that creating/deleting a reflink pointer can
generate an unbounded number of updates, if it is allowed to reference
an unbounded number of indirect extents; to prevent this, merging of
reflink pointers has been disabled.
But there's a hole, which is that copygc/rebalance may fragment existing
extents in the course of moving them around, and if an indirect extent
becomes too fragmented we'll then become unable to delete the reflink
pointer.
The eventual solution is going to be to tweak trigger handling so that
we can process large reflink pointers incrementally when necessary, and
notice that trigger updates don't need to be run for the part of the
reflink pointer not changing. That is going to be a bigger project
though, for another patch.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
for_each_btree_key2() runs each loop iteration in a btree transaction,
and thus does not cause SRCU lock hold time problems.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Also, make bch2_extent_drop_ptrs() safer, so it works with extents and
non-extents iterators.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Recently, journal pre-reservations were removed. They were for reserving
space ahead of time in the journal for operations that are required for
journal reclaim, e.g. btree key cache flushing and interior node btree
updates.
Instead we have watermarks - only operations for journal reclaim are
allowed when the journal is low on space, and in general we're quite
good about doing operations in the order that will free up space in the
journal quickest when we're low on space. If we're doing a journal
reclaim operation out of order, we usually do it in nonblocking mode if
it's not freeing up space at the end of the journal.
There's an exceptino though - interior btree node update operations have
to be BCH_WATERMARK_reclaim - once they've been started, and they can't
be nonblocking. Generally this is fine because they'll only be a very
small fraction of transaction commits - but there's an exception, which
is during journal replay.
Journal replay does many btree operations, but doesn't need to commit
them to the journal since they're already in the journal. So killing off
of pre-reservation, plus another change to make journal replay more
efficient by initially doing the replay in sorted btree order, made it
possible for the interior update operations replay generates to fill and
deadlock the journal.
Fix this by introducing a new check on journal space at the _start_ of
an interior update operation. This causes us to block if necessary in
exactly the same way as we used to when interior updates took a journal
pre-reservaiton, but without all the expensive accounting
pre-reservations required.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The keys being replayed by journal replay have to be synchronized with
updates by other threads that overwrite them. We rely on btree node
locks for synchronizing - but since btree write buffer updates take no
btree locks, that won't work.
Instead, simply disable using the btree write buffer until journal
replay is finished.
This fixes a rare backpointers error in the merge_torture_flakey test.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Fix a number of issues in the cifs filesystem implementation of the FICLONE
ioctl in cifs_remap_file_range(). This is analogous to the previously
fixed bug in cifs_file_copychunk_range() and can share the helper
functions.
Firstly, the invalidation of the destination range is handled incorrectly:
We shouldn't just invalidate the whole file as dirty data in the file may
get lost and we can't just call truncate_inode_pages_range() to invalidate
the destination range as that will erase parts of a partial folio at each
end whilst invalidating and discarding all the folios in the middle. We
need to force all the folios covering the range to be reloaded, but we
mustn't lose dirty data in them that's not in the destination range.
Further, we shouldn't simply round out the range to PAGE_SIZE at each end
as cifs should move to support multipage folios.
Secondly, there's an issue whereby a write may have extended the file
locally, but not have been written back yet. This can leaves the local
idea of the EOF at a later point than the server's EOF. If a clone request
is issued, this will fail on the server with STATUS_INVALID_VIEW_SIZE
(which gets translated to -EIO locally) if the clone source extends past
the server's EOF.
Fix this by:
(0) Flush the source region (already done). The flush does nothing and
the EOF isn't moved if the source region has no dirty data.
(1) Move the EOF to the end of the source region if it isn't already at
least at this point. If we can't do this, for instance if the server
doesn't support it, just flush the entire source file.
(2) Find the folio (if present) at each end of the range, flushing it and
increasing the region-to-be-invalidated to cover those in their
entirety.
(3) Fully discard all the folios covering the range as we want them to be
reloaded.
(4) Then perform the extent duplication.
Thirdly, set i_size after doing the duplicate_extents operation as this
value may be used by various things internally. stat() hides the issue
because setting ->time to 0 causes cifs_getatr() to revalidate the
attributes.
These were causing the cifs/001 xfstest to fail.
Fixes: 04b38d601239 ("vfs: pull btrfs clone API to vfs layer")
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: stable@vger.kernel.org
cc: Christoph Hellwig <hch@lst.de>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: linux-mm@kvack.org
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Fix a number of issues in the cifs filesystem implementation of the
copy_file_range() syscall in cifs_file_copychunk_range().
Firstly, the invalidation of the destination range is handled incorrectly:
We shouldn't just invalidate the whole file as dirty data in the file may
get lost and we can't just call truncate_inode_pages_range() to invalidate
the destination range as that will erase parts of a partial folio at each
end whilst invalidating and discarding all the folios in the middle. We
need to force all the folios covering the range to be reloaded, but we
mustn't lose dirty data in them that's not in the destination range.
Further, we shouldn't simply round out the range to PAGE_SIZE at each end
as cifs should move to support multipage folios.
Secondly, there's an issue whereby a write may have extended the file
locally, but not have been written back yet. This can leaves the local
idea of the EOF at a later point than the server's EOF. If a copy request
is issued, this will fail on the server with STATUS_INVALID_VIEW_SIZE
(which gets translated to -EIO locally) if the copy source extends past the
server's EOF.
Fix this by:
(0) Flush the source region (already done). The flush does nothing and
the EOF isn't moved if the source region has no dirty data.
(1) Move the EOF to the end of the source region if it isn't already at
least at this point. If we can't do this, for instance if the server
doesn't support it, just flush the entire source file.
(2) Find the folio (if present) at each end of the range, flushing it and
increasing the region-to-be-invalidated to cover those in their
entirety.
(3) Fully discard all the folios covering the range as we want them to be
reloaded.
(4) Then perform the copy.
Thirdly, set i_size after doing the copychunk_range operation as this value
may be used by various things internally. stat() hides the issue because
setting ->time to 0 causes cifs_getatr() to revalidate the attributes.
These were causing the generic/075 xfstest to fail.
Fixes: 620d8745b35d ("Introduce cifs_copy_file_range()")
Cc: stable@vger.kernel.org
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: linux-mm@kvack.org
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
The new fuse init flag FUSE_DIRECT_IO_ALLOW_MMAP breaks assumptions made by
FOPEN_PARALLEL_DIRECT_WRITES and causes test generic/095 to hit
BUG_ON(fi->writectr < 0) assertions in fuse_set_nowrite():
generic/095 5s ...
kernel BUG at fs/fuse/dir.c:1756!
...
? fuse_set_nowrite+0x3d/0xdd
? do_raw_spin_unlock+0x88/0x8f
? _raw_spin_unlock+0x2d/0x43
? fuse_range_is_writeback+0x71/0x84
fuse_sync_writes+0xf/0x19
fuse_direct_io+0x167/0x5bd
fuse_direct_write_iter+0xf0/0x146
Auto disable FOPEN_PARALLEL_DIRECT_WRITES when server negotiated
FUSE_DIRECT_IO_ALLOW_MMAP.
Fixes: e78662e818f9 ("fuse: add a new fuse init flag to relax restrictions in no cache mode")
Cc: <stable@vger.kernel.org> # v6.6
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
fuse_dax_conn_free() will be called when fuse_fill_super_common() fails
after fuse_dax_conn_alloc(). Then deactivate_locked_super() in
virtio_fs_get_tree() will call virtio_kill_sb() to release the discarded
superblock. This will call fuse_dax_conn_free() again in fuse_conn_put(),
resulting in a possible double free.
Fixes: 1dd539577c42 ("virtiofs: add a mount option to enable dax")
Signed-off-by: Hangyu Hua <hbh25y@gmail.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Cc: <stable@vger.kernel.org> # v5.10
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Fuse submounts do not perform a lookup for the nodeid that they inherit
from their parent. Instead, the code decrements the nlookup on the
submount's fuse_inode when it is instantiated, and no forget is
performed when a submount root is evicted.
Trouble arises when the submount's parent is evicted despite the
submount itself being in use. In this author's case, the submount was
in a container and deatched from the initial mount namespace via a
MNT_DEATCH operation. When memory pressure triggered the shrinker, the
inode from the parent was evicted, which triggered enough forgets to
render the submount's nodeid invalid.
Since submounts should still function, even if their parent goes away,
solve this problem by sharing refcounted state between the parent and
its submount. When all of the references on this shared state reach
zero, it's safe to forget the final lookup of the fuse nodeid.
Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
Cc: stable@vger.kernel.org
Fixes: 1866d779d5d2 ("fuse: Allow fuse_fill_super_common() for submounts")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Although DIRECT_IO_RELAX's initial usage is to allow shared mmap, its
description indicates a purpose of reducing memory footprint. This
may imply that it could be further used to relax other DIRECT_IO
operations in the future.
Replace it with a flag DIRECT_IO_ALLOW_MMAP which does only one thing,
allow shared mmap of DIRECT_IO files while still bypassing the cache
on regular reads and writes.
[Miklos] Also Keep DIRECT_IO_RELAX definition for backward compatibility.
Signed-off-by: Tyler Fanelli <tfanelli@redhat.com>
Fixes: e78662e818f9 ("fuse: add a new fuse init flag to relax restrictions in no cache mode")
Cc: <stable@vger.kernel.org> # v6.6
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
This reverts commit f4acfcd4deb1 ("debugfs: annotate debugfs handlers
vs. removal with lockdep"), it appears to have false positives and
really shouldn't have been in the -rc series with the fixes anyway.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20231202114936.fd55431ab160.I911aa53abeeca138126f690d383a89b13eb05667@changeid
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
There's no need to drop journal pins in our exit paths - the code was
trying to have everything cleaned up on any shutdown, but better to just
tweak the assertions a bit.
This fixes a bug where calling into journal reclaim in the exit path
would cass a null ptr deref.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Pull smb client fixes from Steve French:
- Two fallocate fixes
- Fix warnings from new gcc
- Two symlink fixes
* tag 'v6.7-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client, common: fix fortify warnings
cifs: Fix FALLOC_FL_INSERT_RANGE by setting i_size after EOF moved
cifs: Fix FALLOC_FL_ZERO_RANGE by setting i_size if EOF moved
smb: client: report correct st_size for SMB and NFS symlinks
smb: client: fix missing mode bits for SMB symlinks
|
|
fsverity provides fast and reliable hash of files, namely fsverity_digest.
The digest can be used by security solutions to verify file contents.
Add new kfunc bpf_get_fsverity_digest() so that we can access fsverity from
BPF LSM programs. This kfunc is added to fs/verity/measure.c because some
data structure used in the function is private to fsverity
(fs/verity/fsverity_private.h).
To avoid recursion, bpf_get_fsverity_digest is only allowed in BPF LSM
programs.
Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20231129234417.856536-3-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext2 fix from Jan Kara:
"Fix an ext2 bug introduced by changes in ext2 & iomap stepping on each
other toes (apparently ext2 driver does not get much testing in
linux-next)"
* tag 'fs_for_v6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
ext2: Fix ki_pos update for DIO buffered-io fallback case
|
|
Pull more bcachefs bugfixes from Kent Overstreet:
- bcache & bcachefs were broken with CFI enabled; patch for closures to
fix type punning
- mark erasure coding as extra-experimental; there are incompatible
disk space accounting changes coming for erasure coding, and I'm
still seeing checksum errors in some tests
- several fixes for durability-related issues (durability is a device
specific setting where we can tell bcachefs that data on a given
device should be counted as replicated x times)
- a fix for a rare livelock when a btree node merge then updates a
parent node that is almost full
- fix a race in the device removal path, where dropping a pointer in a
btree node to a device would be clobbered by an in flight btree write
updating the btree node key on completion
- fix one SRCU lock hold time warning in the btree gc code - ther's
still a bunch more of these to fix
- fix a rare race where we'd start copygc before initializing the "are
we rw" percpu refcount; copygc would think we were already ro and die
immediately
* tag 'bcachefs-2023-11-29' of https://evilpiepirate.org/git/bcachefs: (23 commits)
bcachefs: Extra kthread_should_stop() calls for copygc
bcachefs: Convert gc_alloc_start() to for_each_btree_key2()
bcachefs: Fix race between btree writes and metadata drop
bcachefs: move journal seq assertion
bcachefs: -EROFS doesn't count as move_extent_start_fail
bcachefs: trace_move_extent_start_fail() now includes errcode
bcachefs: Fix split_race livelock
bcachefs: Fix bucket data type for stripe buckets
bcachefs: Add missing validation for jset_entry_data_usage
bcachefs: Fix zstd compress workspace size
bcachefs: bpos is misaligned on big endian
bcachefs: Fix ec + durability calculation
bcachefs: Data update path won't accidentaly grow replicas
bcachefs: deallocate_extra_replicas()
bcachefs: Proper refcounting for journal_keys
bcachefs: preserve device path as device name
bcachefs: Fix an endianness conversion
bcachefs: Start gc, copygc, rebalance threads after initing writes ref
bcachefs: Don't stop copygc thread on device resize
bcachefs: Make sure bch2_move_ratelimit() also waits for move_ops
...
|
|
Let's allow checkpoint=disable back for zoned block device. It's very risky
as the feature relies on fsck or runtime recovery which matches the write
pointers again if the device rebooted while disabling the checkpoint.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for
array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
While there, use struct_size() helper, instead of the open-coded
version, to calculate the size for the allocation of the whole
flexible structure, including of course, the flexible-array member.
This code was found with the help of Coccinelle, and audited and
fixed manually.
Signed-off-by: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/ZSVKwBmxQ1amv47E@work
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
The callers of do_splice_direct() hold file_start_write() on the output
file.
This may cause file permission hooks to be called indirectly on an
overlayfs lower layer, which is on the same filesystem of the output
file and could lead to deadlock with fanotify permission events.
To fix this potential deadlock, move file_start_write() from the callers
into the direct_splice_actor(), so file_start_write() will not be held
while splicing from the input file.
Suggested-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20231128214258.GA2398475@perftesting/
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231130141624.3338942-3-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In preparation of calling do_splice_direct() without file_start_write()
held, create a new helper splice_file_range(), to be called from context
of ->copy_file_range() methods instead of do_splice_direct().
Currently, the only difference is that splice_file_range() does not take
flags argument and that it asserts that file_start_write() is held, but
we factor out a common helper do_splice_direct_actor() that will be used
later.
Use the new helper from __ceph_copy_file_range(), that was incorrectly
passing to do_splice_direct() the copy flags argument as splice flags.
The value of copy flags in ceph is always 0, so it is a smenatic bug fix.
Move the declaration of both helpers to linux/splice.h.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231130141624.3338942-2-amir73il@gmail.com
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
So far, fanotify returns -ENODEV or -EXDEV when trying to set a mark
on a filesystem with a "weak" fsid, namely, zero fsid (e.g. fuse), or
non-uniform fsid (e.g. btrfs non-root subvol).
When group is watching inodes all from the same filesystem (or subvol),
allow adding inode marks with "weak" fsid, because there is no ambiguity
regarding which filesystem reports the event.
The first mark added to a group determines if this group is single or
multi filesystem, depending on the fsid at the path of the added mark.
If the first mark added has a "strong" fsid, marks with "weak" fsid
cannot be added and vice versa.
If the first mark added has a "weak" fsid, following marks must have
the same "weak" fsid and the same sb as the first mark.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20231130165619.3386452-3-amir73il@gmail.com>
|
|
Some filesystems like fuse and nfs have zero or non-unique fsid.
We would like to avoid reporting ambiguous fsid in events, so we need
to avoid marking objects with same fsid and different sb.
To make this easier to enforce, store the fsid in the marks of the group
instead of in the shared conenctor.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20231130165619.3386452-2-amir73il@gmail.com>
|
|
The syzbot has reported that it can hit the warning in
ext4_dio_write_end_io() because i_size < i_disksize. Indeed the
reproducer creates a race between DIO IO completion and truncate
expanding the file and thus ext4_dio_write_end_io() sees an inconsistent
inode state where i_disksize is already updated but i_size is not
updated yet. Since we are careful when setting up DIO write and consider
it extending (and thus performing the IO synchronously with i_rwsem held
exclusively) whenever it goes past either of i_size or i_disksize, we
can use the same test during IO completion without risking entering
ext4_handle_inode_extension() without i_rwsem held. This way we make it
obvious both i_size and i_disksize are large enough when we report DIO
completion without relying on unreliable WARN_ON.
Reported-by: <syzbot+47479b71cdfc78f56d30@syzkaller.appspotmail.com>
Fixes: 91562895f803 ("ext4: properly sync file size update after O_SYNC direct IO")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20231130095653.22679-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Current jbd2 only add REQ_SYNC for descriptor block, metadata log
buffer, commit buffer and superblock buffer, the submitted IO could be
throttled by writeback throttle in block layer, that could lead to
priority inversion in some cases. The log IO looks like a kind of high
priority metadata IO, so it should not be throttled by WBT like QOS
policies in block layer, let's add REQ_SYNC | REQ_IDLE to exempt from
writeback throttle, and also add REQ_META together indicates it's a
metadata IO.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231129114740.2686201-2-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The write_flags print in the trace of jbd2_write_superblock() is not
real, so move the modification before the trace.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231129114740.2686201-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
For files with logical blocks close to EXT_MAX_BLOCKS, the file size
predicted in ext4_mb_normalize_request() may exceed EXT_MAX_BLOCKS.
This can cause some blocks to be preallocated that will not be used.
And after [Fixes], the following issue may be triggered:
=========================================================
kernel BUG at fs/ext4/mballoc.c:4653!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
CPU: 1 PID: 2357 Comm: xfs_io 6.7.0-rc2-00195-g0f5cc96c367f
Hardware name: linux,dummy-virt (DT)
pc : ext4_mb_use_inode_pa+0x148/0x208
lr : ext4_mb_use_inode_pa+0x98/0x208
Call trace:
ext4_mb_use_inode_pa+0x148/0x208
ext4_mb_new_inode_pa+0x240/0x4a8
ext4_mb_use_best_found+0x1d4/0x208
ext4_mb_try_best_found+0xc8/0x110
ext4_mb_regular_allocator+0x11c/0xf48
ext4_mb_new_blocks+0x790/0xaa8
ext4_ext_map_blocks+0x7cc/0xd20
ext4_map_blocks+0x170/0x600
ext4_iomap_begin+0x1c0/0x348
=========================================================
Here is a calculation when adjusting ac_b_ex in ext4_mb_new_inode_pa():
ex.fe_logical = orig_goal_end - EXT4_C2B(sbi, ex.fe_len);
if (ac->ac_o_ex.fe_logical >= ex.fe_logical)
goto adjust_bex;
The problem is that when orig_goal_end is subtracted from ac_b_ex.fe_len
it is still greater than EXT_MAX_BLOCKS, which causes ex.fe_logical to
overflow to a very small value, which ultimately triggers a BUG_ON in
ext4_mb_new_inode_pa() because pa->pa_free < len.
The last logical block of an actual write request does not exceed
EXT_MAX_BLOCKS, so in ext4_mb_normalize_request() also avoids normalizing
the last logical block to exceed EXT_MAX_BLOCKS to avoid the above issue.
The test case in [Link] can reproduce the above issue with 64k block size.
Link: https://patchwork.kernel.org/project/fstests/list/?series=804003
Cc: <stable@kernel.org> # 6.4
Fixes: 93cdf49f6eca ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231127063313.3734294-1-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|