Age | Commit message (Collapse) | Author |
|
The LZMA algorithm support has been landed for more than one year since
Linux 5.16. Besides, the new XZ Utils 5.4 has been available in most
Linux distributions.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20231021020137.1646959-1-hsiangkao@linux.alibaba.com
|
|
Add new mount options lowerdir+ and datadir+ that can be used to add
layers to lower layers stack one by one.
Unlike the legacy lowerdir mount option, special characters (i.e. colons
and cammas) are not unescaped with these new mount options.
The new mount options can be repeated to compose a large stack of lower
layers, but they may not be mixed with the lagacy lowerdir mount option,
because for displaying lower layers in mountinfo, we do not want to mix
escaped with unescaped lower layers path syntax.
Similar to data-only layer rules with the lowerdir mount option, the
datadir+ option must follow at least one lowerdir+ option and the
lowerdir+ option must not follow the datadir+ option.
If the legacy lowerdir mount option follows lowerdir+ and datadir+
mount options, it overrides them. Sepcifically, calling:
fsconfig(FSCONFIG_SET_STRING, "lowerdir", "", 0);
can be used to reset previously setup lower layers.
Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Link: https://lore.kernel.org/r/CAJfpegt7VC94KkRtb1dfHG8+4OzwPBLYqhtc8=QFUxpFJE+=RQ@mail.gmail.com/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
|
|
In preparation for new mount options to add lowerdirs one by one,
generalize ovl_parse_param_upperdir() into helper ovl_parse_layer()
that will be used for parsing a single lower layers.
Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Link: https://lore.kernel.org/r/CAJfpegt7VC94KkRtb1dfHG8+4OzwPBLYqhtc8=QFUxpFJE+=RQ@mail.gmail.com/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
|
|
We are about to add new mount options for adding lowerdir one by one,
but those mount options will not support escaping.
For the existing case, where lowerdir mount option is provided as a colon
separated list, store the user provided (possibly escaped) string and
display it as is when showing the lowerdir mount option.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
|
|
Commit beae836e9c61 ("ovl: temporarily disable appending lowedirs")
removed the ability to append lowerdirs with syntax lowerdir=":<path>".
Remove leftover code and comments that are irrelevant with lowerdir
append mode disabled.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
|
|
An xattr whiteout (called "xwhiteout" in the code) is a reguar file of
zero size with the "overlay.whiteout" xattr set. A file like this in a
directory with the "overlay.whiteouts" xattrs set will be treated the
same way as a regular whiteout.
The "overlay.whiteouts" directory xattr is used in order to
efficiently handle overlay checks in readdir(), as we only need to
checks xattrs in affected directories.
The advantage of this kind of whiteout is that they can be escaped
using the standard overlay xattr escaping mechanism. So, a file with a
"overlay.overlay.whiteout" xattr would be unescaped to
"overlay.whiteout", which could then be consumed by another overlayfs
as a whiteout.
Overlayfs itself doesn't create whiteouts like this, but a userspace
mechanism could use this alternative mechanism to convert images that
may contain whiteouts to be used with overlayfs.
To work as a whiteout for both regular overlayfs mounts as well as
userxattr mounts both the "user.overlay.whiteout*" and the
"trusted.overlay.whiteout*" xattrs will need to be created.
Signed-off-by: Alexander Larsson <alexl@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
|
|
There are cases where you want to use an overlayfs mount as a lowerdir
for another overlayfs mount. For example, if the system rootfs is on
overlayfs due to composefs, or to make it volatile (via tmps), then
you cannot currently store a lowerdir on the rootfs. This means you
can't e.g. store on the rootfs a prepared container image for use
using overlayfs.
To work around this, we introduce an escapment mechanism for overlayfs
xattrs. Whenever the lower/upper dir has a xattr named
"overlay.overlay.XYZ", we list it as "overlay.XYZ" in listxattrs, and
when the user calls getxattr or setxattr on "overlay.XYZ", we apply to
"overlay.overlay.XYZ" in the backing directories.
This allows storing any kind of overlay xattrs in a overlayfs mount
that can be used as a lowerdir in another mount. It is possible to
stack this mechanism multiple times, such that
"overlay.overlay.overlay.XYZ" will survive two levels of overlay mounts,
however this is not all that useful in practice because of stack depth
limitations of overlayfs mounts.
Note: These escaped xattrs are copied to upper during copy-up.
Signed-off-by: Alexander Larsson <alexl@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
|
|
These match the ones for e.g. XATTR_TRUSTED_PREFIX_LEN.
Signed-off-by: Alexander Larsson <alexl@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
|
|
This moves the code from super.c and inode.c, and makes ovl_xattr_get/set()
static.
This is in preparation for doing more work on xattrs support.
Signed-off-by: Alexander Larsson <alexl@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
|
|
When lower fs is a nested overlayfs, calling encode_fh() on a lower
directory dentry may trigger copy up and take sb_writers on the upper fs
of the lower nested overlayfs.
The lower nested overlayfs may have the same upper fs as this overlayfs,
so nested sb_writers lock is illegal.
Move all the callers that encode lower fh to before ovl_want_write().
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
|
|
overlayfs file open (ovl_maybe_lookup_lowerdata) and overlay file llseek
take the ovl_inode_lock, without holding upper sb_writers.
In case of nested lower overlay that uses same upper fs as this overlay,
lockdep will warn about (possibly false positive) circular lock
dependency when doing open/llseek of lower ovl file during copy up with
our upper sb_writers held, because the locking ordering seems reverse to
the locking order in ovl_copy_up_start():
- lower ovl_inode_lock
- upper sb_writers
Let the copy up "transaction" keeps an elevated mnt write count on upper
mnt, but leaves taking upper sb_writers to lower level helpers only when
they actually need it. This allows to avoid holding upper sb_writers
during lower file open/llseek and prevents the lockdep warning.
Minimizing the scope of upper sb_writers during copy up is also needed
for fixing another possible deadlocks by a following patch.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
|
|
Make the locking order of ovl_inode_lock() strictly between the two
vfs stacked layers, i.e.:
- ovl vfs locks: sb_writers, inode_lock, ...
- ovl_inode_lock
- upper vfs locks: sb_writers, inode_lock, ...
To that effect, move ovl_want_write() into the helpers ovl_nlink_start()
and ovl_copy_up_start which currently take the ovl_inode_lock() after
ovl_want_write().
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
|
|
ovl_get_write_access() gets write access to upper mnt without taking
freeze protection on upper sb and ovl_start_write() only takes freeze
protection on upper sb.
These helpers will be used to breakup the large ovl_want_write() scope
during copy up into finer grained freeze protection scopes.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
|
|
A simple wrapper for updating ovl inode size/mtime, to conform
with ovl_file_accessed().
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
|
|
ovl_copyattr() may be called concurrently from aio completion context
without any lock and that could lead to overlay inode attributes getting
permanently out of sync with real inode attributes.
Use ovl inode spinlock to protect ovl_copyattr().
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
|
|
We want to protect concurrent updates of ovl inode size and mtime
(i.e. ovl_copyattr()) from aio completion context.
Punt write aio completion to a workqueue so that we can protect
ovl_copyattr() with a spinlock.
Export sb_init_dio_done_wq(), so that overlayfs can use its own
dio workqueue to punt aio completions.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/8620dfd3-372d-4ae0-aa3f-2fe97dda1bca@kernel.dk/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
|
|
If ovl file is opened O_APPEND, the underlying realfile is also
opened O_APPEND, so it makes sense to propagate the IOCB_APPEND flags
on sync writes to realfile, just as we do with aio writes.
Effectively, because sync ovl writes are protected by inode lock,
this change only makes a difference if the realfile is written to (size
extending writes) from underneath overlayfs. The behavior in this case
is undefined, so it is ok if we change the behavior (to fail the ovl
IOCB_APPEND write).
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
|
|
Overlayfs implements its own function to translate iocb flags into rw
flags, so that they can be passed into another vfs call.
With commit ce71bfea207b4 ("fs: align IOCB_* flags with RWF_* flags")
Jens created a 1:1 matching between the iocb flags and rw flags,
simplifying the conversion.
Signed-off-by: Alessio Balsini <balsini@android.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
|
|
Pull initial bcachefs updates from Kent Overstreet:
"Here's the bcachefs filesystem pull request.
One new patch since last week: the exportfs constants ended up
conflicting with other filesystems that are also getting added to the
global enum, so switched to new constants picked by Amir.
The only new non fs/bcachefs/ patch is the objtool patch that adds
bcachefs functions to the list of noreturns. The patch that exports
osq_lock() has been dropped for now, per Ingo"
* tag 'bcachefs-2023-10-30' of https://evilpiepirate.org/git/bcachefs: (2781 commits)
exportfs: Change bcachefs fid_type enum to avoid conflicts
bcachefs: Refactor memcpy into direct assignment
bcachefs: Fix drop_alloc_keys()
bcachefs: snapshot_create_lock
bcachefs: Fix snapshot skiplists during snapshot deletion
bcachefs: bch2_sb_field_get() refactoring
bcachefs: KEY_TYPE_error now counts towards i_sectors
bcachefs: Fix handling of unknown bkey types
bcachefs: Switch to unsafe_memcpy() in a few places
bcachefs: Use struct_size()
bcachefs: Correctly initialize new buckets on device resize
bcachefs: Fix another smatch complaint
bcachefs: Use strsep() in split_devs()
bcachefs: Add iops fields to bch_member
bcachefs: Rename bch_sb_field_members -> bch_sb_field_members_v1
bcachefs: New superblock section members_v2
bcachefs: Add new helper to retrieve bch_member from sb
bcachefs: bucket_lock() is now a sleepable lock
bcachefs: fix crc32c checksum merge byte order problem
bcachefs: Fix bch2_inode_delete_keys()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"New features:
- raid-stripe-tree
New tree for logical file extent mapping where the physical mapping
may not match on multiple devices. This is now used in zoned mode
to implement RAID0/RAID1* profiles, but can be used in non-zoned
mode as well. The support for RAID56 is in development and will
eventually fix the problems with the current implementation. This
is a backward incompatible feature and has to be enabled at mkfs
time.
- simple quota accounting (squota)
A simplified mode of qgroup that accounts all space on the initial
extent owners (a subvolume), the snapshots are then cheap to create
and delete. The deletion of snapshots in fully accounting qgroups
is a known CPU/IO performance bottleneck.
The squota is not suitable for the general use case but works well
for containers where the original subvolume exists for the whole
time. This is a backward incompatible feature as it needs extending
some structures, but can be enabled on an existing filesystem.
- temporary filesystem fsid (temp_fsid)
The fsid identifies a filesystem and is hard coded in the
structures, which disallows mounting the same fsid found on
different devices.
For a single device filesystem this is not strictly necessary, a
new temporary fsid can be generated on mount e.g. after a device is
cloned. This will be used by Steam Deck for root partition A/B
testing, or can be used for VM root images.
Other user visible changes:
- filesystems with partially finished metadata_uuid conversion cannot
be mounted anymore and the uuid fixup has to be done by btrfs-progs
(btrfstune).
Performance improvements:
- reduce reservations for checksum deletions (with enabled free space
tree by factor of 4), on a sample workload on file with many
extents the deletion time decreased by 12%
- make extent state merges more efficient during insertions, reduce
rb-tree iterations (run time of critical functions reduced by 5%)
Core changes:
- the integrity check functionality has been removed, this was a
debugging feature and removal does not affect other integrity
checks like checksums or tree-checker
- space reservation changes:
- more efficient delayed ref reservations, this avoids building up
too much work or overusing or exhausting the global block
reserve in some situations
- move delayed refs reservation to the transaction start time,
this prevents some ENOSPC corner cases related to exhaustion of
global reserve
- improvements in reducing excessive reservations for block group
items
- adjust overcommit logic in near full situations, account for one
more chunk to eventually allocate metadata chunk, this is mostly
relevant for small filesystems (<10GiB)
- single device filesystems are scanned but not registered (except
seed devices), this allows temp_fsid to work
- qgroup iterations do not need GFP_ATOMIC allocations anymore
- cleanups, refactoring, reduced data structure size, function
parameter simplifications, error handling fixes"
* tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (156 commits)
btrfs: open code timespec64 in struct btrfs_inode
btrfs: remove redundant log root tree index assignment during log sync
btrfs: remove redundant initialization of variable dirty in btrfs_update_time()
btrfs: sysfs: show temp_fsid feature
btrfs: disable the device add feature for temp-fsid
btrfs: disable the seed feature for temp-fsid
btrfs: update comment for temp-fsid, fsid, and metadata_uuid
btrfs: remove pointless empty log context list check when syncing log
btrfs: update comment for struct btrfs_inode::lock
btrfs: remove pointless barrier from btrfs_sync_file()
btrfs: add and use helpers for reading and writing last_trans_committed
btrfs: add and use helpers for reading and writing fs_info->generation
btrfs: add and use helpers for reading and writing log_transid
btrfs: add and use helpers for reading and writing last_log_commit
btrfs: support cloned-device mount capability
btrfs: add helper function find_fsid_by_disk
btrfs: stop reserving excessive space for block group item insertions
btrfs: stop reserving excessive space for block group item updates
btrfs: reorder btrfs_inode to fill gaps
btrfs: open code btrfs_ordered_inode_tree in btrfs_inode
...
|
|
Pull fscrypt updates from Eric Biggers:
"This update adds support for configuring the crypto data unit size
(i.e. the granularity of file contents encryption) to be less than the
filesystem block size. This can allow users to use inline encryption
hardware in some cases when it wouldn't otherwise be possible.
In addition, there are two commits that are prerequisites for the
extent-based encryption support that the btrfs folks are working on"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux:
fscrypt: track master key presence separately from secret
fscrypt: rename fscrypt_info => fscrypt_inode_info
fscrypt: support crypto data unit size less than filesystem block size
fscrypt: replace get_ino_and_lblk_bits with just has_32bit_inodes
fscrypt: compute max_lblk_bits from s_maxbytes and block size
fscrypt: make the bounce page pool opt-in instead of opt-out
fscrypt: make it clearer that key_prefix is deprecated
|
|
Pull nfsd updates from Chuck Lever:
"This release completes the SunRPC thread scheduler work that was begun
in v6.6. The scheduler can now find an svc thread to wake in constant
time and without a list walk. Thanks again to Neil Brown for this
overhaul.
Lorenzo Bianconi contributed infrastructure for a netlink-based NFSD
control plane. The long-term plan is to provide the same functionality
as found in /proc/fs/nfsd, plus some interesting additions, and then
migrate the NFSD user space utilities to netlink.
A long series to overhaul NFSD's NFSv4 operation encoding was applied
in this release. The goals are to bring this family of encoding
functions in line with the matching NFSv4 decoding functions and with
the NFSv2 and NFSv3 XDR functions, preparing the way for better memory
safety and maintainability.
A further improvement to NFSD's write delegation support was
contributed by Dai Ngo. This adds a CB_GETATTR callback, enabling the
server to retrieve cached size and mtime data from clients holding
write delegations. If the server can retrieve this information, it
does not have to recall the delegation in some cases.
The usual panoply of bug fixes and minor improvements round out this
release. As always I am grateful to all contributors, reviewers, and
testers"
* tag 'nfsd-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (127 commits)
svcrdma: Fix tracepoint printk format
svcrdma: Drop connection after an RDMA Read error
NFSD: clean up alloc_init_deleg()
NFSD: Fix frame size warning in svc_export_parse()
NFSD: Rewrite synopsis of nfsd_percpu_counters_init()
nfsd: Clean up errors in nfs3proc.c
nfsd: Clean up errors in nfs4state.c
NFSD: Clean up errors in stats.c
NFSD: simplify error paths in nfsd_svc()
NFSD: Clean up nfsd4_encode_seek()
NFSD: Clean up nfsd4_encode_offset_status()
NFSD: Clean up nfsd4_encode_copy_notify()
NFSD: Clean up nfsd4_encode_copy()
NFSD: Clean up nfsd4_encode_test_stateid()
NFSD: Clean up nfsd4_encode_exchange_id()
NFSD: Clean up nfsd4_do_encode_secinfo()
NFSD: Clean up nfsd4_encode_access()
NFSD: Clean up nfsd4_encode_readdir()
NFSD: Clean up nfsd4_encode_entry4()
NFSD: Add an nfsd4_encode_nfs_cookie4() helper
...
|
|
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode time accessor updates from Christian Brauner:
"This finishes the conversion of all inode time fields to accessor
functions as discussed on list. Changing timestamps manually as we
used to do before is error prone. Using accessors function makes this
robust.
It does not contain the switch of the time fields to discrete 64 bit
integers to replace struct timespec and free up space in struct inode.
But after this, the switch can be trivially made and the patch should
only affect the vfs if we decide to do it"
* tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (86 commits)
fs: rename inode i_atime and i_mtime fields
security: convert to new timestamp accessors
selinux: convert to new timestamp accessors
apparmor: convert to new timestamp accessors
sunrpc: convert to new timestamp accessors
mm: convert to new timestamp accessors
bpf: convert to new timestamp accessors
ipc: convert to new timestamp accessors
linux: convert to new timestamp accessors
zonefs: convert to new timestamp accessors
xfs: convert to new timestamp accessors
vboxsf: convert to new timestamp accessors
ufs: convert to new timestamp accessors
udf: convert to new timestamp accessors
ubifs: convert to new timestamp accessors
tracefs: convert to new timestamp accessors
sysv: convert to new timestamp accessors
squashfs: convert to new timestamp accessors
server: convert to new timestamp accessors
client: convert to new timestamp accessors
...
|
|
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs xattr updates from Christian Brauner:
"The 's_xattr' field of 'struct super_block' currently requires a
mutable table of 'struct xattr_handler' entries (although each handler
itself is const). However, no code in vfs actually modifies the
tables.
This changes the type of 's_xattr' to allow const tables, and modifies
existing file systems to move their tables to .rodata. This is
desirable because these tables contain entries with function pointers
in them; moving them to .rodata makes it considerably less likely to
be modified accidentally or maliciously at runtime"
* tag 'vfs-6.7.xattr' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (30 commits)
const_structs.checkpatch: add xattr_handler
net: move sockfs_xattr_handlers to .rodata
shmem: move shmem_xattr_handlers to .rodata
overlayfs: move xattr tables to .rodata
xfs: move xfs_xattr_handlers to .rodata
ubifs: move ubifs_xattr_handlers to .rodata
squashfs: move squashfs_xattr_handlers to .rodata
smb: move cifs_xattr_handlers to .rodata
reiserfs: move reiserfs_xattr_handlers to .rodata
orangefs: move orangefs_xattr_handlers to .rodata
ocfs2: move ocfs2_xattr_handlers and ocfs2_xattr_handler_map to .rodata
ntfs3: move ntfs_xattr_handlers to .rodata
nfs: move nfs4_xattr_handlers to .rodata
kernfs: move kernfs_xattr_handlers to .rodata
jfs: move jfs_xattr_handlers to .rodata
jffs2: move jffs2_xattr_handlers to .rodata
hfsplus: move hfsplus_xattr_handlers to .rodata
hfs: move hfs_xattr_handlers to .rodata
gfs2: move gfs2_xattr_handlers_max to .rodata
fuse: move fuse_xattr_handlers to .rodata
...
|
|
Pull misc vfs updates from Christian Brauner:
"This contains the usual miscellaneous features, cleanups, and fixes
for vfs and individual fses.
Features:
- Rename and export helpers that get write access to a mount. They
are used in overlayfs to get write access to the upper mount.
- Print the pretty name of the root device on boot failure. This
helps in scenarios where we would usually only print
"unknown-block(1,2)".
- Add an internal SB_I_NOUMASK flag. This is another part in the
endless POSIX ACL saga in a way.
When POSIX ACLs are enabled via SB_POSIXACL the vfs cannot strip
the umask because if the relevant inode has POSIX ACLs set it might
take the umask from there. But if the inode doesn't have any POSIX
ACLs set then we apply the umask in the filesytem itself. So we end
up with:
(1) no SB_POSIXACL -> strip umask in vfs
(2) SB_POSIXACL -> strip umask in filesystem
The umask semantics associated with SB_POSIXACL allowed filesystems
that don't even support POSIX ACLs at all to raise SB_POSIXACL
purely to avoid umask stripping. That specifically means NFS v4 and
Overlayfs. NFS v4 does it because it delegates this to the server
and Overlayfs because it needs to delegate umask stripping to the
upper filesystem, i.e., the filesystem used as the writable layer.
This went so far that SB_POSIXACL is raised eve on kernels that
don't even have POSIX ACL support at all.
Stop this blatant abuse and add SB_I_NOUMASK which is an internal
superblock flag that filesystems can raise to opt out of umask
handling. That should really only be the two mentioned above. It's
not that we want any filesystems to do this. Ideally we have all
umask handling always in the vfs.
- Make overlayfs use SB_I_NOUMASK too.
- Now that we have SB_I_NOUMASK, stop checking for SB_POSIXACL in
IS_POSIXACL() if the kernel doesn't have support for it. This is a
very old patch but it's only possible to do this now with the wider
cleanup that was done.
- Follow-up work on fake path handling from last cycle. Citing mostly
from Amir:
When overlayfs was first merged, overlayfs files of regular files
and directories, the ones that are installed in file table, had a
"fake" path, namely, f_path is the overlayfs path and f_inode is
the "real" inode on the underlying filesystem.
In v6.5, we took another small step by introducing of the
backing_file container and the file_real_path() helper. This change
allowed vfs and filesystem code to get the "real" path of an
overlayfs backing file. With this change, we were able to make
fsnotify work correctly and report events on the "real" filesystem
objects that were accessed via overlayfs.
This method works fine, but it still leaves the vfs vulnerable to
new code that is not aware of files with fake path. A recent
example is commit db1d1e8b9867 ("IMA: use vfs_getattr_nosec to get
the i_version"). This commit uses direct referencing to f_path in
IMA code that otherwise uses file_inode() and file_dentry() to
reference the filesystem objects that it is measuring.
This contains work to switch things around: instead of having
filesystem code opt-in to get the "real" path, have generic code
opt-in for the "fake" path in the few places that it is needed.
Is it far more likely that new filesystems code that does not use
the file_dentry() and file_real_path() helpers will end up causing
crashes or averting LSM/audit rules if we keep the "fake" path
exposed by default.
This change already makes file_dentry() moot, but for now we did
not change this helper just added a WARN_ON() in ovl_d_real() to
catch if we have made any wrong assumptions.
After the dust settles on this change, we can make file_dentry() a
plain accessor and we can drop the inode argument to ->d_real().
- Switch struct file to SLAB_TYPESAFE_BY_RCU. This looks like a small
change but it really isn't and I would like to see everyone on
their tippie toes for any possible bugs from this work.
Essentially we've been doing most of what SLAB_TYPESAFE_BY_RCU for
files since a very long time because of the nasty interactions
between the SCM_RIGHTS file descriptor garbage collection. So
extending it makes a lot of sense but it is a subtle change. There
are almost no places that fiddle with file rcu semantics directly
and the ones that did mess around with struct file internal under
rcu have been made to stop doing that because it really was always
dodgy.
I forgot to put in the link tag for this change and the discussion
in the commit so adding it into the merge message:
https://lore.kernel.org/r/20230926162228.68666-1-mjguzik@gmail.com
Cleanups:
- Various smaller pipe cleanups including the removal of a spin lock
that was only used to protect against writes without pipe_lock()
from O_NOTIFICATION_PIPE aka watch queues. As that was never
implemented remove the additional locking from pipe_write().
- Annotate struct watch_filter with the new __counted_by attribute.
- Clarify do_unlinkat() cleanup so that it doesn't look like an extra
iput() is done that would cause issues.
- Simplify file cleanup when the file has never been opened.
- Use module helper instead of open-coding it.
- Predict error unlikely for stale retry.
- Use WRITE_ONCE() for mount expiry field instead of just commenting
that one hopes the compiler doesn't get smart.
Fixes:
- Fix readahead on block devices.
- Fix writeback when layztime is enabled and inodes whose timestamp
is the only thing that changed reside on wb->b_dirty_time. This
caused excessively large zombie memory cgroup when lazytime was
enabled as such inodes weren't handled fast enough.
- Convert BUG_ON() to WARN_ON_ONCE() in open_last_lookups()"
* tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (26 commits)
file, i915: fix file reference for mmap_singleton()
vfs: Convert BUG_ON to WARN_ON_ONCE in open_last_lookups
writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs
chardev: Simplify usage of try_module_get()
ovl: rely on SB_I_NOUMASK
fs: fix umask on NFS with CONFIG_FS_POSIX_ACL=n
fs: store real path instead of fake path in backing file f_path
fs: create helper file_user_path() for user displayed mapped file path
fs: get mnt_writers count for an open backing file's real path
vfs: stop counting on gcc not messing with mnt_expiry_mark if not asked
vfs: predict the error in retry_estale as unlikely
backing file: free directly
vfs: fix readahead(2) on block devices
io_uring: use files_lookup_fd_locked()
file: convert to SLAB_TYPESAFE_BY_RCU
vfs: shave work on failed file open
fs: simplify misleading code to remove ambiguity regarding ihold()/iput()
watch_queue: Annotate struct watch_filter with __counted_by
fs/pipe: use spinlock in pipe_read() only if there is a watch_queue
fs/pipe: remove unnecessary spinlock from pipe_write()
...
|
|
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull autofs mount api updates from Christian Brauner:
"This ports autofs to the new mount api. The patchset has existed for
quite a while but never made it upstream. Ian picked it back up.
This also fixes a bug where fs_param_is_fd() was passed a garbage
param->dirfd but it expected it to be set to the fd that was used to
set param->file otherwise result->uint_32 contains nonsense. So make
sure it's set.
One less filesystem using the old mount api. We're getting there,
albeit rather slow. The last remaining major filesystem that hasn't
converted is btrfs. Patches exist - I even wrote them - but so far
they haven't made it upstream"
* tag 'vfs-6.7.autofs' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
autofs: fix add autofs_parse_fd()
fsconfig: ensure that dirfd is set to aux
autofs: fix protocol sub version setting
autofs: convert autofs to use the new mount api
autofs: validate protocol version
autofs: refactor parse_options()
autofs: reformat 0pt enum declaration
autofs: refactor super block info init
autofs: add autofs_parse_fd()
autofs: refactor autofs_prepare_pipe()
|
|
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs superblock updates from Christian Brauner:
"This contains the work to make block device opening functions return a
struct bdev_handle instead of just a struct block_device. The same
struct bdev_handle is then also passed to block device closing
functions.
This allows us to propagate context from opening to closing a block
device without having to modify all users everytime.
Sidenote, in the future we might even want to try and have block
device opening functions return a struct file directly but that's a
series on top of this.
These are further preparatory changes to be able to count writable
opens and blocking writes to mounted block devices. That's a separate
piece of work for next cycle and for that we absolutely need the
changes to btrfs that have been quietly dropped somehow.
Originally the series contained a patch that removed the old
blkdev_*() helpers. But since this would've caused needles churn in
-next for bcachefs we ended up delaying it.
The second piece of work addresses one of the major annoyances about
the work last cycle, namely that we required dropping s_umount
whenever we used the superblock and fs_holder_ops for a block device.
The reason for that requirement had been that in some codepaths
s_umount could've been taken under disk->open_mutex (that's always
been the case, at least theoretically). For example, on surprise block
device removal or media change. And opening and closing block devices
required grabbing disk->open_mutex as well.
So we did the work and went through the block layer and fixed all
those places so that s_umount is never taken under disk->open_mutex.
This means no more brittle games where we yield and reacquire s_umount
during block device opening and closing and no more requirements where
block devices need to be closed. Filesystems don't need to care about
this.
There's a bunch of other follow-up work such as moving block device
freezing and thawing to holder operations which makes it work for all
block devices and not just the main block device just as we did for
surprise removal. But that is for next cycle.
Tested with fstests for all major fses, blktests, LTP"
* tag 'vfs-6.7.super' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (37 commits)
porting: update locking requirements
fs: assert that open_mutex isn't held over holder ops
block: assert that we're not holding open_mutex over blk_report_disk_dead
block: move bdev_mark_dead out of disk_check_media_change
block: WARN_ON_ONCE() when we remove active partitions
block: simplify bdev_del_partition()
fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock
jfs: fix log->bdev_handle null ptr deref in lbmStartIO
bcache: Fixup error handling in register_cache()
xfs: Convert to bdev_open_by_path()
reiserfs: Convert to bdev_open_by_dev/path()
ocfs2: Convert to use bdev_open_by_dev()
nfs/blocklayout: Convert to use bdev_open_by_dev/path()
jfs: Convert to bdev_open_by_dev()
f2fs: Convert to bdev_open_by_dev/path()
ext4: Convert to bdev_open_by_dev()
erofs: Convert to use bdev_open_by_path()
btrfs: Convert to bdev_open_by_path()
fs: Convert to bdev_open_by_dev()
mm/swap: Convert to use bdev_open_by_dev()
...
|
|
Add structs and defines for new SMB3.1.1 command, server to client notification.
See MS-SMB2 section 2.2.44
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Update comments to show flags which should be not set (zero).
See MS-SMB2 section 2.2.13
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
The length of dentry name is calculated after the raw name is encrypted,
except for ubifs_link(), which could make the size of dir underflow.
Here is a reproducer:
touch $TMP/file
mkdir $TMP/dir
stat $TMP/dir
for i in $(seq 1 8)
do
ln $TMP/file $TMP/dir/$i
unlink $TMP/dir/$i
done
stat $TMP/dir
The size of dir will be underflow(-96).
Fix it by calculating dentry name's length after the name is encrypted.
Fixes: f4f61d2cc6d8 ("ubifs: Implement encrypted filenames")
Reported-by: Roland Ruckerbauer <roland.ruckerbauer@robart.cc>
Link: https://lore.kernel.org/linux-mtd/1638777819.2925845.1695222544742.JavaMail.zimbra@robart.cc/T/#u
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
'old_idx' could be dereferenced after free via 'rb_link_node' function
call.
Fixes: b5fda08ef213 ("ubifs: Fix memleak when insert_old_idx() failed")
Co-developed-by: Ivanov Mikhail <ivanov.mikhail1@huawei-partners.com>
Signed-off-by: Konstantin Meskhidze <konstantin.meskhidze@huawei.com>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
Fix smatch warning:
fs/ubifs/journal.c:1610 ubifs_jnl_truncate() warn: missing error code
'err'
Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
Ensure that the allocated bud->log_hash (if any) is freed in all cases
when the bud itself is freed, to fix this leak caught by kmemleak:
# keyctl add logon foo:bar data @s
# echo clear > /sys/kernel/debug/kmemleak
# mount -t ubifs /dev/ubi0_0 mnt -o auth_hash_name=sha256,auth_key=foo:bar
# echo a > mnt/x
# umount mnt
# mount -t ubifs /dev/ubi0_0 mnt -o auth_hash_name=sha256,auth_key=foo:bar
# umount mnt
# sleep 5
# echo scan > /sys/kernel/debug/kmemleak
# echo scan > /sys/kernel/debug/kmemleak
# cat /sys/kernel/debug/kmemleak
unreferenced object 0xff... (size 128):
comm "mount"
backtrace:
__kmalloc
__ubifs_hash_get_desc+0x5d/0xe0 ubifs
ubifs_replay_journal
ubifs_mount
...
Fixes: da8ef65f9573 ("ubifs: Authenticate replayed journal")
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
Add description of @time and @flags in ubifs_update_time().
to silence the warnings:
fs/ubifs/file.c:1383: warning: Function parameter or member 'time' not described in 'ubifs_update_time'
fs/ubifs/file.c:1383: warning: Function parameter or member 'flags' not described in 'ubifs_update_time'
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5848
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
Many of the filesystems that call the generic exportfs helpers do not
select the EXPORTFS config.
Move generic_encode_ino32_fh() to libfs.c, same as generic_fh_to_*()
to avoid having to fix all those config dependencies.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202310262151.renqMvme-lkp@intel.com/
Fixes: dfaf653dc415 ("exportfs: make ->encode_fh() a mandatory method for NFS export")
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231026204540.143217-1-amir73il@gmail.com
Tested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The majority of blockdev filesystems, which do not have a UUID in their
on-disk format, derive f_fsid of statfs(2) from bdev->bd_dev.
Use the same practice for freevxfs.
This will allow reporting fanotify events with fanotify_event_info_fid.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231024121457.3014063-1-amir73il@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
There are many "simple" filesystems (*) that report null f_fsid in
statfs(2). Those "simple" filesystems report sb->s_dev as the st_dev
field of the stat syscalls for all inodes of the filesystem (**).
In order to enable fanotify reporting of events with fsid on those
"simple" filesystems, report the sb->s_dev number in f_fsid field of
statfs(2).
(*) For most of the "simple" filesystem refered to in this commit, the
->statfs() operation is simple_statfs(). Some of those fs assign the
simple_statfs() method directly in their ->s_op struct and some assign it
indirectly via a call to simple_fill_super() or to pseudo_fs_fill_super()
with either custom or "simple" s_op.
We also make the same change to efivarfs and hugetlbfs, although they do
not use simple_statfs(), because they use the simple_* inode opreations
(e.g. simple_lookup()).
(**) For most of the "simple" filesystems, the ->getattr() method is not
assigned, so stat() is implemented by generic_fillattr(). A few "simple"
filesystem use the simple_getattr() method which also calls
generic_fillattr() to fill most of the stat struct.
The two exceptions are procfs and 9p. procfs implements several different
->getattr() methods, but they all end up calling generic_fillattr() to
fill the st_dev field from sb->s_dev.
9p has more complicated ->getattr() methods, but they too, end up calling
generic_fillattr() to fill the st_dev field from sb->s_dev.
Note that 9p and kernfs also call simple_statfs() from custom ->statfs()
methods which already fill the f_fsid field, but v9fs_statfs() calls
simple_statfs() only in case f_fsid was not filled and kenrfs_statfs()
overwrites f_fsid after calling simple_statfs().
Link: https://lore.kernel.org/r/20230919094820.g5bwharbmy2dq46w@quack3/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231023143049.2944970-1-amir73il@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
AT_HANDLE_FID was added as an API for name_to_handle_at() that request
the encoding of a file id, which is not intended to be decoded.
This file id is used by fanotify to describe objects in events.
So far, overlayfs is the only filesystem that supports encoding
non-decodeable file ids, by providing export_operations with an
->encode_fh() method and without a ->decode_fh() method.
Add support for encoding non-decodeable file ids to all the filesystems
that do not provide export_operations, by encoding a file id of type
FILEID_INO64_GEN from { i_ino, i_generation }.
A filesystem may that does not support NFS export, can opt-out of
encoding non-decodeable file ids for fanotify by defining an empty
export_operations struct (i.e. with a NULL ->encode_fh() method).
This allows the use of fanotify events with file ids on filesystems
like 9p which do not support NFS export to bring fanotify in feature
parity with inotify on those filesystems.
Note that fanotify also requires that the filesystems report a non-null
fsid. Currently, many simple filesystems that have support for inotify
(e.g. debugfs, tracefs, sysfs) report a null fsid, so can still not be
used with fanotify in file id reporting mode.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231023180801.2953446-5-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Similar to the common FILEID_INO32* file handle types, define common
FILEID_INO64* file handle types.
The type values of FILEID_INO64_GEN and FILEID_INO64_GEN_PARENT are the
values returned by fuse and xfs for 64bit ino encoded file handle types.
Note that these type value are filesystem specific and they do not define
a universal file handle format, for example:
fuse encodes FILEID_INO64_GEN as [ino-hi32,ino-lo32,gen] and xfs encodes
FILEID_INO64_GEN as [hostr-order-ino64,gen] (a.k.a xfs_fid64).
The FILEID_INO64_GEN fhandle type is going to be used for file ids for
fanotify from filesystems that do not support NFS export.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231023180801.2953446-4-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Rename the default helper for encoding FILEID_INO32_GEN* file handles to
generic_encode_ino32_fh() and convert the filesystems that used the
default implementation to use the generic helper explicitly.
After this change, exportfs_encode_inode_fh() no longer has a default
implementation to encode FILEID_INO32_GEN* file handles.
This is a step towards allowing filesystems to encode non-decodeable
file handles for fanotify without having to implement any
export_operations.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231023180801.2953446-3-amir73il@gmail.com
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
With recent block level changes we should never be in a situation where
we hold disk->open_mutex when calling into these helpers. So assert that
in the code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231017184823.1383356-6-hch@lst.de
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The implementation of bdev holder operations such as fs_bdev_mark_dead()
and fs_bdev_sync() grab sb->s_umount semaphore under
bdev->bd_holder_lock. This is problematic because it leads to
disk->open_mutex -> sb->s_umount lock ordering which is counterintuitive
(usually we grab higher level (e.g. filesystem) locks first and lower
level (e.g. block layer) locks later) and indeed makes lockdep complain
about possible locking cycles whenever we open a block device while
holding sb->s_umount semaphore. Implement a function
bdev_super_lock_shared() which safely transitions from holding
bdev->bd_holder_lock to holding sb->s_umount on alive superblock without
introducing the problematic lock dependency. We use this function
fs_bdev_sync() and fs_bdev_mark_dead().
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231018152924.3858-1-jack@suse.cz
Link: https://lore.kernel.org/r/20231017184823.1383356-1-hch@lst.de
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
When sbi->flag is JFS_NOINTEGRITY in lmLogOpen(), log->bdev_handle can't
be inited, so it value will be NULL.
Therefore, add the "log ->no_integrity=1" judgment in lbmStartIO() to avoid such
problems.
Reported-and-tested-by: syzbot+23bc20037854bb335d59@syzkaller.appspotmail.com
Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com>
Link: https://lore.kernel.org/r/20231009094557.1398920-1-lizhi.xu@windriver.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Convert xfs to use bdev_open_by_path() and pass the handle around.
CC: "Darrick J. Wong" <djwong@kernel.org>
CC: linux-xfs@vger.kernel.org
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-28-jack@suse.cz
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Convert reiserfs to use bdev_open_by_dev/path() and pass the handle
around.
CC: reiserfs-devel@vger.kernel.org
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-27-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Convert ocfs2 heartbeat code to use bdev_open_by_dev() and pass the
handle around.
CC: Joseph Qi <joseph.qi@linux.alibaba.com>
CC: ocfs2-devel@oss.oracle.com
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-26-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Convert block device handling to use bdev_open_by_dev/path() and pass
the handle around.
CC: linux-nfs@vger.kernel.org
CC: Trond Myklebust <trond.myklebust@hammerspace.com>
CC: Anna Schumaker <anna@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-25-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Convert jfs to use bdev_open_by_dev() and pass the handle around.
CC: Dave Kleikamp <shaggy@kernel.org>
CC: jfs-discussion@lists.sourceforge.net
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-24-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Convert f2fs to use bdev_open_by_dev/path() and pass the handle around.
CC: Jaegeuk Kim <jaegeuk@kernel.org>
CC: Chao Yu <chao@kernel.org>
CC: linux-f2fs-devel@lists.sourceforge.net
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-23-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Convert ext4 to use bdev_open_by_dev() and pass the handle around.
CC: linux-ext4@vger.kernel.org
CC: Ted Tso <tytso@mit.edu>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-22-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
|