summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2024-02-25fs/super.c: don't drop ->s_user_ns until we free struct super_block itselfAl Viro
Avoids fun races in RCU pathwalk... Same goes for freeing LSM shite hanging off super_block's arse. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-24bcachefs: Fix check_snapshot() memcpyKent Overstreet
check_snapshot() copies the bch_snapshot to a temporary to easily handle older versions that don't have all the fields of the current version, but it lacked a min() to correctly handle keys newer and larger than the current version. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-24bcachefs: Fix bch2_journal_flush_device_pins()Kent Overstreet
If a journal write errored, the list of devices it was written to could be empty - we're not supposed to mark an empty replicas list. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-24bcachefs: fix iov_iter count underflow on sub-block dio readBrian Foster
bch2_direct_IO_read() checks the request offset and size for sector alignment and then falls through to a couple calculations to shrink the size of the request based on the inode size. The problem is that these checks round up to the fs block size, which runs the risk of underflowing iter->count if the block size happens to be large enough. This is triggered by fstest generic/361 with a 4k block size, which subsequently leads to a crash. To avoid this crash, check that the shorten length doesn't exceed the overall length of the iter. Fixes: Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Su Yue <glass.su@suse.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-24bcachefs: Fix BTREE_ITER_FILTER_SNAPSHOTS on inodes btreeKent Overstreet
If we're in FILTER_SNAPSHOTS mode and we start scanning a range of the keyspace where no keys are visible in the current snapshot, we have a problem - we'll scan for a very long time before scanning terminates. Awhile back, this was fixed for most cases with peek_upto() (and assertions that enforce that it's being used). But the fix missed the fact that the inodes btree is different - every key offset is in a different snapshot tree, not just the inode field. Fixes: Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-24bcachefs: Kill __GFP_NOFAIL in buffered read pathKent Overstreet
Recently, we fixed our __GFP_NOFAIL usage in the readahead path, but the easy one in read_single_folio() (where wa can return an error) was missed - oops. Fixes: Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-24bcachefs: fix backpointer_to_text() when dev does not existKent Overstreet
Fixes: Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-23fscrypt: shrink the size of struct fscrypt_inode_info slightlyEric Biggers
Shrink the size of struct fscrypt_inode_info by 8 bytes by packing the small fields into the 64 bits after ci_enc_key. Link: https://lore.kernel.org/r/20240224060103.91037-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2024-02-23fscrypt: write CBC-CTS instead of CTS-CBCEric Biggers
Calling CBC with ciphertext stealing "CBC-CTS" seems to be more common than calling it "CTS-CBC". E.g., CBC-CTS is used by OpenSSL, Crypto++, RFC3962, and RFC6803. The NIST SP800-38A addendum uses CBC-CS1, CBC-CS2, and CBC-CS3, distinguishing between different CTS conventions but similarly putting the CBC part first. In the interest of avoiding any idiosyncratic terminology, update the fscrypt documentation and the fscrypt_mode "friendly names" to align with the more common convention. Changing the "friendly names" only affects some log messages. The actual mode constants in the API are unchanged; those call it simply "CTS". Add a note to the documentation that clarifies that "CBC" and "CTS" in the API really mean CBC-ESSIV and CBC-CTS, respectively. Link: https://lore.kernel.org/r/20240224053550.44659-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2024-02-24xfs: fix log recovery erroring out on refcount recovery failureDarrick J. Wong
Per the comment in the error case of xfs_reflink_recover_cow, zero out any error (after shutting down the log) so that we actually kill any new intent items that might have gotten logged by later recovery steps. Discovered by xfs/434, which few people actually seem to run. Fixes: 2c1e31ed5c88 ("xfs: place intent recovery under NOFS allocation context") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-23crash: split vmcoreinfo exporting code out from crash_core.cBaoquan He
Now move the relevant codes into separate files: kernel/crash_reserve.c, include/linux/crash_reserve.h. And add config item CRASH_RESERVE to control its enabling. And also update the old ifdeffery of CONFIG_CRASH_CORE, including of <linux/crash_core.h> and config item dependency on CRASH_CORE accordingly. And also do renaming as follows: - arch/xxx/kernel/{crash_core.c => vmcore_info.c} because they are only related to vmcoreinfo exporting on x86, arm64, riscv. And also Remove config item CRASH_CORE, and rely on CONFIG_KEXEC_CORE to decide if build in crash_core.c. [yang.lee@linux.alibaba.com: remove duplicated include in vmcore_info.c] Link: https://lkml.kernel.org/r/20240126005744.16561-1-yang.lee@linux.alibaba.com Link: https://lkml.kernel.org/r/20240124051254.67105-3-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23exec: Delete unnecessary statements in remove_arg_zero()Li kunyu
'ret=0; ' In actual operation, the ret was not modified, so this sentence can be removed. Signed-off-by: Li kunyu <kunyu@nfschina.com> Link: https://lore.kernel.org/r/20240220052426.62018-1-kunyu@nfschina.com Signed-off-by: Kees Cook <keescook@chromium.org>
2024-02-23fuse: introduce FUSE_PASSTHROUGH capabilityAmir Goldstein
FUSE_PASSTHROUGH capability to passthrough FUSE operations to backing files will be made available with kernel config CONFIG_FUSE_PASSTHROUGH. When requesting FUSE_PASSTHROUGH, userspace needs to specify the max_stack_depth that is allowed for FUSE on top of backing files. Introduce the flag FOPEN_PASSTHROUGH and backing_id to fuse_open_out argument that can be used when replying to OPEN request, to setup passthrough of io operations on the fuse inode to a backing file. Introduce a refcounted fuse_backing object that will be used to associate an open backing file with a fuse inode. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-02-23fuse: factor out helper for FUSE_DEV_IOC_CLONEAmir Goldstein
In preparation to adding more fuse dev ioctls. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-02-23fuse: allow parallel dio writes with FUSE_DIRECT_IO_ALLOW_MMAPAmir Goldstein
Instead of denying caching mode on parallel dio open, deny caching open only while parallel dio are in-progress and wait for in-progress parallel dio writes before entering inode caching io mode. This allows executing parallel dio when inode is not in caching mode even if shared mmap is allowed, but no mmaps have been performed on the inode in question. An mmap on direct_io file now waits for all in-progress parallel dio writes to complete, so parallel dio writes together with FUSE_DIRECT_IO_ALLOW_MMAP is enabled by this commit. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-02-23fuse: introduce inode io modesAmir Goldstein
The fuse inode io mode is determined by the mode of its open files/mmaps and parallel dio opens and expressed in the value of fi->iocachectr: > 0 - caching io: files open in caching mode or mmap on direct_io file < 0 - parallel dio: direct io mode with parallel dio writes enabled == 0 - direct io: no files open in caching mode and no files mmaped Note that iocachectr value of 0 might become positive or negative, while non-parallel dio is getting processed. direct_io mmap uses page cache, so first mmap will mark the file as ff->io_opened and increment fi->iocachectr to enter the caching io mode. If the server opens the file in caching mode while it is already open for parallel dio or vice versa the open fails. This allows executing parallel dio when inode is not in caching mode and no mmaps have been performed on the inode in question. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-02-23fuse: prepare for failing open responseAmir Goldstein
In preparation for inode io modes, a server open response could fail due to conflicting inode io modes. Allow returning an error from fuse_finish_open() and handle the error in the callers. fuse_finish_open() is used as the callback of finish_open(), so that FMODE_OPENED will not be set if fuse_finish_open() fails. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-02-23fuse: break up fuse_open_common()Amir Goldstein
fuse_open_common() has a lot of code relevant only for regular files and O_TRUNC in particular. Copy the little bit of remaining code into fuse_dir_open() and stop using this common helper for directory open. Also split out fuse_dir_finish_open() from fuse_finish_open() before we add inode io modes to fuse_finish_open(). Suggested-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-02-23fuse: allocate ff->release_args only if release is neededAmir Goldstein
This removed the need to pass isdir argument to fuse_put_file(). Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-02-23fuse: factor out helper fuse_truncate_update_attr()Amir Goldstein
fuse_finish_open() is called from fuse_open_common() and from fuse_create_open(). In the latter case, the O_TRUNC flag is always cleared in finish_open()m before calling into fuse_finish_open(). Move the bits that update attribute cache post O_TRUNC open into a helper and call this helper from fuse_open_common() directly. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-02-23fuse: add fuse_dio_lock/unlock helper functionsBernd Schubert
So far this is just a helper to remove complex locking logic out of fuse_direct_write_iter. Especially needed by the next patch in the series to that adds the fuse inode cache IO mode and adds in even more locking complexity. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-02-23fuse: create helper function if DIO write needs exclusive lockBernd Schubert
This makes the code a bit easier to read and allows to more easily add more conditions when an exclusive lock is needed. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-02-23fuse: fix VM_MAYSHARE and direct_io_allow_mmapBernd Schubert
There were multiple issues with direct_io_allow_mmap: - fuse_link_write_file() was missing, resulting in warnings in fuse_write_file_get() and EIO from msync() - "vma->vm_ops = &fuse_file_vm_ops" was not set, but especially fuse_page_mkwrite is needed. The semantics of invalidate_inode_pages2() is so far not clearly defined in fuse_file_mmap. It dates back to commit 3121bfe76311 ("fuse: fix "direct_io" private mmap") Though, as direct_io_allow_mmap is a new feature, that was for MAP_PRIVATE only. As invalidate_inode_pages2() is calling into fuse_launder_folio() and writes out dirty pages, it should be safe to call invalidate_inode_pages2 for MAP_PRIVATE and MAP_SHARED as well. Cc: Hao Xu <howeyxu@tencent.com> Cc: stable@vger.kernel.org Fixes: e78662e818f9 ("fuse: add a new fuse init flag to relax restrictions in no cache mode") Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-02-23virtiofs: emit uevents on filesystem eventsStefan Hajnoczi
Alyssa Ross <hi@alyssa.is> requested that virtiofs notifies userspace when filesytems become available. This can be used to detect when a filesystem with a given tag is hotplugged, for example. uevents allow userspace to detect changes without resorting to polling. The tag is included as a uevent property so it's easy for userspace to identify the filesystem in question even when the sysfs directory goes away during removal. Here are example uevents: # udevadm monitor -k -p KERNEL[111.113221] add /fs/virtiofs/2 (virtiofs) ACTION=add DEVPATH=/fs/virtiofs/2 SUBSYSTEM=virtiofs TAG=test KERNEL[165.527167] remove /fs/virtiofs/2 (virtiofs) ACTION=remove DEVPATH=/fs/virtiofs/2 SUBSYSTEM=virtiofs TAG=test Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-02-23virtiofs: export filesystem tags through sysfsStefan Hajnoczi
The virtiofs filesystem is mounted using a "tag" which is exported by the virtiofs device: # mount -t virtiofs <tag> /mnt The virtiofs driver knows about all the available tags but these are currently not exported to user space. People have asked for these tags to be exported to user space. Most recently Lennart Poettering has asked for it as he wants to scan the tags and mount virtiofs automatically in certain cases. https://gitlab.com/virtio-fs/virtiofsd/-/issues/128 This patch exports tags at /sys/fs/virtiofs/<N>/tag where N is the id of the virtiofs device. The filesystem tag can be obtained by reading this "tag" file. There is also a symlink at /sys/fs/virtiofs/<N>/device that points to the virtiofs device that exports this tag. This patch converts the existing struct virtio_fs into a full kobject. It already had a refcount so it's an easy change. The virtio_fs objects can then be exposed in a kset at /sys/fs/virtiofs/. Note that virtio_fs objects may live slightly longer than we wish for them to be exposed to userspace, so kobject_del() is called explicitly when the underlying virtio_device is removed. The virtio_fs object is freed when all references are dropped (e.g. active mounts) but disappears as soon as the virtiofs device is gone. Originally-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-02-23virtiofs: forbid newlines in tagsStefan Hajnoczi
Newlines in virtiofs tags are awkward for users and potential vectors for string injection attacks. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-02-23sysfs: Fix crash on empty group attributes arrayDan Williams
It turns out that arch/x86/events/intel/core.c makes use of "empty" attributes. static struct attribute *empty_attrs; __init int intel_pmu_init(void) { struct attribute **extra_skl_attr = &empty_attrs; struct attribute **extra_attr = &empty_attrs; struct attribute **td_attr = &empty_attrs; struct attribute **mem_attr = &empty_attrs; struct attribute **tsx_attr = &empty_attrs; ... That breaks the assumption __first_visible() that expects that if grp->attrs is set then grp->attrs[0] must also be set and results in backtraces like: BUG: kernel NULL pointer dereference, address: 00rnel mode #PF: error_code(0x0000) - not-present ] PREEMPT SMP NOPTI CPU: 1 PID: 1 Comm: swapper/IP: 0010:exra_is_visible+0x14/0x20 ? exc_page_fault+0x68/0x190 internal_create_groups+0x42/0xa0 pmu_dev_alloc+0xc0/0xe0 perf_event_sysfs_init+0x580000000000 ]--- RIP: 0010:exra_is_visible+0x14/0 Check for non-empty attributes array before calling is_visible(). Reported-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Closes: https://github.com/thesofproject/linux/pull/4799#issuecomment-1958537212 Fixes: 70317fd24b41 ("sysfs: Introduce a mechanism to hide static attribute_groups") Cc: Marc Herbert <marc.herbert@intel.com> Cc: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Marc Herbert <marc.herbert@intel.com> Link: https://lore.kernel.org/r/170863445442.1479840.1818801787239831650.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-22fat: fix uninitialized field in nostale filehandlesJan Kara
When fat_encode_fh_nostale() encodes file handle without a parent it stores only first 10 bytes of the file handle. However the length of the file handle must be a multiple of 4 so the file handle is actually 12 bytes long and the last two bytes remain uninitialized. This is not great at we potentially leak uninitialized information with the handle to userspace. Properly initialize the full handle length. Link: https://lkml.kernel.org/r/20240205122626.13701-1-jack@suse.cz Reported-by: syzbot+3ce5dea5b1539ff36769@syzkaller.appspotmail.com Fixes: ea3983ace6b7 ("fat: restructure export_operations") Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Amir Goldstein <amir73il@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22nilfs2: convert cpfile to use kmap_localRyusuke Konishi
Convert all remaining usages of kmap_atomic in cpfile to kmap_local. Link: https://lkml.kernel.org/r/20240122140202.6950-16-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22nilfs2: remove nilfs_cpfile_{get,put}_checkpoint()Ryusuke Konishi
All calls to nilfs_cpfile_get_checkpoint() and nilfs_cpfile_put_checkpoint() that call kmap() and kunmap() separately are now gone, so remove these methods. Link: https://lkml.kernel.org/r/20240122140202.6950-15-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22nilfs2: localize highmem mapping for checkpoint reading within cpfileRyusuke Konishi
Move the code for reading from a checkpoint entry that is performed in nilfs_attach_checkpoint() to the cpfile side, and make the page mapping local and temporary. And use kmap_local instead of kmap to access the checkpoint entry page. In order to load the ifile inode information included in the checkpoint entry within the inode lock section of nilfs_ifile_read(), the newly added checkpoint reading method nilfs_cpfile_read_checkpoint() is called indirectly via nilfs_ifile_read() instead of from nilfs_attach_checkpoint(). Link: https://lkml.kernel.org/r/20240122140202.6950-14-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22nilfs2: localize highmem mapping for checkpoint finalization within cpfileRyusuke Konishi
Move the checkpoint finalization routine to the cpfile side, and make the page mapping local and temporary. And use kmap_local instead of kmap to access the checkpoint entry page when finalizing a checkpoint. In this conversion, some of the information on the checkpoint entry being rewritten is passed through the arguments of the newly added method nilfs_cpfile_finalize_checkpoint(). Link: https://lkml.kernel.org/r/20240122140202.6950-13-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22nilfs2: localize highmem mapping for checkpoint creation within cpfileRyusuke Konishi
In order to convert kmap() used in cpfile to kmap_local, first move the checkpoint creation routine, which is one of the places where kmap is used, to the cpfile side and make the page mapping local and temporary. And use kmap_local instead of kmap to access the checkpoint entry page (and header block page) when generating a checkpoint. Link: https://lkml.kernel.org/r/20240122140202.6950-12-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22nilfs2: convert ifile to use kmap_localRyusuke Konishi
Convert deprecated kmap() and kmap_atomic() to use kmap_local for the ifile metadata file used to manage disk inodes. In some usages, calls to kmap_local and kunmap_local are split into different helpers, but those usages can be safely changed to local thread kmap. Link: https://lkml.kernel.org/r/20240122140202.6950-11-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22nilfs2: do not acquire rwsem in nilfs_bmap_write()Ryusuke Konishi
It is now clear that nilfs_bmap_write() is only used to finalize logs written to disk. Concurrent bmap modification operations are not performed on bmaps in this context. Additionally, this function does not modify data used in read-only operations such as bmap lookups. Therefore, there is no need to acquire bmap->b_sem in nilfs_bmap_write(), so delete it. Link: https://lkml.kernel.org/r/20240122140202.6950-10-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22nilfs2: move nilfs_bmap_write call out of nilfs_write_inode_commonRyusuke Konishi
Before converting the disk inode management metadata file ifile, the call to nilfs_bmap_write(), the i_device_code setting, and the zero-fill code for inodes on the super root block are moved from nilfs_write_inode_common() to its callers. This cleanup simplifies the role and arguments of nilfs_write_inode_common() and collects calls to nilfs_bmap_write() to the log writing code. Also, add and use a new helper nilfs_write_root_mdt_inode() to avoid code duplication in the data export routine nilfs_segctor_fill_in_super_root() to the super root block's buffer. Link: https://lkml.kernel.org/r/20240122140202.6950-9-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22nilfs2: convert DAT to use kmap_localRyusuke Konishi
Concerning the code of the metadata file DAT for disk address translation, convert all parts that use the deprecated kmap_atomic to use kmap_local. All transformations are directly possible. Link: https://lkml.kernel.org/r/20240122140202.6950-8-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22nilfs2: convert persistent object allocator to use kmap_localRyusuke Konishi
Regarding the allocator code that is commonly used in the ondisk inode metadata file ifile and the disk address translation metadata file DAT, convert the parts that use the deprecated kmap_atomic() and kmap() to use kmap_local. Most can be converted directly, but only nilfs_palloc_prepare_alloc_entry() needs to be rewritten to change mapping sections so that multiple kmap_local/kunmap_local calls are nested and disk I/O can be avoided within the mapping sections. Link: https://lkml.kernel.org/r/20240122140202.6950-7-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22nilfs2: convert sufile to use kmap_localRyusuke Konishi
Concerning the code of the metadata file sufile for segment management, convert all parts that uses the deprecated kmap_atomic() to use kmap_local. All transformations are directly possible here. Link: https://lkml.kernel.org/r/20240122140202.6950-6-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22nilfs2: convert metadata file common code to use kmap_localRyusuke Konishi
In the common code of metadata files, the new block creation routine nilfs_mdt_insert_new_block() still uses the deprecated kmap_atomic(), so convert it to use kmap_local. Link: https://lkml.kernel.org/r/20240122140202.6950-5-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22nilfs2: convert nilfs_copy_buffer() to use kmap_localRyusuke Konishi
The routine nilfs_copy_buffer() that copies a block buffer still uses the deprecated kmap_atomic(), so convert it to use kmap_local. Link: https://lkml.kernel.org/r/20240122140202.6950-4-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22nilfs2: convert segment buffer to use kmap_localRyusuke Konishi
In the segment buffer code used for log writing, a CRC calculation routine uses the deprecated kmap_atomic(), so convert it to use kmap_local. Link: https://lkml.kernel.org/r/20240122140202.6950-3-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22nilfs2: convert recovery logic to use kmap_localRyusuke Konishi
Patch series "nilfs2: eliminate kmap and kmap_atomic calls". This series converts remaining kmap and kmap_atomic calls to use kmap_local, mainly in metadata files, and eliminates calls to these deprecated kmap functions from nilfs2. This series does not include converting metadata files to use folios, but it is a step in that direction. Most conversions are straightforward, but some are not: the checkpoint file, the inode file, and the persistent object allocator. These have been adjusted or rewritten to avoid multiple kmap_local calls or nest them if necessary, and to eliminate long waits like block I/O within the highmem mapping sections. This series has been tested in both 32-bit and 64-bit environments with varying block sizes. This patch (of 15): In the recovery function when mounting, nilfs_recovery_copy_block() uses the deprecated kmap_atomic(), so convert it to use kmap_local. Link: https://lkml.kernel.org/r/20240122140202.6950-1-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/20240122140202.6950-2-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22ocfs2: spelling fixYongzhen Zhang
Modify reques to request in the comment. Link: https://lkml.kernel.org/r/20240108015604.38377-1-zhangyongzhen@kylinos.cn Signed-off-by: Yongzhen Zhang <zhangyongzhen@kylinos.cn> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: net/ipv4/udp.c f796feabb9f5 ("udp: add local "peek offset enabled" flag") 56667da7399e ("net: implement lockless setsockopt(SO_PEEK_OFF)") Adjacent changes: net/unix/garbage.c aa82ac51d633 ("af_unix: Drop oob_skb ref before purging queue in GC.") 11498715f266 ("af_unix: Remove io_uring code for GC.") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-22userfaultfd: use per-vma locks in userfaultfd operationsLokesh Gidra
All userfaultfd operations, except write-protect, opportunistically use per-vma locks to lock vmas. On failure, attempt again inside mmap_lock critical section. Write-protect operation requires mmap_lock as it iterates over multiple vmas. Link: https://lkml.kernel.org/r/20240215182756.3448972-5-lokeshgidra@google.com Signed-off-by: Lokesh Gidra <lokeshgidra@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Nicolas Geoffray <ngeoffray@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22userfaultfd: protect mmap_changing with rw_sem in userfaulfd_ctxLokesh Gidra
Increments and loads to mmap_changing are always in mmap_lock critical section. This ensures that if userspace requests event notification for non-cooperative operations (e.g. mremap), userfaultfd operations don't occur concurrently. This can be achieved by using a separate read-write semaphore in userfaultfd_ctx such that increments are done in write-mode and loads in read-mode, thereby eliminating the dependency on mmap_lock for this purpose. This is a preparatory step before we replace mmap_lock usage with per-vma locks in fill/move ioctls. Link: https://lkml.kernel.org/r/20240215182756.3448972-3-lokeshgidra@google.com Signed-off-by: Lokesh Gidra <lokeshgidra@google.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nicolas Geoffray <ngeoffray@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22userfaultfd: move userfaultfd_ctx struct to header fileLokesh Gidra
Patch series "per-vma locks in userfaultfd", v7. Performing userfaultfd operations (like copy/move etc.) in critical section of mmap_lock (read-mode) causes significant contention on the lock when operations requiring the lock in write-mode are taking place concurrently. We can use per-vma locks instead to significantly reduce the contention issue. Android runtime's Garbage Collector uses userfaultfd for concurrent compaction. mmap-lock contention during compaction potentially causes jittery experience for the user. During one such reproducible scenario, we observed the following improvements with this patch-set: - Wall clock time of compaction phase came down from ~3s to <500ms - Uninterruptible sleep time (across all threads in the process) was ~10ms (none in mmap_lock) during compaction, instead of >20s This patch (of 4): Move the struct to userfaultfd_k.h to be accessible from mm/userfaultfd.c. There are no other changes in the struct. This is required to prepare for using per-vma locks in userfaultfd operations. Link: https://lkml.kernel.org/r/20240215182756.3448972-1-lokeshgidra@google.com Link: https://lkml.kernel.org/r/20240215182756.3448972-2-lokeshgidra@google.com Signed-off-by: Lokesh Gidra <lokeshgidra@google.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nicolas Geoffray <ngeoffray@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22dax: check for data cache aliasing at runtimeMathieu Desnoyers
Replace the following fs/Kconfig:FS_DAX dependency: depends on !(ARM || MIPS || SPARC) By a runtime check within alloc_dax(). This runtime check returns ERR_PTR(-EOPNOTSUPP) if the @ops parameter is non-NULL (which means the kernel is using an aliased mapping) on an architecture which has data cache aliasing. Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for CONFIG_DAX=n for consistency. This is done in preparation for using cpu_dcache_is_aliasing() in a following change which will properly support architectures which detect data cache aliasing at runtime. Link: https://lkml.kernel.org/r/20240215144633.96437-8-mathieu.desnoyers@efficios.com Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Russell King <linux@armlinux.org.uk> Cc: Alasdair Kergon <agk@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: kernel test robot <lkp@intel.com> Cc: Michael Sclafani <dm-devel@lists.linux.dev> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22virtio: treat alloc_dax() -EOPNOTSUPP failure as non-fatalMathieu Desnoyers
In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of virtio virtio_fs_setup_dax() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal. Link: https://lkml.kernel.org/r/20240215144633.96437-7-mathieu.desnoyers@efficios.com Co-developed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Russell King <linux@armlinux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: kernel test robot <lkp@intel.com> Cc: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>