summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2023-11-06gfs2: Get rid of gfs2_alloc_blocks generation parameterAndreas Gruenbacher
Get rid of the generation parameter of gfs2_alloc_blocks(): we only ever set the generation of the current inode while creating it, so do so directly. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-11-05Merge tag 'ubifs-for-linus-6.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs Pull UBI and UBIFS updates from Richard Weinberger: - UBI Fastmap improvements - Minor issues found by static analysis bots in both UBI and UBIFS - Fix for wrong dentry length UBIFS in fscrypt mode * tag 'ubifs-for-linus-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs: ubifs: ubifs_link: Fix wrong name len calculating when UBIFS is encrypted ubi: block: Fix use-after-free in ubiblock_cleanup ubifs: fix possible dereference after free ubi: fastmap: Add control in 'UBI_IOCATT' ioctl to reserve PEBs for filling pools ubi: fastmap: Add module parameter to control reserving filling pool PEBs ubi: fastmap: Fix lapsed wear leveling for first 64 PEBs ubi: fastmap: Get wl PEB even ec beyonds the 'max' if free PEBs are run out ubi: fastmap: may_reserve_for_fm: Don't reserve PEB if fm_anchor exists ubi: fastmap: Remove unneeded break condition while filling pools ubi: fastmap: Wait until there are enough free PEBs before filling pools ubi: fastmap: Use free pebs reserved for bad block handling ubi: Replace erase_block() with sync_erase() ubi: fastmap: Allocate memory with GFP_NOFS in ubi_update_fastmap ubi: fastmap: erase_block: Get erase counter from wl_entry rather than flash ubi: fastmap: Fix missed ec updating after erasing old fastmap data block ubifs: Fix missing error code err ubifs: Fix memory leak of bud->log_hash ubifs: Fix some kernel-doc comments
2023-11-05bcachefs: Improve stripe checksum error messageKent Overstreet
We now include the name of the device in the error message - and also increment the number of checksum errors on that device. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-05bcachefs: Simplify, fix bch2_backpointer_get_key()Kent Overstreet
- backpointer_not_found() checks backpointers_no_use_write_buffer, no need to do it inbackpointer_get_key(). - always use backpointer_get_node() for pointers to nodes: backpointer_get_key() was sometimes returning the key from the root node unlocked. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-05bcachefs: kill thing_it_points_to arg to backpointer_not_found()Kent Overstreet
This can be calculated locally. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-05bcachefs: bch2_ec_read_extent() now takes btree_transKent Overstreet
We're not supposed to have more than one btree_trans at a time in a given thread - that causes recursive locking deadlocks. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-05bcachefs: bch2_stripe_to_text() now prints ptr gensKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-05bcachefs: Don't iterate over journal entries just for btree rootsKent Overstreet
Small performance optimization, and a bit of a code cleanup too. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-05bcachefs: Break up bch2_journal_write()Kent Overstreet
Split up bch2_journal_write() to simplify locking: - bch2_journal_write_pick_flush(), which needs j->lock - bch2_journal_write_prep, which operates on the journal buffer to be written and will need the upcoming buf_lock for synchronization with the btree write buffer flush path Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-05bcachefs: Replace ERANGE with private error codesKent Overstreet
We avoid using standard error codes: private, per-callsite error codes make debugging easier. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-05bcachefs: bkey_copy() is no longer a macroKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-05bcachefs: x-macro-ify inode flags enumKent Overstreet
This lets us use bch2_prt_bitflags to print them out. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-05bcachefs: Convert bch2_fs_open() to darrayKent Overstreet
Open coded dynamic arrays are deprecated. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-05bcachefs: Move __bch2_members_v2_get_mut to sb-members.hKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-05bcachefs: bch2_prt_datetime()Kent Overstreet
Improved, better named version of pr_time(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04bcachefs: CONFIG_BCACHEFS_DEBUG_TRANSACTIONS no longer defaults to yKent Overstreet
BCACHEFS_DEBUG_TRANSACTIONS is useful, but it's too expensive to have on by default - and it hasn't been coming up in bug reports. Turn it off by default until we figure out a way to make it cheaper. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04bcachefs: Add a comment for BTREE_INSERT_NOJOURNAL usageKent Overstreet
BTREE_INSERT_NOJOURNAL is primarily used for a performance optimization related to inode updates and fsync - document it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04bcachefs: rebalance_work btree is not a snapshots btreeKent Overstreet
rebalance_work entries may refer to entries in the extents btree, which is a snapshots btree, or they may also refer to entries in the reflink btree, which is not. Hence rebalance_work keys may use the snapshot field but it's not required to be nonzero - add a new btree flag to reflect this. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04bcachefs: Add missing printk newlinesKent Overstreet
This was causing error messages in -tools to not get printed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04bcachefs: Fix recovery when forced to use JSET_NO_FLUSH journal entryKent Overstreet
When we didn't find anything in the journal that we'd like to use, and we're forced to use whatever we can find - that entry will have been a JSET_NO_FLUSH entry with a garbage last_seq value, since it's not normally used. Initialize it to something sane, for bch2_fs_journal_start(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04bcachefs: .get_parent() should return an error pointerKent Overstreet
Delete the useless check for inum == 0; we'll return -ENOENT without it, which is what we want. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04bcachefs: Fix bch2_delete_dead_inodes()Kent Overstreet
- the fsck_err() check for the filesystem being clean was incorrect, causing us to always fail to delete unlinked inodes - if a snapshot had been taken, the unlinked inode needs to be propagated to snapshot leaves so the unlink can happen there - fixed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04bcachefs: use swab40 for bch_backpointer.bucket_offset bitfieldBrian Foster
The bucket_offset field of bch_backpointer is a 40-bit bitfield, but the bch2_backpointer_swab() helper uses swab32. This leads to inconsistency when an on-disk fs is accessed from an opposite endian machine. As it turns out, we already have an internal swab40() helper that is used from the bch_alloc_v4 swab callback. Lift it into the backpointers header file and use it consistently in both places. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04bcachefs: byte order swap bch_alloc_v4.fragmentation_lru fieldBrian Foster
A simple test to populate a filesystem on one CPU architecture and fsck on an arch of the opposite byte order produces errors related to the fragmentation LRU. This occurs because the 64-bit fragmentation_lru field is not byte-order swapped when reads detect that the on-disk/bset key values were written in opposite byte-order of the current CPU. Update the bch2_alloc_v4 swab callback to handle fragmentation_lru as is done for other multi-byte fields. This doesn't affect existing filesystems when accessed by CPUs of the same endianness because the ->swab() callback is only called when the bset flags indicate an endianness mismatch between the CPU and on-disk data. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04bcachefs: allow writeback to fill bio completelyBrian Foster
The bcachefs folio writeback code includes a bio full check as well as a fixed size check to determine when to split off and submit writeback I/O. The inclusive check of the latter against the limit means that writeback can submit slightly prematurely. This is not a functional problem, but results in unnecessarily split I/Os and extent merging. This can be observed with a buffered write sized exactly to the current maximum value (1MB) and with key_merging_disabled=1. The latter prevents the merge from the second write such that a subsequent check of the extent list shows a 1020k extent followed by a contiguous 4k extent. The purpose for the fixed size check is also undocumented and somewhat obscure. Lift this check into a new helper that wraps the bio check, fix the comparison logic, and add a comment to document the purpose and how we might improve on this in the future. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04bcachefs: fix odebug warn and lockdep splat due to on-stack rhashtableBrian Foster
Guenter Roeck reports a lockdep splat and DEBUG_OBJECTS_WORK related warning when bch2_copygc_thread() initializes its rhashtable. The lockdep splat relates to a warning print caused by the fact that the rhashtable exists on the stack but is not annotated as so. This is something that could be addressed by INIT_WORK_ONSTACK(), but rhashtable doesn't expose that control and probably isnt worth the churn for just one user. Instead, dynamically allocate the buckets_in_flight structure and avoid the splat that way. Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04bcachefs: update alloc cursor in early bucket allocatorBrian Foster
A recent bug report uncovered a scenario where a filesystem never runs with freespace_initialized, and therefore the user observes significantly degraded write performance by virtue of running the early bucket allocator. The associated bug aside, the primary cause of the performance drop in this particular instance is that the early bucket allocator does not update the allocation cursor. This means that every allocation walks the alloc btree from the first bucket of the associated device looking for a bucket marked as free space. Update the early allocator code to set the alloc cursor to the last processed position in the tree, similar to how the freelist allocator behaves. With the alloc_cursor being updated, the retry logic also needs to be updated to restart from the beginning of the device when a free bucket is not available between the cursor and the end of the device. Track the restart position in a first_bucket variable to make the code a bit more easily readable and consistent with the freelist allocator. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04bcachefs: serialize on cached key in early bucket allocatorBrian Foster
bcachefs had a transient bug where freespace_initialized was not properly being set, which lead to unexpected use of the early bucket allocator at runtime. This issue has been fixed, but the existence of it uncovered a coherency issue in the early bucket allocation code that is somewhat related to how uncached iterators deal with the key cache. The problem itself manifests as occasional failure of generic/113 due to corruption, often seen as a duplicate backpointer or multiple data types per-bucket error. The immediate cause of the error is a racing bucket allocation along the lines of the following sequence: - Task 1 selects key A in bch2_bucket_alloc_early() and schedules. - Task 2 selects the same key A, but proceeds to complete the allocation and associated I/O, after which it releases the open_bucket. - Task 1 resumes with key A, but does not recognize the bucket is now allocated because the open_bucket has been removed from the hash when it was released in the previous step. This generally shouldn't happen because the allocating task updates the alloc btree key before releasing the bucket. This is not sufficient in this particular instance, however, because an uncached iterator for a cached btree doesn't actually lock the key cache slot when no key exists for a given slot in the cache. Thus the fact that the allocation side updates the cached key means that multiple uncached iters can stumble across the same alloc key and duplicate the bucket allocation as described above. This is something that probably needs a longer term fix in the iterator code. As a short term fix, close the race through explicit use of a cached iterator for likely allocation candidates. We don't want to scan the btree with a cached iterator because that would unnecessarily pollute the cache. This mitigates cache pollution by primarily scanning the tree with an uncached iterator, but closes the race by creating a key cache entry for any prospective slot prior to the bucket allocation attempt (also similar to how _alloc_freelist() works via try_alloc_bucket()). This survives many iterations of generic/113 on a kernel hacked to always use the early bucket allocator. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04bcachefs: Data move path now uses bch2_trans_unlock_long()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04Merge tag 'f2fs-for-6.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "In this cycle, we introduce a bigger page size support by changing the internal f2fs's block size aligned to the page size. We also continue to improve zoned block device support regarding the power off recovery. As usual, there are some bug fixes regarding the error handling routines in compression and ioctl. Enhancements: - Support Block Size == Page Size - let f2fs_precache_extents() traverses in file range - stop iterating f2fs_map_block if hole exists - preload extent_cache for POSIX_FADV_WILLNEED - compress: fix to avoid fragment w/ OPU during f2fs_ioc_compress_file() Bug fixes: - do not return EFSCORRUPTED, but try to run online repair - finish previous checkpoints before returning from remount - fix error handling of __get_node_page and __f2fs_build_free_nids - clean up zones when not successfully unmounted - fix to initialize map.m_pblk in f2fs_precache_extents() - fix to drop meta_inode's page cache in f2fs_put_super() - set the default compress_level on ioctl - fix to avoid use-after-free on dic - fix to avoid redundant compress extension - do sanity check on cluster when CONFIG_F2FS_CHECK_FS is on - fix deadloop in f2fs_write_cache_pages()" * tag 'f2fs-for-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: f2fs: finish previous checkpoints before returning from remount f2fs: fix error handling of __get_node_page f2fs: do not return EFSCORRUPTED, but try to run online repair f2fs: fix error path of __f2fs_build_free_nids f2fs: Clean up errors in segment.h f2fs: clean up zones when not successfully unmounted f2fs: let f2fs_precache_extents() traverses in file range f2fs: avoid format-overflow warning f2fs: fix to initialize map.m_pblk in f2fs_precache_extents() f2fs: Support Block Size == Page Size f2fs: stop iterating f2fs_map_block if hole exists f2fs: preload extent_cache for POSIX_FADV_WILLNEED f2fs: set the default compress_level on ioctl f2fs: compress: fix to avoid fragment w/ OPU during f2fs_ioc_compress_file() f2fs: fix to drop meta_inode's page cache in f2fs_put_super() f2fs: split initial and dynamic conditions for extent_cache f2fs: compress: fix to avoid redundant compress extension f2fs: compress: do sanity check on cluster when CONFIG_F2FS_CHECK_FS is on f2fs: compress: fix to avoid use-after-free on dic f2fs: compress: fix deadloop in f2fs_write_cache_pages()
2023-11-04Merge tag '9p-for-6.7-rc1' of https://github.com/martinetd/linuxLinus Torvalds
Pull 9p updates from Dominique Martinet: A bunch of small fixes: - three W=1 warning fixes: the NULL -> "" replacement isn't trivial but is serialized identically by the protocol layer and has been tested - one syzbot/KCSAN datarace annotation where we don't care about users messing with the fd they passed to mount -t 9p - removing a declaration without implementation - yet another race fix for trans_fd around connection close: the 'err' field is also used in potentially racy calls and this isn't complete, but it's better than what we had - and finally a theorical memory leak fix on serialization failure" * tag '9p-for-6.7-rc1' of https://github.com/martinetd/linux: 9p/net: fix possible memory leak in p9_check_errors() 9p/fs: add MODULE_DESCRIPTION 9p/net: xen: fix false positive printf format overflow warning 9p: v9fs_listxattr: fix %s null argument warning 9p/trans_fd: Annotate data-racy writes to file::f_flags fs/9p: Remove unused function declaration v9fs_inode2stat() 9p/trans_fd: avoid sending req to a cancelled conn
2023-11-04Merge tag '6.7-rc-smb3-client-fixes-part1' of ↵Linus Torvalds
git://git.samba.org/sfrench/cifs-2.6 Pull smb client updates from Steve French: - use after free fixes and deadlock fix - symlink timestamp fix - hashing perf improvement - multichannel fixes - minor debugging improvements - fix creating fifos when using "sfu" mounts - NTLMSSP authentication improvement - minor fixes to include some missing create flags and structures from recently updated protocol documentation * tag '6.7-rc-smb3-client-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6: cifs: force interface update before a fresh session setup cifs: do not reset chan_max if multichannel is not supported at mount cifs: reconnect helper should set reconnect for the right channel smb: client: fix use-after-free in smb2_query_info_compound() smb: client: remove extra @chan_count check in __cifs_put_smb_ses() cifs: add xid to query server interface call cifs: print server capabilities in DebugData smb: use crypto_shash_digest() in symlink_hash() smb: client: fix use-after-free bug in cifs_debug_data_proc_show() smb: client: fix potential deadlock when releasing mids smb3: fix creating FIFOs when mounting with "sfu" mount option Add definition for new smb3.1.1 command type SMB3: clarify some of the unused CreateOption flags cifs: Add client version details to NTLM authenticate message smb3: fix touch -h of symlink
2023-11-04Merge tag 'efi-next-for-v6.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI update from Ard Biesheuvel: "This is the only remaining EFI change, as everything else was taken via -tip this cycle: - implement uid/gid mount options for efivarfs" * tag 'efi-next-for-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: efivarfs: Add uid/gid mount options
2023-11-04bcachefs: Ensure srcu lock is not held too longKent Overstreet
The SRCU read lock that btree_trans takes exists to make it safe for bch2_trans_relock() to deref pointers to btree nodes/key cache items we don't have locked, but as a side effect it blocks reclaim from freeing those items. Thus, it's important to not hold it for too long: we need to differentiate between bch2_trans_unlock() calls that will be only for a short duration, and ones that will be for an unbounded duration. This introduces bch2_trans_unlock_long(), to be used mainly by the data move paths. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04bcachefs: Fix build errors with gcc 10Kent Overstreet
gcc 10 seems to complain about array bounds in situations where gcc 11 does not - curious. This unfortunately requires adding some casts for now; we may investigate getting rid of our __u64 _data[] VLA in a future patch so that our start[0] members can be VLAs. Reported-by: John Stoffel <john@stoffel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04bcachefs: Fix MEAN_AND_VARIANCE kconfig optionsKent Overstreet
Fixes: https://lore.kernel.org/linux-bcachefs/CAMuHMdXpwMdLuoWsNGa8qacT_5Wv-vSTz0xoBR5n_fnD9cNOuQ@mail.gmail.com/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-04bcachefs: Ensure copygc does not spinKent Overstreet
If copygc does no work - finds no fragmented buckets - wait for a bit of IO to happen. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-03Merge tag 'driver-core-6.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the set of driver core updates for 6.7-rc1. Nothing major in here at all, just a small number of changes including: - minor cleanups and updates from Andy Shevchenko - __counted_by addition - firmware_loader update for aborting loads cleaner - other minor changes, details in the shortlog - documentation update All of these have been in linux-next for a while with no reported issues" * tag 'driver-core-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (21 commits) firmware_loader: Abort all upcoming firmware load request once reboot triggered firmware_loader: Refactor kill_pending_fw_fallback_reqs() Documentation: security-bugs.rst: linux-distros relaxed their rules driver core: Release all resources during unbind before updating device links driver core: class: remove boilerplate code driver core: platform: Annotate struct irq_affinity_devres with __counted_by resource: Constify resource crosscheck APIs resource: Unify next_resource() and next_resource_skip_children() resource: Reuse for_each_resource() macro PCI: Implement custom llseek for sysfs resource entries kernfs: sysfs: support custom llseek method for sysfs entries debugfs: Fix __rcu type comparison warning device property: Replace custom implementation of COUNT_ARGS() drivers: base: test: Make property entry API test modular driver core: Add missing parameter description to __fwnode_link_add() device property: Clarify usage scope of some struct fwnode_handle members devres: rename the first parameter of devm_add_action(_or_reset) driver core: platform: Unify the firmware node type check driver core: platform: Use temporary variable in platform_device_add() driver core: platform: Refactor error path in a couple places ...
2023-11-03ceph: allow idmapped mountsChristian Brauner
Now that we converted cephfs internally to account for idmapped mounts allow the creation of idmapped mounts on by setting the FS_ALLOW_IDMAP flag. Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03ceph: allow idmapped atomic_open inode opChristian Brauner
Enable ceph_atomic_open() to handle idmapped mounts. This is just a matter of passing down the mount's idmapping. [ aleksandr.mikhalitsyn: adapted to 5fadbd9929 ("ceph: rely on vfs for setgid stripping") ] Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03ceph: allow idmapped set_acl inode opChristian Brauner
Enable ceph_set_acl() to handle idmapped mounts. This is just a matter of passing down the mount's idmapping. Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03ceph: allow idmapped setattr inode opChristian Brauner
Enable __ceph_setattr() to handle idmapped mounts. This is just a matter of passing down the mount's idmapping. [ aleksandr.mikhalitsyn: adapted to b27c82e12965 ("attr: port attribute changes to new types") ] Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03ceph: pass idmap to __ceph_setattrAlexander Mikhalitsyn
Just pass down the mount's idmapping to __ceph_setattr, because we will need it later. Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Acked-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03ceph: allow idmapped permission inode opChristian Brauner
Enable ceph_permission() to handle idmapped mounts. This is just a matter of passing down the mount's idmapping. Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03ceph: allow idmapped getattr inode opChristian Brauner
Enable ceph_getattr() to handle idmapped mounts. This is just a matter of passing down the mount's idmapping. Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03ceph: pass an idmapping to mknod/symlink/mkdirChristian Brauner
Enable mknod/symlink/mkdir iops to handle idmapped mounts. This is just a matter of passing down the mount's idmapping. Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03ceph: add enable_unsafe_idmap module parameterAlexander Mikhalitsyn
This parameter is used to decide if we allow to perform IO on idmapped mount in case when MDS lacks support of CEPHFS_FEATURE_HAS_OWNER_UIDGID feature. In this case we can't properly handle MDS permission checks and if UID/GID-based restrictions are enabled on the MDS side then IO requests which go through an idmapped mount may fail with -EACCESS/-EPERM. Fortunately, for most of users it's not a case and everything should work fine. But we put work "unsafe" in the module parameter name to warn users about possible problems with this feature and encourage update of cephfs MDS. Suggested-by: Stéphane Graber <stgraber@ubuntu.com> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Acked-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03ceph: handle idmapped mounts in create_request_message()Christian Brauner
Inode operations that create a new filesystem object such as ->mknod, ->create, ->mkdir() and others don't take a {g,u}id argument explicitly. Instead the caller's fs{g,u}id is used for the {g,u}id of the new filesystem object. In order to ensure that the correct {g,u}id is used map the caller's fs{g,u}id for creation requests. This doesn't require complex changes. It suffices to pass in the relevant idmapping recorded in the request message. If this request message was triggered from an inode operation that creates filesystem objects it will have passed down the relevant idmaping. If this is a request message that was triggered from an inode operation that doens't need to take idmappings into account the initial idmapping is passed down which is an identity mapping. This change uses a new cephfs protocol extension CEPHFS_FEATURE_HAS_OWNER_UIDGID which adds two new fields (owner_{u,g}id) to the request head structure. So, we need to ensure that MDS supports it otherwise we need to fail any IO that comes through an idmapped mount because we can't process it in a proper way. MDS server without such an extension will use caller_{u,g}id fields to set a new inode owner UID/GID which is incorrect because caller_{u,g}id values are unmapped. At the same time we can't map these fields with an idmapping as it can break UID/GID-based permission checks logic on the MDS side. This problem was described with a lot of details at [1], [2]. [1] https://lore.kernel.org/lkml/CAEivzxfw1fHO2TFA4dx3u23ZKK6Q+EThfzuibrhA3RKM=ZOYLg@mail.gmail.com/ [2] https://lore.kernel.org/all/20220104140414.155198-3-brauner@kernel.org/ Link: https://github.com/ceph/ceph/pull/52575 Link: https://tracker.ceph.com/issues/62217 Co-Developed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03ceph: stash idmapping in mdsc requestChristian Brauner
When sending a mds request cephfs will send relevant data for the requested operation. For creation requests the caller's fs{g,u}id is used to set the ownership of the newly created filesystem object. For setattr requests the caller can pass in arbitrary {g,u}id values to which the relevant filesystem object is supposed to be changed. If the caller is performing the relevant operation via an idmapped mount cephfs simply needs to take the idmapping into account when it sends the relevant mds request. In order to support idmapped mounts for cephfs we stash the idmapping whenever they are relevant for the operation for the duration of the request. Since mds requests can be queued and performed asynchronously we make sure to keep the idmapping around and release it once the request has finished. In follow-up patches we will use this to send correct ownership information over the wire. This patch just adds the basic infrastructure to keep the idmapping around. The actual conversion patches are all fairly minimal. Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-11-03fs: export mnt_idmap_get/mnt_idmap_putAlexander Mikhalitsyn
These helpers are required to support idmapped mounts in CephFS. Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>