summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2023-05-22Merge tag 'nfs-for-6.4-2' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client fixes from Anna Schumaker: "Stable Fix: - Don't change task->tk_status after the call to rpc_exit_task Other Bugfixes: - Convert kmap_atomic() to kmap_local_folio() - Fix a potential double free with READ_PLUS" * tag 'nfs-for-6.4-2' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFSv4.2: Fix a potential double free with READ_PLUS SUNRPC: Don't change task->tk_status after the call to rpc_exit_task NFS: Convert kmap_atomic() to kmap_local_folio()
2023-05-22fs: dlm: fix cleanup pending ops when interruptedAlexander Aring
Immediately clean up a posix lock request if it is interrupted while waiting for a result from user space (dlm_controld.) This largely reverts the recent commit b92a4e3f86b1 ("fs: dlm: change posix lock sigint handling"). That previous commit attempted to defer lock cleanup to the point in time when a result from user space arrived. The deferred approach was not reliable because some dlm plock ops may not receive replies. Cc: stable@vger.kernel.org Fixes: b92a4e3f86b1 ("fs: dlm: change posix lock sigint handling") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-05-22fs: dlm: return positive pid value for F_GETLKAlexander Aring
The GETLK pid values have all been negated since commit 9d5b86ac13c5 ("fs/locks: Remove fl_nspid and use fs-specific l_pid for remote locks"). Revert this for local pids, and leave in place negative pids for remote owners. Cc: stable@vger.kernel.org Fixes: 9d5b86ac13c5 ("fs/locks: Remove fl_nspid and use fs-specific l_pid for remote locks") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-05-22dlm: Replace all non-returning strlcpy with strscpyAzeem Shaikh
strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). No return values were used, so direct replacement is safe. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89 Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com> Signed-off-by: David Teigland <teigland@redhat.com>
2023-05-22exportfs: add explicit flag to request non-decodeable file handlesAmir Goldstein
So far, all callers of exportfs_encode_inode_fh(), except for fsnotify's show_mark_fhandle(), check that filesystem can decode file handles, but we would like to add more callers that do not require a file handle that can be decoded. Introduce a flag to explicitly request a file handle that may not to be decoded later and a wrapper exportfs_encode_fid() that sets this flag and convert show_mark_fhandle() to use the new wrapper. This will be used to allow adding fanotify support to filesystems that do not support NFS export. Acked-by: Jeff Layton <jlayton@kernel.org> Acked-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20230502124817.3070545-3-amir73il@gmail.com>
2023-05-22exportfs: change connectable argument to bit flagsAmir Goldstein
Convert the bool connectable arguemnt into a bit flags argument and define the EXPORT_FS_CONNECTABLE flag as a requested property of the file handle. We are going to add a flag for requesting non-decodeable file handles. Acked-by: Jeff Layton <jlayton@kernel.org> Acked-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20230502124817.3070545-2-amir73il@gmail.com>
2023-05-21Merge tag '6.4-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbdLinus Torvalds
Pull ksmbd server fixes from Steve French: - two fixes for incorrect SMB3 message validation (one for client which uses 8 byte padding, and one for empty bcc) - two fixes for out of bounds bugs: one for username offset checks (in session setup) and the other for create context name length checks in open requests * tag '6.4-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd: ksmbd: smb2: Allow messages padded to 8byte boundary ksmbd: allocate one more byte for implied bcc[0] ksmbd: fix wrong UserName check in session_user ksmbd: fix global-out-of-bounds in smb2_find_context_vals
2023-05-21Merge tag '6.4-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs client fixes from Steve French: "Two smb3 client fixes, both related to deferred close, and also for stable: - send close for deferred handles before not after lease break response to avoid possible sharing violations - check all opens on an inode (looking for deferred handles) when lease break is returned not just the handle the lease break came in on" * tag '6.4-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: SMB3: drop reference to cfile before sending oplock break SMB3: Close all deferred handles of inode in case of handle lease break
2023-05-19fs: remove the special !CONFIG_BLOCK def_blk_fopsChristoph Hellwig
def_blk_fops always returns -ENODEV, which dosn't match the return value of a non-existing block device with CONFIG_BLOCK, which is -ENXIO. Just remove the extra implementation and fall back to the default no_open_fops that always returns -ENXIO. Fixes: 9361401eb761 ("[PATCH] BLOCK: Make it possible to disable the block layer [try #6]") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230508144405.41792-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-19NFSv4.2: Fix a potential double free with READ_PLUSAnna Schumaker
kfree()-ing the scratch page isn't enough, we also need to set the pointer back to NULL to avoid a double-free in the case of a resend. Fixes: fbd2a05f29a9 (NFSv4.2: Rework scratch handling for READ_PLUS) Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-05-19NFS: Convert kmap_atomic() to kmap_local_folio()Fabio M. De Francesco
kmap_atomic() is deprecated in favor of kmap_local_{folio,page}(). Therefore, replace kmap_atomic() with kmap_local_folio() in nfs_readdir_folio_array_append(). kmap_atomic() disables page-faults and preemption (the latter only for !PREEMPT_RT kernels), However, the code within the mapping/un-mapping in nfs_readdir_folio_array_append() does not depend on the above-mentioned side effects. Therefore, a mere replacement of the old API with the new one is all that is required (i.e., there is no need to explicitly add any calls to pagefault_disable() and/or preempt_disable()). Tested with (x)fstests in a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel with HIGHMEM64GB enabled. Cc: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Fixes: ec108d3cc766 ("NFS: Convert readdir page array functions to use a folio") Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-05-19Merge tag 'ceph-for-6.4-rc3' of https://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fixes from Ilya Dryomov: "A workaround for a just discovered bug in MClientSnap encoding which goes back to 2017 (marked for stable) and a fixup to quieten a static checker" * tag 'ceph-for-6.4-rc3' of https://github.com/ceph/ceph-client: ceph: force updating the msg pointer in non-split case ceph: silence smatch warning in reconnect_caps_cb()
2023-05-19Merge tag 's390-6.4-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Alexander Gordeev: - Add check whether the required facilities are installed before using the s390-specific ChaCha20 implementation - Key blobs for s390 protected key interface IOCTLs commands PKEY_VERIFYKEY2 and PKEY_VERIFYKEY3 may contain clear key material. Zeroize copies of these keys in kernel memory after creating protected keys - Set CONFIG_INIT_STACK_NONE=y in defconfigs to avoid extra overhead of initializing all stack variables by default - Make sure that when a new channel-path is enabled all subchannels are evaluated: with and without any devices connected on it - When SMT thread CPUs are added to CPU topology masks the nr_cpu_ids limit is not checked and could be exceeded. Respect the nr_cpu_ids limit and avoid a warning when CONFIG_DEBUG_PER_CPU_MAPS is set - The pointer to IPL Parameter Information Block is stored in the absolute lowcore as a virtual address. Save it as the physical address for later use by dump tools - Fix a Queued Direct I/O (QDIO) problem on z/VM guests using QIOASSIST with dedicated (pass through) QDIO-based devices such as FCP, real OSA or HiperSockets - s390's struct statfs and struct statfs64 contain padding, which field-by-field copying does not set. Initialize the respective structures with zeros before filling them and copying to userspace - Grow s390 compat_statfs64, statfs and statfs64 structures f_spare array member to cover padding and simplify things - Remove obsolete SCHED_BOOK and SCHED_DRAWER configs - Remove unneeded S390_CCW_IOMMU and S390_AP_IOM configs * tag 's390-6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/iommu: get rid of S390_CCW_IOMMU and S390_AP_IOMMU s390/Kconfig: remove obsolete configs SCHED_{BOOK,DRAWER} s390/uapi: cover statfs padding by growing f_spare statfs: enforce statfs[64] structure initialization s390/qdio: fix do_sqbs() inline assembly constraint s390/ipl: fix IPIB virtual vs physical address confusion s390/topology: honour nr_cpu_ids when adding CPUs s390/cio: include subchannels without devices also for evaluation s390/defconfigs: set CONFIG_INIT_STACK_NONE=y s390/pkey: zeroize key blobs s390/crypto: use vector instructions only if available for ChaCha20
2023-05-19ntfs: Remove unneeded semicolonShaomin Deng
Remove the unneeded semicolon after curly braces. Signed-off-by: Shaomin Deng <dengshaomin@cdjrlc.com> Reviewed-by: Namjae Jeon <linkinjeon@kernel.org> Message-Id: <20221105153135.5975-1-dengshaomin@cdjrlc.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-19ntfs: Correct spellingDeming Wang
We should use this replace thie. Signed-off-by: Deming Wang <wangdeming@inspur.com> Reviewed-by: Namjae Jeon <linkinjeon@kernel.org> Message-Id: <20230206091815.1687-1-wangdeming@inspur.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-19ntfs: remove redundant initialization to pointer cb_sb_startColin Ian King
The pointer cb_sb_start is being initialized with a value that is never read, it is being re-assigned the same value later on when it is first being used. The initialization is redundant and can be removed. Cleans up clang scan build warning: fs/ntfs/compress.c:164:6: warning: Value stored to 'cb_sb_start' during its initialization is never read [deadcode.DeadStores] u8 *cb_sb_start = cb; /* Beginning of the current sb in the cb. */ Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Namjae Jeon <linkinjeon@kernel.org> Message-Id: <20230418153607.3125704-1-colin.i.king@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-19fs: allow to mount beneath top mountChristian Brauner
Various distributions are adding or are in the process of adding support for system extensions and in the future configuration extensions through various tools. A more detailed explanation on system and configuration extensions can be found on the manpage which is listed below at [1]. System extension images may – dynamically at runtime — extend the /usr/ and /opt/ directory hierarchies with additional files. This is particularly useful on immutable system images where a /usr/ and/or /opt/ hierarchy residing on a read-only file system shall be extended temporarily at runtime without making any persistent modifications. When one or more system extension images are activated, their /usr/ and /opt/ hierarchies are combined via overlayfs with the same hierarchies of the host OS, and the host /usr/ and /opt/ overmounted with it ("merging"). When they are deactivated, the mount point is disassembled — again revealing the unmodified original host version of the hierarchy ("unmerging"). Merging thus makes the extension's resources suddenly appear below the /usr/ and /opt/ hierarchies as if they were included in the base OS image itself. Unmerging makes them disappear again, leaving in place only the files that were shipped with the base OS image itself. System configuration images are similar but operate on directories containing system or service configuration. On nearly all modern distributions mount propagation plays a crucial role and the rootfs of the OS is a shared mount in a peer group (usually with peer group id 1): TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID / / ext4 shared:1 29 1 On such systems all services and containers run in a separate mount namespace and are pivot_root()ed into their rootfs. A separate mount namespace is almost always used as it is the minimal isolation mechanism services have. But usually they are even much more isolated up to the point where they almost become indistinguishable from containers. Mount propagation again plays a crucial role here. The rootfs of all these services is a slave mount to the peer group of the host rootfs. This is done so the service will receive mount propagation events from the host when certain files or directories are updated. In addition, the rootfs of each service, container, and sandbox is also a shared mount in its separate peer group: TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID / / ext4 shared:24 master:1 71 47 For people not too familiar with mount propagation, the master:1 means that this is a slave mount to peer group 1. Which as one can see is the host rootfs as indicated by shared:1 above. The shared:24 indicates that the service rootfs is a shared mount in a separate peer group with peer group id 24. A service may run other services. Such nested services will also have a rootfs mount that is a slave to the peer group of the outer service rootfs mount. For containers things are just slighly different. A container's rootfs isn't a slave to the service's or host rootfs' peer group. The rootfs mount of a container is simply a shared mount in its own peer group: TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID /home/ubuntu/debian-tree / ext4 shared:99 61 60 So whereas services are isolated OS components a container is treated like a separate world and mount propagation into it is restricted to a single well known mount that is a slave to the peer group of the shared mount /run on the host: TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID /propagate/debian-tree /run/host/incoming tmpfs master:5 71 68 Here, the master:5 indicates that this mount is a slave to the peer group with peer group id 5. This allows to propagate mounts into the container and served as a workaround for not being able to insert mounts into mount namespaces directly. But the new mount api does support inserting mounts directly. For the interested reader the blogpost in [2] might be worth reading where I explain the old and the new approach to inserting mounts into mount namespaces. Containers of course, can themselves be run as services. They often run full systems themselves which means they again run services and containers with the exact same propagation settings explained above. The whole system is designed so that it can be easily updated, including all services in various fine-grained ways without having to enter every single service's mount namespace which would be prohibitively expensive. The mount propagation layout has been carefully chosen so it is possible to propagate updates for system extensions and configurations from the host into all services. The simplest model to update the whole system is to mount on top of /usr, /opt, or /etc on the host. The new mount on /usr, /opt, or /etc will then propagate into every service. This works cleanly the first time. However, when the system is updated multiple times it becomes necessary to unmount the first update on /opt, /usr, /etc and then propagate the new update. But this means, there's an interval where the old base system is accessible. This has to be avoided to protect against downgrade attacks. The vfs already exposes a mechanism to userspace whereby mounts can be mounted beneath an existing mount. Such mounts are internally referred to as "tucked". The patch series exposes the ability to mount beneath a top mount through the new MOVE_MOUNT_BENEATH flag for the move_mount() system call. This allows userspace to seamlessly upgrade mounts. After this series the only thing that will have changed is that mounting beneath an existing mount can be done explicitly instead of just implicitly. Today, there are two scenarios where a mount can be mounted beneath an existing mount instead of on top of it: (1) When a service or container is started in a new mount namespace and pivot_root()s into its new rootfs. The way this is done is by mounting the new rootfs beneath the old rootfs: fd_newroot = open("/var/lib/machines/fedora", ...); fd_oldroot = open("/", ...); fchdir(fd_newroot); pivot_root(".", "."); After the pivot_root(".", ".") call the new rootfs is mounted beneath the old rootfs which can then be unmounted to reveal the underlying mount: fchdir(fd_oldroot); umount2(".", MNT_DETACH); Since pivot_root() moves the caller into a new rootfs no mounts must be propagated out of the new rootfs as a consequence of the pivot_root() call. Thus, the mounts cannot be shared. (2) When a mount is propagated to a mount that already has another mount mounted on the same dentry. The easiest example for this is to create a new mount namespace. The following commands will create a mount namespace where the rootfs mount / will be a slave to the peer group of the host rootfs / mount's peer group. IOW, it will receive propagation from the host: mount --make-shared / unshare --mount --propagation=slave Now a new mount on the /mnt dentry in that mount namespace is created. (As it can be confusing it should be spelled out that the tmpfs mount on the /mnt dentry that was just created doesn't propagate back to the host because the rootfs mount / of the mount namespace isn't a peer of the host rootfs.): mount -t tmpfs tmpfs /mnt TARGET SOURCE FSTYPE PROPAGATION └─/mnt tmpfs tmpfs Now another terminal in the host mount namespace can observe that the mount indeed hasn't propagated back to into the host mount namespace. A new mount can now be created on top of the /mnt dentry with the rootfs mount / as its parent: mount --bind /opt /mnt TARGET SOURCE FSTYPE PROPAGATION └─/mnt /dev/sda2[/opt] ext4 shared:1 The mount namespace that was created earlier can now observe that the bind mount created on the host has propagated into it: TARGET SOURCE FSTYPE PROPAGATION └─/mnt /dev/sda2[/opt] ext4 master:1 └─/mnt tmpfs tmpfs But instead of having been mounted on top of the tmpfs mount at the /mnt dentry the /opt mount has been mounted on top of the rootfs mount at the /mnt dentry. And the tmpfs mount has been remounted on top of the propagated /opt mount at the /opt dentry. So in other words, the propagated mount has been mounted beneath the preexisting mount in that mount namespace. Mount namespaces make this easy to illustrate but it's also easy to mount beneath an existing mount in the same mount namespace (The following example assumes a shared rootfs mount / with peer group id 1): mount --bind /opt /opt TARGET SOURCE FSTYPE MNT_ID PARENT_ID PROPAGATION └─/opt /dev/sda2[/opt] ext4 188 29 shared:1 If another mount is mounted on top of the /opt mount at the /opt dentry: mount --bind /tmp /opt The following clunky mount tree will result: TARGET SOURCE FSTYPE MNT_ID PARENT_ID PROPAGATION └─/opt /dev/sda2[/tmp] ext4 405 29 shared:1 └─/opt /dev/sda2[/opt] ext4 188 405 shared:1 └─/opt /dev/sda2[/tmp] ext4 404 188 shared:1 The /tmp mount is mounted beneath the /opt mount and another copy is mounted on top of the /opt mount. This happens because the rootfs / and the /opt mount are shared mounts in the same peer group. When the new /tmp mount is supposed to be mounted at the /opt dentry then the /tmp mount first propagates to the root mount at the /opt dentry. But there already is the /opt mount mounted at the /opt dentry. So the old /opt mount at the /opt dentry will be mounted on top of the new /tmp mount at the /tmp dentry, i.e. @opt->mnt_parent is @tmp and @opt->mnt_mountpoint is /tmp (Note that @opt->mnt_root is /opt which is what shows up as /opt under SOURCE). So again, a mount will be mounted beneath a preexisting mount. (Fwiw, a few iterations of mount --bind /opt /opt in a loop on a shared rootfs is a good example of what could be referred to as mount explosion.) The main point is that such mounts allows userspace to umount a top mount and reveal an underlying mount. So for example, umounting the tmpfs mount on /mnt that was created in example (1) using mount namespaces reveals the /opt mount which was mounted beneath it. In (2) where a mount was mounted beneath the top mount in the same mount namespace unmounting the top mount would unmount both the top mount and the mount beneath. In the process the original mount would be remounted on top of the rootfs mount / at the /opt dentry again. This again, is a result of mount propagation only this time it's umount propagation. However, this can be avoided by simply making the parent mount / of the @opt mount a private or slave mount. Then the top mount and the original mount can be unmounted to reveal the mount beneath. These two examples are fairly arcane and are merely added to make it clear how mount propagation has effects on current and future features. More common use-cases will just be things like: mount -t btrfs /dev/sdA /mnt mount -t xfs /dev/sdB --beneath /mnt umount /mnt after which we'll have updated from a btrfs filesystem to a xfs filesystem without ever revealing the underlying mountpoint. The crux is that the proposed mechanism already exists and that it is so powerful as to cover cases where mounts are supposed to be updated with new versions. Crucially, it offers an important flexibility. Namely that updates to a system may either be forced or can be delayed and the umount of the top mount be left to a service if it is a cooperative one. This adds a new flag to move_mount() that allows to explicitly move a beneath the top mount adhering to the following semantics: * Mounts cannot be mounted beneath the rootfs. This restriction encompasses the rootfs but also chroots via chroot() and pivot_root(). To mount a mount beneath the rootfs or a chroot, pivot_root() can be used as illustrated above. * The source mount must be a private mount to force the kernel to allocate a new, unused peer group id. This isn't a required restriction but a voluntary one. It avoids repeating a semantical quirk that already exists today. If bind mounts which already have a peer group id are inserted into mount trees that have the same peer group id this can cause a lot of mount propagation events to be generated (For example, consider running mount --bind /opt /opt in a loop where the parent mount is a shared mount.). * Avoid getting rid of the top mount in the kernel. Cooperative services need to be able to unmount the top mount themselves. This also avoids a good deal of additional complexity. The umount would have to be propagated which would be another rather expensive operation. So namespace_lock() and lock_mount_hash() would potentially have to be held for a long time for both a mount and umount propagation. That should be avoided. * The path to mount beneath must be mounted and attached. * The top mount and its parent must be in the caller's mount namespace and the caller must be able to mount in that mount namespace. * The caller must be able to unmount the top mount to prove that they could reveal the underlying mount. * The propagation tree is calculated based on the destination mount's parent mount and the destination mount's mountpoint on the parent mount. Of course, if the parent of the destination mount and the destination mount are shared mounts in the same peer group and the mountpoint of the new mount to be mounted is a subdir of their ->mnt_root then both will receive a mount of /opt. That's probably easier to understand with an example. Assuming a standard shared rootfs /: mount --bind /opt /opt mount --bind /tmp /opt will cause the same mount tree as: mount --bind /opt /opt mount --beneath /tmp /opt because both / and /opt are shared mounts/peers in the same peer group and the /opt dentry is a subdirectory of both the parent's and the child's ->mnt_root. If a mount tree like that is created it almost always is an accident or abuse of mount propagation. Realistically what most people probably mean in this scenarios is: mount --bind /opt /opt mount --make-private /opt mount --make-shared /opt This forces the allocation of a new separate peer group for the /opt mount. Aferwards a mount --bind or mount --beneath actually makes sense as the / and /opt mount belong to different peer groups. Before that it's likely just confusion about what the user wanted to achieve. * Refuse MOVE_MOUNT_BENEATH if: (1) the @mnt_from has been overmounted in between path resolution and acquiring @namespace_sem when locking @mnt_to. This avoids the proliferation of shadow mounts. (2) if @to_mnt is moved to a different mountpoint while acquiring @namespace_sem to lock @to_mnt. (3) if @to_mnt is unmounted while acquiring @namespace_sem to lock @to_mnt. (4) if the parent of the target mount propagates to the target mount at the same mountpoint. This would mean mounting @mnt_from on @mnt_to->mnt_parent and then propagating a copy @c of @mnt_from onto @mnt_to. This defeats the whole purpose of mounting @mnt_from beneath @mnt_to. (5) if the parent mount @mnt_to->mnt_parent propagates to @mnt_from at the same mountpoint. If @mnt_to->mnt_parent propagates to @mnt_from this would mean propagating a copy @c of @mnt_from on top of @mnt_from. Afterwards @mnt_from would be mounted on top of @mnt_to->mnt_parent and @mnt_to would be unmounted from @mnt->mnt_parent and remounted on @mnt_from. But since @c is already mounted on @mnt_from, @mnt_to would ultimately be remounted on top of @c. Afterwards, @mnt_from would be covered by a copy @c of @mnt_from and @c would be covered by @mnt_from itself. This defeats the whole purpose of mounting @mnt_from beneath @mnt_to. Cases (1) to (3) are required as they deal with races that would cause bugs or unexpected behavior for users. Cases (4) and (5) refuse semantical quirks that would not be a bug but would cause weird mount trees to be created. While they can already be created via other means (mount --bind /opt /opt x n) there's no reason to repeat past mistakes in new features. Link: https://man7.org/linux/man-pages/man8/systemd-sysext.8.html [1] Link: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html [2] Link: https://github.com/flatcar/sysext-bakery Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_1 Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_2 Link: https://github.com/systemd/systemd/pull/26013 Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Message-Id: <20230202-fs-move-mount-replace-v4-4-98f3d80d7eaa@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-19fs: use a for loop when locking a mountChristian Brauner
Currently, lock_mount() uses a goto to retry the lookup until it succeeded in acquiring the namespace_lock() preventing the top mount from being overmounted. While that's perfectly fine we want to lookup the mountpoint on the parent of the top mount in later patches. So adapt the code to make this easier to implement. Also, the for loop is arguably a little cleaner and makes the code easier to follow. No functional changes intended. Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Message-Id: <20230202-fs-move-mount-replace-v4-3-98f3d80d7eaa@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-19fs: properly document __lookup_mnt()Christian Brauner
The comment on top of __lookup_mnt() states that it finds the first mount implying that there could be multiple mounts mounted at the same dentry with the same parent. On older kernels "shadow mounts" could be created during mount propagation. So if a mount @m in the destination propagation tree already had a child mount @p mounted at @mp then any mount @n we propagated to @m at the same @mp would be appended after the preexisting mount @p in @mount_hashtable. This was a completely direct way of creating shadow mounts. That direct way is gone but there are still subtle ways to create shadow mounts. For example, when attaching a source mnt @mnt to a shared mount. The root of the source mnt @mnt might be overmounted by a mount @o after we finished path lookup but before we acquired the namespace semaphore to copy the source mount tree @mnt. After we acquired the namespace lock @mnt is copied including @o covering it. After we attach @mnt to a shared mount @dest_mnt we end up propagation it to all it's peer and slaves @d. If @d already has a mount @n mounted on top of it we tuck @mnt beneath @n. This means, we mount @mnt at @d and mount @n on @mnt. Now we have both @o and @n mounted on the same mountpoint at @mnt. Explain this in the documentation as this is pretty subtle. Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Message-Id: <20230202-fs-move-mount-replace-v4-2-98f3d80d7eaa@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-19fs: add path_mounted()Christian Brauner
Add a small helper to check whether a path refers to the root of the mount instead of open-coding this everywhere. Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Message-Id: <20230202-fs-move-mount-replace-v4-1-98f3d80d7eaa@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-18Merge tag 'mm-hotfixes-stable-2023-05-18-15-52' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Eight hotfixes. Four are cc:stable, the other four are for post-6.4 issues, or aren't considered suitable for backporting" * tag 'mm-hotfixes-stable-2023-05-18-15-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: MAINTAINERS: Cleanup Arm Display IP maintainers MAINTAINERS: repair pattern in DIALOG SEMICONDUCTOR DRIVERS nilfs2: fix use-after-free bug of nilfs_root in nilfs_evict_inode() mm: fix zswap writeback race condition mm: kfence: fix false positives on big endian zsmalloc: move LRU update from zs_map_object() to zs_malloc() mm: shrinkers: fix race condition on debugfs cleanup maple_tree: make maple state reusable after mas_empty_area()
2023-05-18ceph: force updating the msg pointer in non-split caseXiubo Li
When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the request may still contain a list of 'split_realms', and we need to skip it anyway. Or it will be parsed as a corrupt snaptrace. Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/61200 Reported-by: Frank Schilder <frans@dtu.dk> Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-05-18ceph: silence smatch warning in reconnect_caps_cb()Xiubo Li
Smatch static checker warning: fs/ceph/mds_client.c:3968 reconnect_caps_cb() warn: missing error code here? '__get_cap_for_mds()' failed. 'err' = '0' [ idryomov: Dan says that Smatch considers it intentional only if the "ret = 0;" assignment is within 4 or 5 lines of the goto. ] Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-05-17nilfs2: fix use-after-free bug of nilfs_root in nilfs_evict_inode()Ryusuke Konishi
During unmount process of nilfs2, nothing holds nilfs_root structure after nilfs2 detaches its writer in nilfs_detach_log_writer(). However, since nilfs_evict_inode() uses nilfs_root for some cleanup operations, it may cause use-after-free read if inodes are left in "garbage_list" and released by nilfs_dispose_list() at the end of nilfs_detach_log_writer(). Fix this issue by modifying nilfs_evict_inode() to only clear inode without additional metadata changes that use nilfs_root if the file system is degraded to read-only or the writer is detached. Link: https://lkml.kernel.org/r/20230509152956.8313-1-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+78d4495558999f55d1da@syzkaller.appspotmail.com Closes: https://lkml.kernel.org/r/00000000000099e5ac05fb1c3b85@google.com Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-17SMB3: drop reference to cfile before sending oplock breakBharath SM
In cifs_oplock_break function we drop reference to a cfile at the end of function, due to which close command goes on wire after lease break acknowledgment even if file is already closed by application but we had deferred the handle close. If other client with limited file shareaccess waiting on lease break ack proceeds operation on that file as soon as first client sends ack, then we may encounter status sharing violation error because of open handle. Solution is to put reference to cfile(send close on wire if last ref) and then send oplock acknowledgment to server. Fixes: 9e31678fb403 ("SMB3: fix lease break timeout when multiple deferred close handles for the same file.") Cc: stable@kernel.org Signed-off-by: Bharath SM <bharathsm@microsoft.com> Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-17SMB3: Close all deferred handles of inode in case of handle lease breakBharath SM
Oplock break may occur for different file handle than the deferred handle. Check for inode deferred closes list, if it's not empty then close all the deferred handles of inode because we should not cache handles if we dont have handle lease. Eg: If openfilelist has one deferred file handle and another open file handle from app for a same file, then on a lease break we choose the first handle in openfile list. The first handle in list can be deferred handle or actual open file handle from app. In case if it is actual open handle then today, we don't close deferred handles if we lose handle lease on a file. Problem with this is, later if app decides to close the existing open handle then we still be caching deferred handles until deferred close timeout. Leaving open handle may result in sharing violation when windows client tries to open a file with limited file share access. So we should check for deferred list of inode and walk through the list of deferred files in inode and close all deferred files. Fixes: 9e31678fb403 ("SMB3: fix lease break timeout when multiple deferred close handles for the same file.") Cc: stable@kernel.org Signed-off-by: Bharath SM <bharathsm@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-17Merge tag 'nfsd-6.4-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - A collection of minor bug fixes * tag 'nfsd-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: NFSD: Remove open coding of string copy SUNRPC: Fix trace_svc_register() call site SUNRPC: always free ctxt when freeing deferred request SUNRPC: double free xprt_ctxt while still in use SUNRPC: Fix error handling in svc_setup_socket() SUNRPC: Fix encoding of accepted but unsuccessful RPC replies lockd: define nlm_port_min,max with CONFIG_SYSCTL nfsd: define exports_proc_ops with CONFIG_PROC_FS SUNRPC: Avoid relying on crypto API to derive CBC-CTS output IV
2023-05-17efivarfs: expose used and total sizeAnisse Astier
When writing EFI variables, one might get errors with no other message on why it fails. Being able to see how much is used by EFI variables helps analyzing such issues. Since this is not a conventional filesystem, block size is intentionally set to 1 instead of PAGE_SIZE. x86 quirks of reserved size are taken into account; so that available and free size can be different, further helping debugging space issues. With this patch, one can see the remaining space in EFI variable storage via efivarfs, like this: $ df -h /sys/firmware/efi/efivars/ Filesystem Size Used Avail Use% Mounted on efivarfs 176K 106K 66K 62% /sys/firmware/efi/efivars Signed-off-by: Anisse Astier <an.astier@criteo.com> [ardb: - rename efi_reserved_space() to efivar_reserved_space() - whitespace/coding style tweaks] Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2023-05-17fs: don't call posix_acl_listxattr in generic_listxattrJeff Layton
Commit f2620f166e2a caused the kernel to start emitting POSIX ACL xattrs for NFSv4 inodes, which it doesn't support. The only other user of generic_listxattr is HFS (classic) and it doesn't support POSIX ACLs either. Fixes: f2620f166e2a xattr: simplify listxattr helpers Reported-by: Ondrej Valousek <ondrej.valousek.xm@renesas.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Message-Id: <20230516124655.82283-1-jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-17statfs: enforce statfs[64] structure initializationIlya Leoshkevich
s390's struct statfs and struct statfs64 contain padding, which field-by-field copying does not set. Initialize the respective structs with zeros before filling them and copying them to userspace, like it's already done for the compat versions of these structs. Found by KMSAN. [agordeev@linux.ibm.com: fixed typo in patch description] Acked-by: Heiko Carstens <hca@linux.ibm.com> Cc: stable@vger.kernel.org # v4.14+ Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/r/20230504144021.808932-2-iii@linux.ibm.com Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2023-05-17btrfs: use nofs when cleaning up aborted transactionsJosef Bacik
Our CI system caught a lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 6.3.0-rc7+ #1167 Not tainted ------------------------------------------------------ kswapd0/46 is trying to acquire lock: ffff8c6543abd650 (sb_internal#2){++++}-{0:0}, at: btrfs_commit_inode_delayed_inode+0x5f/0x120 but task is already holding lock: ffffffffabe61b40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x4aa/0x7a0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (fs_reclaim){+.+.}-{0:0}: fs_reclaim_acquire+0xa5/0xe0 kmem_cache_alloc+0x31/0x2c0 alloc_extent_state+0x1d/0xd0 __clear_extent_bit+0x2e0/0x4f0 try_release_extent_mapping+0x216/0x280 btrfs_release_folio+0x2e/0x90 invalidate_inode_pages2_range+0x397/0x470 btrfs_cleanup_dirty_bgs+0x9e/0x210 btrfs_cleanup_one_transaction+0x22/0x760 btrfs_commit_transaction+0x3b7/0x13a0 create_subvol+0x59b/0x970 btrfs_mksubvol+0x435/0x4f0 __btrfs_ioctl_snap_create+0x11e/0x1b0 btrfs_ioctl_snap_create_v2+0xbf/0x140 btrfs_ioctl+0xa45/0x28f0 __x64_sys_ioctl+0x88/0xc0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc -> #0 (sb_internal#2){++++}-{0:0}: __lock_acquire+0x1435/0x21a0 lock_acquire+0xc2/0x2b0 start_transaction+0x401/0x730 btrfs_commit_inode_delayed_inode+0x5f/0x120 btrfs_evict_inode+0x292/0x3d0 evict+0xcc/0x1d0 inode_lru_isolate+0x14d/0x1e0 __list_lru_walk_one+0xbe/0x1c0 list_lru_walk_one+0x58/0x80 prune_icache_sb+0x39/0x60 super_cache_scan+0x161/0x1f0 do_shrink_slab+0x163/0x340 shrink_slab+0x1d3/0x290 shrink_node+0x300/0x720 balance_pgdat+0x35c/0x7a0 kswapd+0x205/0x410 kthread+0xf0/0x120 ret_from_fork+0x29/0x50 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(sb_internal#2); lock(fs_reclaim); lock(sb_internal#2); *** DEADLOCK *** 3 locks held by kswapd0/46: #0: ffffffffabe61b40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x4aa/0x7a0 #1: ffffffffabe50270 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x113/0x290 #2: ffff8c6543abd0e0 (&type->s_umount_key#44){++++}-{3:3}, at: super_cache_scan+0x38/0x1f0 stack backtrace: CPU: 0 PID: 46 Comm: kswapd0 Not tainted 6.3.0-rc7+ #1167 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x58/0x90 check_noncircular+0xd6/0x100 ? save_trace+0x3f/0x310 ? add_lock_to_list+0x97/0x120 __lock_acquire+0x1435/0x21a0 lock_acquire+0xc2/0x2b0 ? btrfs_commit_inode_delayed_inode+0x5f/0x120 start_transaction+0x401/0x730 ? btrfs_commit_inode_delayed_inode+0x5f/0x120 btrfs_commit_inode_delayed_inode+0x5f/0x120 btrfs_evict_inode+0x292/0x3d0 ? lock_release+0x134/0x270 ? __pfx_wake_bit_function+0x10/0x10 evict+0xcc/0x1d0 inode_lru_isolate+0x14d/0x1e0 __list_lru_walk_one+0xbe/0x1c0 ? __pfx_inode_lru_isolate+0x10/0x10 ? __pfx_inode_lru_isolate+0x10/0x10 list_lru_walk_one+0x58/0x80 prune_icache_sb+0x39/0x60 super_cache_scan+0x161/0x1f0 do_shrink_slab+0x163/0x340 shrink_slab+0x1d3/0x290 shrink_node+0x300/0x720 balance_pgdat+0x35c/0x7a0 kswapd+0x205/0x410 ? __pfx_autoremove_wake_function+0x10/0x10 ? __pfx_kswapd+0x10/0x10 kthread+0xf0/0x120 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x29/0x50 </TASK> This happens because when we abort the transaction in the transaction commit path we call invalidate_inode_pages2_range on our block group cache inodes (if we have space cache v1) and any delalloc inodes we may have. The plain invalidate_inode_pages2_range() call passes through GFP_KERNEL, which makes sense in most cases, but not here. Wrap these two invalidate callees with memalloc_nofs_save/memalloc_nofs_restore to make sure we don't end up with the fs reclaim dependency under the transaction dependency. CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-17btrfs: handle memory allocation failure in btrfs_csum_one_bioJohannes Thumshirn
Since f8a53bb58ec7 ("btrfs: handle checksum generation in the storage layer") the failures of btrfs_csum_one_bio() are handled via bio_end_io(). This means, we can return BLK_STS_RESOURCE from btrfs_csum_one_bio() in case the allocation of the ordered sums fails. This also fixes a syzkaller report, where injecting a failure into the kvzalloc() call results in a BUG_ON(). Reported-by: syzbot+d8941552e21eac774778@syzkaller.appspotmail.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-17btrfs: scrub: try harder to mark RAID56 block groups read-onlyQu Wenruo
Currently we allow a block group not to be marked read-only for scrub. But for RAID56 block groups if we require the block group to be read-only, then we're allowed to use cached content from scrub stripe to reduce unnecessary RAID56 reads. So this patch would: - Make btrfs_inc_block_group_ro() try harder During my tests, for cases like btrfs/061 and btrfs/064, we can hit ENOSPC from btrfs_inc_block_group_ro() calls during scrub. The reason is if we only have one single data chunk, and trying to scrub it, we won't have any space left for any newer data writes. But this check should be done by the caller, especially for scrub cases we only temporarily mark the chunk read-only. And newer data writes would always try to allocate a new data chunk when needed. - Return error for scrub if we failed to mark a RAID56 chunk read-only Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-17fs: d_path: include internal.hArnd Bergmann
make W=1 warns about a missing prototype that is defined but not visible at point where simple_dname() is defined: fs/d_path.c:317:7: error: no previous prototype for 'simple_dname' [-Werror=missing-prototypes] Signed-off-by: Arnd Bergmann <arnd@arndb.de> Message-Id: <20230516195444.551461-1-arnd@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-17coredump: require O_WRONLY instead of O_RDWRVladimir Sementsov-Ogievskiy
The motivation for this patch has been to enable using a stricter apparmor profile to prevent programs from reading any coredump in the system. However, this became something else. The following details are based on Christian's and Linus' archeology into the history of the number "2" in the coredump handling code. To make sure we're not accidently introducing some subtle behavioral change into the coredump code we set out on a voyage into the depths of history.git to figure out why this was O_RDWR in the first place. Coredump handling was introduced over 30 years ago in commit ddc733f452e0 ("[PATCH] Linux-0.97 (August 1, 1992)"). The original code used O_WRONLY: open_namei("core",O_CREAT | O_WRONLY | O_TRUNC,0600,&inode,NULL) However, this changed in 1993 and starting with commit 9cb9f18b5d26 ("[PATCH] Linux-0.99.10 (June 7, 1993)") the coredump code suddenly used the constant "2": open_namei("core",O_CREAT | 2 | O_TRUNC,0600,&inode,NULL) This was curious as in the same commit the kernel switched from constants to proper defines in other places such as KERNEL_DS and USER_DS and O_RDWR did already exist. So why was "2" used? It turns out that open_namei() - an early version of what later turned into filp_open() - didn't accept O_RDWR. A semantic quirk of the open() uapi is the definition of the O_RDONLY flag. It would seem natural to define: #define O_RDWR (O_RDONLY | O_WRONLY) but that isn't possible because: #define O_RDONLY 0 This makes O_RDONLY effectively meaningless when passed to the kernel. In other words, there has never been a way - until O_PATH at least - to open a file without any permission; O_RDONLY was always implied on the uapi side while the kernel does in fact allow opening files without permissions. The trouble comes when trying to map the uapi flags onto the corresponding file mode flags FMODE_{READ,WRITE}. This mapping still happens today and is causing issues to this day (We ran into this during additions for openat2() for example.). So the special value "3" was used to indicate that the file was opened for special access: f->f_flags = flag = flags; f->f_mode = (flag+1) & O_ACCMODE; if (f->f_mode) flag++; This allowed the file mode to be set to FMODE_READ | FMODE_WRITE mapping the O_{RDONLY,WRONLY,RDWR} flags into the FMODE_{READ,WRITE} flags. The special access then required read-write permissions and 0 was used to access symlinks. But back when ddc733f452e0 ("[PATCH] Linux-0.97 (August 1, 1992)") added coredump handling open_namei() took the FMODE_{READ,WRITE} flags as an argument. So the coredump handling introduced in ddc733f452e0 ("[PATCH] Linux-0.97 (August 1, 1992)") was buggy because O_WRONLY shouldn't have been passed. Since O_WRONLY is 1 but open_namei() took FMODE_{READ,WRITE} it was passed FMODE_READ on accident. So 9cb9f18b5d26 ("[PATCH] Linux-0.99.10 (June 7, 1993)") was a bugfix for this and the 2 didn't really mean O_RDWR, it meant FMODE_WRITE which was correct. The clue is that FMODE_{READ,WRITE} didn't exist yet and thus a raw "2" value was passed. Fast forward 5 years when around 2.2.4pre4 (February 16, 1999) this code was changed to: - dentry = open_namei(corefile,O_CREAT | 2 | O_TRUNC | O_NOFOLLOW, 0600); ... + file = filp_open(corefile,O_CREAT | 2 | O_TRUNC | O_NOFOLLOW, 0600); At this point the raw "2" should have become O_WRONLY again as filp_open() didn't take FMODE_{READ,WRITE} but O_{RDONLY,WRONLY,RDWR}. Another 17 years later, the code was changed again cementing the mistake and making it almost impossible to detect when commit 378c6520e7d2 ("fs/coredump: prevent fsuid=0 dumps into user-controlled directories") replaced the raw "2" with O_RDWR. And now, here we are with this patch that sent us on a quest to answer the big questions in life such as "Why are coredump files opened with O_RDWR?" and "Is it safe to just use O_WRONLY?". So with this commit we're reintroducing O_WRONLY again and bringing this code back to its original state when it was first introduced in commit ddc733f452e0 ("[PATCH] Linux-0.97 (August 1, 1992)") over 30 years ago. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> Message-Id: <20230420120409.602576-1-vsementsov@yandex-team.ru> [brauner@kernel.org: completely rewritten commit message] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-16coredump, vmcore: Set p_align to 4 for PT_NOTEFangrui Song
Tools like readelf/llvm-readelf use p_align to parse a PT_NOTE program header as an array of 4-byte entries or 8-byte entries. Currently, there are workarounds[1] in place for Linux to treat p_align==0 as 4. However, it would be more appropriate to set the correct alignment so that tools do not have to rely on guesswork. FreeBSD coredumps set p_align to 4 as well. [1]: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=82ed9683ec099d8205dc499ac84febc975235af6 [2]: https://reviews.llvm.org/D150022 Signed-off-by: Fangrui Song <maskray@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230512022528.3430327-1-maskray@google.com
2023-05-16ksmbd: smb2: Allow messages padded to 8byte boundaryGustav Johansson
clc length is now accepted to <= 8 less than length, rather than < 8. Solve issues on some of Axis's smb clients which send messages where clc length is 8 bytes less than length. The specific client was running kernel 4.19.217 with smb dialect 3.0.2 on armv7l. Cc: stable@vger.kernel.org Signed-off-by: Gustav Johansson <gustajo@axis.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-16ksmbd: allocate one more byte for implied bcc[0]Chih-Yen Chang
ksmbd_smb2_check_message allows client to return one byte more, so we need to allocate additional memory in ksmbd_conn_handler_loop to avoid out-of-bound access. Cc: stable@vger.kernel.org Signed-off-by: Chih-Yen Chang <cc85nod@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-16ksmbd: fix wrong UserName check in session_userChih-Yen Chang
The offset of UserName is related to the address of security buffer. To ensure the validaty of UserName, we need to compare name_off + name_len with secbuf_len instead of auth_msg_len. [ 27.096243] ================================================================== [ 27.096890] BUG: KASAN: slab-out-of-bounds in smb_strndup_from_utf16+0x188/0x350 [ 27.097609] Read of size 2 at addr ffff888005e3b542 by task kworker/0:0/7 ... [ 27.099950] Call Trace: [ 27.100194] <TASK> [ 27.100397] dump_stack_lvl+0x33/0x50 [ 27.100752] print_report+0xcc/0x620 [ 27.102305] kasan_report+0xae/0xe0 [ 27.103072] kasan_check_range+0x35/0x1b0 [ 27.103757] smb_strndup_from_utf16+0x188/0x350 [ 27.105474] smb2_sess_setup+0xaf8/0x19c0 [ 27.107935] handle_ksmbd_work+0x274/0x810 [ 27.108315] process_one_work+0x419/0x760 [ 27.108689] worker_thread+0x2a2/0x6f0 [ 27.109385] kthread+0x160/0x190 [ 27.110129] ret_from_fork+0x1f/0x30 [ 27.110454] </TASK> Cc: stable@vger.kernel.org Signed-off-by: Chih-Yen Chang <cc85nod@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-16ksmbd: fix global-out-of-bounds in smb2_find_context_valsChih-Yen Chang
Add tag_len argument in smb2_find_context_vals() to avoid out-of-bound read when create_context's name_len is larger than tag length. [ 7.995411] ================================================================== [ 7.995866] BUG: KASAN: global-out-of-bounds in memcmp+0x83/0xa0 [ 7.996248] Read of size 8 at addr ffffffff8258d940 by task kworker/0:0/7 ... [ 7.998191] Call Trace: [ 7.998358] <TASK> [ 7.998503] dump_stack_lvl+0x33/0x50 [ 7.998743] print_report+0xcc/0x620 [ 7.999458] kasan_report+0xae/0xe0 [ 7.999895] kasan_check_range+0x35/0x1b0 [ 8.000152] memcmp+0x83/0xa0 [ 8.000347] smb2_find_context_vals+0xf7/0x1e0 [ 8.000635] smb2_open+0x1df2/0x43a0 [ 8.006398] handle_ksmbd_work+0x274/0x810 [ 8.006666] process_one_work+0x419/0x760 [ 8.006922] worker_thread+0x2a2/0x6f0 [ 8.007429] kthread+0x160/0x190 [ 8.007946] ret_from_fork+0x1f/0x30 [ 8.008181] </TASK> Cc: stable@vger.kernel.org Signed-off-by: Chih-Yen Chang <cc85nod@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-16ext2: Add direct-io trace pointsRitesh Harjani (IBM)
This patch adds the trace point to ext2 direct-io apis in fs/ext2/file.c Here is how the output looks like a.out-467865 [006] 6758.170968: ext2_dio_write_begin: dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 4096 flags DIRECT|WRITE aio 1 ret 0 a.out-467865 [006] 6758.171061: ext2_dio_write_end: dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 0 flags DIRECT|WRITE aio 1 ret -529 kworker/3:153-444162 [003] 6758.171252: ext2_dio_write_endio: dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 4096 flags DIRECT|WRITE aio 1 ret 0 a.out-468222 [001] 6761.628924: ext2_dio_read_begin: dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 4096 flags DIRECT aio 1 ret 0 a.out-468222 [001] 6761.629063: ext2_dio_read_end: dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 0 flags DIRECT aio 1 ret -529 a.out-468428 [005] 6763.937454: ext2_dio_write_begin: dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 4096 flags DIRECT aio 0 ret 0 a.out-468428 [005] 6763.937829: ext2_dio_write_endio: dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 4096 flags DIRECT aio 0 ret 0 a.out-468428 [005] 6763.937847: ext2_dio_write_end: dev 7:12 ino 0xe isize 0x1000 pos 0x1000 len 0 flags DIRECT aio 0 ret 4096 a.out-468609 [000] 6765.702878: ext2_dio_read_begin: dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 4096 flags DIRECT aio 0 ret 0 a.out-468609 [000] 6765.703243: ext2_dio_read_end: dev 7:12 ino 0xe isize 0x1000 pos 0x1000 len 0 flags DIRECT aio 0 ret 4096 Reported-and-tested-by: Disha Goel <disgoel@linux.ibm.com> [Need to add CFLAGS_trace for fixing unable to find trace file problem] Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <b8b0897fa2b273a448d7b4ba7317357ac73c08bc.1682069716.git.ritesh.list@gmail.com>
2023-05-16ext2: Move direct-io to use iomapRitesh Harjani (IBM)
This patch converts ext2 direct-io path to iomap interface. - This also takes care of DIO_SKIP_HOLES part in which we return -ENOTBLK from ext2_iomap_begin(), in case if the write is done on a hole. - This fallbacks to buffered-io in case of DIO_SKIP_HOLES or in case of a partial write or if any error is detected in ext2_iomap_end(). We try to return -ENOTBLK in such cases. - For any unaligned or extending DIO writes, we pass IOMAP_DIO_FORCE_WAIT flag to ensure synchronous writes. - For extending writes we set IOMAP_F_DIRTY in ext2_iomap_begin because otherwise with dsync writes on devices that support FUA, generic_write_sync won't be called and we might miss inode metadata updates. - Since ext2 already now uses _nolock vartiant of sync write. Hence there is no inode lock problem with iomap in this patch. - ext2_iomap_ops are now being shared by DIO, DAX & fiemap path Tested-by: Disha Goel <disgoel@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <610b672a52f2a7ff6dc550fd14d0f995806232a5.1682069716.git.ritesh.list@gmail.com>
2023-05-16ext2: Use generic_buffers_fsync() implementationRitesh Harjani (IBM)
Next patch converts ext2 to use iomap interface for DIO. iomap layer can call generic_write_sync() -> ext2_fsync() from iomap_dio_complete while still holding the inode_lock(). Now writeback from other paths doesn't need inode_lock(). It seems there is also no need of an inode_lock() for sync_mapping_buffers(). It uses it's own mapping->private_lock for it's buffer list handling. Hence this patch is in preparation to move ext2 to iomap. This uses generic_buffers_fsync() which does not take any inode_lock() in ext2_fsync(). Tested-by: Disha Goel <disgoel@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <76d206a464574ff91db25bc9e43479b51ca7e307.1682069716.git.ritesh.list@gmail.com>
2023-05-16ext4: Use generic_buffers_fsync_noflush() implementationRitesh Harjani (IBM)
ext4 when got converted to iomap for dio, it copied __generic_file_fsync implementation to avoid taking inode_lock in order to avoid any deadlock (since iomap takes an inode_lock while calling generic_write_sync()). The previous patch already added generic_buffers_fsync*() which does not take any inode_lock(). Hence kill the redundant code and use generic_buffers_fsync_noflush() function instead. Tested-by: Disha Goel <disgoel@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <b43d4bb4403061ed86510c9587673e30a461ba14.1682069716.git.ritesh.list@gmail.com>
2023-05-16fs/buffer.c: Add generic_buffers_fsync*() implementationRitesh Harjani (IBM)
Some of the higher layers like iomap takes inode_lock() when calling generic_write_sync(). Also writeback already happens from other paths without inode lock, so it's difficult to say that we really need sync_mapping_buffers() to take any inode locking here. Having said that, let's add generic_buffers_fsync/_noflush() implementation in buffer.c with no inode_lock/unlock() for now so that filesystems like ext2 and ext4's nojournal mode can use it. Ext4 when got converted to iomap for direct-io already copied it's own variant of __generic_file_fsync() without lock. This patch adds generic_buffers_fsync() & generic_buffers_fsync_noflush() implementations for use in filesystems like ext2 & ext4 respectively. Later we can review other filesystems as well to see if we can make generic_buffers_fsync/_noflush() which does not take any inode_lock() as the default path. Tested-by: Disha Goel <disgoel@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <d573408ac8408627d23a3d2d166e748c172c4c9e.1682069716.git.ritesh.list@gmail.com>
2023-05-16ext2/dax: Fix ext2_setsize when len is page alignedRitesh Harjani (IBM)
PAGE_ALIGN(x) macro gives the next highest value which is multiple of pagesize. But if x is already page aligned then it simply returns x. So, if x passed is 0 in dax_zero_range() function, that means the length gets passed as 0 to ->iomap_begin(). In ext2 it then calls ext2_get_blocks -> max_blocks as 0 and hits bug_on here in ext2_get_blocks(). BUG_ON(maxblocks == 0); Instead we should be calling dax_truncate_page() here which takes care of it. i.e. it only calls dax_zero_range if the offset is not page/block aligned. This can be easily triggered with following on fsdax mounted pmem device. dd if=/dev/zero of=file count=1 bs=512 truncate -s 0 file [79.525838] EXT2-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk [79.529376] ext2 filesystem being mounted at /mnt1/test supports timestamps until 2038 (0x7fffffff) [93.793207] ------------[ cut here ]------------ [93.795102] kernel BUG at fs/ext2/inode.c:637! [93.796904] invalid opcode: 0000 [#1] PREEMPT SMP PTI [93.798659] CPU: 0 PID: 1192 Comm: truncate Not tainted 6.3.0-rc2-xfstests-00056-g131086faa369 #139 [93.806459] RIP: 0010:ext2_get_blocks.constprop.0+0x524/0x610 <...> [93.835298] Call Trace: [93.836253] <TASK> [93.837103] ? lock_acquire+0xf8/0x110 [93.838479] ? d_lookup+0x69/0xd0 [93.839779] ext2_iomap_begin+0xa7/0x1c0 [93.841154] iomap_iter+0xc7/0x150 [93.842425] dax_zero_range+0x6e/0xa0 [93.843813] ext2_setsize+0x176/0x1b0 [93.845164] ext2_setattr+0x151/0x200 [93.846467] notify_change+0x341/0x4e0 [93.847805] ? lock_acquire+0xf8/0x110 [93.849143] ? do_truncate+0x74/0xe0 [93.850452] ? do_truncate+0x84/0xe0 [93.851739] do_truncate+0x84/0xe0 [93.852974] do_sys_ftruncate+0x2b4/0x2f0 [93.854404] do_syscall_64+0x3f/0x90 [93.855789] entry_SYSCALL_64_after_hwframe+0x72/0xdc CC: stable@vger.kernel.org Fixes: 2aa3048e03d3 ("iomap: switch iomap_zero_range to use iomap_iter") Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <046a58317f29d9603d1068b2bbae47c2332c17ae.1682069716.git.ritesh.list@gmail.com>
2023-05-15NFSD: Remove open coding of string copyAzeem Shaikh
Instead of open coding a __dynamic_array(), use the __string() and __assign_str() helper macros that exist for this kind of use case. Part of an effort to remove deprecated strlcpy() [1] completely from the kernel[2]. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89 Fixes: 3c92fba557c6 ("NFSD: Enhance the nfsd_cb_setup tracepoint") Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-05-15fs: fix incorrect fmode_t castsMin-Hua Chen
Use __FMODE_NONOTIFY instead of FMODE_NONOTIFY to fixes the following sparce warnings: fs/overlayfs/file.c:48:37: sparse: warning: restricted fmode_t degrades to integer fs/overlayfs/file.c:128:13: sparse: warning: restricted fmode_t degrades to integer fs/open.c:1159:21: sparse: warning: restricted fmode_t degrades to integer Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com> Message-Id: <20230502232210.119063-1-minhuadotchen@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-15jffs2: reduce stack usage in jffs2_build_xattr_subsystem()Fabian Frederick
Use kcalloc() for allocation/flush of 128 pointers table to reduce stack usage. Function now returns -ENOMEM or 0 on success. stackusage Before: ./fs/jffs2/xattr.c:775 jffs2_build_xattr_subsystem 1208 dynamic,bounded After: ./fs/jffs2/xattr.c:775 jffs2_build_xattr_subsystem 192 dynamic,bounded Also update definition when CONFIG_JFFS2_FS_XATTR is not enabled Tested with an MTD mount point and some user set/getfattr. Many current target on OpenWRT also suffer from a compilation warning (that become an error with CONFIG_WERROR) with the following output: fs/jffs2/xattr.c: In function 'jffs2_build_xattr_subsystem': fs/jffs2/xattr.c:887:1: error: the frame size of 1088 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] 887 | } | ^ Using dynamic allocation fix this compilation warning. Fixes: c9f700f840bd ("[JFFS2][XATTR] using 'delete marker' for xdatum/xref deletion") Reported-by: Tim Gardner <tim.gardner@canonical.com> Reported-by: kernel test robot <lkp@intel.com> Reported-by: Ron Economos <re@w6rz.net> Reported-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Cc: stable@vger.kernel.org Message-Id: <20230506045612.16616-1-ansuelsmth@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-15fs/open.c: Fix W=1 kernel doc warningsAnuradha Weeraman
fs/open.c: In functions 'setattr_vfsuid' and 'setattr_vfsgid': warning: Function parameter or member 'attr' not described - Fix warning by removing kernel-doc for these as they are static inline functions and not required to be exposed via kernel-doc. fs/open.c: warning: Excess function parameter 'opened' description in 'finish_open' warning: Excess function parameter 'cred' description in 'vfs_open' - Fix by removing the parameters from the kernel-doc as they are no longer required by the function. Signed-off-by: Anuradha Weeraman <anuradha@debian.org> Message-Id: <20230506182928.384105-1-anuradha@debian.org> Signed-off-by: Christian Brauner <brauner@kernel.org>