Age | Commit message (Collapse) | Author |
|
Remove redundant and unnecessary code.
ep_alloc uses kzalloc to create struct eventpoll, so there is no need to
set fields to defaults of 0. This was accidentally introduced in commit
85455c795c07 ("eventpoll: support busy poll per epoll instance") and
expanded on in follow-up commits.
Signed-off-by: Joe Damato <jdamato@fastly.com>
Link: https://lore.kernel.org/r/20240807105231.179158-1-jdamato@fastly.com
Reviewed-by: Martin Karsten <mkarsten@uwaterloo.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-fsdevel@vger.kernel.org
The comment on EXPORT_SYMBOL(close_fd) was added in commit 2ca2a09d6215
("fs: add ksys_close() wrapper; remove in-kernel calls to sys_close()"),
before commit 8760c909f54a ("file: Rename __close_fd to close_fd and remove
the files parameter") gave the function its current name, however commit
1572bfdf21d4 ("file: Replace ksys_close with close_fd") removes the
referenced caller entirely, obsoleting this comment.
Signed-off-by: Joel Savitz <jsavitz@redhat.com>
Link: https://lore.kernel.org/r/20240803025455.239276-1-jsavitz@redhat.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
replace 'permanetly' with 'permanently' in the comment &
replace 'propogated' with 'propagated' in the comment
Signed-off-by: Yuesong Li <liyuesong@vivo.com>
Link: https://lore.kernel.org/r/20240806034710.2807788-1-liyuesong@vivo.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Both i_mode and noexec checks wrapped in WARN_ON stem from an artifact
of the previous implementation. They used to legitimately check for the
condition, but that got moved up in two commits:
633fb6ac3980 ("exec: move S_ISREG() check earlier")
0fd338b2d2cd ("exec: move path_noexec() check earlier")
Instead of being removed said checks are WARN_ON'ed instead, which
has some debug value.
However, the spurious path_noexec check is racy, resulting in
unwarranted warnings should someone race with setting the noexec flag.
One can note there is more to perm-checking whether execve is allowed
and none of the conditions are guaranteed to still hold after they were
tested for.
Additionally this does not validate whether the code path did any perm
checking to begin with -- it will pass if the inode happens to be
regular.
Keep the redundant path_noexec() check even though it's mindless
nonsense checking for guarantee that isn't given so drop the WARN.
Reword the commentary and do small tidy ups while here.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20240805131721.765484-1-mjguzik@gmail.com
[brauner: keep redundant path_noexec() check]
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The lookup_fast helper in fs/namei.c has some subtlety in how dentries
are returned. Document them.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240802-openfast-v1-2-a1cff2a33063@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This function no longer exists.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240802-openfast-v1-1-a1cff2a33063@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Commit 2eea9ce4310d ("mounts: keep list of mounts in an rbtree")
removed the implementation but leave declaration.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20240803115000.589872-1-yuehaibing@huawei.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Christian Brauner <brauner@kernel.org> says:
It is currently possible to mount on top of various ephemeral entities
in procfs. This specifically includes magic links. To recap, magic links
are links of the form /proc/<pid>/fd/<nr>. They serve as references to
a target file and during path lookup they cause a jump to the target
path. Such magic links disappear if the corresponding file descriptor is
closed.
Currently it is possible to overmount such magic links:
int fd = open("/mnt/foo", O_RDONLY);
sprintf(path, "/proc/%d/fd/%d", getpid(), fd);
int fd2 = openat(AT_FDCWD, path, O_PATH | O_NOFOLLOW);
mount("/mnt/bar", path, "", MS_BIND, 0);
Arguably, this is nonsensical and is mostly interesting for an attacker
that wants to somehow trick a process into e.g., reopening something
that they didn't intend to reopen or to hide a malicious file
descriptor.
But also it risks leaking mounts for long-running processes. When
overmounting a magic link like above, the mount will not be detached
when the file descriptor is closed. Only the target mountpoint will
disappear. Which has the consequence of making it impossible to unmount
that mount afterwards. So the mount will stick around until the process
exits and the /proc/<pid>/ directory is cleaned up during
proc_flush_pid() when the dentries are pruned and invalidated.
That in turn means it's possible for a program to accidentally leak
mounts and it's also possible to make a task leak mounts without it's
knowledge if the attacker just keeps overmounting things under
/proc/<pid>/fd/<nr>.
I think it's wrong to try and fix this by us starting to play games with
close() or somewhere else to undo these mounts when the file descriptor
is closed. The fact that we allow overmounting of such magic links is
simply a bug and one that we need to fix.
Similar things can be said about entries under fdinfo/ and map_files/ so
those are restricted as well.
I have a further more aggressive patch that gets out the big hammer and
makes everything under /proc/<pid>/*, as well as immediate symlinks such
as /proc/self, /proc/thread-self, /proc/mounts, /proc/net that point
into /proc/<pid>/ not overmountable. Imho, all of this should be blocked
if we can get away with it. It's only useful to hide exploits such as in [1].
And again, overmounting of any global procfs files remains unaffected
and is an existing and supported use-case.
Link: https://righteousit.com/2024/07/24/hiding-linux-processes-with-bind-mounts [1]
// Note that repro uses the traditional way of just mounting over
// /proc/<pid>/fd/<nr>. This could also all be achieved just based on
// file descriptors using move_mount(). So /proc/<pid>/fd/<nr> isn't the
// only entry vector here. It's also possible to e.g., mount directly
// onto /proc/<pid>/map_files/* without going over /proc/<pid>/fd/<nr>.
int main(int argc, char *argv[])
{
char path[PATH_MAX];
creat("/mnt/foo", 0777);
creat("/mnt/bar", 0777);
/*
* For illustration use a bunch of file descriptors in the upper
* range that are unused.
*/
for (int i = 10000; i >= 256; i--) {
printf("I'm: /proc/%d/\n", getpid());
int fd2 = open("/mnt/foo", O_RDONLY);
if (fd2 < 0) {
printf("%m - Failed to open\n");
_exit(1);
}
int newfd = dup2(fd2, i);
if (newfd < 0) {
printf("%m - Failed to dup\n");
_exit(1);
}
close(fd2);
sprintf(path, "/proc/%d/fd/%d", getpid(), newfd);
int fd = openat(AT_FDCWD, path, O_PATH | O_NOFOLLOW);
if (fd < 0) {
printf("%m - Failed to open\n");
_exit(3);
}
sprintf(path, "/proc/%d/fd/%d", getpid(), fd);
printf("Mounting on top of %s\n", path);
if (mount("/mnt/bar", path, "", MS_BIND, 0)) {
printf("%m - Failed to mount\n");
_exit(4);
}
close(newfd);
close(fd2);
}
/*
* Give some time to look at things. The mounts now linger until
* the process exits.
*/
sleep(10000);
_exit(0);
}
* patches from https://lore.kernel.org/r/20240806-work-procfs-v1-0-fb04e1d09f0c@kernel.org:
proc: block mounting on top of /proc/<pid>/fdinfo/*
proc: block mounting on top of /proc/<pid>/fd/*
proc: block mounting on top of /proc/<pid>/map_files/*
proc: add proc_splice_unmountable()
proc: proc_readfdinfo() -> proc_fdinfo_iterate()
proc: proc_readfd() -> proc_fd_iterate()
Link: https://lore.kernel.org/r/20240806-work-procfs-v1-0-fb04e1d09f0c@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Entries under /proc/<pid>/fdinfo/* are ephemeral and may go away before
the process dies. As such allowing them to be used as mount points
creates the ability to leak mounts that linger until the process dies
with no ability to unmount them until then. Don't allow using them as
mountpoints.
Link: https://lore.kernel.org/r/20240806-work-procfs-v1-6-fb04e1d09f0c@kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Entries under /proc/<pid>/fd/* are ephemeral and may go away before the
process dies. As such allowing them to be used as mount points creates
the ability to leak mounts that linger until the process dies with no
ability to unmount them until then. Don't allow using them as
mountpoints.
Link: https://lore.kernel.org/r/20240806-work-procfs-v1-5-fb04e1d09f0c@kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Entries under /proc/<pid>/map_files/* are ephemeral and may go away
before the process dies. As such allowing them to be used as mount
points creates the ability to leak mounts that linger until the process
dies with no ability to unmount them until then. Don't allow using them
as mountpoints.
Link: https://lore.kernel.org/r/20240806-work-procfs-v1-4-fb04e1d09f0c@kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add a tiny procfs helper to splice a dentry that cannot be mounted upon.
Link: https://lore.kernel.org/r/20240806-work-procfs-v1-3-fb04e1d09f0c@kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Give the method to iterate through the fdinfo directory a better name.
Link: https://lore.kernel.org/r/20240806-work-procfs-v1-2-fb04e1d09f0c@kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Give the method to iterate through the fd directory a better name.
Link: https://lore.kernel.org/r/20240806-work-procfs-v1-1-fb04e1d09f0c@kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This adds a Kconfig option and boot param to allow removing
the FOLL_FORCE flag from /proc/pid/mem write calls because
it can be abused.
The traditional forcing behavior is kept as default because
it can break GDB and some other use cases.
Previously we tried a more sophisticated approach allowing
distributions to fine-tune /proc/pid/mem behavior, however
that got NAK-ed by Linus [1], who prefers this simpler
approach with semantics also easier to understand for users.
Link: https://lore.kernel.org/lkml/CAHk-=wiGWLChxYmUA5HrT5aopZrB7_2VTa0NLZcxORgkUe5tEQ@mail.gmail.com/ [1]
Cc: Doug Anderson <dianders@chromium.org>
Cc: Jeff Xu <jeffxu@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>
Link: https://lore.kernel.org/r/20240802080225.89408-1-adrian.ratiu@collabora.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Unlock before returning an error code if this allocation fails.
Fixes: 0626e6641f6b ("cifsd: add server handler for central processing and tranport layers")
Cc: stable@vger.kernel.org # v5.15+
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Steve French reported null pointer dereference error from sha256 lib.
cifs.ko can send session setup requests on reused connection.
If reused connection is used for binding session, conn->binding can
still remain true and generate_preauth_hash() will not set
sess->Preauth_HashValue and it will be NULL.
It is used as a material to create an encryption key in
ksmbd_gen_smb311_encryptionkey. ->Preauth_HashValue cause null pointer
dereference error from crypto_shash_update().
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 8 PID: 429254 Comm: kworker/8:39
Hardware name: LENOVO 20MAS08500/20MAS08500, BIOS N2CET69W (1.52 )
Workqueue: ksmbd-io handle_ksmbd_work [ksmbd]
RIP: 0010:lib_sha256_base_do_update.isra.0+0x11e/0x1d0 [sha256_ssse3]
<TASK>
? show_regs+0x6d/0x80
? __die+0x24/0x80
? page_fault_oops+0x99/0x1b0
? do_user_addr_fault+0x2ee/0x6b0
? exc_page_fault+0x83/0x1b0
? asm_exc_page_fault+0x27/0x30
? __pfx_sha256_transform_rorx+0x10/0x10 [sha256_ssse3]
? lib_sha256_base_do_update.isra.0+0x11e/0x1d0 [sha256_ssse3]
? __pfx_sha256_transform_rorx+0x10/0x10 [sha256_ssse3]
? __pfx_sha256_transform_rorx+0x10/0x10 [sha256_ssse3]
_sha256_update+0x77/0xa0 [sha256_ssse3]
sha256_avx2_update+0x15/0x30 [sha256_ssse3]
crypto_shash_update+0x1e/0x40
hmac_update+0x12/0x20
crypto_shash_update+0x1e/0x40
generate_key+0x234/0x380 [ksmbd]
generate_smb3encryptionkey+0x40/0x1c0 [ksmbd]
ksmbd_gen_smb311_encryptionkey+0x72/0xa0 [ksmbd]
ntlm_authenticate.isra.0+0x423/0x5d0 [ksmbd]
smb2_sess_setup+0x952/0xaa0 [ksmbd]
__process_request+0xa3/0x1d0 [ksmbd]
__handle_ksmbd_work+0x1c4/0x2f0 [ksmbd]
handle_ksmbd_work+0x2d/0xa0 [ksmbd]
process_one_work+0x16c/0x350
worker_thread+0x306/0x440
? __pfx_worker_thread+0x10/0x10
kthread+0xef/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x44/0x70
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
Fixes: f5a544e3bab7 ("ksmbd: add support for SMB3 multichannel")
Cc: stable@vger.kernel.org # v5.15+
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Add the __counted_by compiler attribute to the flexible array member
entries to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull execve fix from Kees Cook:
- binfmt_elf_fdpic: fix AUXV size with ELF_HWCAP2 (Max Filippov)
* tag 'execve-v6.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
binfmt_elf_fdpic: fix AUXV size calculation when ELF_HWCAP2 is defined
|
|
The runtime constant feature removes all the users of these variables,
allowing the compiler to optimize them away. It's quite difficult to
extract their values from the kernel text, and the memory saved by
removing them is tiny, and it was never the point of this optimization.
Since the dentry_hashtable is a core data structure, it's valuable for
debugging tools to be able to read it easily. For instance, scripts
built on drgn, like the dentrycache script[1], rely on it to be able to
perform diagnostics on the contents of the dcache. Annotate it as used,
so the compiler doesn't discard it.
Link: https://github.com/oracle-samples/drgn-tools/blob/3afc56146f54d09dfd1f6d3c1b7436eda7e638be/drgn_tools/dentry.py#L325-L355 [1]
Fixes: e3c92e81711d ("runtime constants: add x86 architecture support")
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Switch to the new kmem_cache_create_rcu() helper which allows us to use
a custom free pointer offset avoiding the need to have an external free
pointer which would grow struct file behind our backs.
Link: https://lore.kernel.org/r/20240828-work-kmem_cache-rcu-v3-3-5460bc1f09f6@kernel.org
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The current setup with bio_may_exceed_limit and __bio_split_to_limits
is a bit of a mess.
Change it so that __bio_split_to_limits does all the work and is just
a variant of bio_split_to_limits that returns nr_segs. This is done
by inlining it and instead have the various bio_split_* helpers directly
submit the potentially split bios.
To support btrfs, the rw version has a lower level helper split out
that just returns the offset to split. This turns out to nicely clean
up the btrfs flow as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Link: https://lore.kernel.org/r/20240826173820.1690925-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
fuse_mount_list doesn't exist, use fuse_conn_list.
Signed-off-by: Aurelien Aptel <aaptel@nvidia.com>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
I've been timing various fuse operations and it's quite annoying to do
with kprobes. Add two tracepoints for sending and ending fuse requests
to make it easier to debug and time various operations.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
fuse_writepage_locked()
This change refactors the shared logic in fuse_writepages_fill() and
fuse_writepages_locked() into two separate helper functions,
fuse_writepage_args_page_fill() and fuse_writepage_args_setup().
No functional changes added.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Before this change, wpa->ia.ff is initialized with an acquired reference
on the fuse file right before it submits the writeback request. If there
are auxiliary writebacks, then the initialization and reference
acquisition needs to also be set before we submit the auxiliary writeback
request.
To make the logic simpler and to pave the way for a subsequent
refactoring of fuse_writepages_fill() and fuse_writepage_locked(), this
change initializes and acquires wpa->ia.ff when the wpa is allocated.
No functional changes added.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
To pave the way for refactoring out the shared logic in
fuse_writepages_fill() and fuse_writepage_locked(), this change converts
the temporary page in fuse_writepages_fill() to use the folio API.
This is similar to the change in commit e0887e095a80 ("fuse: Convert
fuse_writepage_locked to take a folio"), which converted the tmp page in
fuse_writepage_locked() to use the folio API.
inc_node_page_state() is intentionally preserved here instead of
converting to node_stat_add_folio() since it is updating the stat of the
underlying page and to better maintain API symmetry with
dec_node_page_stat() in fuse_writepage_finish_stat().
No functional changes added.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
callback
Prior to this change, data->ff is checked and if not initialized then
initialized in the fuse_writepages_fill() callback, which gets called
for every dirty page in the address space mapping.
This logic is better placed in the main fuse_writepages() caller where
data.ff is initialized before walking the dirty pages.
No functional changes added.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Move the logic for updating the bdi and page stats for a finished
writeback into a separate helper function, where it can be called from
both fuse_writepage_finish() and fuse_writepage_add() (in the case
where there is already an auxiliary write request for the page).
No functional changes added.
Suggested by: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Drop the unused "struct fuse_mount *fm" arg in
fuse_writepage_finish().
No functional changes added.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
In some cases, the fi->writepages may be empty. And there is no need
to check fi->writepages with spin_lock, which may have an impact on
performance due to lock contention. For example, in scenarios where
multiple readers read the same file without any writers, or where
the page cache is not enabled.
Also remove the outdated comment since commit 6b2fb79963fb ("fuse:
optimize writepages search") has optimize the situation by replacing
list with rb-tree.
Signed-off-by: yangyun <yangyun50@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Virtiofs has its own queuing mechanism, but still requests are first queued
on fiq->pending to be immediately dequeued and queued onto the virtio
queue.
The queuing on fiq->pending is unnecessary and might even have some
performance impact due to being a contention point.
Forget requests are handled similarly.
Move the queuing of requests and forgets into the fiq->ops->*.
fuse_iqueue_ops are renamed to reflect the new semantics.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Fixed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Tested-by: Peter-Jan Gootzen <pgootzen@nvidia.com>
Reviewed-by: Peter-Jan Gootzen <pgootzen@nvidia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Current design and handling of passthrough is without fuse
caching and with that FUSE_WRITEBACK_CACHE is conflicting.
Fixes: 7dc4e97a4f9a ("fuse: introduce FUSE_PASSTHROUGH capability")
Cc: stable@kernel.org # v6.9
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Acked-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Since commit 5679897eb104 ("vfs: make sync_filesystem return errors from
->sync_fs"), the return value from sync_fs callback can be seen in
sync_filesystem(). Thus the errseq_set opreation can be removed here.
Depends-on: commit 5679897eb104 ("vfs: make sync_filesystem return errors from ->sync_fs")
Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
|
|
Under certain conditions, the range to be cleared by FALLOC_FL_ZERO_RANGE
may only be buffered locally and not yet have been flushed to the server.
For example:
xfs_io -f -t -c "pwrite -S 0x41 0 4k" \
-c "pwrite -S 0x42 4k 4k" \
-c "fzero 0 4k" \
-c "pread -v 0 8k" /xfstest.test/foo
will write two 4KiB blocks of data, which get buffered in the pagecache,
and then fallocate() is used to clear the first 4KiB block on the server -
but we don't flush the data first, which means the EOF position on the
server is wrong, and so the FSCTL_SET_ZERO_DATA RPC fails (and xfs_io
ignores the error), but then when we try to read it, we see the old data.
Fix this by preflushing any part of the target region that above the
server's idea of the EOF position to force the server to update its EOF
position.
Note, however, that we don't want to simply expand the file by moving the
EOF before doing the FSCTL_SET_ZERO_DATA[*] because someone else might see
the zeroed region or if the RPC fails we then have to try to clean it up or
risk getting corruption.
[*] And we have to move the EOF first otherwise FSCTL_SET_ZERO_DATA won't
do what we want.
This fixes the generic/008 xfstest.
[!] Note: A better way to do this might be to split the operation into two
parts: we only do FSCTL_SET_ZERO_DATA for the part of the range below the
server's EOF and then, if that worked, invalidate the buffered pages for the
part above the range.
Fixes: 6b69040247e1 ("cifs/smb3: Fix data inconsistent when zero file range")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <stfrench@microsoft.com>
cc: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
cc: Pavel Shilovsky <pshilov@microsoft.com>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: linux-mm@kvack.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Fix a number of crashers
- Update email address for an NFSD reviewer
* tag 'nfsd-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
fs/nfsd: fix update of inode attrs in CB_GETATTR
nfsd: fix potential UAF in nfsd4_cb_getattr_release
nfsd: hold reference to delegation when updating it for cb_getattr
MAINTAINERS: Update Olga Kornievskaia's email address
nfsd: prevent panic for nfsv4.0 closed files in nfs4_show_open
nfsd: ensure that nfsd4_fattr_args.context is zeroed out
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix use-after-free when submitting bios for read, after an error and
partially submitted bio the original one is freed while it can be
still be accessed again
- fix fstests case btrfs/301, with enabled quotas wait for delayed
iputs when flushing delalloc
- fix periodic block group reclaim, an unitialized value can be
returned if there are no block groups to reclaim
- fix build warning (-Wmaybe-uninitialized)
* tag 'for-6.11-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix uninitialized return value from btrfs_reclaim_sweep()
btrfs: fix a use-after-free when hitting errors inside btrfs_submit_chunk()
btrfs: initialize last_extent_end to fix -Wmaybe-uninitialized warning in extent_fiemap()
btrfs: run delayed iputs when flushing delalloc
|
|
In the case where the aux writeback list is dropped (e.g. the pages
have been truncated or the connection is broken), the stats for
its pages and backing device info need to be updated as well.
Fixes: e2653bd53a98 ("fuse: fix leaked aux requests")
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Cc: <stable@vger.kernel.org> # v5.1
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Originally when a stolen page was inserted into fuse's page cache by
fuse_try_move_page(), it would be marked uptodate. Then
fuse_readpages_end() would call SetPageUptodate() again on the already
uptodate page.
Commit 413e8f014c8b ("fuse: Convert fuse_readpages_end() to use
folio_end_read()") changed that by replacing the SetPageUptodate() +
unlock_page() combination with folio_end_read(), which does mostly the
same, except it sets the uptodate flag with an xor operation, which in the
above scenario resulted in the uptodate flag being cleared, which in turn
resulted in EIO being returned on the read.
Fix by clearing PG_uptodate instead of setting it in fuse_try_move_page(),
conforming to the expectation of folio_end_read().
Reported-by: Jürg Billeter <j@bitron.ch>
Debugged-by: Matthew Wilcox <willy@infradead.org>
Fixes: 413e8f014c8b ("fuse: Convert fuse_readpages_end() to use folio_end_read()")
Cc: <stable@vger.kernel.org> # v6.10
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
The memory of struct fuse_file is allocated but not freed
when get_create_ext return error.
Fixes: 3e2b6fdbdc9a ("fuse: send security context of inode on file")
Cc: stable@vger.kernel.org # v5.17
Signed-off-by: yangyun <yangyun50@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
resending
There is a race condition where inflight requests will not be aborted if
they are in the middle of being re-sent when the connection is aborted.
If fuse_resend has already moved all the requests in the fpq->processing
lists to its private queue ("to_queue") and then the connection starts
and finishes aborting, these requests will be added to the pending queue
and remain on it indefinitely.
Fixes: 760eac73f9f6 ("fuse: Introduce a new notification type for resend pending requests")
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Cc: <stable@vger.kernel.org> # v6.9
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
The existing code uses min_t(ssize_t, outarg.size, XATTR_LIST_MAX) when
parsing the FUSE daemon's response to a zero-length getxattr/listxattr
request.
On 32-bit kernels, where ssize_t and outarg.size are the same size, this is
wrong: The min_t() will pass through any size values that are negative when
interpreted as signed.
fuse_listxattr() will then return this userspace-supplied negative value,
which callers will treat as an error value.
This kind of bug pattern can lead to fairly bad security bugs because of
how error codes are used in the Linux kernel. If a caller were to convert
the numeric error into an error pointer, like so:
struct foo *func(...) {
int len = fuse_getxattr(..., NULL, 0);
if (len < 0)
return ERR_PTR(len);
...
}
then it would end up returning this userspace-supplied negative value cast
to a pointer - but the caller of this function wouldn't recognize it as an
error pointer (IS_ERR_VALUE() only detects values in the narrow range in
which legitimate errno values are), and so it would just be treated as a
kernel pointer.
I think there is at least one theoretical codepath where this could happen,
but that path would involve virtio-fs with submounts plus some weird
SELinux configuration, so I think it's probably not a concern in practice.
Cc: stable@vger.kernel.org # v4.9
Fixes: 63401ccdb2ca ("fuse: limit xattr returned size")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Refactor xfs_file_fallocate into separate helpers for each mode,
two factors for i_size handling and a single switch statement over the
supported modes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240827065123.1762168-7-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Move the xfs_is_always_cow_inode check from the caller into
xfs_alloc_file_space to prepare for refactoring of xfs_file_fallocate.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240827065123.1762168-6-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Call xfs_flush_unmap_range from xfs_free_file_space so that
xfs_file_fallocate doesn't have to predict which mode will call it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240827065123.1762168-5-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The fallocate system call takes a mode argument, but that argument
contains a wild mix of exclusive modes and an optional flags.
Replace FALLOC_FL_SUPPORTED_MASK with FALLOC_FL_MODE_MASK, which excludes
the optional flag bit, so that we can use switch statement on the value
to easily enumerate the cases while getting the check for duplicate modes
for free.
To make this (and in the future the file system implementations) more
readable also add a symbolic name for the 0 mode used to allocate blocks.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240827065123.1762168-4-hch@lst.de
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Fix cifs_file_copychunk_range() to flush the destination region before
invalidating it to avoid potential loss of data should the copy fail, in
whole or in part, in some way.
Fixes: 7b2404a886f8 ("cifs: Fix flushing, invalidation and file size with copy_file_range()")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <stfrench@microsoft.com>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Short DIO reads, particularly in relation to cifs, are not being handled
correctly by cifs and netfslib. This can be tested by doing a DIO read of
a file where the size of read is larger than the size of the file. When it
crosses the EOF, it gets a short read and this gets retried, and in the
case of cifs, the retry read fails, with the failure being translated to
ENODATA.
Fix this by the following means:
(1) Add a flag, NETFS_SREQ_HIT_EOF, for the filesystem to set when it
detects that the read did hit the EOF.
(2) Make the netfslib read assessment stop processing subrequests when it
encounters one with that flag set.
(3) Return rreq->transferred, the accumulated contiguous amount read to
that point, to userspace for a DIO read.
(4) Make cifs set the flag and clear the error if the read RPC returned
ENODATA.
(5) Make cifs set the flag and clear the error if a short read occurred
without error and the read-to file position is now at the remote inode
size.
Fixes: 69c3c023af25 ("cifs: Implement netfslib hooks")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
When netfslib asks cifs to issue a read operation, it prefaces this with a
call to ->clamp_length() which cifs uses to negotiate credits, providing
receive capacity on the server; however, in the event that a read op needs
reissuing, netfslib doesn't call ->clamp_length() again as that could
shorten the subrequest, leaving a gap.
This causes the retried read to be done with zero credits which causes the
server to reject it with STATUS_INVALID_PARAMETER. This is a problem for a
DIO read that is requested that would go over the EOF. The short read will
be retried, causing EINVAL to be returned to the user when it fails.
Fix this by making cifs_req_issue_read() negotiate new credits if retrying
(NETFS_SREQ_RETRYING now gets set in the read side as well as the write
side in this instance).
This isn't sufficient, however: the new credits might not be sufficient to
complete the remainder of the read, so also add an additional field,
rreq->actual_len, that holds the actual size of the op we want to perform
without having to alter subreq->len.
We then rely on repeated short reads being retried until we finish the read
or reach the end of file and make a zero-length read.
Also fix a couple of places where the subrequest start and length need to
be altered by the amount so far transferred when being used.
Fixes: 69c3c023af25 ("cifs: Implement netfslib hooks")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
We do embedd struct fown_struct into struct file letting it take up 32
bytes in total. We could tweak struct fown_struct to be more compact but
really it shouldn't even be embedded in struct file in the first place.
Instead, actual users of struct fown_struct should allocate the struct
on demand. This frees up 24 bytes in struct file.
That will have some potentially user-visible changes for the ownership
fcntl()s. Some of them can now fail due to allocation failures.
Practically, that probably will almost never happen as the allocations
are small and they only happen once per file.
The fown_struct is used during kill_fasync() which is used by e.g.,
pipes to generate a SIGIO signal. Sending of such signals is conditional
on userspace having set an owner for the file using one of the F_OWNER
fcntl()s. Such users will be unaffected if struct fown_struct is
allocated during the fcntl() call.
There are a few subsystems that call __f_setown() expecting
file->f_owner to be allocated:
(1) tun devices
file->f_op->fasync::tun_chr_fasync()
-> __f_setown()
There are no callers of tun_chr_fasync().
(2) tty devices
file->f_op->fasync::tty_fasync()
-> __tty_fasync()
-> __f_setown()
tty_fasync() has no additional callers but __tty_fasync() has. Note
that __tty_fasync() only calls __f_setown() if the @on argument is
true. It's called from:
file->f_op->release::tty_release()
-> tty_release()
-> __tty_fasync()
-> __f_setown()
tty_release() calls __tty_fasync() with @on false
=> __f_setown() is never called from tty_release().
=> All callers of tty_release() are safe as well.
file->f_op->release::tty_open()
-> tty_release()
-> __tty_fasync()
-> __f_setown()
__tty_hangup() calls __tty_fasync() with @on false
=> __f_setown() is never called from tty_release().
=> All callers of __tty_hangup() are safe as well.
From the callchains it's obvious that (1) and (2) end up getting called
via file->f_op->fasync(). That can happen either through the F_SETFL
fcntl() with the FASYNC flag raised or via the FIOASYNC ioctl(). If
FASYNC is requested and the file isn't already FASYNC then
file->f_op->fasync() is called with @on true which ends up causing both
(1) and (2) to call __f_setown().
(1) and (2) are the only subsystems that call __f_setown() from the
file->f_op->fasync() handler. So both (1) and (2) have been updated to
allocate a struct fown_struct prior to calling fasync_helper() to
register with the fasync infrastructure. That's safe as they both call
fasync_helper() which also does allocations if @on is true.
The other interesting case are file leases:
(3) file leases
lease_manager_ops->lm_setup::lease_setup()
-> __f_setown()
Which in turn is called from:
generic_add_lease()
-> lease_manager_ops->lm_setup::lease_setup()
-> __f_setown()
So here again we can simply make generic_add_lease() allocate struct
fown_struct prior to the lease_manager_ops->lm_setup::lease_setup()
which happens under a spinlock.
With that the two remaining subsystems that call __f_setown() are:
(4) dnotify
(5) sockets
Both have their own custom ioctls to set struct fown_struct and both
have been converted to allocate a struct fown_struct on demand from
their respective ioctls.
Interactions with O_PATH are fine as well e.g., when opening a /dev/tty
as O_PATH then no file->f_op->open() happens thus no file->f_owner is
allocated. That's fine as no file operation will be set for those and
the device has never been opened. fcntl()s called on such things will
just allocate a ->f_owner on demand. Although I have zero idea why'd you
care about f_owner on an O_PATH fd.
Link: https://lore.kernel.org/r/20240813-work-f_owner-v2-1-4e9343a79f9f@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|