summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2024-01-24bcachefs: discard path uses unlock_long()Kent Overstreet
Some (bad) devices can have really terrible discard latency; we don't want them blocking memory reclaim and causing warnings. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-24Merge tag 'execve-v6.8-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve fixes from Kees Cook: - Fix error handling in begin_new_exec() (Bernd Edlinger) - MAINTAINERS: specifically mention ELF (Alexey Dobriyan) - Various cleanups related to earlier open() (Askar Safin, Kees Cook) * tag 'execve-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: exec: Distinguish in_execve from in_exec exec: Fix error handling in begin_new_exec() exec: Add do_close_execat() helper exec: remove useless comment ELF, MAINTAINERS: specifically mention ELF
2024-01-24uselib: remove use of __FMODE_EXECLinus Torvalds
Jann Horn points out that uselib() really shouldn't trigger the new FMODE_EXEC logic introduced by commit 4759ff71f23e ("exec: __FMODE_EXEC instead of in_execve for LSMs"). In fact, it shouldn't even have ever triggered the old pre-existing logic for __FMODE_EXEC (like the NFS code that makes executables not need read permissions). Unlike a real execve(), that can work even with files that are purely executable by the user (not readable), uselib() has that MAY_READ requirement becasue it's really just a convenience wrapper around mmap() for legacy shared libraries. The whole FMODE_EXEC bit was originally introduced by commit b500531e6f5f ("[PATCH] Introduce FMODE_EXEC file flag"), primarily to give ETXTBUSY error returns for distributed filesystems. It has since grown a few other warts (like that NFS thing), but there really isn't any reason to use it for uselib(), and now that we are trying to use it to replace the horrid 'tsk->in_execve' flag, it's actively wrong. Of course, as Jann Horn also points out, nobody should be enabling CONFIG_USELIB in the first place in this day and age, but that's a different discussion entirely. Reported-by: Jann Horn <jannh@google.com> Fixes: 4759ff71f23e ("exec: __FMODE_EXEC instead of in_execve for LSMs") Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-01-24exec: Distinguish in_execve from in_execKees Cook
Just to help distinguish the fs->in_exec flag from the current->in_execve flag, add comments in check_unsafe_exec() and copy_fs() for more context. Also note that in_execve is only used by TOMOYO now. Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-fsdevel@vger.kernel.org Cc: linux-mm@kvack.org Signed-off-by: Kees Cook <keescook@chromium.org>
2024-01-24fsnotify: optimize the case of no parent watcherAmir Goldstein
If parent inode is not watching, check for the event in masks of sb/mount/inode masks early to optimize out most of the code in __fsnotify_parent() and avoid calling fsnotify(). Jens has reported that this optimization improves BW and IOPS in an io_uring benchmark by more than 10% and reduces perf reported CPU usage. before: + 4.51% io_uring [kernel.vmlinux] [k] fsnotify + 3.67% io_uring [kernel.vmlinux] [k] __fsnotify_parent after: + 2.37% io_uring [kernel.vmlinux] [k] __fsnotify_parent Reported-and-tested-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/linux-fsdevel/b45bd8ff-5654-4e67-90a6-aad5e6759e0b@kernel.dk/ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20240116113247.758848-1-amir73il@gmail.com>
2024-01-24nfsd: fix RELEASE_LOCKOWNERNeilBrown
The test on so_count in nfsd4_release_lockowner() is nonsense and harmful. Revert to using check_for_locks(), changing that to not sleep. First: harmful. As is documented in the kdoc comment for nfsd4_release_lockowner(), the test on so_count can transiently return a false positive resulting in a return of NFS4ERR_LOCKS_HELD when in fact no locks are held. This is clearly a protocol violation and with the Linux NFS client it can cause incorrect behaviour. If RELEASE_LOCKOWNER is sent while some other thread is still processing a LOCK request which failed because, at the time that request was received, the given owner held a conflicting lock, then the nfsd thread processing that LOCK request can hold a reference (conflock) to the lock owner that causes nfsd4_release_lockowner() to return an incorrect error. The Linux NFS client ignores that NFS4ERR_LOCKS_HELD error because it never sends NFS4_RELEASE_LOCKOWNER without first releasing any locks, so it knows that the error is impossible. It assumes the lock owner was in fact released so it feels free to use the same lock owner identifier in some later locking request. When it does reuse a lock owner identifier for which a previous RELEASE failed, it will naturally use a lock_seqid of zero. However the server, which didn't release the lock owner, will expect a larger lock_seqid and so will respond with NFS4ERR_BAD_SEQID. So clearly it is harmful to allow a false positive, which testing so_count allows. The test is nonsense because ... well... it doesn't mean anything. so_count is the sum of three different counts. 1/ the set of states listed on so_stateids 2/ the set of active vfs locks owned by any of those states 3/ various transient counts such as for conflicting locks. When it is tested against '2' it is clear that one of these is the transient reference obtained by find_lockowner_str_locked(). It is not clear what the other one is expected to be. In practice, the count is often 2 because there is precisely one state on so_stateids. If there were more, this would fail. In my testing I see two circumstances when RELEASE_LOCKOWNER is called. In one case, CLOSE is called before RELEASE_LOCKOWNER. That results in all the lock states being removed, and so the lockowner being discarded (it is removed when there are no more references which usually happens when the lock state is discarded). When nfsd4_release_lockowner() finds that the lock owner doesn't exist, it returns success. The other case shows an so_count of '2' and precisely one state listed in so_stateid. It appears that the Linux client uses a separate lock owner for each file resulting in one lock state per lock owner, so this test on '2' is safe. For another client it might not be safe. So this patch changes check_for_locks() to use the (newish) find_any_file_locked() so that it doesn't take a reference on the nfs4_file and so never calls nfsd_file_put(), and so never sleeps. With this check is it safe to restore the use of check_for_locks() rather than testing so_count against the mysterious '2'. Fixes: ce3c4ad7f4ce ("NFSD: Fix possible sleep during nfsd4_release_lockowner()") Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Cc: stable@vger.kernel.org # v6.2+ Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-01-24fs: Remove NTFS classicMatthew Wilcox (Oracle)
The replacement, NTFS3, was merged over two years ago. It is now time to remove the original from the tree as it is the last user of several APIs, and it is not worth changing. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240115072025.2071931-1-willy@infradead.org Acked-by: Namjae Jeon <linkinjeon@kernel.org> Acked-by: Dave Chinner <david@fromorbit.com> Cc: Anton Altaparmakov <anton@tuxera.com> Cc: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-01-23cifs: fix stray unlock in cifs_chan_skip_or_disableShyam Prasad N
A recent change moved the code that decides to skip a channel or disable multichannel entirely, into a helper function. During this, a mutex_unlock of the session_mutex should have been removed. Doing that here. Fixes: f591062bdbf4 ("cifs: handle servers that still advertise multichannel after disabling") Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-01-23cifs: set replay flag for retries of write commandShyam Prasad N
Similar to the rest of the commands, this is a change to add replay flags on retry. This one does not add a back-off, considering that we may want to flush a write ASAP to the server. Considering that this will be a flush of cached pages, the retrans value is also not honoured. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-01-23cifs: commands that are retried should have replay flag setShyam Prasad N
MS-SMB2 states that the header flag SMB2_FLAGS_REPLAY_OPERATION needs to be set when a command needs to be retried, so that the server is aware that this is a replay for an operation that appeared before. This can be very important, for example, for state changing operations and opens which get retried following a reconnect; since the client maybe unaware of the status of the previous open. This is particularly important for multichannel scenario, since disconnection of one connection does not mean that the session is lost. The requests can be replayed on another channel. This change also makes use of exponential back-off before replays and also limits the number of retries to "retrans" mount option value. Also, this change does not modify the read/write codepath. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-01-23cifs: helper function to check replayable error codesShyam Prasad N
The code to check for replay is not just -EAGAIN. In some cases, the send request or receive response may result in network errors, which we're now mapping to -ECONNABORTED. This change introduces a helper function which checks if the error returned in one of the above two errors. And all checks for replays will now use this helper. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-01-23cifs: translate network errors on send to -ECONNABORTEDShyam Prasad N
When the network stack returns various errors, we today bubble up the error to the user (in case of soft mounts). This change translates all network errors except -EINTR and -EAGAIN to -ECONNABORTED. A similar approach is taken when we receive network errors when reading from the socket. The change also forces the cifsd thread to reconnect during it's next activity. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-01-23cifs: cifs_pick_channel should try selecting active channelsShyam Prasad N
cifs_pick_channel today just selects a channel based on the policy of least loaded channel. However, it does not take into account if the channel needs reconnect. As a result, we can have failures in send that can be completely avoided. This change doesn't make a channel a candidate for this selection if it needs reconnect. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-01-23cifs: Share server EOF pos with netfslibDavid Howells
Use cifsi->netfs_ctx.remote_i_size instead of cifsi->server_eof so that netfslib can refer to it to. Signed-off-by: David Howells <dhowells@redhat.com> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Rohith Surabattula <rohiths.msft@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: linux-cachefs@redhat.com cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org Signed-off-by: Steve French <stfrench@microsoft.com>
2024-01-23smb: Work around Clang __bdos() type confusionKees Cook
Recent versions of Clang gets confused about the possible size of the "user" allocation, and CONFIG_FORTIFY_SOURCE ends up emitting a warning[1]: repro.c:126:4: warning: call to '__write_overflow_field' declared with 'warning' attribute: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning] 126 | __write_overflow_field(p_size_field, size); | ^ for this memset(): int len; __le16 *user; ... len = ses->user_name ? strlen(ses->user_name) : 0; user = kmalloc(2 + (len * 2), GFP_KERNEL); ... if (len) { ... } else { memset(user, '\0', 2); } While Clang works on this bug[2], switch to using a direct assignment, which avoids memset() entirely which both simplifies the code and silences the false positive warning. (Making "len" size_t also silences the warning, but the direct assignment seems better.) Reported-by: Nathan Chancellor <nathan@kernel.org> Closes: https://github.com/ClangBuiltLinux/linux/issues/1966 [1] Link: https://github.com/llvm/llvm-project/issues/77813 [2] Cc: Steve French <sfrench@samba.org> Cc: Paulo Alcantara <pc@manguebit.com> Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com> Cc: Shyam Prasad N <sprasad@microsoft.com> Cc: Tom Talpey <tom@talpey.com> Cc: linux-cifs@vger.kernel.org Cc: llvm@lists.linux.dev Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-01-23Merge tag 'trace-v6.8-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing and eventfs fixes from Steven Rostedt: - Fix histogram tracing_map insertion. The tracing_map_insert copies the value into the elt variable and then assigns the elt to the entry value. But it is possible that the entry value becomes visible on other CPUs before the elt is fully initialized. This is fixed by adding a wmb() between the initialization of the elt variable and assigning it. - Have eventfs directory have unique inode numbers. Having them be all the same proved to be a failure as the 'find' application will think that the directories are causing loops, as it checks for directory loops via their inodes. Have the evenfs dir entries get their inodes assigned when they are referenced and then save them in the eventfs_inode structure. * tag 'trace-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: eventfs: Save directory inodes in the eventfs_inode structure tracing: Ensure visibility when inserting an element into tracing_map
2024-01-23smb: client: delete "true", "false" definesAlexey Dobriyan
Kernel has its own official true/false definitions. The defines aren't even used in this file. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-01-23quota: Drop GFP_NOFS instances under dquot->dq_lock and dqio_semJan Kara
Quota code acquires dquot->dq_lock whenever reading / writing dquot. When reading / writing quota info we hold dqio_sem. Since these locks can be acquired during inode reclaim (through dquot_drop() -> dqput() -> dquot_release()) we are setting nofs allocation context whenever acquiring these locks. Hence there's no need to use GFP_NOFS allocations in quota code doing IO. Just switch it to GFP_KERNEL. Signed-off-by: Jan Kara <jack@suse.cz>
2024-01-23quota: Set nofs allocation context when acquiring dqio_semJan Kara
dqio_sem can be acquired during inode reclaim through dquot_drop() -> dqput() -> dquot_release() -> write_file_info() context. In some cases (most notably through dquot_get_next_id()) it can be acquired without holding dquot->dq_lock (which already sets nofs allocation context). So we need to set nofs allocation context when acquiring it as well. Signed-off-by: Jan Kara <jack@suse.cz>
2024-01-23ext2: Remove GFP_NOFS use in ext2_xattr_cache_insert()Jan Kara
ext2_xattr_cache_insert() calls mb_cache_entry_create() with GFP_NOFS because it is called under EXT2_I(inode)->xattr_sem. However xattr_sem or any higher ranking lock is not acquired on fs reclaim path for ext2 at least since we don't do page writeback from direct reclaim. Thus GFP_NOFS is not needed. Signed-off-by: Jan Kara <jack@suse.cz>
2024-01-23ext2: Drop GFP_NOFS use in ext2_get_blocks()Jan Kara
ext2_get_blocks() calls sb_issue_zeroout() with GFP_NOFS flag. However the call is performed under inode->i_rwsem and EXT2_I(inode)->i_truncate_mutex neither of which is acquired during direct fs reclaim. So it is safe to change the gfp mask to GFP_KERNEL. Signed-off-by: Jan Kara <jack@suse.cz>
2024-01-23ext2: Drop GFP_NOFS allocation from ext2_init_block_alloc_info()Jan Kara
The allocation happens under inode->i_rwsem and EXT2_I(inode)->i_truncate_mutex. Neither of them is acquired during direct fs reclaim so the allocation can be changed to GFP_KERNEL. Signed-off-by: Jan Kara <jack@suse.cz>
2024-01-23udf: Remove GFP_NOFS allocation in udf_expand_file_adinicb()Jan Kara
udf_expand_file_adinicb() is called under inode->i_rwsem and mapping->invalidate_lock. i_rwsem is safe wrt fs reclaim, invalidate_lock on this inode is safe as well (we hold inode reference so reclaim will not touch it, furthermore even lockdep should not complain as invalidate_lock is acquired from udf_evict_inode() only when truncating inode which should not happen from fs reclaim). Signed-off-by: Jan Kara <jack@suse.cz>
2024-01-23udf: Avoid GFP_NOFS allocation in udf_load_pvoldesc()Jan Kara
udf_load_pvoldesc() is called only during mount when it is safe to enter fs reclaim (we hold only s_umount semaphore). Change GFP_NOFS to GFP_KERNEL allocation. Signed-off-by: Jan Kara <jack@suse.cz>
2024-01-23udf: Avoid GFP_NOFS allocation in udf_symlink()Jan Kara
The GFP_NOFS allocation in udf_symlink() is called only under inode->i_rwsem and UDF_I(inode)->i_data_sem. The first is safe wrt reclaim, the second should be as well but allocating unde this lock is actually unnecessary. Move the allocation from under i_data_sem and change it to GFP_KERNEL. Signed-off-by: Jan Kara <jack@suse.cz>
2024-01-23udf: Remove GFP_NOFS from dir iteration codeJan Kara
Directory iteration code was using GFP_NOFS allocations in two places. However the code is called only under inode->i_rwsem which is generally safe wrt reclaim. So we can do the allocations with GFP_KERNEL instead. Signed-off-by: Jan Kara <jack@suse.cz>
2024-01-23Merge tag 'exportfs-6.9' of ↵Christian Brauner
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/cel/linux Merge exportfs fixes from Chuck Lever: * tag 'exportfs-6.9' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/cel/linux: fs: Create a generic is_dot_dotdot() utility exportfs: fix the fallback implementation of the get_name export operation Link: https://lore.kernel.org/r/BDC2AEB4-7085-4A7C-8DE8-A659FE1DBA6A@oracle.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-01-23eventfd: move 'eventfd-count' printing out of spinlockWen Yang
When printing eventfd->count, interrupts will be disabled and a spinlock will be obtained, competing with eventfd_write(). By moving the "eventfd-count" print out of the spinlock and merging multiple seq_printf() into one, it could improve a bit, just like timerfd_show(). Signed-off-by: Wen Yang <wenyang.linux@foxmail.com> Link: https://lore.kernel.org/r/tencent_B0B3D2BD9861FD009E03AB18A81783322709@qq.com Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Dylan Yudaken <dylany@fb.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Eric Biggers <ebiggers@google.com> Cc: <linux-fsdevel@vger.kernel.org> Cc: <linux-kernel@vger.kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-01-23fs: Create a generic is_dot_dotdot() utilityChuck Lever
De-duplicate the same functionality in several places by hoisting the is_dot_dotdot() utility function into linux/fs.h. Suggested-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-01-23exportfs: fix the fallback implementation of the get_name export operationTrond Myklebust
The fallback implementation for the get_name export operation uses readdir() to try to match the inode number to a filename. That filename is then used together with lookup_one() to produce a dentry. A problem arises when we match the '.' or '..' entries, since that causes lookup_one() to fail. This has sometimes been seen to occur for filesystems that violate POSIX requirements around uniqueness of inode numbers, something that is common for snapshot directories. This patch just ensures that we skip '.' and '..' rather than allowing a match. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/linux-nfs/CAOQ4uxiOZobN76OKB-VBNXWeFKVwLW_eK5QtthGyYzWU9mjb7Q@mail.gmail.com/ Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-01-23eventfs: Save directory inodes in the eventfs_inode structureSteven Rostedt (Google)
The eventfs inodes and directories are allocated when referenced. But this leaves the issue of keeping consistent inode numbers and the number is only saved in the inode structure itself. When the inode is no longer referenced, it can be freed. When the file that the inode was representing is referenced again, the inode is once again created, but the inode number needs to be the same as it was before. Just making the inode numbers the same for all files is fine, but that does not work with directories. The find command will check for loops via the inode number and having the same inode number for directories triggers: # find /sys/kernel/tracing find: File system loop detected; '/sys/kernel/debug/tracing/events/initcall/initcall_finish' is part of the same file system loop as '/sys/kernel/debug/tracing/events/initcall'. [..] Linus pointed out that the eventfs_inode structure ends with a single 32bit int, and on 64 bit machines, there's likely a 4 byte hole due to alignment. We can use this hole to store the inode number for the eventfs_inode. All directories in eventfs are represented by an eventfs_inode and that data structure can hold its inode number. That last int was also purposely placed at the end of the structure to prevent holes from within. Now that there's a 4 byte number to hold the inode, both the inode number and the last integer can be moved up in the structure for better cache locality, where the llist and rcu fields can be moved to the end as they are only used when the eventfs_inode is being deleted. Link: https://lore.kernel.org/all/CAMuHMdXKiorg-jiuKoZpfZyDJ3Ynrfb8=X+c7x0Eewxn-YRdCA@mail.gmail.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240122152748.46897388@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Fixes: 53c41052ba31 ("eventfs: Have the inodes all for files and directories all be the same") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Kees Cook <keescook@chromium.org>
2024-01-23ovl: mark xwhiteouts directory with overlay.opaque='x'Amir Goldstein
An opaque directory cannot have xwhiteouts, so instead of marking an xwhiteouts directory with a new xattr, overload overlay.opaque xattr for marking both opaque dir ('y') and xwhiteouts dir ('x'). This is more efficient as the overlay.opaque xattr is checked during lookup of directory anyway. This also prevents unnecessary checking the xattr when reading a directory without xwhiteouts, i.e. most of the time. Note that the xwhiteouts marker is not checked on the upper layer and on the last layer in lowerstack, where xwhiteouts are not expected. Fixes: bc8df7a3dc03 ("ovl: Add an alternative type of whiteout") Cc: <stable@vger.kernel.org> # v6.7 Reviewed-by: Alexander Larsson <alexl@redhat.com> Tested-by: Alexander Larsson <alexl@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2024-01-22Revert "btrfs: zstd: fix and simplify the inline extent decompression"Linus Torvalds
This reverts commit 1e7f6def8b2370ecefb54b3c8f390ff894b0c51b. It causes my machine to not even boot, and Klara Modin reports that the cause is that small zstd-compressed files return garbage when read. Reported-by: Klara Modin <klarasmodin@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CABq1_vj4GpUeZpVG49OHCo-3sdbe2-2ROcu_xDvUG-6-5zPRXg@mail.gmail.com/ Reported-and-bisected-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: David Sterba <dsterba@suse.com> Cc: Qu Wenruo <wqu@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-01-22afs: Fix missing/incorrect unlocking of RCU read lockDavid Howells
In afs_proc_addr_prefs_show(), we need to unlock the RCU read lock in both places before returning (and not lock it again). Fixes: f94f70d39cc2 ("afs: Provide a way to configure address priorities") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202401172243.cd53d5f6-oliver.sang@intel.com Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org
2024-01-22afs: Remove afs_dynroot_d_revalidate() as it is redundantDavid Howells
Remove afs_dynroot_d_revalidate() as it is redundant as all it does is return 1 and the caller assumes that if the op is not given. Suggested-by: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org
2024-01-22afs: Fix error handling with lookup via FS.InlineBulkStatusDavid Howells
When afs does a lookup, it tries to use FS.InlineBulkStatus to preemptively look up a bunch of files in the parent directory and cache this locally, on the basis that we might want to look at them too (for example if someone does an ls on a directory, they may want want to then stat every file listed). FS.InlineBulkStatus can be considered a compound op with the normal abort code applying to the compound as a whole. Each status fetch within the compound is then given its own individual abort code - but assuming no error that prevents the bulk fetch from returning the compound result will be 0, even if all the constituent status fetches failed. At the conclusion of afs_do_lookup(), we should use the abort code from the appropriate status to determine the error to return, if any - but instead it is assumed that we were successful if the op as a whole succeeded and we return an incompletely initialised inode, resulting in ENOENT, no matter the actual reason. In the particular instance reported, a vnode with no permission granted to be accessed is being given a UAEACCES abort code which should be reported as EACCES, but is instead being reported as ENOENT. Fix this by abandoning the inode (which will be cleaned up with the op) if file[1] has an abort code indicated and turn that abort code into an error instead. Whilst we're at it, add a tracepoint so that the abort codes of the individual subrequests of FS.InlineBulkStatus can be logged. At the moment only the container abort code can be 0. Fixes: e49c7b2f6de7 ("afs: Build an abstraction around an "operation" concept") Reported-by: Jeffrey Altman <jaltman@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2024-01-22afs: Hide silly-rename files from userspaceDavid Howells
There appears to be a race between silly-rename files being created/removed and various userspace tools iterating over the contents of a directory, leading to such errors as: find: './kernel/.tmp_cpio_dir/include/dt-bindings/reset/.__afs2080': No such file or directory tar: ./include/linux/greybus/.__afs3C95: File removed before we read it when building a kernel. Fix afs_readdir() so that it doesn't return .__afsXXXX silly-rename files to userspace. This doesn't stop them being looked up directly by name as we need to be able to look them up from within the kernel as part of the silly-rename algorithm. Fixes: 79ddbfa500b3 ("afs: Implement sillyrename for unlink and rename") Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
2024-01-22cachefiles, erofs: Fix NULL deref in when cachefiles is not doing ondemand-modeDavid Howells
cachefiles_ondemand_init_object() as called from cachefiles_open_file() and cachefiles_create_tmpfile() does not check if object->ondemand is set before dereferencing it, leading to an oops something like: RIP: 0010:cachefiles_ondemand_init_object+0x9/0x41 ... Call Trace: <TASK> cachefiles_open_file+0xc9/0x187 cachefiles_lookup_cookie+0x122/0x2be fscache_cookie_state_machine+0xbe/0x32b fscache_cookie_worker+0x1f/0x2d process_one_work+0x136/0x208 process_scheduled_works+0x3a/0x41 worker_thread+0x1a2/0x1f6 kthread+0xca/0xd2 ret_from_fork+0x21/0x33 Fix this by making cachefiles_ondemand_init_object() return immediately if cachefiles->ondemand is NULL. Fixes: 3c5ecfe16e76 ("cachefiles: extract ondemand info field from cachefiles_object") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Gao Xiang <xiang@kernel.org> cc: Chao Yu <chao@kernel.org> cc: Yue Hu <huyue2@coolpad.com> cc: Jeffle Xu <jefflexu@linux.alibaba.com> cc: linux-erofs@lists.ozlabs.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org
2024-01-22netfs: Fix a NULL vs IS_ERR() check in netfs_perform_write()Dan Carpenter
The netfs_grab_folio_for_write() function doesn't return NULL, it returns error pointers. Update the check accordingly. Fixes: c38f4e96e605 ("netfs: Provide func to copy data to pagecache for buffered write") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/29fb1310-8e2d-47ba-b68d-40354eb7b896@moroto.mountain/
2024-01-22netfs, fscache: Prevent Oops in fscache_put_cache()Dan Carpenter
This function dereferences "cache" and then checks if it's IS_ERR_OR_NULL(). Check first, then dereference. Fixes: 9549332df4ed ("fscache: Implement cache registration") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/e84bc740-3502-4f16-982a-a40d5676615c@moroto.mountain/ # v2
2024-01-22cifs: Don't use certain unnecessary folio_*() functionsDavid Howells
Filesystems should use folio->index and folio->mapping, instead of folio_index(folio), folio_mapping() and folio_file_mapping() since they know that it's in the pagecache. Change this automagically with: perl -p -i -e 's/folio_mapping[(]([^)]*)[)]/\1->mapping/g' fs/smb/client/*.c perl -p -i -e 's/folio_file_mapping[(]([^)]*)[)]/\1->mapping/g' fs/smb/client/*.c perl -p -i -e 's/folio_index[(]([^)]*)[)]/\1->index/g' fs/smb/client/*.c Reported-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Ronnie Sahlberg <lsahlber@redhat.com> cc: Shyam Prasad N <sprasad@microsoft.com> cc: Tom Talpey <tom@talpey.com> cc: linux-cifs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org
2024-01-22afs: Don't use certain unnecessary folio_*() functionsDavid Howells
Filesystems should use folio->index and folio->mapping, instead of folio_index(folio), folio_mapping() and folio_file_mapping() since they know that it's in the pagecache. Change this automagically with: perl -p -i -e 's/folio_mapping[(]([^)]*)[)]/\1->mapping/g' fs/afs/*.c perl -p -i -e 's/folio_file_mapping[(]([^)]*)[)]/\1->mapping/g' fs/afs/*.c perl -p -i -e 's/folio_index[(]([^)]*)[)]/\1->index/g' fs/afs/*.c Reported-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org
2024-01-22netfs: Don't use certain unnecessary folio_*() functionsDavid Howells
Filesystems should use folio->index and folio->mapping, instead of folio_index(folio), folio_mapping() and folio_file_mapping() since they know that it's in the pagecache. Change this automagically with: perl -p -i -e 's/folio_mapping[(]([^)]*)[)]/\1->mapping/g' fs/netfs/*.c perl -p -i -e 's/folio_file_mapping[(]([^)]*)[)]/\1->mapping/g' fs/netfs/*.c perl -p -i -e 's/folio_index[(]([^)]*)[)]/\1->index/g' fs/netfs/*.c Reported-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-afs@lists.infradead.org cc: linux-cachefs@redhat.com cc: linux-cifs@vger.kernel.org cc: linux-erofs@lists.ozlabs.org cc: linux-fsdevel@vger.kernel.org
2024-01-22Merge tag 'for-6.8-rc1-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - zoned mode fixes: - fix slowdown when writing large file sequentially by looking up block groups with enough space faster - locking fixes when activating a zone - new mount API fixes: - preserve mount options for a ro/rw mount of the same subvolume - scrub fixes: - fix use-after-free in case the chunk length is not aligned to 64K, this does not happen normally but has been reported on images converted from ext4 - similar alignment check was missing with raid-stripe-tree - subvolume deletion fixes: - prevent calling ioctl on already deleted subvolume - properly track flag tracking a deleted subvolume - in subpage mode, fix decompression of an inline extent (zlib, lzo, zstd) - fix crash when starting writeback on a folio, after integration with recent MM changes this needs to be started conditionally - reject unknown flags in defrag ioctl - error handling, API fixes, minor warning fixes * tag 'for-6.8-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: scrub: limit RST scrub to chunk boundary btrfs: scrub: avoid use-after-free when chunk length is not 64K aligned btrfs: don't unconditionally call folio_start_writeback in subpage btrfs: use the original mount's mount options for the legacy reconfigure btrfs: don't warn if discard range is not aligned to sector btrfs: tree-checker: fix inline ref size in error messages btrfs: zstd: fix and simplify the inline extent decompression btrfs: lzo: fix and simplify the inline extent decompression btrfs: zlib: fix and simplify the inline extent decompression btrfs: defrag: reject unknown flags of btrfs_ioctl_defrag_range_args btrfs: avoid copying BTRFS_ROOT_SUBVOL_DEAD flag to snapshot of subvolume being deleted btrfs: don't abort filesystem when attempting to snapshot deleted subvolume btrfs: zoned: fix lock ordering in btrfs_zone_activate() btrfs: fix unbalanced unlock of mapping_tree_lock btrfs: ref-verify: free ref cache before clearing mount opt btrfs: fix kvcalloc() arguments order in btrfs_ioctl_send() btrfs: zoned: optimize hint byte for zoned allocator btrfs: zoned: factor out prepare_allocation_zoned()
2024-01-22exec: Fix error handling in begin_new_exec()Bernd Edlinger
If get_unused_fd_flags() fails, the error handling is incomplete because bprm->cred is already set to NULL, and therefore free_bprm will not unlock the cred_guard_mutex. Note there are two error conditions which end up here, one before and one after bprm->cred is cleared. Fixes: b8a61c9e7b4a ("exec: Generic execfd support") Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Acked-by: Eric W. Biederman <ebiederm@xmission.com> Link: https://lore.kernel.org/r/AS8P193MB128517ADB5EFF29E04389EDAE4752@AS8P193MB1285.EURP193.PROD.OUTLOOK.COM Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2024-01-22exec: Add do_close_execat() helperKees Cook
Consolidate the calls to allow_write_access()/fput() into a single place, since we repeat this code pattern. Add comments around the callers for the details on it. Link: https://lore.kernel.org/r/202209161637.9EDAF6B18@keescook Signed-off-by: Kees Cook <keescook@chromium.org>
2024-01-22exec: remove useless commentAskar Safin
Function name is wrong and the comment tells us nothing Signed-off-by: Askar Safin <safinaskar@zohomail.com> Link: https://lore.kernel.org/r/20240109030801.31827-1-safinaskar@zohomail.com Signed-off-by: Kees Cook <keescook@chromium.org>
2024-01-22bcachefs: fix incorrect usage of REQ_OP_FLUSHChristoph Hellwig
REQ_OP_FLUSH is only for internal use in the blk-mq and request based drivers. File systems and other block layer consumers must use REQ_OP_WRITE | REQ_PREFLUSH as documented in Documentation/block/writeback_cache_control.rst. While REQ_OP_FLUSH appears to work for blk-mq drivers it does not get the proper flush state machine handling, and completely fails for any bio based drivers, including all the stacking drivers. The block layer will also get a check in 6.8 to reject this use case entirely. [Note: completely untested, but as this never got fixed since the original bug report in November: https://bugzilla.kernel.org/show_bug.cgi?id=218184 and the the discussion in December: https://lore.kernel.org/all/20231221053016.72cqcfg46vxwohcj@moria.home.lan/T/ this seems to be best way to force it] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-22bcachefs: Add gfp flags param to bch2_prt_task_backtrace()Kent Overstreet
Fixes: e6a2566f7a00 ("bcachefs: Better journal tracepoints") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Reported-by: smatch
2024-01-22do_sys_name_to_handle(): use kzalloc() to fix kernel-infoleakNikita Zhandarovich
syzbot identified a kernel information leak vulnerability in do_sys_name_to_handle() and issued the following report [1]. [1] "BUG: KMSAN: kernel-infoleak in instrument_copy_to_user include/linux/instrumented.h:114 [inline] BUG: KMSAN: kernel-infoleak in _copy_to_user+0xbc/0x100 lib/usercopy.c:40 instrument_copy_to_user include/linux/instrumented.h:114 [inline] _copy_to_user+0xbc/0x100 lib/usercopy.c:40 copy_to_user include/linux/uaccess.h:191 [inline] do_sys_name_to_handle fs/fhandle.c:73 [inline] __do_sys_name_to_handle_at fs/fhandle.c:112 [inline] __se_sys_name_to_handle_at+0x949/0xb10 fs/fhandle.c:94 __x64_sys_name_to_handle_at+0xe4/0x140 fs/fhandle.c:94 ... Uninit was created at: slab_post_alloc_hook+0x129/0xa70 mm/slab.h:768 slab_alloc_node mm/slub.c:3478 [inline] __kmem_cache_alloc_node+0x5c9/0x970 mm/slub.c:3517 __do_kmalloc_node mm/slab_common.c:1006 [inline] __kmalloc+0x121/0x3c0 mm/slab_common.c:1020 kmalloc include/linux/slab.h:604 [inline] do_sys_name_to_handle fs/fhandle.c:39 [inline] __do_sys_name_to_handle_at fs/fhandle.c:112 [inline] __se_sys_name_to_handle_at+0x441/0xb10 fs/fhandle.c:94 __x64_sys_name_to_handle_at+0xe4/0x140 fs/fhandle.c:94 ... Bytes 18-19 of 20 are uninitialized Memory access of size 20 starts at ffff888128a46380 Data copied to user address 0000000020000240" Per Chuck Lever's suggestion, use kzalloc() instead of kmalloc() to solve the problem. Fixes: 990d6c2d7aee ("vfs: Add name to file handle conversion support") Suggested-by: Chuck Lever III <chuck.lever@oracle.com> Reported-and-tested-by: <syzbot+09b349b3066c2e0b1e96@syzkaller.appspotmail.com> Signed-off-by: Nikita Zhandarovich <n.zhandarovich@fintech.ru> Link: https://lore.kernel.org/r/20240119153906.4367-1-n.zhandarovich@fintech.ru Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>