summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2020-06-17open: add close_range()Christian Brauner
This adds the close_range() syscall. It allows to efficiently close a range of file descriptors up to all file descriptors of a calling task. I was contacted by FreeBSD as they wanted to have the same close_range() syscall as we proposed here. We've coordinated this and in the meantime, Kyle was fast enough to merge close_range() into FreeBSD already in April: https://reviews.freebsd.org/D21627 https://svnweb.freebsd.org/base?view=revision&revision=359836 and the current plan is to backport close_range() to FreeBSD 12.2 (cf. [2]) once its merged in Linux too. Python is in the process of switching to close_range() on FreeBSD and they are waiting on us to merge this to switch on Linux as well: https://bugs.python.org/issue38061 The syscall came up in a recent discussion around the new mount API and making new file descriptor types cloexec by default. During this discussion, Al suggested the close_range() syscall (cf. [1]). Note, a syscall in this manner has been requested by various people over time. First, it helps to close all file descriptors of an exec()ing task. This can be done safely via (quoting Al's example from [1] verbatim): /* that exec is sensitive */ unshare(CLONE_FILES); /* we don't want anything past stderr here */ close_range(3, ~0U); execve(....); The code snippet above is one way of working around the problem that file descriptors are not cloexec by default. This is aggravated by the fact that we can't just switch them over without massively regressing userspace. For a whole class of programs having an in-kernel method of closing all file descriptors is very helpful (e.g. demons, service managers, programming language standard libraries, container managers etc.). (Please note, unshare(CLONE_FILES) should only be needed if the calling task is multi-threaded and shares the file descriptor table with another thread in which case two threads could race with one thread allocating file descriptors and the other one closing them via close_range(). For the general case close_range() before the execve() is sufficient.) Second, it allows userspace to avoid implementing closing all file descriptors by parsing through /proc/<pid>/fd/* and calling close() on each file descriptor. From looking at various large(ish) userspace code bases this or similar patterns are very common in: - service managers (cf. [4]) - libcs (cf. [6]) - container runtimes (cf. [5]) - programming language runtimes/standard libraries - Python (cf. [2]) - Rust (cf. [7], [8]) As Dmitry pointed out there's even a long-standing glibc bug about missing kernel support for this task (cf. [3]). In addition, the syscall will also work for tasks that do not have procfs mounted and on kernels that do not have procfs support compiled in. In such situations the only way to make sure that all file descriptors are closed is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE, OPEN_MAX trickery (cf. comment [8] on Rust). The performance is striking. For good measure, comparing the following simple close_all_fds() userspace implementation that is essentially just glibc's version in [6]: static int close_all_fds(void) { int dir_fd; DIR *dir; struct dirent *direntp; dir = opendir("/proc/self/fd"); if (!dir) return -1; dir_fd = dirfd(dir); while ((direntp = readdir(dir))) { int fd; if (strcmp(direntp->d_name, ".") == 0) continue; if (strcmp(direntp->d_name, "..") == 0) continue; fd = atoi(direntp->d_name); if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2) continue; close(fd); } closedir(dir); return 0; } to close_range() yields: 1. closing 4 open files: - close_all_fds(): ~280 us - close_range(): ~24 us 2. closing 1000 open files: - close_all_fds(): ~5000 us - close_range(): ~800 us close_range() is designed to allow for some flexibility. Specifically, it does not simply always close all open file descriptors of a task. Instead, callers can specify an upper bound. This is e.g. useful for scenarios where specific file descriptors are created with well-known numbers that are supposed to be excluded from getting closed. For extra paranoia close_range() comes with a flags argument. This can e.g. be used to implement extension. Once can imagine userspace wanting to stop at the first error instead of ignoring errors under certain circumstances. There might be other valid ideas in the future. In any case, a flag argument doesn't hurt and keeps us on the safe side. From an implementation side this is kept rather dumb. It saw some input from David and Jann but all nonsense is obviously my own! - Errors to close file descriptors are currently ignored. (Could be changed by setting a flag in the future if needed.) - __close_range() is a rather simplistic wrapper around __close_fd(). My reasoning behind this is based on the nature of how __close_fd() needs to release an fd. But maybe I misunderstood specifics: We take the files_lock and rcu-dereference the fdtable of the calling task, we find the entry in the fdtable, get the file and need to release files_lock before calling filp_close(). In the meantime the fdtable might have been altered so we can't just retake the spinlock and keep the old rcu-reference of the fdtable around. Instead we need to grab a fresh reference to the fdtable. If my reasoning is correct then there's really no point in fancyfying __close_range(): We just need to rcu-dereference the fdtable of the calling task once to cap the max_fd value correctly and then go on calling __close_fd() in a loop. /* References */ [1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/ [2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d867f8/Modules/_posixsubprocess.c#L220 [3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7 [4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/fd-util.c#L217 [5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/src/lxc/start.c#L236 [6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17 Note that this is an internal implementation that is not exported. Currently, libc seems to not provide an exported version of this because of missing kernel support to do this. Note, in a recent patch series Florian made grantpt() a nop thereby removing the code referenced here. [7]: https://github.com/rust-lang/rust/issues/12148 [8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5e451/src/libstd/sys/unix/process2.rs#L303-L308 Rust's solution is slightly different but is equally unperformant. Rust calls getdtablesize() which is a glibc library function that simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then goes on to call close() on each fd. That's obviously overkill for most tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or OPEN_MAX. Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set to 1024. Even in this case, there's a very high chance that in the common case Rust is calling the close() syscall 1021 times pointlessly if the task just has 0, 1, and 2 open. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Kyle Evans <self@kyle-evans.net> Cc: Jann Horn <jannh@google.com> Cc: David Howells <dhowells@redhat.com> Cc: Dmitry V. Levin <ldv@altlinux.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Florian Weimer <fweimer@redhat.com> Cc: linux-api@vger.kernel.org
2020-06-16afs: Fix silly renameDavid Howells
Fix AFS's silly rename by the following means: (1) Set the destination directory in afs_do_silly_rename() so as to avoid misbehaviour and indicate that the directory data version will increment by 1 so as to avoid warnings about unexpected changes in the DV. Also indicate that the ctime should be updated to avoid xfstest grumbling. (2) Note when the server indicates that a directory changed more than we expected (AFS_OPERATION_DIR_CONFLICT), indicating a conflict with a third party change, checking on successful completion of unlink and rename. The problem is that the FS.RemoveFile RPC op doesn't report the status of the unlinked file, though YFS.RemoveFile2 does. This can be mitigated by the assumption that if the directory DV cranked by exactly 1, we can be sure we removed one link from the file; further, ordinarily in AFS, files cannot be hardlinked across directories, so if we reduce nlink to 0, the file is deleted. However, if the directory DV jumps by more than 1, we cannot know if a third party intervened by adding or removing a link on the file we just removed a link from. The same also goes for any vnode that is at the destination of the FS.Rename RPC op. (3) Make afs_vnode_commit_status() apply the nlink drop inside the cb_lock section along with the other attribute updates if ->op_unlinked is set on the descriptor for the appropriate vnode. (4) Issue a follow up status fetch to the unlinked file in the event of a third party conflict that makes it impossible for us to know if we actually deleted the file or not. (5) Provide a flag, AFS_VNODE_SILLY_DELETED, to make afs_getattr() lie to the user about the nlink of a silly deleted file so that it appears as 0, not 1. Found with the generic/035 and generic/084 xfstests. Fixes: e49c7b2f6de7 ("afs: Build an abstraction around an "operation" concept") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-16btrfs: use kfree() in btrfs_ioctl_get_subvol_info()Waiman Long
In btrfs_ioctl_get_subvol_info(), there is a classic case where kzalloc() was incorrectly paired with kzfree(). According to David Sterba, there isn't any sensitive information in the subvol_info that needs to be cleared before freeing. So kzfree() isn't really needed, use kfree() instead. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-16btrfs: fix RWF_NOWAIT writes blocking on extent locks and waiting for IOFilipe Manana
A RWF_NOWAIT write is not supposed to wait on filesystem locks that can be held for a long time or for ongoing IO to complete. However when calling check_can_nocow(), if the inode has prealloc extents or has the NOCOW flag set, we can block on extent (file range) locks through the call to btrfs_lock_and_flush_ordered_range(). Such lock can take a significant amount of time to be available. For example, a fiemap task may be running, and iterating through the entire file range checking all extents and doing backref walking to determine if they are shared, or a readpage operation may be in progress. Also at btrfs_lock_and_flush_ordered_range(), called by check_can_nocow(), after locking the file range we wait for any existing ordered extent that is in progress to complete. Another operation that can take a significant amount of time and defeat the purpose of RWF_NOWAIT. So fix this by trying to lock the file range and if it's currently locked return -EAGAIN to user space. If we are able to lock the file range without waiting and there is an ordered extent in the range, return -EAGAIN as well, instead of waiting for it to complete. Finally, don't bother trying to lock the snapshot lock of the root when attempting a RWF_NOWAIT write, as that is only important for buffered writes. Fixes: edf064e7c6fec3 ("btrfs: nowait aio support") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-16btrfs: fix RWF_NOWAIT write not failling when we need to cowFilipe Manana
If we attempt to do a RWF_NOWAIT write against a file range for which we can only do NOCOW for a part of it, due to the existence of holes or shared extents for example, we proceed with the write as if it were possible to NOCOW the whole range. Example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ touch /mnt/sdj/bar $ chattr +C /mnt/sdj/bar $ xfs_io -d -c "pwrite -S 0xab -b 256K 0 256K" /mnt/bar wrote 262144/262144 bytes at offset 0 256 KiB, 1 ops; 0.0003 sec (694.444 MiB/sec and 2777.7778 ops/sec) $ xfs_io -c "fpunch 64K 64K" /mnt/bar $ sync $ xfs_io -d -c "pwrite -N -V 1 -b 128K -S 0xfe 0 128K" /mnt/bar wrote 131072/131072 bytes at offset 0 128 KiB, 1 ops; 0.0007 sec (160.051 MiB/sec and 1280.4097 ops/sec) This last write should fail with -EAGAIN since the file range from 64K to 128K is a hole. On xfs it fails, as expected, but on ext4 it currently succeeds because apparently it is expensive to check if there are extents allocated for the whole range, but I'll check with the ext4 people. Fix the issue by checking if check_can_nocow() returns a number of NOCOW'able bytes smaller then the requested number of bytes, and if it does return -EAGAIN. Fixes: edf064e7c6fec3 ("btrfs: nowait aio support") CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-16btrfs: fix failure of RWF_NOWAIT write into prealloc extent beyond eofFilipe Manana
If we attempt to write to prealloc extent located after eof using a RWF_NOWAIT write, we always fail with -EAGAIN. We do actually check if we have an allocated extent for the write at the start of btrfs_file_write_iter() through a call to check_can_nocow(), but later when we go into the actual direct IO write path we simply return -EAGAIN if the write starts at or beyond EOF. Trivial to reproduce: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ touch /mnt/foo $ chattr +C /mnt/foo $ xfs_io -d -c "pwrite -S 0xab 0 64K" /mnt/foo wrote 65536/65536 bytes at offset 0 64 KiB, 16 ops; 0.0004 sec (135.575 MiB/sec and 34707.1584 ops/sec) $ xfs_io -c "falloc -k 64K 1M" /mnt/foo $ xfs_io -d -c "pwrite -N -V 1 -S 0xfe -b 64K 64K 64K" /mnt/foo pwrite: Resource temporarily unavailable On xfs and ext4 the write succeeds, as expected. Fix this by removing the wrong check at btrfs_direct_IO(). Fixes: edf064e7c6fec3 ("btrfs: nowait aio support") CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-16btrfs: fix hang on snapshot creation after RWF_NOWAIT writeFilipe Manana
If we do a successful RWF_NOWAIT write we end up locking the snapshot lock of the inode, through a call to check_can_nocow(), but we never unlock it. This means the next attempt to create a snapshot on the subvolume will hang forever. Trivial reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ touch /mnt/foobar $ chattr +C /mnt/foobar $ xfs_io -d -c "pwrite -S 0xab 0 64K" /mnt/foobar $ xfs_io -d -c "pwrite -N -V 1 -S 0xfe 0 64K" /mnt/foobar $ btrfs subvolume snapshot -r /mnt /mnt/snap --> hangs Fix this by unlocking the snapshot lock if check_can_nocow() returned success. Fixes: edf064e7c6fec3 ("btrfs: nowait aio support") CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-16btrfs: check if a log root exists before locking the log_mutex on unlinkFilipe Manana
This brings back an optimization that commit e678934cbe5f02 ("btrfs: Remove unnecessary check from join_running_log_trans") removed, but in a different form. So it's almost equivalent to a revert. That commit removed an optimization where we avoid locking a root's log_mutex when there is no log tree created in the current transaction. The affected code path is triggered through unlink operations. That commit was based on the assumption that the optimization was not necessary because we used to have the following checks when the patch was authored: int btrfs_del_dir_entries_in_log(...) { (...) if (dir->logged_trans < trans->transid) return 0; ret = join_running_log_trans(root); (...) } int btrfs_del_inode_ref_in_log(...) { (...) if (inode->logged_trans < trans->transid) return 0; ret = join_running_log_trans(root); (...) } However before that patch was merged, another patch was merged first which replaced those checks because they were buggy. That other patch corresponds to commit 803f0f64d17769 ("Btrfs: fix fsync not persisting dentry deletions due to inode evictions"). The assumption that if the logged_trans field of an inode had a smaller value then the current transaction's generation (transid) meant that the inode was not logged in the current transaction was only correct if the inode was not evicted and reloaded in the current transaction. So the corresponding bug fix changed those checks and replaced them with the following helper function: static bool inode_logged(struct btrfs_trans_handle *trans, struct btrfs_inode *inode) { if (inode->logged_trans == trans->transid) return true; if (inode->last_trans == trans->transid && test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags) && !test_bit(BTRFS_FS_LOG_RECOVERING, &trans->fs_info->flags)) return true; return false; } So if we have a subvolume without a log tree in the current transaction (because we had no fsyncs), every time we unlink an inode we can end up trying to lock the log_mutex of the root through join_running_log_trans() twice, once for the inode being unlinked (by btrfs_del_inode_ref_in_log()) and once for the parent directory (with btrfs_del_dir_entries_in_log()). This means if we have several unlink operations happening in parallel for inodes in the same subvolume, and the those inodes and/or their parent inode were changed in the current transaction, we end up having a lot of contention on the log_mutex. The test robots from intel reported a -30.7% performance regression for a REAIM test after commit e678934cbe5f02 ("btrfs: Remove unnecessary check from join_running_log_trans"). So just bring back the optimization to join_running_log_trans() where we check first if a log root exists before trying to lock the log_mutex. This is done by checking for a bit that is set on the root when a log tree is created and removed when a log tree is freed (at transaction commit time). Commit e678934cbe5f02 ("btrfs: Remove unnecessary check from join_running_log_trans") was merged in the 5.4 merge window while commit 803f0f64d17769 ("Btrfs: fix fsync not persisting dentry deletions due to inode evictions") was merged in the 5.3 merge window. But the first commit was actually authored before the second commit (May 23 2019 vs June 19 2019). Reported-by: kernel test robot <rong.a.chen@intel.com> Link: https://lore.kernel.org/lkml/20200611090233.GL12456@shao2-debian/ Fixes: e678934cbe5f02 ("btrfs: Remove unnecessary check from join_running_log_trans") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-16btrfs: fix bytes_may_use underflow when running balance and scrub in parallelFilipe Manana
When balance and scrub are running in parallel it is possible to end up with an underflow of the bytes_may_use counter of the data space_info object, which triggers a warning like the following: [134243.793196] BTRFS info (device sdc): relocating block group 1104150528 flags data [134243.806891] ------------[ cut here ]------------ [134243.807561] WARNING: CPU: 1 PID: 26884 at fs/btrfs/space-info.h:125 btrfs_add_reserved_bytes+0x1da/0x280 [btrfs] [134243.808819] Modules linked in: btrfs blake2b_generic xor (...) [134243.815779] CPU: 1 PID: 26884 Comm: kworker/u8:8 Tainted: G W 5.6.0-rc7-btrfs-next-58 #5 [134243.816944] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [134243.818389] Workqueue: writeback wb_workfn (flush-btrfs-108483) [134243.819186] RIP: 0010:btrfs_add_reserved_bytes+0x1da/0x280 [btrfs] [134243.819963] Code: 0b f2 85 (...) [134243.822271] RSP: 0018:ffffa4160aae7510 EFLAGS: 00010287 [134243.822929] RAX: 000000000000c000 RBX: ffff96159a8c1000 RCX: 0000000000000000 [134243.823816] RDX: 0000000000008000 RSI: 0000000000000000 RDI: ffff96158067a810 [134243.824742] RBP: ffff96158067a800 R08: 0000000000000001 R09: 0000000000000000 [134243.825636] R10: ffff961501432a40 R11: 0000000000000000 R12: 000000000000c000 [134243.826532] R13: 0000000000000001 R14: ffffffffffff4000 R15: ffff96158067a810 [134243.827432] FS: 0000000000000000(0000) GS:ffff9615baa00000(0000) knlGS:0000000000000000 [134243.828451] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [134243.829184] CR2: 000055bd7e414000 CR3: 00000001077be004 CR4: 00000000003606e0 [134243.830083] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [134243.830975] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [134243.831867] Call Trace: [134243.832211] find_free_extent+0x4a0/0x16c0 [btrfs] [134243.832846] btrfs_reserve_extent+0x91/0x180 [btrfs] [134243.833487] cow_file_range+0x12d/0x490 [btrfs] [134243.834080] fallback_to_cow+0x82/0x1b0 [btrfs] [134243.834689] ? release_extent_buffer+0x121/0x170 [btrfs] [134243.835370] run_delalloc_nocow+0x33f/0xa30 [btrfs] [134243.836032] btrfs_run_delalloc_range+0x1ea/0x6d0 [btrfs] [134243.836725] ? find_lock_delalloc_range+0x221/0x250 [btrfs] [134243.837450] writepage_delalloc+0xe8/0x150 [btrfs] [134243.838059] __extent_writepage+0xe8/0x4c0 [btrfs] [134243.838674] extent_write_cache_pages+0x237/0x530 [btrfs] [134243.839364] extent_writepages+0x44/0xa0 [btrfs] [134243.839946] do_writepages+0x23/0x80 [134243.840401] __writeback_single_inode+0x59/0x700 [134243.841006] writeback_sb_inodes+0x267/0x5f0 [134243.841548] __writeback_inodes_wb+0x87/0xe0 [134243.842091] wb_writeback+0x382/0x590 [134243.842574] ? wb_workfn+0x4a2/0x6c0 [134243.843030] wb_workfn+0x4a2/0x6c0 [134243.843468] process_one_work+0x26d/0x6a0 [134243.843978] worker_thread+0x4f/0x3e0 [134243.844452] ? process_one_work+0x6a0/0x6a0 [134243.844981] kthread+0x103/0x140 [134243.845400] ? kthread_create_worker_on_cpu+0x70/0x70 [134243.846030] ret_from_fork+0x3a/0x50 [134243.846494] irq event stamp: 0 [134243.846892] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [134243.847682] hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020 [134243.848687] softirqs last enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020 [134243.849913] softirqs last disabled at (0): [<0000000000000000>] 0x0 [134243.850698] ---[ end trace bd7c03622e0b0a96 ]--- [134243.851335] ------------[ cut here ]------------ When relocating a data block group, for each extent allocated in the block group we preallocate another extent with the same size for the data relocation inode (we do it at prealloc_file_extent_cluster()). We reserve space by calling btrfs_check_data_free_space(), which ends up incrementing the data space_info's bytes_may_use counter, and then call btrfs_prealloc_file_range() to allocate the extent, which always decrements the bytes_may_use counter by the same amount. The expectation is that writeback of the data relocation inode always follows a NOCOW path, by writing into the preallocated extents. However, when starting writeback we might end up falling back into the COW path, because the block group that contains the preallocated extent was turned into RO mode by a scrub running in parallel. The COW path then calls the extent allocator which ends up calling btrfs_add_reserved_bytes(), and this function decrements the bytes_may_use counter of the data space_info object by an amount corresponding to the size of the allocated extent, despite we haven't previously incremented it. When the counter currently has a value smaller then the allocated extent we reset the counter to 0 and emit a warning, otherwise we just decrement it and slowly mess up with this counter which is crucial for space reservation, the end result can be granting reserved space to tasks when there isn't really enough free space, and having the tasks fail later in critical places where error handling consists of a transaction abort or hitting a BUG_ON(). Fix this by making sure that if we fallback to the COW path for a data relocation inode, we increment the bytes_may_use counter of the data space_info object. The COW path will then decrement it at btrfs_add_reserved_bytes() on success or through its error handling part by a call to extent_clear_unlock_delalloc() (which ends up calling btrfs_clear_delalloc_extent() that does the decrement operation) in case of an error. Test case btrfs/061 from fstests could sporadically trigger this. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-16btrfs: fix data block group relocation failure due to concurrent scrubFilipe Manana
When running relocation of a data block group while scrub is running in parallel, it is possible that the relocation will fail and abort the current transaction with an -EINVAL error: [134243.988595] BTRFS info (device sdc): found 14 extents, stage: move data extents [134243.999871] ------------[ cut here ]------------ [134244.000741] BTRFS: Transaction aborted (error -22) [134244.001692] WARNING: CPU: 0 PID: 26954 at fs/btrfs/ctree.c:1071 __btrfs_cow_block+0x6a7/0x790 [btrfs] [134244.003380] Modules linked in: btrfs blake2b_generic xor raid6_pq (...) [134244.012577] CPU: 0 PID: 26954 Comm: btrfs Tainted: G W 5.6.0-rc7-btrfs-next-58 #5 [134244.014162] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [134244.016184] RIP: 0010:__btrfs_cow_block+0x6a7/0x790 [btrfs] [134244.017151] Code: 48 c7 c7 (...) [134244.020549] RSP: 0018:ffffa41607863888 EFLAGS: 00010286 [134244.021515] RAX: 0000000000000000 RBX: ffff9614bdfe09c8 RCX: 0000000000000000 [134244.022822] RDX: 0000000000000001 RSI: ffffffffb3d63980 RDI: 0000000000000001 [134244.024124] RBP: ffff961589e8c000 R08: 0000000000000000 R09: 0000000000000001 [134244.025424] R10: ffffffffc0ae5955 R11: 0000000000000000 R12: ffff9614bd530d08 [134244.026725] R13: ffff9614ced41b88 R14: ffff9614bdfe2a48 R15: 0000000000000000 [134244.028024] FS: 00007f29b63c08c0(0000) GS:ffff9615ba600000(0000) knlGS:0000000000000000 [134244.029491] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [134244.030560] CR2: 00007f4eb339b000 CR3: 0000000130d6e006 CR4: 00000000003606f0 [134244.031997] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [134244.033153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [134244.034484] Call Trace: [134244.034984] btrfs_cow_block+0x12b/0x2b0 [btrfs] [134244.035859] do_relocation+0x30b/0x790 [btrfs] [134244.036681] ? do_raw_spin_unlock+0x49/0xc0 [134244.037460] ? _raw_spin_unlock+0x29/0x40 [134244.038235] relocate_tree_blocks+0x37b/0x730 [btrfs] [134244.039245] relocate_block_group+0x388/0x770 [btrfs] [134244.040228] btrfs_relocate_block_group+0x161/0x2e0 [btrfs] [134244.041323] btrfs_relocate_chunk+0x36/0x110 [btrfs] [134244.041345] btrfs_balance+0xc06/0x1860 [btrfs] [134244.043382] ? btrfs_ioctl_balance+0x27c/0x310 [btrfs] [134244.045586] btrfs_ioctl_balance+0x1ed/0x310 [btrfs] [134244.045611] btrfs_ioctl+0x1880/0x3760 [btrfs] [134244.049043] ? do_raw_spin_unlock+0x49/0xc0 [134244.049838] ? _raw_spin_unlock+0x29/0x40 [134244.050587] ? __handle_mm_fault+0x11b3/0x14b0 [134244.051417] ? ksys_ioctl+0x92/0xb0 [134244.052070] ksys_ioctl+0x92/0xb0 [134244.052701] ? trace_hardirqs_off_thunk+0x1a/0x1c [134244.053511] __x64_sys_ioctl+0x16/0x20 [134244.054206] do_syscall_64+0x5c/0x280 [134244.054891] entry_SYSCALL_64_after_hwframe+0x49/0xbe [134244.055819] RIP: 0033:0x7f29b51c9dd7 [134244.056491] Code: 00 00 00 (...) [134244.059767] RSP: 002b:00007ffcccc1dd08 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [134244.061168] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f29b51c9dd7 [134244.062474] RDX: 00007ffcccc1dda0 RSI: 00000000c4009420 RDI: 0000000000000003 [134244.063771] RBP: 0000000000000003 R08: 00005565cea4b000 R09: 0000000000000000 [134244.065032] R10: 0000000000000541 R11: 0000000000000202 R12: 00007ffcccc2060a [134244.066327] R13: 00007ffcccc1dda0 R14: 0000000000000002 R15: 00007ffcccc1dec0 [134244.067626] irq event stamp: 0 [134244.068202] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [134244.069351] hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020 [134244.070909] softirqs last enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020 [134244.072392] softirqs last disabled at (0): [<0000000000000000>] 0x0 [134244.073432] ---[ end trace bd7c03622e0b0a99 ]--- The -EINVAL error comes from the following chain of function calls: __btrfs_cow_block() <-- aborts the transaction btrfs_reloc_cow_block() replace_file_extents() get_new_location() <-- returns -EINVAL When relocating a data block group, for each allocated extent of the block group, we preallocate another extent (at prealloc_file_extent_cluster()), associated with the data relocation inode, and then dirty all its pages. These preallocated extents have, and must have, the same size that extents from the data block group being relocated have. Later before we start the relocation stage that updates pointers (bytenr field of file extent items) to point to the the new extents, we trigger writeback for the data relocation inode. The expectation is that writeback will write the pages to the previously preallocated extents, that it follows the NOCOW path. That is generally the case, however, if a scrub is running it may have turned the block group that contains those extents into RO mode, in which case writeback falls back to the COW path. However in the COW path instead of allocating exactly one extent with the expected size, the allocator may end up allocating several smaller extents due to free space fragmentation - because we tell it at cow_file_range() that the minimum allocation size can match the filesystem's sector size. This later breaks the relocation's expectation that an extent associated to a file extent item in the data relocation inode has the same size as the respective extent pointed by a file extent item in another tree - in this case the extent to which the relocation inode poins to is smaller, causing relocation.c:get_new_location() to return -EINVAL. For example, if we are relocating a data block group X that has a logical address of X and the block group has an extent allocated at the logical address X + 128KiB with a size of 64KiB: 1) At prealloc_file_extent_cluster() we allocate an extent for the data relocation inode with a size of 64KiB and associate it to the file offset 128KiB (X + 128KiB - X) of the data relocation inode. This preallocated extent was allocated at block group Z; 2) A scrub running in parallel turns block group Z into RO mode and starts scrubing its extents; 3) Relocation triggers writeback for the data relocation inode; 4) When running delalloc (btrfs_run_delalloc_range()), we try first the NOCOW path because the data relocation inode has BTRFS_INODE_PREALLOC set in its flags. However, because block group Z is in RO mode, the NOCOW path (run_delalloc_nocow()) falls back into the COW path, by calling cow_file_range(); 5) At cow_file_range(), in the first iteration of the while loop we call btrfs_reserve_extent() to allocate a 64KiB extent and pass it a minimum allocation size of 4KiB (fs_info->sectorsize). Due to free space fragmentation, btrfs_reserve_extent() ends up allocating two extents of 32KiB each, each one on a different iteration of that while loop; 6) Writeback of the data relocation inode completes; 7) Relocation proceeds and ends up at relocation.c:replace_file_extents(), with a leaf which has a file extent item that points to the data extent from block group X, that has a logical address (bytenr) of X + 128KiB and a size of 64KiB. Then it calls get_new_location(), which does a lookup in the data relocation tree for a file extent item starting at offset 128KiB (X + 128KiB - X) and belonging to the data relocation inode. It finds a corresponding file extent item, however that item points to an extent that has a size of 32KiB, which doesn't match the expected size of 64KiB, resuling in -EINVAL being returned from this function and propagated up to __btrfs_cow_block(), which aborts the current transaction. To fix this make sure that at cow_file_range() when we call the allocator we pass it a minimum allocation size corresponding the desired extent size if the inode belongs to the data relocation tree, otherwise pass it the filesystem's sector size as the minimum allocation size. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-16btrfs: fix race between block group removal and block group creationFilipe Manana
There is a race between block group removal and block group creation when the removal is completed by a task running fitrim or scrub. When this happens we end up failing the block group creation with an error -EEXIST since we attempt to insert a duplicate block group item key in the extent tree. That results in a transaction abort. The race happens like this: 1) Task A is doing a fitrim, and at btrfs_trim_block_group() it freezes block group X with btrfs_freeze_block_group() (until very recently that was named btrfs_get_block_group_trimming()); 2) Task B starts removing block group X, either because it's now unused or due to relocation for example. So at btrfs_remove_block_group(), while holding the chunk mutex and the block group's lock, it sets the 'removed' flag of the block group and it sets the local variable 'remove_em' to false, because the block group is currently frozen (its 'frozen' counter is > 0, until very recently this counter was named 'trimming'); 3) Task B unlocks the block group and the chunk mutex; 4) Task A is done trimming the block group and unfreezes the block group by calling btrfs_unfreeze_block_group() (until very recently this was named btrfs_put_block_group_trimming()). In this function we lock the block group and set the local variable 'cleanup' to true because we were able to decrement the block group's 'frozen' counter down to 0 and the flag 'removed' is set in the block group. Since 'cleanup' is set to true, it locks the chunk mutex and removes the extent mapping representing the block group from the mapping tree; 5) Task C allocates a new block group Y and it picks up the logical address that block group X had as the logical address for Y, because X was the block group with the highest logical address and now the second block group with the highest logical address, the last in the fs mapping tree, ends at an offset corresponding to block group X's logical address (this logical address selection is done at volumes.c:find_next_chunk()). At this point the new block group Y does not have yet its item added to the extent tree (nor the corresponding device extent items and chunk item in the device and chunk trees). The new group Y is added to the list of pending block groups in the transaction handle; 6) Before task B proceeds to removing the block group item for block group X from the extent tree, which has a key matching: (X logical offset, BTRFS_BLOCK_GROUP_ITEM_KEY, length) task C while ending its transaction handle calls btrfs_create_pending_block_groups(), which finds block group Y and tries to insert the block group item for Y into the exten tree, which fails with -EEXIST since logical offset is the same that X had and task B hasn't yet deleted the key from the extent tree. This failure results in a transaction abort, producing a stack like the following: ------------[ cut here ]------------ BTRFS: Transaction aborted (error -17) WARNING: CPU: 2 PID: 19736 at fs/btrfs/block-group.c:2074 btrfs_create_pending_block_groups+0x1eb/0x260 [btrfs] Modules linked in: btrfs blake2b_generic xor raid6_pq (...) CPU: 2 PID: 19736 Comm: fsstress Tainted: G W 5.6.0-rc7-btrfs-next-58 #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_create_pending_block_groups+0x1eb/0x260 [btrfs] Code: ff ff ff 48 8b 55 50 f0 48 (...) RSP: 0018:ffffa4160a1c7d58 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff961581909d98 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffffb3d63990 RDI: 0000000000000001 RBP: ffff9614f3356a58 R08: 0000000000000000 R09: 0000000000000001 R10: ffff9615b65b0040 R11: 0000000000000000 R12: ffff961581909c10 R13: ffff9615b0c32000 R14: ffff9614f3356ab0 R15: ffff9614be779000 FS: 00007f2ce2841e80(0000) GS:ffff9615bae00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000555f18780000 CR3: 0000000131d34005 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_start_dirty_block_groups+0x398/0x4e0 [btrfs] btrfs_commit_transaction+0xd0/0xc50 [btrfs] ? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs] ? __ia32_sys_fdatasync+0x20/0x20 iterate_supers+0xdb/0x180 ksys_sync+0x60/0xb0 __ia32_sys_sync+0xa/0x10 do_syscall_64+0x5c/0x280 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f2ce1d4d5b7 Code: 83 c4 08 48 3d 01 (...) RSP: 002b:00007ffd8b558c58 EFLAGS: 00000202 ORIG_RAX: 00000000000000a2 RAX: ffffffffffffffda RBX: 000000000000002c RCX: 00007f2ce1d4d5b7 RDX: 00000000ffffffff RSI: 00000000186ba07b RDI: 000000000000002c RBP: 0000555f17b9e520 R08: 0000000000000012 R09: 000000000000ce00 R10: 0000000000000078 R11: 0000000000000202 R12: 0000000000000032 R13: 0000000051eb851f R14: 00007ffd8b558cd0 R15: 0000555f1798ec20 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020 softirqs last enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace bd7c03622e0b0a9c ]--- Fix this simply by making btrfs_remove_block_group() remove the block group's item from the extent tree before it flags the block group as removed. Also make the free space deletion from the free space tree before flagging the block group as removed, to avoid a similar race with adding and removing free space entries for the free space tree. Fixes: 04216820fe83d5 ("Btrfs: fix race between fs trimming and block group remove/allocation") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-16btrfs: fix a block group ref counter leak after failure to remove block groupFilipe Manana
When removing a block group, if we fail to delete the block group's item from the extent tree, we jump to the 'out' label and end up decrementing the block group's reference count once only (by 1), resulting in a counter leak because the block group at that point was already removed from the block group cache rbtree - so we have to decrement the reference count twice, once for the rbtree and once for our lookup at the start of the function. There is a second bug where if removing the free space tree entries (the call to remove_block_group_free_space()) fails we end up jumping to the 'out_put_group' label but end up decrementing the reference count only once, when we should have done it twice, since we have already removed the block group from the block group cache rbtree. This happens because the reference count decrement for the rbtree reference happens after attempting to remove the free space tree entries, which is far away from the place where we remove the block group from the rbtree. To make things less error prone, decrement the reference count for the rbtree immediately after removing the block group from it. This also eleminates the need for two different exit labels on error, renaming 'out_put_label' to just 'out' and removing the old 'out'. Fixes: f6033c5e333238 ("btrfs: fix block group leak when removing fails") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-16block: Fix use-after-free in blkdev_get()Jason Yan
In blkdev_get() we call __blkdev_get() to do some internal jobs and if there is some errors in __blkdev_get(), the bdput() is called which means we have released the refcount of the bdev (actually the refcount of the bdev inode). This means we cannot access bdev after that point. But acctually bdev is still accessed in blkdev_get() after calling __blkdev_get(). This results in use-after-free if the refcount is the last one we released in __blkdev_get(). Let's take a look at the following scenerio: CPU0 CPU1 CPU2 blkdev_open blkdev_open Remove disk bd_acquire blkdev_get __blkdev_get del_gendisk bdev_unhash_inode bd_acquire bdev_get_gendisk bd_forget failed because of unhashed bdput bdput (the last one) bdev_evict_inode access bdev => use after free [ 459.350216] BUG: KASAN: use-after-free in __lock_acquire+0x24c1/0x31b0 [ 459.351190] Read of size 8 at addr ffff88806c815a80 by task syz-executor.0/20132 [ 459.352347] [ 459.352594] CPU: 0 PID: 20132 Comm: syz-executor.0 Not tainted 4.19.90 #2 [ 459.353628] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 459.354947] Call Trace: [ 459.355337] dump_stack+0x111/0x19e [ 459.355879] ? __lock_acquire+0x24c1/0x31b0 [ 459.356523] print_address_description+0x60/0x223 [ 459.357248] ? __lock_acquire+0x24c1/0x31b0 [ 459.357887] kasan_report.cold+0xae/0x2d8 [ 459.358503] __lock_acquire+0x24c1/0x31b0 [ 459.359120] ? _raw_spin_unlock_irq+0x24/0x40 [ 459.359784] ? lockdep_hardirqs_on+0x37b/0x580 [ 459.360465] ? _raw_spin_unlock_irq+0x24/0x40 [ 459.361123] ? finish_task_switch+0x125/0x600 [ 459.361812] ? finish_task_switch+0xee/0x600 [ 459.362471] ? mark_held_locks+0xf0/0xf0 [ 459.363108] ? __schedule+0x96f/0x21d0 [ 459.363716] lock_acquire+0x111/0x320 [ 459.364285] ? blkdev_get+0xce/0xbe0 [ 459.364846] ? blkdev_get+0xce/0xbe0 [ 459.365390] __mutex_lock+0xf9/0x12a0 [ 459.365948] ? blkdev_get+0xce/0xbe0 [ 459.366493] ? bdev_evict_inode+0x1f0/0x1f0 [ 459.367130] ? blkdev_get+0xce/0xbe0 [ 459.367678] ? destroy_inode+0xbc/0x110 [ 459.368261] ? mutex_trylock+0x1a0/0x1a0 [ 459.368867] ? __blkdev_get+0x3e6/0x1280 [ 459.369463] ? bdev_disk_changed+0x1d0/0x1d0 [ 459.370114] ? blkdev_get+0xce/0xbe0 [ 459.370656] blkdev_get+0xce/0xbe0 [ 459.371178] ? find_held_lock+0x2c/0x110 [ 459.371774] ? __blkdev_get+0x1280/0x1280 [ 459.372383] ? lock_downgrade+0x680/0x680 [ 459.373002] ? lock_acquire+0x111/0x320 [ 459.373587] ? bd_acquire+0x21/0x2c0 [ 459.374134] ? do_raw_spin_unlock+0x4f/0x250 [ 459.374780] blkdev_open+0x202/0x290 [ 459.375325] do_dentry_open+0x49e/0x1050 [ 459.375924] ? blkdev_get_by_dev+0x70/0x70 [ 459.376543] ? __x64_sys_fchdir+0x1f0/0x1f0 [ 459.377192] ? inode_permission+0xbe/0x3a0 [ 459.377818] path_openat+0x148c/0x3f50 [ 459.378392] ? kmem_cache_alloc+0xd5/0x280 [ 459.379016] ? entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 459.379802] ? path_lookupat.isra.0+0x900/0x900 [ 459.380489] ? __lock_is_held+0xad/0x140 [ 459.381093] do_filp_open+0x1a1/0x280 [ 459.381654] ? may_open_dev+0xf0/0xf0 [ 459.382214] ? find_held_lock+0x2c/0x110 [ 459.382816] ? lock_downgrade+0x680/0x680 [ 459.383425] ? __lock_is_held+0xad/0x140 [ 459.384024] ? do_raw_spin_unlock+0x4f/0x250 [ 459.384668] ? _raw_spin_unlock+0x1f/0x30 [ 459.385280] ? __alloc_fd+0x448/0x560 [ 459.385841] do_sys_open+0x3c3/0x500 [ 459.386386] ? filp_open+0x70/0x70 [ 459.386911] ? trace_hardirqs_on_thunk+0x1a/0x1c [ 459.387610] ? trace_hardirqs_off_caller+0x55/0x1c0 [ 459.388342] ? do_syscall_64+0x1a/0x520 [ 459.388930] do_syscall_64+0xc3/0x520 [ 459.389490] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 459.390248] RIP: 0033:0x416211 [ 459.390720] Code: 75 14 b8 02 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 04 19 00 00 c3 48 83 ec 08 e8 0a fa ff ff 48 89 04 24 b8 02 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 53 fa ff ff 48 89 d0 48 83 c4 08 48 3d 01 [ 459.393483] RSP: 002b:00007fe45dfe9a60 EFLAGS: 00000293 ORIG_RAX: 0000000000000002 [ 459.394610] RAX: ffffffffffffffda RBX: 00007fe45dfea6d4 RCX: 0000000000416211 [ 459.395678] RDX: 00007fe45dfe9b0a RSI: 0000000000000002 RDI: 00007fe45dfe9b00 [ 459.396758] RBP: 000000000076bf20 R08: 0000000000000000 R09: 000000000000000a [ 459.397930] R10: 0000000000000075 R11: 0000000000000293 R12: 00000000ffffffff [ 459.399022] R13: 0000000000000bd9 R14: 00000000004cdb80 R15: 000000000076bf2c [ 459.400168] [ 459.400430] Allocated by task 20132: [ 459.401038] kasan_kmalloc+0xbf/0xe0 [ 459.401652] kmem_cache_alloc+0xd5/0x280 [ 459.402330] bdev_alloc_inode+0x18/0x40 [ 459.402970] alloc_inode+0x5f/0x180 [ 459.403510] iget5_locked+0x57/0xd0 [ 459.404095] bdget+0x94/0x4e0 [ 459.404607] bd_acquire+0xfa/0x2c0 [ 459.405113] blkdev_open+0x110/0x290 [ 459.405702] do_dentry_open+0x49e/0x1050 [ 459.406340] path_openat+0x148c/0x3f50 [ 459.406926] do_filp_open+0x1a1/0x280 [ 459.407471] do_sys_open+0x3c3/0x500 [ 459.408010] do_syscall_64+0xc3/0x520 [ 459.408572] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 459.409415] [ 459.409679] Freed by task 1262: [ 459.410212] __kasan_slab_free+0x129/0x170 [ 459.410919] kmem_cache_free+0xb2/0x2a0 [ 459.411564] rcu_process_callbacks+0xbb2/0x2320 [ 459.412318] __do_softirq+0x225/0x8ac Fix this by delaying bdput() to the end of blkdev_get() which means we have finished accessing bdev. Fixes: 77ea887e433a ("implement in-kernel gendisk events handling") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Jason Yan <yanaijie@huawei.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ming Lei <ming.lei@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-16afs: afs_vnode_commit_status() doesn't need to check the RPC errorDavid Howells
afs_vnode_commit_status() is only ever called if op->error is 0, so remove the op->error checks from the function. Fixes: e49c7b2f6de7 ("afs: Build an abstraction around an "operation" concept") Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-16afs: Fix use of afs_check_for_remote_deletion()David Howells
afs_check_for_remote_deletion() checks to see if error ENOENT is returned by the server in response to an operation and, if so, marks the primary vnode as having been deleted as the FID is no longer valid. However, it's being called from the operation success functions, where no abort has happened - and if an inline abort is recorded, it's handled by afs_vnode_commit_status(). Fix this by actually calling the operation aborted method if provided and having that point to afs_check_for_remote_deletion(). Fixes: e49c7b2f6de7 ("afs: Build an abstraction around an "operation" concept") Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-16afs: Remove afs_operation::abort_codeDavid Howells
Remove afs_operation::abort_code as it's read but never set. Use ac.abort_code instead. Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-16afs: Fix yfs_fs_fetch_status() to honour vnode selectorDavid Howells
Fix yfs_fs_fetch_status() to honour the vnode selector in op->fetch_status.which as does afs_fs_fetch_status() that allows afs_do_lookup() to use this as an alternative to the InlineBulkStatus RPC call if not implemented by the server. This doesn't matter in the current code as YFS servers always implement InlineBulkStatus, but a subsequent will call it on YFS servers too in some circumstances. Fixes: e49c7b2f6de7 ("afs: Build an abstraction around an "operation" concept") Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-16afs: Remove yfs_fs_fetch_file_status() as it's not usedDavid Howells
Remove yfs_fs_fetch_file_status() as it's no longer used. Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-16fs: Do not check if there is a fsnotify watcher on pseudo inodesMel Gorman
The kernel uses internal mounts created by kern_mount() and populated with files with no lookup path by alloc_file_pseudo() for a variety of reasons. An example of such a mount is for anonymous pipes. For pipes, every vfs_write() regardless of filesystem, calls fsnotify_modify() to notify of any changes which incurs a small amount of overhead in fsnotify even when there are no watchers. It can also trigger for reads and readv and writev, it was simply vfs_write() that was noticed first. A patch is pending that reduces, but does not eliminate, the overhead of fsnotify but for files that cannot be looked up via a path, even that small overhead is unnecessary. The user API for all notification subsystems (inotify, fanotify, ...) is based on the pathname and a dirfd and proc entries appear to be the only visible representation of the files. Proc does not have the same pathname as the internal entry and the proc inode is not the same as the internal inode so even if fanotify is used on a file under /proc/XX/fd, no useful events are notified. This patch changes alloc_file_pseudo() to always opt out of fsnotify by setting FMODE_NONOTIFY flag so that no check is made for fsnotify watchers on pseudo files. This should be safe as the underlying helper for the dentry is d_alloc_pseudo() which explicitly states that no lookups are ever performed meaning that fanotify should have nothing useful to attach to. The test motivating this was "perf bench sched messaging --pipe". On a single-socket machine using threads the difference of the patch was as follows. 5.7.0 5.7.0 vanilla nofsnotify-v1r1 Amean 1 1.3837 ( 0.00%) 1.3547 ( 2.10%) Amean 3 3.7360 ( 0.00%) 3.6543 ( 2.19%) Amean 5 5.8130 ( 0.00%) 5.7233 * 1.54%* Amean 7 8.1490 ( 0.00%) 7.9730 * 2.16%* Amean 12 14.6843 ( 0.00%) 14.1820 ( 3.42%) Amean 18 21.8840 ( 0.00%) 21.7460 ( 0.63%) Amean 24 28.8697 ( 0.00%) 29.1680 ( -1.03%) Amean 30 36.0787 ( 0.00%) 35.2640 * 2.26%* Amean 32 38.0527 ( 0.00%) 38.1223 ( -0.18%) The difference is small but in some cases it's outside the noise so while marginal, there is still some small benefit to ignoring fsnotify for files allocated via alloc_file_pseudo() in some cases. Link: https://lore.kernel.org/r/20200615121358.GF3183@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2020-06-15Squashfs: Replace zero-length array with flexible-arrayGustavo A. R. Silva
There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://github.com/KSPP/linux/issues/21 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-06-15jffs2: Replace zero-length array with flexible-arrayGustavo A. R. Silva
There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://github.com/KSPP/linux/issues/21 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-06-15aio: Replace zero-length array with flexible-arrayGustavo A. R. Silva
There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://github.com/KSPP/linux/issues/21 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-06-15Merge tag 'ext4-for-linus-5.8-rc1-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull more ext4 updates from Ted Ts'o: "This is the second round of ext4 commits for 5.8 merge window [1]. It includes the per-inode DAX support, which was dependant on the DAX infrastructure which came in via the XFS tree, and a number of regression and bug fixes; most notably the "BUG: using smp_processor_id() in preemptible code in ext4_mb_new_blocks" reported by syzkaller" [1] The pull request actually came in 15 minutes after I had tagged the rc1 release. Tssk, tssk, late.. - Linus * tag 'ext4-for-linus-5.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4, jbd2: ensure panic by fix a race between jbd2 abort and ext4 error handlers ext4: support xattr gnu.* namespace for the Hurd ext4: mballoc: Use this_cpu_read instead of this_cpu_ptr ext4: avoid utf8_strncasecmp() with unstable name ext4: stop overwrite the errcode in ext4_setup_super ext4: fix partial cluster initialization when splitting extent ext4: avoid race conditions when remounting with options that change dax Documentation/dax: Update DAX enablement for ext4 fs/ext4: Introduce DAX inode flag fs/ext4: Remove jflag variable fs/ext4: Make DAX mount option a tri-state fs/ext4: Only change S_DAX on inode load fs/ext4: Update ext4_should_use_dax() fs/ext4: Change EXT4_MOUNT_DAX to EXT4_MOUNT_DAX_ALWAYS fs/ext4: Disallow verity if inode is DAX fs/ext4: Narrow scope of DAX check in setflags
2020-06-15io_uring: cancel by ->task not pidPavel Begunkov
For an exiting process it tries to cancel all its inflight requests. Use req->task to match such instead of work.pid. We always have req->task set, and it will be valid because we're matching only current exiting task. Also, remove work.pid and everything related, it's useless now. Reported-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-15io_uring: lazy get taskPavel Begunkov
There will be multiple places where req->task is used, so refcount-pin it lazily with introduced *io_{get,put}_req_task(). We need to always have valid ->task for cancellation reasons, but don't care about pinning it in some cases. That's why it sets req->task in io_req_init() and implements get/put laziness with a flag. This also removes using @current from polling io_arm_poll_handler(), etc., but doesn't change observable behaviour. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-15io_uring: batch cancel in io_uring_cancel_files()Pavel Begunkov
Instead of waiting for each request one by one, first try to cancel all of them in a batched manner, and then go over inflight_list/etc to reap leftovers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-15io_uring: cancel all task's requests on exitPavel Begunkov
If a process is going away, io_uring_flush() will cancel only 1 request with a matching pid. Cancel all of them Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-15io-wq: add an option to cancel all matched reqsPavel Begunkov
This adds support for cancelling all io-wq works matching a predicate. It isn't used yet, so no change in observable behaviour. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-15io-wq: reorder cancellation pending -> runningPavel Begunkov
Go all over all pending lists and cancel works there, and only then try to match running requests. No functional changes here, just a preparation for bulk cancellation. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-15afs: Fix the mapping of the UAEOVERFLOW abort codeDavid Howells
Abort code UAEOVERFLOW is returned when we try and set a time that's out of range, but it's currently mapped to EREMOTEIO by the default case. Fix UAEOVERFLOW to map instead to EOVERFLOW. Found with the generic/258 xfstest. Note that the test is wrong as it assumes that the filesystem will support a pre-UNIX-epoch date. Fixes: 1eda8bab70ca ("afs: Add support for the UAE error table") Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-15afs: Fix truncation issues and mmap writeback sizeDavid Howells
Fix the following issues: (1) Fix writeback to reduce the size of a store operation to i_size, effectively discarding the extra data. The problem comes when afs_page_mkwrite() records that a page is about to be modified by mmap(). It doesn't know what bits of the page are going to be modified, so it records the whole page as being dirty (this is stored in page->private as start and end offsets). Without this, the marshalling for the store to the server extends the size of the file to the end of the page (in afs_fs_store_data() and yfs_fs_store_data()). (2) Fix setattr to actually truncate the pagecache, thereby clearing the discarded part of a file. (3) Fix setattr to check that the new size is okay and to disable ATTR_SIZE if i_size wouldn't change. (4) Force i_size to be updated as the result of a truncate. (5) Don't truncate if ATTR_SIZE is not set. (6) Call pagecache_isize_extended() if the file was enlarged. Note that truncate_set_size() isn't used because the setting of i_size is done inside afs_vnode_commit_status() under the vnode->cb_lock. Found with the generic/029 and generic/393 xfstests. Fixes: 31143d5d515e ("AFS: implement basic file write support") Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record") Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-15afs: Concoct ctimesDavid Howells
The in-kernel afs filesystem ignores ctime because the AFS fileserver protocol doesn't support ctimes. This, however, causes various xfstests to fail. Work around this by: (1) Setting ctime to attr->ia_ctime in afs_setattr(). (2) Not ignoring ATTR_MTIME_SET, ATTR_TIMES_SET and ATTR_TOUCH settings. (3) Setting the ctime from the server mtime when on the target file when creating a hard link to it. (4) Setting the ctime on directories from their revised mtimes when renaming/moving a file. Found by the generic/221 and generic/309 xfstests. Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-15afs: Fix EOF corruptionDavid Howells
When doing a partial writeback, afs_write_back_from_locked_page() may generate an FS.StoreData RPC request that writes out part of a file when a file has been constructed from pieces by doing seek, write, seek, write, ... as is done by ld. The FS.StoreData RPC is given the current i_size as the file length, but the server basically ignores it unless the data length is 0 (in which case it's just a truncate operation). The revised file length returned in the result of the RPC may then not reflect what we suggested - and this leads to i_size getting moved backwards - which causes issues later. Fix the client to take account of this by ignoring the returned file size unless the data version number jumped unexpectedly - in which case we're going to have to clear the pagecache and reload anyway. This can be observed when doing a kernel build on an AFS mount. The following pair of commands produce the issue: ld -m elf_x86_64 -z max-page-size=0x200000 --emit-relocs \ -T arch/x86/realmode/rm/realmode.lds \ arch/x86/realmode/rm/header.o \ arch/x86/realmode/rm/trampoline_64.o \ arch/x86/realmode/rm/stack.o \ arch/x86/realmode/rm/reboot.o \ -o arch/x86/realmode/rm/realmode.elf arch/x86/tools/relocs --realmode \ arch/x86/realmode/rm/realmode.elf \ >arch/x86/realmode/rm/realmode.relocs This results in the latter giving: Cannot read ELF section headers 0/18: Success as the realmode.elf file got corrupted. The sequence of events can also be driven with: xfs_io -t -f \ -c "pwrite -S 0x58 0 0x58" \ -c "pwrite -S 0x59 10000 1000" \ -c "close" \ /afs/example.com/scratch/a Fixes: 31143d5d515e ("AFS: implement basic file write support") Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-15afs: afs_write_end() should change i_size under the right lockDavid Howells
Fix afs_write_end() to change i_size under vnode->cb_lock rather than ->wb_lock so that it doesn't race with afs_vnode_commit_status() and afs_getattr(). The ->wb_lock is only meant to guard access to ->wb_keys which isn't accessed by that piece of code. Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record") Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-15afs: Fix non-setting of mtime when writing into mmapDavid Howells
The mtime on an inode needs to be updated when a write is made into an mmap'ed section. There are three ways in which this could be done: update it when page_mkwrite is called, update it when a page is changed from dirty to writeback or leave it to the server and fix the mtime up from the reply to the StoreData RPC. Found with the generic/215 xfstest. Fixes: 1cf7a1518aef ("afs: Implement shared-writeable mmap") Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-15io_uring: fix lazy work initPavel Begunkov
Don't leave garbage in req.work before punting async on -EAGAIN in io_iopoll_queue(). [ 140.922099] general protection fault, probably for non-canonical address 0xdead000000000100: 0000 [#1] PREEMPT SMP PTI ... [ 140.922105] RIP: 0010:io_worker_handle_work+0x1db/0x480 ... [ 140.922114] Call Trace: [ 140.922118] ? __next_timer_interrupt+0xe0/0xe0 [ 140.922119] io_wqe_worker+0x2a9/0x360 [ 140.922121] ? _raw_spin_unlock_irqrestore+0x24/0x40 [ 140.922124] kthread+0x12c/0x170 [ 140.922125] ? io_worker_handle_work+0x480/0x480 [ 140.922126] ? kthread_park+0x90/0x90 [ 140.922127] ret_from_fork+0x22/0x30 Fixes: 7cdaf587de7c ("io_uring: avoid whole io_wq_work copy for requests completed inline") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-15efivarfs: Don't return -EINTR when rate-limiting readsTony Luck
Applications that read EFI variables may see a return value of -EINTR if they exceed the rate limit and a signal delivery is attempted while the process is sleeping. This is quite surprising to the application, which probably doesn't have code to handle it. Change the interruptible sleep to a non-interruptible one. Reported-by: Lennart Poettering <mzxreary@0pointer.de> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20200528194905.690-3-tony.luck@intel.com Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2020-06-15efivarfs: Update inode modification time for successful writesTony Luck
Some applications want to be able to see when EFI variables have been updated. Update the modification time for successful writes. Reported-by: Lennart Poettering <mzxreary@0pointer.de> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20200528194905.690-2-tony.luck@intel.com Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2020-06-14dlmfs: clean up dlmfs_file_{read,write}() a bitAl Viro
The damn file is constant-sized - 64 bytes. IOW, * i_size_read() is pointless * so's dynamic allocation * so's the 'size' argument of user_dlm_read_lvb() * ... and so's open-coding simple_read_from_buffer(), while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-06-14Merge tag 'for-5.8-part2-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "This reverts the direct io port to iomap infrastructure of btrfs merged in the first pull request. We found problems in invalidate page that don't seem to be fixable as regressions or without changing iomap code that would not affect other filesystems. There are four reverts in total, but three of them are followup cleanups needed to revert a43a67a2d715 cleanly. The result is the buffer head based implementation of direct io. Reverts are not great, but under current circumstances I don't see better options" * tag 'for-5.8-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Revert "btrfs: switch to iomap_dio_rw() for dio" Revert "fs: remove dio_end_io()" Revert "btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK" Revert "btrfs: split btrfs_direct_IO to read and write part"
2020-06-14Revert "btrfs: switch to iomap_dio_rw() for dio"David Sterba
This reverts commit a43a67a2d715540c1368b9501a22b0373b5874c0. This patch reverts the main part of switching direct io implementation to iomap infrastructure. There's a problem in invalidate page that couldn't be solved as regression in this development cycle. The problem occurs when buffered and direct io are mixed, and the ranges overlap. Although this is not recommended, filesystems implement measures or fallbacks to make it somehow work. In this case, fallback to buffered IO would be an option for btrfs (this already happens when direct io is done on compressed data), but the change would be needed in the iomap code, bringing new semantics to other filesystems. Another problem arises when again the buffered and direct ios are mixed, invalidation fails, then -EIO is set on the mapping and fsync will fail, though there's no real error. There have been discussions how to fix that, but revert seems to be the least intrusive option. Link: https://lore.kernel.org/linux-btrfs/20200528192103.xm45qoxqmkw7i5yl@fiona/ Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-13Merge tag '5.8-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull more cifs updates from Steve French: "12 cifs/smb3 fixes, 2 for stable. - add support for idsfromsid on create and chgrp/chown allowing ability to save owner information more naturally for some workloads - improve query info (getattr) when SMB3.1.1 posix extensions are negotiated by using new query info level" * tag '5.8-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6: smb3: Add debug message for new file creation with idsfromsid mount option cifs: fix chown and chgrp when idsfromsid mount option enabled smb3: allow uid and gid owners to be set on create with idsfromsid mount option smb311: Add tracepoints for new compound posix query info smb311: add support for using info level for posix extensions query smb311: Add support for lookup with posix extensions query info smb311: Add support for SMB311 query info (non-compounded) SMB311: Add support for query info using posix extensions (level 100) smb3: add indatalen that can be a non-zero value to calculation of credit charge in smb2 ioctl smb3: fix typo in mount options displayed in /proc/mounts cifs: Add get_security_type_str function to return sec type. smb3: extend fscache mount volume coherency check
2020-06-13Merge tag 'kbuild-v5.8-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull more Kbuild updates from Masahiro Yamada: - fix build rules in binderfs sample - fix build errors when Kbuild recurses to the top Makefile - covert '---help---' in Kconfig to 'help' * tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: treewide: replace '---help---' in Kconfig files with 'help' kbuild: fix broken builds because of GZIP,BZIP2,LZOP variables samples: binderfs: really compile this sample and fix build issues
2020-06-13Merge tag 'iomap-5.8-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull iomap fix from Darrick Wong: "A single iomap bug fix for a variable type mistake on 32-bit architectures, fixing an integer overflow problem in the unshare actor" * tag 'iomap-5.8-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: Fix unsharing of an extent >2GB on a 32-bit machine
2020-06-13Merge tag 'xfs-5.8-merge-9' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fix from Darrick Wong: "We've settled down into the bugfix phase; this one fixes a resource leak on an error bailout path" * tag 'xfs-5.8-merge-9' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: Add the missed xfs_perag_put() for xfs_ifree_cluster()
2020-06-14treewide: replace '---help---' in Kconfig files with 'help'Masahiro Yamada
Since commit 84af7a6194e4 ("checkpatch: kconfig: prefer 'help' over '---help---'"), the number of '---help---' has been gradually decreasing, but there are still more than 2400 instances. This commit finishes the conversion. While I touched the lines, I also fixed the indentation. There are a variety of indentation styles found. a) 4 spaces + '---help---' b) 7 spaces + '---help---' c) 8 spaces + '---help---' d) 1 space + 1 tab + '---help---' e) 1 tab + '---help---' (correct indentation) f) 1 tab + 1 space + '---help---' g) 1 tab + 2 spaces + '---help---' In order to convert all of them to 1 tab + 'help', I ran the following commend: $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/' Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-06-13Merge tag 'notifications-20200601' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull notification queue from David Howells: "This adds a general notification queue concept and adds an event source for keys/keyrings, such as linking and unlinking keys and changing their attributes. Thanks to Debarshi Ray, we do have a pull request to use this to fix a problem with gnome-online-accounts - as mentioned last time: https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47 Without this, g-o-a has to constantly poll a keyring-based kerberos cache to find out if kinit has changed anything. [ There are other notification pending: mount/sb fsinfo notifications for libmount that Karel Zak and Ian Kent have been working on, and Christian Brauner would like to use them in lxc, but let's see how this one works first ] LSM hooks are included: - A set of hooks are provided that allow an LSM to rule on whether or not a watch may be set. Each of these hooks takes a different "watched object" parameter, so they're not really shareable. The LSM should use current's credentials. [Wanted by SELinux & Smack] - A hook is provided to allow an LSM to rule on whether or not a particular message may be posted to a particular queue. This is given the credentials from the event generator (which may be the system) and the watch setter. [Wanted by Smack] I've provided SELinux and Smack with implementations of some of these hooks. WHY === Key/keyring notifications are desirable because if you have your kerberos tickets in a file/directory, your Gnome desktop will monitor that using something like fanotify and tell you if your credentials cache changes. However, we also have the ability to cache your kerberos tickets in the session, user or persistent keyring so that it isn't left around on disk across a reboot or logout. Keyrings, however, cannot currently be monitored asynchronously, so the desktop has to poll for it - not so good on a laptop. This facility will allow the desktop to avoid the need to poll. DESIGN DECISIONS ================ - The notification queue is built on top of a standard pipe. Messages are effectively spliced in. The pipe is opened with a special flag: pipe2(fds, O_NOTIFICATION_PIPE); The special flag has the same value as O_EXCL (which doesn't seem like it will ever be applicable in this context)[?]. It is given up front to make it a lot easier to prohibit splice&co from accessing the pipe. [?] Should this be done some other way? I'd rather not use up a new O_* flag if I can avoid it - should I add a pipe3() system call instead? The pipe is then configured:: ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth); ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter); Messages are then read out of the pipe using read(). - It should be possible to allow write() to insert data into the notification pipes too, but this is currently disabled as the kernel has to be able to insert messages into the pipe *without* holding pipe->mutex and the code to make this work needs careful auditing. - sendfile(), splice() and vmsplice() are disabled on notification pipes because of the pipe->mutex issue and also because they sometimes want to revert what they just did - but one or more notification messages might've been interleaved in the ring. - The kernel inserts messages with the wait queue spinlock held. This means that pipe_read() and pipe_write() have to take the spinlock to update the queue pointers. - Records in the buffer are binary, typed and have a length so that they can be of varying size. This allows multiple heterogeneous sources to share a common buffer; there are 16 million types available, of which I've used just a few, so there is scope for others to be used. Tags may be specified when a watchpoint is created to help distinguish the sources. - Records are filterable as types have up to 256 subtypes that can be individually filtered. Other filtration is also available. - Notification pipes don't interfere with each other; each may be bound to a different set of watches. Any particular notification will be copied to all the queues that are currently watching for it - and only those that are watching for it. - When recording a notification, the kernel will not sleep, but will rather mark a queue as having lost a message if there's insufficient space. read() will fabricate a loss notification message at an appropriate point later. - The notification pipe is created and then watchpoints are attached to it, using one of: keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01); watch_mount(AT_FDCWD, "/", 0, fd, 0x02); watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03); where in both cases, fd indicates the queue and the number after is a tag between 0 and 255. - Watches are removed if either the notification pipe is destroyed or the watched object is destroyed. In the latter case, a message will be generated indicating the enforced watch removal. Things I want to avoid: - Introducing features that make the core VFS dependent on the network stack or networking namespaces (ie. usage of netlink). - Dumping all this stuff into dmesg and having a daemon that sits there parsing the output and distributing it as this then puts the responsibility for security into userspace and makes handling namespaces tricky. Further, dmesg might not exist or might be inaccessible inside a container. - Letting users see events they shouldn't be able to see. TESTING AND MANPAGES ==================== - The keyutils tree has a pipe-watch branch that has keyctl commands for making use of notifications. Proposed manual pages can also be found on this branch, though a couple of them really need to go to the main manpages repository instead. If the kernel supports the watching of keys, then running "make test" on that branch will cause the testing infrastructure to spawn a monitoring process on the side that monitors a notifications pipe for all the key/keyring changes induced by the tests and they'll all be checked off to make sure they happened. https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch - A test program is provided (samples/watch_queue/watch_test) that can be used to monitor for keyrings, mount and superblock events. Information on the notifications is simply logged to stdout" * tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: smack: Implement the watch_key and post_notification hooks selinux: Implement the watch_key security hook keys: Make the KEY_NEED_* perms an enum rather than a mask pipe: Add notification lossage handling pipe: Allow buffers to be marked read-whole-or-error for notifications Add sample notification program watch_queue: Add a key/keyring notification facility security: Add hooks to rule on setting a watch pipe: Add general notification queue support pipe: Add O_NOTIFICATION_PIPE security: Add a hook for the point of notification insertion uapi: General notification queue definitions
2020-06-12smb3: Add debug message for new file creation with idsfromsid mount optionSteve French
Pavel noticed that a debug message (disabled by default) in creating the security descriptor context could be useful for new file creation owner fields (as we already have for the mode) when using mount parm idsfromsid. [38120.392272] CIFS: FYI: owner S-1-5-88-1-0, group S-1-5-88-2-0 [38125.792637] CIFS: FYI: owner S-1-5-88-1-1000, group S-1-5-88-2-1000 Also cleans up a typo in a comment Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2020-06-12Merge branch 'proc-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull proc fix from Eric Biederman: "Much to my surprise syzbot found a very old bug in proc that the recent changes made easier to reproce. This bug is subtle enough it looks like it fooled everyone who should know better" * 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: proc: Use new_inode not new_inode_pseudo
2020-06-12proc: Use new_inode not new_inode_pseudoEric W. Biederman
Recently syzbot reported that unmounting proc when there is an ongoing inotify watch on the root directory of proc could result in a use after free when the watch is removed after the unmount of proc when the watcher exits. Commit 69879c01a0c3 ("proc: Remove the now unnecessary internal mount of proc") made it easier to unmount proc and allowed syzbot to see the problem, but looking at the code it has been around for a long time. Looking at the code the fsnotify watch should have been removed by fsnotify_sb_delete in generic_shutdown_super. Unfortunately the inode was allocated with new_inode_pseudo instead of new_inode so the inode was not on the sb->s_inodes list. Which prevented fsnotify_unmount_inodes from finding the inode and removing the watch as well as made it so the "VFS: Busy inodes after unmount" warning could not find the inodes to warn about them. Make all of the inodes in proc visible to generic_shutdown_super, and fsnotify_sb_delete by using new_inode instead of new_inode_pseudo. The only functional difference is that new_inode places the inodes on the sb->s_inodes list. I wrote a small test program and I can verify that without changes it can trigger this issue, and by replacing new_inode_pseudo with new_inode the issues goes away. Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/000000000000d788c905a7dfa3f4@google.com Reported-by: syzbot+7d2debdcdb3cb93c1e5e@syzkaller.appspotmail.com Fixes: 0097875bd415 ("proc: Implement /proc/thread-self to point at the directory of the current thread") Fixes: 021ada7dff22 ("procfs: switch /proc/self away from proc_dir_entry") Fixes: 51f0885e5415 ("vfs,proc: guarantee unique inodes in /proc") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>