summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2022-05-16btrfs: remove unnecessary btrfs_i_size_write(0) callsOmar Sandoval
btrfs_new_inode() always returns an inode with i_size and disk_i_size set to 0 (via inode_init_always() and btrfs_alloc_inode(), respectively). Remove the unnecessary calls to btrfs_i_size_write() in btrfs_mkdir() and btrfs_create_subvol_root(). Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: get rid of btrfs_add_nondir()Omar Sandoval
This is a trivial wrapper around btrfs_add_link(). The only thing it does other than moving arguments around is translating a > 0 return value to -EEXIST. As far as I can tell, btrfs_add_link() won't return > 0 (and if it did, the existing callsites in, e.g., btrfs_mkdir() would be broken). The check itself dates back to commit 2c90e5d65842 ("Btrfs: still corruption hunting"), so it's probably left over from debugging. Let's just get rid of btrfs_add_nondir(). Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: fix anon_dev leak in create_subvol()Omar Sandoval
When btrfs_qgroup_inherit(), btrfs_alloc_tree_block, or btrfs_insert_root() fail in create_subvol(), we return without freeing anon_dev. Reorganize the error handling in create_subvol() to fix this. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: reserve correct number of items for renameOmar Sandoval
btrfs_rename() and btrfs_rename_exchange() don't account for enough items. Replace the incorrect explanations with a specific breakdown of the number of items and account them accurately. Note that this glosses over RENAME_WHITEOUT because the next commit is going to rework that, too. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: reserve correct number of items for unlink and rmdirOmar Sandoval
__btrfs_unlink_inode() calls btrfs_update_inode() on the parent directory in order to update its size and sequence number. Make sure we account for it. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16m68knommu: allow elf_fdpic loader to be selectedGreg Ungerer
The m68k architecture code is capable of supporting the binfmt_elf_fdpic loader, so allow it to be configured. It is restricted to nommu configurations at this time due to the MMU context structures/code not supporting everything elf_fdpic needs when MMU is enabled. Signed-off-by: Greg Ungerer <gerg@linux-m68k.org>
2022-05-14Unify the primitives for file descriptor closingAl Viro
Currently we have 3 primitives for removing an opened file from descriptor table - pick_file(), __close_fd_get_file() and close_fd_get_file(). Their calling conventions are rather odd and there's a code duplication for no good reason. They can be unified - 1) have __range_close() cap max_fd in the very beginning; that way we don't need separate way for pick_file() to report being past the end of descriptor table. 2) make {__,}close_fd_get_file() return file (or NULL) directly, rather than returning it via struct file ** argument. Don't bother with (bogus) return value - nobody wants that -ENOENT. 3) make pick_file() return NULL on unopened descriptor - the only caller that used to care about the distinction between descriptor past the end of descriptor table and finding NULL in descriptor table doesn't give a damn after (1). 4) lift ->files_lock out of pick_file() That actually simplifies the callers, as well as the primitives themselves. Code duplication is also gone... Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-05-14fs: remove fget_many and fput_many interfaceGou Hao
These two interface were added in 091141a42 commit, but now there is no place to call them. The only user of fput/fget_many() was removed in commit 62906e89e63b ("io_uring: remove file batch-get optimisation"). A user of get_file_rcu_many() were removed in commit f073531070d2 ("init: add an init_dup helper"). And replace atomic_long_sub/add to atomic_long_dec/inc can improve performance. Here are the test results of unixbench: Cmd: ./Run -c 64 context1 Without patch: System Benchmarks Partial Index BASELINE RESULT INDEX Pipe-based Context Switching 4000.0 2798407.0 6996.0 ======== System Benchmarks Index Score (Partial Only) 6996.0 With patch: System Benchmarks Partial Index BASELINE RESULT INDEX Pipe-based Context Switching 4000.0 3486268.8 8715.7 ======== System Benchmarks Index Score (Partial Only) 8715.7 Signed-off-by: Gou Hao <gouhao@uniontech.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-05-14io_uring: implement multishot mode for acceptHao Xu
Refactor io_accept() to support multishot mode. theoretical analysis: 1) when connections come in fast - singleshot: add accept sqe(userspace) --> accept inline ^ | |-----------------| - multishot: add accept sqe(userspace) --> accept inline ^ | |--*--| we do accept repeatedly in * place until get EAGAIN 2) when connections come in at a low pressure similar thing like 1), we reduce a lot of userspace-kernel context switch and useless vfs_poll() tests: Did some tests, which goes in this way: server client(multiple) accept connect read write write read close close Basically, raise up a number of clients(on same machine with server) to connect to the server, and then write some data to it, the server will write those data back to the client after it receives them, and then close the connection after write return. Then the client will read the data and then close the connection. Here I test 10000 clients connect one server, data size 128 bytes. And each client has a go routine for it, so they come to the server in short time. test 20 times before/after this patchset, time spent:(unit cycle, which is the return value of clock()) before: 1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075 +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984 +1934226+1914385)/20.0 = 1927633.75 after: 1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112 +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506 +1871324+1940803)/20.0 = 1894750.45 (1927633.75 - 1894750.45) / 1927633.75 = 1.65% Signed-off-by: Hao Xu <howeyxu@tencent.com> Link: https://lore.kernel.org/r/20220514142046.58072-5-haoxu.linux@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-14io_uring: let fast poll support multishotHao Xu
For operations like accept, multishot is a useful feature, since we can reduce a number of accept sqe. Let's integrate it to fast poll, it may be good for other operations in the future. Signed-off-by: Hao Xu <howeyxu@tencent.com> Link: https://lore.kernel.org/r/20220514142046.58072-4-haoxu.linux@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-14io_uring: add REQ_F_APOLL_MULTISHOT for requestsHao Xu
Add a flag to indicate multishot mode for fast poll. currently only accept use it, but there may be more operations leveraging it in the future. Also add a mask IO_APOLL_MULTI_POLLED which stands for REQ_F_APOLL_MULTI | REQ_F_POLLED, to make the code short and cleaner. Signed-off-by: Hao Xu <howeyxu@tencent.com> Link: https://lore.kernel.org/r/20220514142046.58072-3-haoxu.linux@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-14gfs2: replace 'found' with dedicated list iterator variableJakob Koschel
To move the list iterator variable into the list_for_each_entry_*() macro in the future it should be avoided to use the list iterator variable after the loop body. To *never* use the list iterator variable after the loop it was concluded to use a separate iterator variable instead of a found boolean [1]. This removes the need to use a found variable and simply checking if the variable was set, can determine if the break/goto was hit. Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ [1] Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-05-13proc/sysctl: make protected_* world readableJulius Hemanth Pitti
protected_* files have 600 permissions which prevents non-superuser from reading them. Container like "AWS greengrass" refuse to launch unless protected_hardlinks and protected_symlinks are set. When containers like these run with "userns-remap" or "--user" mapping container's root to non-superuser on host, they fail to run due to denied read access to these files. As these protections are hardly a secret, and do not possess any security risk, making them world readable. Though above greengrass usecase needs read access to only protected_hardlinks and protected_symlinks files, setting all other protected_* files to 644 to keep consistency. Link: http://lkml.kernel.org/r/20200709235115.56954-1-jpitti@cisco.com Fixes: 800179c9b8a1 ("fs: add link restrictions") Signed-off-by: Julius Hemanth Pitti <jpitti@cisco.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13Merge tag 'gfs2-v5.18-rc4-fix3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 fixes from Andreas Gruenbacher: "We've finally identified commit dc732906c245 ("gfs2: Introduce flag for glock holder auto-demotion") to be the other cause of the filesystem corruption we've been seeing. This feature isn't strictly necessary anymore, so we've decided to stop using it for now. With this and the gfs_iomap_end rounding fix you've already seen ("gfs2: Fix filesystem block deallocation for short writes" in this pull request), we're corruption free again now. - Fix filesystem block deallocation for short writes. - Stop using glock holder auto-demotion for now. - Get rid of buffered writes inefficiencies due to page faults being disabled. - Minor other cleanups" * tag 'gfs2-v5.18-rc4-fix3' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Stop using glock holder auto-demotion for now gfs2: buffered write prefaulting gfs2: Align read and write chunks to the page cache gfs2: Pull return value test out of should_fault_in_pages gfs2: Clean up use of fault_in_iov_iter_{read,write}able gfs2: Variable rename gfs2: Fix filesystem block deallocation for short writes
2022-05-13ext4: add unmount filesystem messageZhang Yi
Now that we have kernel message at mount time, system administrator could acquire the mount time, device and options easily. But we don't have corresponding unmounting message at umount time, so we cannot know if someone umount a filesystem easily. Some of the modern filesystems (e.g. xfs) have the umounting kernel message, so add one for ext4 filesystem for convenience. EXT4-fs (sdb): mounted filesystem with ordered data mode. Quota mode: none. EXT4-fs (sdb): unmounting filesystem. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220412145320.2669897-1-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-13io_uring: only wake when the correct events are setDylan Yudaken
The check for waking up a request compares the poll_t bits, however this will always contain some common flags so this always wakes up. For files with single wait queues such as sockets this can cause the request to be sent to the async worker unnecesarily. Further if it is non-blocking will complete the request with EAGAIN which is not desired. Here exclude these common events, making sure to not exclude POLLERR which might be important. Fixes: d7718a9d25a6 ("io_uring: use poll driven retry for files that support it") Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220512091834.728610-3-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-13gfs2: Stop using glock holder auto-demotion for nowAndreas Gruenbacher
We're having unresolved issues with the glock holder auto-demotion mechanism introduced in commit dc732906c245. This mechanism was assumed to be essential for avoiding frequent short reads and writes until commit 296abc0d91d8 ("gfs2: No short reads or writes upon glock contention"). Since then, when the inode glock is lost, it is simply re-acquired and the operation is resumed. This means that apart from the performance penalty, we might as well drop the inode glock before faulting in pages, and re-acquire it afterwards. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-05-13gfs2: buffered write prefaultingAndreas Gruenbacher
In gfs2_file_buffered_write, to increase the likelihood that all the user memory we're trying to write will be resident in memory, carry out the write in chunks and fault in each chunk of user memory before trying to write it. Otherwise, some workloads will trigger frequent short "internal" writes, causing filesystem blocks to be allocated and then partially deallocated again when writing into holes, which is wasteful and breaks reservations. Neither the chunked writes nor any of the short "internal" writes are user visible. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-05-13ext4: remove unnecessary conditionalsLv Ruyi
iput() has already handled null and non-null parameter, so it is no need to use if(). Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Lv Ruyi <lv.ruyi@zte.com.cn> Link: https://lore.kernel.org/r/20220411032337.2517465-1-lv.ruyi@zte.com.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-05-13gfs2: Align read and write chunks to the page cacheAndreas Gruenbacher
Align the chunks that reads and writes are carried out in to the page cache rather than the user buffers. This will be more efficient in general, especially for allocating writes. Optimizing the case that the user buffer is gfs2 backed isn't very useful; we only need to make sure we won't deadlock. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-05-13gfs2: Pull return value test out of should_fault_in_pagesAndreas Gruenbacher
Pull the return value test of the previous read or write operation out of should_fault_in_pages(). In a following patch, we'll fault in pages before the I/O and there will be no return value to check. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-05-13gfs2: Clean up use of fault_in_iov_iter_{read,write}ableAndreas Gruenbacher
No need to store the return value of the fault_in functions in separate variables. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-05-13gfs2: Variable renameAndreas Gruenbacher
Instead of counting the number of bytes read from the filesystem, functions gfs2_file_direct_read and gfs2_file_read_iter count the number of bytes written into the user buffer. Conversely, functions gfs2_file_direct_write and gfs2_file_buffered_write count the number of bytes read from the user buffer. This is nothing but confusing, so change the read functions to count how many bytes they have read, and the write functions to count how many bytes they have written. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-05-13gfs2: Fix filesystem block deallocation for short writesAndreas Gruenbacher
When a write cannot be carried out in full, gfs2_iomap_end() releases blocks that have been allocated for this write but haven't been used. To compute the end of the allocation, gfs2_iomap_end() incorrectly rounded the end of the attempted write down to the next block boundary to arrive at the end of the allocation. It would have to round up, but the end of the allocation is also available as iomap->offset + iomap->length, so just use that instead. In addition, use round_up() for computing the start of the unused range. Fixes: 64bc06bb32ee ("gfs2: iomap buffered write support") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-05-13Merge tag 'ceph-for-5.18-rc7' of https://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fix from Ilya Dryomov: "Two fixes to properly maintain xattrs on async creates and thus preserve SELinux context on newly created files and to avoid improper usage of folio->private field which triggered BUG_ONs. Both marked for stable" * tag 'ceph-for-5.18-rc7' of https://github.com/ceph/ceph-client: ceph: check folio PG_private bit instead of folio->private ceph: fix setting of xattrs on async created inodes
2022-05-13Merge tag 'nfs-for-5.18-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Trond Myklebust: "One more pull request. There was a bug in the fix to ensure that gss- proxy continues to work correctly after we fixed the AF_LOCAL socket leak in the RPC code. This therefore reverts that broken patch, and replaces it with one that works correctly. Stable fixes: - SUNRPC: Ensure that the gssproxy client can start in a connected state Bugfixes: - Revert "SUNRPC: Ensure gss-proxy connects on setup" - nfs: fix broken handling of the softreval mount option" * tag 'nfs-for-5.18-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: nfs: fix broken handling of the softreval mount option SUNRPC: Ensure that the gssproxy client can start in a connected state Revert "SUNRPC: Ensure gss-proxy connects on setup"
2022-05-13Merge tag 'mm-hotfixes-stable-2022-05-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Seven MM fixes, three of which address issues added in the most recent merge window, four of which are cc:stable. Three non-MM fixes, none very serious" [ And yes, that's a real pull request from Andrew, not me creating a branch from emailed patches. Woo-hoo! ] * tag 'mm-hotfixes-stable-2022-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: MAINTAINERS: add a mailing list for DAMON development selftests: vm: Makefile: rename TARGETS to VMTARGETS mm/kfence: reset PG_slab and memcg_data before freeing __kfence_pool mailmap: add entry for martyna.szapar-mudlaw@intel.com arm[64]/memremap: don't abuse pfn_valid() to ensure presence of linear map procfs: prevent unprivileged processes accessing fdinfo dir mm: mremap: fix sign for EFAULT error return value mm/hwpoison: use pr_err() instead of dump_page() in get_any_page() mm/huge_memory: do not overkill when splitting huge_zero_page Revert "mm/memory-failure.c: skip huge_zero_page in memory_failure()"
2022-05-13mm/uffd: enable write protection for shmem & hugetlbfsPeter Xu
We've had all the necessary changes ready for both shmem and hugetlbfs. Turn on all the shmem/hugetlbfs switches for userfaultfd-wp. We can expand UFFD_API_RANGE_IOCTLS_BASIC with _UFFDIO_WRITEPROTECT too because all existing types now support write protection mode. Since vma_can_userfault() will be used elsewhere, move into userfaultfd_k.h. Link: https://lkml.kernel.org/r/20220405014926.15101-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/pagemap: recognize uffd-wp bit for shmem/hugetlbfsPeter Xu
This requires the pagemap code to be able to recognize the newly introduced swap special pte for uffd-wp, meanwhile the general case for hugetlb that we recently start to support. It should make pagemap uffd-wp support complete. Link: https://lkml.kernel.org/r/20220405014923.15047-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/hugetlb: only drop uffd-wp special pte if requiredPeter Xu
As with shmem uffd-wp special ptes, only drop the uffd-wp special swap pte if unmapping an entire vma or synchronized such that faults can not race with the unmap operation. This requires passing zap_flags all the way to the lowest level hugetlb unmap routine: __unmap_hugepage_range. In general, unmap calls originated in hugetlbfs code will pass the ZAP_FLAG_DROP_MARKER flag as synchronization is in place to prevent faults. The exception is hole punch which will first unmap without any synchronization. Later when hole punch actually removes the page from the file, it will check to see if there was a subsequent fault and if so take the hugetlb fault mutex while unmapping again. This second unmap will pass in ZAP_FLAG_DROP_MARKER. The justification of "whether to apply ZAP_FLAG_DROP_MARKER flag when unmap a hugetlb range" is (IMHO): we should never reach a state when a page fault could errornously fault in a page-cache page that was wr-protected to be writable, even in an extremely short period. That could happen if e.g. we pass ZAP_FLAG_DROP_MARKER when hugetlbfs_punch_hole() calls hugetlb_vmdelete_list(), because if a page faults after that call and before remove_inode_hugepages() is executed, the page cache can be mapped writable again in the small racy window, that can cause unexpected data overwritten. [peterx@redhat.com: fix sparse warning] Link: https://lkml.kernel.org/r/Ylcdw8I1L5iAoWhb@xz-m1.local [akpm@linux-foundation.org: move zap_flags_t from mm.h to mm_types.h to fix build issues] Link: https://lkml.kernel.org/r/20220405014915.14873-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm: teach core mm about pte markersPeter Xu
This patch still does not use pte marker in any way, however it teaches the core mm about the pte marker idea. For example, handle_pte_marker() is introduced that will parse and handle all the pte marker faults. Many of the places are more about commenting it up - so that we know there's the possibility of pte marker showing up, and why we don't need special code for the cases. [peterx@redhat.com: userfaultfd.c needs swapops.h] Link: https://lkml.kernel.org/r/YmRlVj3cdizYJsr0@xz-m1.local Link: https://lkml.kernel.org/r/20220405014833.14015-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13hugetlbfs: fix hugetlbfs_statfs() lockingMina Almasry
After commit db71ef79b59b ("hugetlb: make free_huge_page irq safe"), the subpool lock should be locked with spin_lock_irq() and all call sites was modified as such, except for the ones in hugetlbfs_statfs(). Link: https://lkml.kernel.org/r/20220429202207.3045-1-almasrymina@google.com Fixes: db71ef79b59b ("hugetlb: make free_huge_page irq safe") Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm/mprotect: use mmu_gatherNadav Amit
Patch series "mm/mprotect: avoid unnecessary TLB flushes", v6. This patchset is intended to remove unnecessary TLB flushes during mprotect() syscalls. Once this patch-set make it through, similar and further optimizations for MADV_COLD and userfaultfd would be possible. Basically, there are 3 optimizations in this patch-set: 1. Use TLB batching infrastructure to batch flushes across VMAs and do better/fewer flushes. This would also be handy for later userfaultfd enhancements. 2. Avoid unnecessary TLB flushes. This optimization is the one that provides most of the performance benefits. Unlike previous versions, we now only avoid flushes that would not result in spurious page-faults. 3. Avoiding TLB flushes on change_huge_pmd() that are only needed to prevent the A/D bits from changing. Andrew asked for some benchmark numbers. I do not have an easy determinate macrobenchmark in which it is easy to show benefit. I therefore ran a microbenchmark: a loop that does the following on anonymous memory, just as a sanity check to see that time is saved by avoiding TLB flushes. The loop goes: mprotect(p, PAGE_SIZE, PROT_READ) mprotect(p, PAGE_SIZE, PROT_READ|PROT_WRITE) *p = 0; // make the page writable The test was run in KVM guest with 1 or 2 threads (the second thread was busy-looping). I measured the time (cycles) of each operation: 1 thread 2 threads mmots +patch mmots +patch PROT_READ 3494 2725 (-22%) 8630 7788 (-10%) PROT_READ|WRITE 3952 2724 (-31%) 9075 2865 (-68%) [ mmots = v5.17-rc6-mmots-2022-03-06-20-38 ] The exact numbers are really meaningless, but the benefit is clear. There are 2 interesting results though. (1) PROT_READ is cheaper, while one can expect it not to be affected. This is presumably due to TLB miss that is saved (2) Without memory access (*p = 0), the speedup of the patch is even greater. In that scenario mprotect(PROT_READ) also avoids the TLB flush. As a result both operations on the patched kernel take roughly ~1500 cycles (with either 1 or 2 threads), whereas on mmotm their cost is as high as presented in the table. This patch (of 3): change_pXX_range() currently does not use mmu_gather, but instead implements its own deferred TLB flushes scheme. This both complicates the code, as developers need to be aware of different invalidation schemes, and prevents opportunities to avoid TLB flushes or perform them in finer granularity. The use of mmu_gather for modified PTEs has benefits in various scenarios even if pages are not released. For instance, if only a single page needs to be flushed out of a range of many pages, only that page would be flushed. If a THP page is flushed, on x86 a single TLB invlpg instruction can be used instead of 512 instructions (or a full TLB flush, which would Linux would actually use by default). mprotect() over multiple VMAs requires a single flush. Use mmu_gather in change_pXX_range(). As the pages are not released, only record the flushed range using tlb_flush_pXX_range(). Handle THP similarly and get rid of flush_cache_range() which becomes redundant since tlb_start_vma() calls it when needed. Link: https://lkml.kernel.org/r/20220401180821.1986781-1-namit@vmware.com Link: https://lkml.kernel.org/r/20220401180821.1986781-2-namit@vmware.com Signed-off-by: Nadav Amit <namit@vmware.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Xu <peterx@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Nick Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13io_uring: avoid io-wq -EAGAIN looping for !IOPOLLPavel Begunkov
If an opcode handler semi-reliably returns -EAGAIN, io_wq_submit_work() might continue busily hammer the same handler over and over again, which is not ideal. The -EAGAIN handling in question was put there only for IOPOLL, so restrict it to IOPOLL mode only where there is no other recourse than to retry as we cannot wait. Fixes: def596e9557c9 ("io_uring: support for IO polling") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f168b4f24181942f3614dd8ff648221736f572e6.1652433740.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-13io_uring: add flag for allocating a fully sparse direct descriptor spaceJens Axboe
Currently to setup a fully sparse descriptor space upfront, the app needs to alloate an array of the full size and memset it to -1 and then pass that in. Make this a bit easier by allowing a flag that simply does this internally rather than needing to copy each slot separately. This works with IORING_REGISTER_FILES2 as the flag is set in struct io_uring_rsrc_register, and is only allow when the type is IORING_RSRC_FILE as this doesn't make sense for registered buffers. Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-13io_uring: bump max direct descriptor count to 1MJens Axboe
We currently limit these to 32K, but since we're now backing the table space with vmalloc when needed, there's no reason why we can't make it bigger. The total space is limited by RLIMIT_NOFILE as well. Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-13io_uring: allow allocated fixed files for acceptJens Axboe
If the application passes in IORING_FILE_INDEX_ALLOC as the file_slot, then that's a hint to allocate a fixed file descriptor rather than have one be passed in directly. This can be useful for having io_uring manage the direct descriptor space, and also allows multi-shot support to work with fixed files. Normal accept direct requests will complete with 0 for success, and < 0 in case of error. If io_uring is asked to allocated the direct descriptor, then the direct descriptor is returned in case of success. Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-13io_uring: allow allocated fixed files for openat/openat2Jens Axboe
If the application passes in IORING_FILE_INDEX_ALLOC as the file_slot, then that's a hint to allocate a fixed file descriptor rather than have one be passed in directly. This can be useful for having io_uring manage the direct descriptor space. Normal open direct requests will complete with 0 for success, and < 0 in case of error. If io_uring is asked to allocated the direct descriptor, then the direct descriptor is returned in case of success. Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-13io_uring: add basic fixed file allocatorJens Axboe
Applications currently always pick where they want fixed files to go. In preparation for allowing these types of commands with multishot support, add a basic allocator in the fixed file table. Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-13io_uring: track fixed files with a bitmapJens Axboe
In preparation for adding a basic allocator for direct descriptors, add helpers that set/clear whether a file slot is used. Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-12fs/ntfs3: validate BOOT sectors_per_clustersRandy Dunlap
When the NTFS BOOT sectors_per_clusters field is > 0x80, it represents a shift value. Make sure that the shift value is not too large before using it (NTFS max cluster size is 2MB). Return -EVINVAL if it too large. This prevents negative shift values and shift values that are larger than the field size. Prevents this UBSAN error: UBSAN: shift-out-of-bounds in ../fs/ntfs3/super.c:673:16 shift exponent -192 is negative Link: https://lkml.kernel.org/r/20220502175342.20296-1-rdunlap@infradead.org Fixes: 82cae269cfa9 ("fs/ntfs3: Add initialization of super block") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: syzbot+1631f09646bc214d2e76@syzkaller.appspotmail.com Reviewed-by: Namjae Jeon <linkinjeon@kernel.org> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Kari Argillander <kari.argillander@stargateuniverse.net> Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Build issue in drivers/net/ethernet/sfc/ptp.c 54fccfdd7c66 ("sfc: efx_default_channel_type APIs can be static") 49e6123c65da ("net: sfc: fix memory leak due to ptp channel") https://lore.kernel.org/all/20220510130556.52598fe2@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-12io_uring_enter(): don't leave f.flags uninitializedAl Viro
simplifies logics on cleanup, as well... Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-05-12f2fs: keep wait_ms if EAGAIN happensJaegeuk Kim
In f2fs_gc thread, let's keep wait_ms when sec_freed was zero. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-12f2fs: introduce f2fs_gc_control to consolidate f2fs_gc parametersJaegeuk Kim
No functional change. - remove checkpoint=disable check for f2fs_write_checkpoint - get sec_freed all the time Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-12Merge tag 'fixes_for_v5.18-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fs fixes from Jan Kara: "Three fixes that I'd still like to get to 5.18: - add a missing sanity check in the fanotify FAN_RENAME feature (added in 5.17, let's fix it before it gets wider usage in userspace) - udf fix for recently introduced filesystem corruption issue - writeback fix for a race in inode list handling that can lead to delayed writeback and possible dirty throttling stalls" * tag 'fixes_for_v5.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: udf: Avoid using stale lengthOfImpUse writeback: Avoid skipping inode writeback fanotify: do not allow setting dirent events in mask of non-dir
2022-05-12f2fs: reject test_dummy_encryption when !CONFIG_FS_ENCRYPTIONEric Biggers
There is no good reason to allow this mount option when the kernel isn't configured with encryption support. Since this option is only for testing, we can just fix this; we don't really need to worry about breaking anyone who might be counting on this option being ignored. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-12f2fs: kill volatile write supportJaegeuk Kim
There's no user, since all can use atomic writes simply. Let's kill it. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-12f2fs: change the current atomic write wayDaeho Jeong
Current atomic write has three major issues like below. - keeps the updates in non-reclaimable memory space and they are even hard to be migrated, which is not good for contiguous memory allocation. - disk spaces used for atomic files cannot be garbage collected, so this makes it difficult for the filesystem to be defragmented. - If atomic write operations hit the threshold of either memory usage or garbage collection failure count, All the atomic write operations will fail immediately. To resolve the issues, I will keep a COW inode internally for all the updates to be flushed from memory, when we need to flush them out in a situation like high memory pressure. These COW inodes will be tagged as orphan inodes to be reclaimed in case of sudden power-cut or system failure during atomic writes. Signed-off-by: Daeho Jeong <daehojeong@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-12f2fs: don't need inode lock for system hidden quotaJaegeuk Kim
Let's avoid false-alarmed lockdep warning. [ 58.914674] [T1501146] -> #2 (&sb->s_type->i_mutex_key#20){+.+.}-{3:3}: [ 58.915975] [T1501146] system_server: down_write+0x7c/0xe0 [ 58.916738] [T1501146] system_server: f2fs_quota_sync+0x60/0x1a8 [ 58.917563] [T1501146] system_server: block_operations+0x16c/0x43c [ 58.918410] [T1501146] system_server: f2fs_write_checkpoint+0x114/0x318 [ 58.919312] [T1501146] system_server: f2fs_issue_checkpoint+0x178/0x21c [ 58.920214] [T1501146] system_server: f2fs_sync_fs+0x48/0x6c [ 58.920999] [T1501146] system_server: f2fs_do_sync_file+0x334/0x738 [ 58.921862] [T1501146] system_server: f2fs_sync_file+0x30/0x48 [ 58.922667] [T1501146] system_server: __arm64_sys_fsync+0x84/0xf8 [ 58.923506] [T1501146] system_server: el0_svc_common.llvm.12821150825140585682+0xd8/0x20c [ 58.924604] [T1501146] system_server: do_el0_svc+0x28/0xa0 [ 58.925366] [T1501146] system_server: el0_svc+0x24/0x38 [ 58.926094] [T1501146] system_server: el0_sync_handler+0x88/0xec [ 58.926920] [T1501146] system_server: el0_sync+0x1b4/0x1c0 [ 58.927681] [T1501146] -> #1 (&sbi->cp_global_sem){+.+.}-{3:3}: [ 58.928889] [T1501146] system_server: down_write+0x7c/0xe0 [ 58.929650] [T1501146] system_server: f2fs_write_checkpoint+0xbc/0x318 [ 58.930541] [T1501146] system_server: f2fs_issue_checkpoint+0x178/0x21c [ 58.931443] [T1501146] system_server: f2fs_sync_fs+0x48/0x6c [ 58.932226] [T1501146] system_server: sync_filesystem+0xac/0x130 [ 58.933053] [T1501146] system_server: generic_shutdown_super+0x38/0x150 [ 58.933958] [T1501146] system_server: kill_block_super+0x24/0x58 [ 58.934791] [T1501146] system_server: kill_f2fs_super+0xcc/0x124 [ 58.935618] [T1501146] system_server: deactivate_locked_super+0x90/0x120 [ 58.936529] [T1501146] system_server: deactivate_super+0x74/0xac [ 58.937356] [T1501146] system_server: cleanup_mnt+0x128/0x168 [ 58.938150] [T1501146] system_server: __cleanup_mnt+0x18/0x28 [ 58.938944] [T1501146] system_server: task_work_run+0xb8/0x14c [ 58.939749] [T1501146] system_server: do_notify_resume+0x114/0x1e8 [ 58.940595] [T1501146] system_server: work_pending+0xc/0x5f0 [ 58.941375] [T1501146] -> #0 (&sbi->gc_lock){+.+.}-{3:3}: [ 58.942519] [T1501146] system_server: __lock_acquire+0x1270/0x2868 [ 58.943366] [T1501146] system_server: lock_acquire+0x114/0x294 [ 58.944169] [T1501146] system_server: down_write+0x7c/0xe0 [ 58.944930] [T1501146] system_server: f2fs_issue_checkpoint+0x13c/0x21c [ 58.945831] [T1501146] system_server: f2fs_sync_fs+0x48/0x6c [ 58.946614] [T1501146] system_server: f2fs_do_sync_file+0x334/0x738 [ 58.947472] [T1501146] system_server: f2fs_ioc_commit_atomic_write+0xc8/0x14c [ 58.948439] [T1501146] system_server: __f2fs_ioctl+0x674/0x154c [ 58.949253] [T1501146] system_server: f2fs_ioctl+0x54/0x88 [ 58.950018] [T1501146] system_server: __arm64_sys_ioctl+0xa8/0x110 [ 58.950865] [T1501146] system_server: el0_svc_common.llvm.12821150825140585682+0xd8/0x20c [ 58.951965] [T1501146] system_server: do_el0_svc+0x28/0xa0 [ 58.952727] [T1501146] system_server: el0_svc+0x24/0x38 [ 58.953454] [T1501146] system_server: el0_sync_handler+0x88/0xec [ 58.954279] [T1501146] system_server: el0_sync+0x1b4/0x1c0 Cc: stable@vger.kernel.org Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>