summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2022-01-11Merge tag 'for-linus-5.17-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml Pull UML updates from Richard Weinberger: - set_fs removal - Devicetree support - Many cleanups from Al - Various virtio and build related fixes * tag 'for-linus-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: (31 commits) um: virtio_uml: Allow probing from devicetree um: Add devicetree support um: Extract load file helper from initrd.c um: remove set_fs hostfs: Fix writeback of dirty pages um: Use swap() to make code cleaner um: header debriding - sigio.h um: header debriding - os.h um: header debriding - net_*.h um: header debriding - mem_user.h um: header debriding - activate_ipi() um: common-offsets.h debriding... um, x86: bury crypto_tfm_ctx_offset um: unexport handle_page_fault() um: remove a dangling extern of syscall_trace() um: kill unused cpu() uml/i386: missing include in barrier.h um: stop polluting the namespace with registers.h contents logic_io instance of iounmap() needs volatile on argument um: move amd64 variant of mmap(2) to arch/x86/um/syscalls_64.c ...
2022-01-11Merge tag 'for-linus-5.17-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs Pull JFFS2, UBI and UBIFS updates from Richard Weinberger: "JFFS2: - Fix for a deadlock in jffs2_write_begin() UBI: - Fixes in comments UBIFS: - Expose error counters in sysfs - Many bugfixes found by Hulk Robot and others" * tag 'for-linus-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs: jffs2: GC deadlock reading a page that is used in jffs2_write_begin() ubifs: read-only if LEB may always be taken in ubifs_garbage_collect ubifs: fix double return leb in ubifs_garbage_collect ubifs: fix slab-out-of-bounds in ubifs_change_lp ubifs: fix snprintf() length check ubifs: Document sysfs nodes ubifs: Export filesystem error counters ubifs: Error path in ubifs_remount_rw() seems to wrongly free write buffers ubifs: Make use of the helper macro kthread_run() ubi: Fix a mistake in comment ubifs: Fix spelling mistakes
2022-01-11Merge tag 'dlm-5.17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: "This set includes the normal collection of minor fixes and cleanups, new kmem caches for network messaging structs, a start on some basic tracepoints, and some new debugfs files for inserting test messages" * tag 'dlm-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: (32 commits) fs: dlm: print cluster addr if non-cluster node connects fs: dlm: memory cache for lowcomms hotpath fs: dlm: memory cache for writequeue_entry fs: dlm: memory cache for midcomms hotpath fs: dlm: remove wq_alloc mutex fs: dlm: use event based wait for pending remove fs: dlm: check for pending users filling buffers fs: dlm: use list_empty() to check last iteration fs: dlm: fix build with CONFIG_IPV6 disabled fs: dlm: replace use of socket sk_callback_lock with sock_lock fs: dlm: don't call kernel_getpeername() in error_report() fs: dlm: fix potential buffer overflow fs: dlm:Remove unneeded semicolon fs: dlm: remove double list_first_entry call fs: dlm: filter user dlm messages for kernel locks fs: dlm: add lkb waiters debugfs functionality fs: dlm: add lkb debugfs functionality fs: dlm: allow create lkb with specific id range fs: dlm: add debugfs rawmsg send functionality fs: dlm: let handle callback data as void ...
2022-01-11Merge tag 'gfs2-v5.16-rc3-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 updates from Andreas Gruenbacher: "Various minor gfs2 cleanups and fixes" * tag 'gfs2-v5.16-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: dump inode object for iopen glocks gfs2: Fix gfs2_instantiate description gfs2: Remove redundant check for GLF_INSTANTIATE_NEEDED gfs2: remove redundant set of INSTANTIATE_NEEDED gfs2: Fix __gfs2_holder_init function name in kernel-doc comment
2022-01-11xfs: take the ILOCK when readdir inspects directory mapping dataDarrick J. Wong
I was poking around in the directory code while diagnosing online fsck bugs, and noticed that xfs_readdir doesn't actually take the directory ILOCK when it calls xfs_dir2_isblock. xfs_dir_open most probably loaded the data fork mappings and the VFS took i_rwsem (aka IOLOCK_SHARED) so we're protected against writer threads, but we really need to follow the locking model like we do in other places. To avoid unnecessarily cycling the ILOCK for fairly small directories, change the block/leaf _getdents functions to consume the ILOCK hold that the parent readdir function took to decide on a _getdents implementation. It is ok to cycle the ILOCK in readdir because the VFS takes the IOLOCK in the appropriate mode during lookups and writes, and we don't want to be holding the ILOCK when we copy directory entries to userspace in case there's a page fault. We really only need it to protect against data fork lookups, like we do for other files. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-01-11Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Convert ext4 to use the new mount API, and add support for the FS_IOC_GETFSLABEL and FS_IOC_SETFSLABEL ioctls. In addition the usual large number of clean ups and bug fixes, in particular for the fast_commit feature" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (48 commits) ext4: don't use the orphan list when migrating an inode ext4: use BUG_ON instead of if condition followed by BUG ext4: fix a copy and paste typo ext4: set csum seed in tmp inode while migrating to extents ext4: remove unnecessary 'offset' assignment ext4: remove redundant o_start statement ext4: drop an always true check ext4: remove unused assignments ext4: remove redundant statement ext4: remove useless resetting io_end_size in mpage_process_page() ext4: allow to change s_last_trim_minblks via sysfs ext4: change s_last_trim_minblks type to unsigned long ext4: implement support for get/set fs label ext4: only set EXT4_MOUNT_QUOTA when journalled quota file is specified ext4: don't use kfree() on rcu protected pointer sbi->s_qf_names ext4: avoid trim error on fs with small groups ext4: fix an use-after-free issue about data=journal writeback mode ext4: fix null-ptr-deref in '__ext4_journal_ensure_credits' ext4: initialize err_blk before calling __ext4_get_inode_loc ext4: fix a possible ABBA deadlock due to busy PA ...
2022-01-11Merge tag 'xfs-5.17-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Darrick Wong: "The big new feature here is that the mount code now only bothers to try to free stale COW staging extents if the fs unmounted uncleanly. This should reduce mount times, particularly on filesystems supporting reflink and containing a large number of allocation groups. Everything else this cycle are bugfixes, as the iomap folios conversion should be plenty enough excitement for anyone. That and I ran out of brain bandwidth after Thanksgiving last year. Summary: - Fix log recovery with da btree buffers when metauuid is in use. - Fix type coercion problems in xattr buffer size validation. - Fix a bug in online scrub dir leaf bestcount checking. - Only run COW recovery when recovering the log. - Fix symlink target buffer UAF problems and symlink locking problems by not exposing xfs innards to the VFS. - Fix incorrect quotaoff lock usage. - Don't let transactions cancel cleanly if they have deferred work items attached. - Fix a UAF when we're deciding if we need to relog an intent item. - Reduce kvmalloc overhead for log shadow buffers. - Clean up sysfs attr group usage. - Fix a bug where scrub's bmap/rmap checking could race with a quota file block allocation due to insufficient locking. - Teach scrub to complain about invalid project ids" * tag 'xfs-5.17-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: warn about inodes with project id of -1 xfs: hold quota inode ILOCK_EXCL until the end of dqalloc xfs: Remove redundant assignment of mp xfs: reduce kvmalloc overhead for CIL shadow buffers xfs: sysfs: use default_groups in kobj_type xfs: prevent UAF in xfs_log_item_in_current_chkpt xfs: prevent a WARN_ONCE() in xfs_ioc_attr_list() xfs: Fix comments mentioning xfs_ialloc xfs: check sb_meta_uuid for dabuf buffer recovery xfs: fix a bug in the online fsck directory leaf1 bestcount check xfs: only run COW extent recovery when there are no live extents xfs: don't expose internal symlink metadata buffers to the vfs xfs: fix quotaoff mutex usage now that we don't support disabling it xfs: shut down filesystem if we xfs_trans_cancel with deferred work items
2022-01-11Merge tag 'for-5.17-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "This end of the year branch is intentionally not that exciting. Most of the changes are under the hood, but there are some minor user visible improvements and several performance improvements too. Features: - make send work with concurrent block group relocation. We're not allowed to prevent send failing or silently producing some bad stream but with more fine grained locking and checks it's possible. The send vs deduplication exclusion could reuse the same logic in the future. - new exclusive operation 'balance paused' to allow adding a device to filesystem with paused balance - new sysfs file for fsid stored in the per-device directory to help distinguish devices when seeding is enabled, the fsid may differ from the one reported by the filesystem Performance improvements: - less metadata needed for directory logging, directory deletion is 20-40% faster - in zoned mode, cache zone information during mount to speed up repeated queries (about 50% speedup) - free space tree entries get indexed and searched by size (latency -30%, search run time -30%) - less contention in tree node locking when inserting a key and no splits are needed (files/sec in fsmark improves by 1-20%) Fixes: - fix ENOSPC failure when attempting direct IO write into NOCOW range - fix deadlock between quota enable and other quota operations - global reserve minimum calculations fixed to account for free space tree - in zoned mode, fix condition for chunk allocation that may not find the right zone for reuse and could lead to early ENOSPC Core: - global reserve stealing got simplified and cleaned up in evict - remove async transaction commit based on manual transaction refs, reuse existing kthread and mechanisms to let it commit transaction before timeout - preparatory work for extent tree v2, add wrappers for global tree roots, truncation path cleanups - remove readahead framework, it's a bit overengineered and used only for scrub, and yet it does not cover all its needs, there is another readahead built in the b-tree search that is now used, performance drop on HDD is about 5% which is acceptable and scrub is often throttled anyway, on SSDs there's no reported drop but slight improvement - self tests report extent tree state when error occurs - replace assert with debugging information when an uncommitted transaction is found at unmount time Other: - error handling improvements - other cleanups and refactoring" * tag 'for-5.17-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (115 commits) btrfs: output more debug messages for uncommitted transaction btrfs: respect the max size in the header when activating swap file btrfs: fix argument list that the kdoc format and script verified btrfs: remove unnecessary parameter type from compression_decompress_bio btrfs: selftests: dump extent io tree if extent-io-tree test failed btrfs: scrub: cleanup the argument list of scrub_stripe() btrfs: scrub: cleanup the argument list of scrub_chunk() btrfs: remove reada infrastructure btrfs: scrub: use btrfs_path::reada for extent tree readahead btrfs: scrub: remove the unnecessary path parameter for scrub_raid56_parity() btrfs: refactor unlock_up btrfs: skip transaction commit after failure to create subvolume btrfs: zoned: fix chunk allocation condition for zoned allocator btrfs: add extent allocator hook to decide to allocate chunk or not btrfs: zoned: unset dedicated block group on allocation failure btrfs: zoned: drop redundant check for REQ_OP_ZONE_APPEND and btrfs_is_zoned btrfs: zoned: sink zone check into btrfs_repair_one_zone btrfs: zoned: simplify btrfs_check_meta_write_pointer btrfs: zoned: encapsulate inode locking for zoned relocation btrfs: sysfs: add devinfo/fsid to retrieve actual fsid from the device ...
2022-01-11Merge tag 'erofs-for-5.17-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs updates from Gao Xiang: "In this cycle, tail-packing data inline for compressed files is now supported so that tail pcluster can be stored and read together with inode metadata in order to save data I/O and storage space. In addition to that, to prepare for the upcoming subpage, folio and fscache features, we also introduce meta buffers to get rid of erofs_get_meta_page() since it was too close to the page itself. In addition, in order to show supported kernel features and control sync decompression strategy, new sysfs nodes are introduced in this cycle as well. Summary: - add sysfs interface and a sysfs node to control sync decompression - add tail-packing inline support for compressed files - get rid of erofs_get_meta_page()" * tag 'erofs-for-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: use meta buffers for zmap operations erofs: use meta buffers for xattr operations erofs: use meta buffers for super operations erofs: use meta buffers for inode operations erofs: introduce meta buffer operations erofs: add on-disk compressed tail-packing inline support erofs: support inline data decompression erofs: support unaligned data decompression erofs: introduce z_erofs_fixup_insize erofs: tidy up z_erofs_lz4_decompress erofs: clean up erofs_map_blocks tracepoints erofs: Replace zero-length array with flexible-array member erofs: add sysfs node to control sync decompression strategy erofs: add sysfs interface erofs: rename lz4_0pading to zero_padding
2022-01-119p, afs, ceph, nfs: Use current_is_kswapd() rather than ↵David Howells
gfpflags_allow_blocking() In 9p, afs ceph, and nfs, gfpflags_allow_blocking() (which wraps a test for __GFP_DIRECT_RECLAIM being set) is used to determine if ->releasepage() should wait for the completion of a DIO write to fscache with something like: if (folio_test_fscache(folio)) { if (!gfpflags_allow_blocking(gfp) || !(gfp & __GFP_FS)) return false; folio_wait_fscache(folio); } Instead, current_is_kswapd() should be used instead. Note that this is based on a patch originally by Zhaoyang Huang[1]. In addition to extending it to the other network filesystems and putting it on top of my fscache rewrite, it also needs to include linux/swap.h in a bunch of places. Can current_is_kswapd() be moved to linux/mm.h? Changes ======= ver #5: - Dropping the changes for cifs. Originally-signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Co-developed-by: David Howells <dhowells@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: Zhaoyang Huang <zhaoyang.huang@unisoc.com> cc: Dominique Martinet <asmadeus@codewreck.org> cc: Marc Dionne <marc.dionne@auristor.com> cc: Steve French <smfrench@gmail.com> cc: Trond Myklebust <trond.myklebust@hammerspace.com> cc: linux-cachefs@redhat.com cc: v9fs-developer@lists.sourceforge.net cc: linux-afs@lists.infradead.org cc: ceph-devel@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: linux-nfs@vger.kernel.org cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/1638952658-20285-1-git-send-email-huangzhaoyang@gmail.com/ [1] Link: https://lore.kernel.org/r/164021590773.640689.16777975200823659231.stgit@warthog.procyon.org.uk/ # v4
2022-01-11Merge tag 'fs.idmapped.v5.17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull fs idmapping updates from Christian Brauner: "This contains the work to enable the idmapping infrastructure to support idmapped mounts of filesystems mounted with an idmapping. In addition this contains various cleanups that avoid repeated open-coding of the same functionality and simplify the code in quite a few places. We also finish the renaming of the mapping helpers we started a few kernel releases back and move them to a dedicated header to not continue polluting the fs header needlessly with low-level idmapping helpers. With this series the fs header only contains idmapping helpers that interact with fs objects. Currently we only support idmapped mounts for filesystems mounted without an idmapping themselves. This was a conscious decision mentioned in multiple places (cf. [1]). As explained at length in [3] it is perfectly fine to extend support for idmapped mounts to filesystem's mounted with an idmapping should the need arise. The need has been there for some time now (cf. [2]). Before we can port any filesystem that is mountable with an idmapping to support idmapped mounts in the coming cycles, we need to first extend the mapping helpers to account for the filesystem's idmapping. This again, is explained at length in our documentation at [3] and also in the individual commit messages so here's an overview. Currently, the low-level mapping helpers implement the remapping algorithms described in [3] in a simplified manner as we could rely on the fact that all filesystems supporting idmapped mounts are mounted without an idmapping. In contrast, filesystems mounted with an idmapping are very likely to not use an identity mapping and will instead use a non-identity mapping. So the translation step from or into the filesystem's idmapping in the remapping algorithm cannot be skipped for such filesystems. Non-idmapped filesystems and filesystems not supporting idmapped mounts are unaffected by this change as the remapping algorithms can take the same shortcut as before. If the low-level helpers detect that they are dealing with an idmapped mount but the underlying filesystem is mounted without an idmapping we can rely on the previous shortcut and can continue to skip the translation step from or into the filesystem's idmapping. And of course, if the low-level helpers detect that they are not dealing with an idmapped mount they can simply return the relevant id unchanged; no remapping needs to be performed at all. These checks guarantee that only the minimal amount of work is performed. As before, if idmapped mounts aren't used the low-level helpers are idempotent and no work is performed at all" Link: 2ca4dcc4909d ("fs/mount_setattr: tighten permission checks") [1] Link: https://github.com/containers/podman/issues/10374 [2] Link: Documentations/filesystems/idmappings.rst [3] Link: a65e58e791a1 ("fs: document and rename fsid helpers") [4] * tag 'fs.idmapped.v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: fs: support mapped mounts of mapped filesystems fs: add i_user_ns() helper fs: port higher-level mapping helpers fs: remove unused low-level mapping helpers fs: use low-level mapping helpers docs: update mapping documentation fs: account for filesystem mappings fs: tweak fsuidgid_has_mapping() fs: move mapping helpers fs: add is_idmapped_mnt() helper
2022-01-11fscache: Add a tracepoint for cookie use/unuseDavid Howells
Add a tracepoint to track fscache_use/unuse_cookie(). Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/164021588628.640689.12942919367404043608.stgit@warthog.procyon.org.uk/ # v4
2022-01-11ceph: add fscache writeback supportJeff Layton
When updating the backing store from the pagecache (a'la writepage or writepages), write to the cache first. This allows us to keep caching files even when they are being written, as long as we have appropriate caps. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20211129162907.149445-3-jlayton@kernel.org/ # v1 Link: https://lore.kernel.org/r/20211207134451.66296-3-jlayton@kernel.org/ # v2 Link: https://lore.kernel.org/r/163906985808.143852.1383891557313186623.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967190257.1823006.16713609520911954804.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021585020.640689.6765214932458435472.stgit@warthog.procyon.org.uk/ # v4
2022-01-11ceph: conversion to new fscache APIJeff Layton
Now that the fscache API has been reworked and simplified, change ceph over to use it. With the old API, we would only instantiate a cookie when the file was open for reads. Change it to instantiate the cookie when the inode is instantiated and call use/unuse when the file is opened/closed. Also, ensure we resize the cached data on truncates, and invalidate the cache in response to the appropriate events. This will allow us to plumb in write support later. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20211129162907.149445-2-jlayton@kernel.org/ # v1 Link: https://lore.kernel.org/r/20211207134451.66296-2-jlayton@kernel.org/ # v2 Link: https://lore.kernel.org/r/163906984277.143852.14697110691303589000.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967188351.1823006.5065634844099079351.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021581427.640689.14128682147127509264.stgit@warthog.procyon.org.uk/ # v4
2022-01-11select: Fix indefinitely sleeping task in poll_schedule_timeout()Jan Kara
A task can end up indefinitely sleeping in do_select() -> poll_schedule_timeout() when the following race happens: TASK1 (thread1) TASK2 TASK1 (thread2) do_select() setup poll_wqueues table with 'fd' write data to 'fd' pollwake() table->triggered = 1 closes 'fd' thread1 is waiting for poll_schedule_timeout() - sees table->triggered table->triggered = 0 return -EINTR loop back in do_select() But at this point when TASK1 loops back, the fdget() in the setup of poll_wqueues fails. So now so we never find 'fd' is ready for reading and sleep in poll_schedule_timeout() indefinitely. Treat an fd that got closed as a fd on which some event happened. This makes sure cannot block indefinitely in do_select(). Another option would be to return -EBADF in this case but that has a potential of subtly breaking applications that excercise this behavior and it happens to work for them. So returning fd as active seems like a safer choice. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-11gfs2: dump inode object for iopen glocksBob Peterson
Before this patch, glock dumps would not dump the gl_object for iopen glocks. This information can help us debug problems related to eviction: when AN iopen glock is blocked we can see the status of its underlying inode and its flags, etc. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-01-119p: fix enodata when reading growing fileDominique Martinet
Reading from a file that was just extended by a write, but the write had not yet reached the server would return ENODATA as illustrated by this command: $ xfs_io -c 'open -ft test' -c 'w 4096 1000' -c 'r 0 1000' wrote 1000/1000 bytes at offset 4096 1000.000000 bytes, 1 ops; 0.0001 sec (5.610 MiB/sec and 5882.3529 ops/sec) pread: No data available Fix this case by having netfs assume zeroes when reads from server come short like AFS and CEPH do Link: https://lkml.kernel.org/r/20220110111444.926753-1-asmadeus@codewreck.org Cc: stable@vger.kernel.org Fixes: eb497943fa21 ("9p: Convert to using the netfs helper lib to do reads and caching") Co-authored-by: David Howells <dhowells@redhat.com> Reviewed-by: David Howells <dhowells@redhat.com> Tested-by: David Howells <dhowells@redhat.com> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2022-01-10Merge tag '5.17-net-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core ---- - Defer freeing TCP skbs to the BH handler, whenever possible, or at least perform the freeing outside of the socket lock section to decrease cross-CPU allocator work and improve latency. - Add netdevice refcount tracking to locate sources of netdevice and net namespace refcount leaks. - Make Tx watchdog less intrusive - avoid pausing Tx and restarting all queues from a single CPU removing latency spikes. - Various small optimizations throughout the stack from Eric Dumazet. - Make netdev->dev_addr[] constant, force modifications to go via appropriate helpers to allow us to keep addresses in ordered data structures. - Replace unix_table_lock with per-hash locks, improving performance of bind() calls. - Extend skb drop tracepoint with a drop reason. - Allow SO_MARK and SO_PRIORITY setsockopt under CAP_NET_RAW. BPF --- - New helpers: - bpf_find_vma(), find and inspect VMAs for profiling use cases - bpf_loop(), runtime-bounded loop helper trading some execution time for much faster (if at all converging) verification - bpf_strncmp(), improve performance, avoid compiler flakiness - bpf_get_func_arg(), bpf_get_func_ret(), bpf_get_func_arg_cnt() for tracing programs, all inlined by the verifier - Support BPF relocations (CO-RE) in the kernel loader. - Further the support for BTF_TYPE_TAG annotations. - Allow access to local storage in sleepable helpers. - Convert verifier argument types to a composable form with different attributes which can be shared across types (ro, maybe-null). - Prepare libbpf for upcoming v1.0 release by cleaning up APIs, creating new, extensible ones where missing and deprecating those to be removed. Protocols --------- - WiFi (mac80211/cfg80211): - notify user space about long "come back in N" AP responses, allow it to react to such temporary rejections - allow non-standard VHT MCS 10/11 rates - use coarse time in airtime fairness code to save CPU cycles - Bluetooth: - rework of HCI command execution serialization to use a common queue and work struct, and improve handling errors reported in the middle of a batch of commands - rework HCI event handling to use skb_pull_data, avoiding packet parsing pitfalls - support AOSP Bluetooth Quality Report - SMC: - support net namespaces, following the RDMA model - improve connection establishment latency by pre-clearing buffers - introduce TCP ULP for automatic redirection to SMC - Multi-Path TCP: - support ioctls: SIOCINQ, OUTQ, and OUTQNSD - support socket options: IP_TOS, IP_FREEBIND, IP_TRANSPARENT, IPV6_FREEBIND, and IPV6_TRANSPARENT, TCP_CORK and TCP_NODELAY - support cmsgs: TCP_INQ - improvements in the data scheduler (assigning data to subflows) - support fastclose option (quick shutdown of the full MPTCP connection, similar to TCP RST in regular TCP) - MCTP (Management Component Transport) over serial, as defined by DMTF spec DSP0253 - "MCTP Serial Transport Binding". Driver API ---------- - Support timestamping on bond interfaces in active/passive mode. - Introduce generic phylink link mode validation for drivers which don't have any quirks and where MAC capability bits fully express what's supported. Allow PCS layer to participate in the validation. Convert a number of drivers. - Add support to set/get size of buffers on the Rx rings and size of the tx copybreak buffer via ethtool. - Support offloading TC actions as first-class citizens rather than only as attributes of filters, improve sharing and device resource utilization. - WiFi (mac80211/cfg80211): - support forwarding offload (ndo_fill_forward_path) - support for background radar detection hardware - SA Query Procedures offload on the AP side New hardware / drivers ---------------------- - tsnep - FPGA based TSN endpoint Ethernet MAC used in PLCs with real-time requirements for isochronous communication with protocols like OPC UA Pub/Sub. - Qualcomm BAM-DMUX WWAN - driver for data channels of modems integrated into many older Qualcomm SoCs, e.g. MSM8916 or MSM8974 (qcom_bam_dmux). - Microchip LAN966x multi-port Gigabit AVB/TSN Ethernet Switch driver with support for bridging, VLANs and multicast forwarding (lan966x). - iwlmei driver for co-operating between Intel's WiFi driver and Intel's Active Management Technology (AMT) devices. - mse102x - Vertexcom MSE102x Homeplug GreenPHY chips - Bluetooth: - MediaTek MT7921 SDIO devices - Foxconn MT7922A - Realtek RTL8852AE Drivers ------- - Significantly improve performance in the datapaths of: lan78xx, ax88179_178a, lantiq_xrx200, bnxt. - Intel Ethernet NICs: - igb: support PTP/time PEROUT and EXTTS SDP functions on 82580/i354/i350 adapters - ixgbevf: new PF -> VF mailbox API which avoids the risk of mailbox corruption with ESXi - iavf: support configuration of VLAN features of finer granularity, stacked tags and filtering - ice: PTP support for new E822 devices with sub-ns precision - ice: support firmware activation without reboot - Mellanox Ethernet NICs (mlx5): - expose control over IRQ coalescing mode (CQE vs EQE) via ethtool - support TC forwarding when tunnel encap and decap happen between two ports of the same NIC - dynamically size and allow disabling various features to save resources for running in embedded / SmartNIC scenarios - Broadcom Ethernet NICs (bnxt): - use page frag allocator to improve Rx performance - expose control over IRQ coalescing mode (CQE vs EQE) via ethtool - Other Ethernet NICs: - amd-xgbe: add Ryzen 6000 (Yellow Carp) Ethernet support - Microsoft cloud/virtual NIC (mana): - add XDP support (PASS, DROP, TX) - Mellanox Ethernet switches (mlxsw): - initial support for Spectrum-4 ASICs - VxLAN with IPv6 underlay - Marvell Ethernet switches (prestera): - support flower flow templates - add basic IP forwarding support - NXP embedded Ethernet switches (ocelot & felix): - support Per-Stream Filtering and Policing (PSFP) - enable cut-through forwarding between ports by default - support FDMA to improve packet Rx/Tx to CPU - Other embedded switches: - hellcreek: improve trapping management (STP and PTP) packets - qca8k: support link aggregation and port mirroring - Qualcomm 802.11ax WiFi (ath11k): - qca6390, wcn6855: enable 802.11 power save mode in station mode - BSS color change support - WCN6855 hw2.1 support - 11d scan offload support - scan MAC address randomization support - full monitor mode, only supported on QCN9074 - qca6390/wcn6855: report signal and tx bitrate - qca6390: rfkill support - qca6390/wcn6855: regdb.bin support - Intel WiFi (iwlwifi): - support SAR GEO Offset Mapping (SGOM) and Time-Aware-SAR (TAS) in cooperation with the BIOS - support for Optimized Connectivity Experience (OCE) scan - support firmware API version 68 - lots of preparatory work for the upcoming Bz device family - MediaTek WiFi (mt76): - Specific Absorption Rate (SAR) support - mt7921: 160 MHz channel support - RealTek WiFi (rtw88): - Specific Absorption Rate (SAR) support - scan offload - Other WiFi NICs - ath10k: support fetching (pre-)calibration data from nvmem - brcmfmac: configure keep-alive packet on suspend - wcn36xx: beacon filter support" * tag '5.17-net-next' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2048 commits) tcp: tcp_send_challenge_ack delete useless param `skb` net/qla3xxx: Remove useless DMA-32 fallback configuration rocker: Remove useless DMA-32 fallback configuration hinic: Remove useless DMA-32 fallback configuration lan743x: Remove useless DMA-32 fallback configuration net: enetc: Remove useless DMA-32 fallback configuration cxgb4vf: Remove useless DMA-32 fallback configuration cxgb4: Remove useless DMA-32 fallback configuration cxgb3: Remove useless DMA-32 fallback configuration bnx2x: Remove useless DMA-32 fallback configuration et131x: Remove useless DMA-32 fallback configuration be2net: Remove useless DMA-32 fallback configuration vmxnet3: Remove useless DMA-32 fallback configuration bna: Simplify DMA setting net: alteon: Simplify DMA setting myri10ge: Simplify DMA setting qlcnic: Simplify DMA setting net: allwinner: Fix print format page_pool: remove spinlock in page_pool_refill_alloc_cache() amt: fix wrong return type of amt_send_membership_update() ...
2022-01-10ubifs: Fix to add refcount once page is set privateZhihao Cheng
MM defined the rule [1] very clearly that once page was set with PG_private flag, we should increment the refcount in that page, also main flows like pageout(), migrate_page() will assume there is one additional page reference count if page_has_private() returns true. Otherwise, we may get a BUG in page migration: page:0000000080d05b9d refcount:-1 mapcount:0 mapping:000000005f4d82a8 index:0xe2 pfn:0x14c12 aops:ubifs_file_address_operations [ubifs] ino:8f1 dentry name:"f30e" flags: 0x1fffff80002405(locked|uptodate|owner_priv_1|private|node=0| zone=1|lastcpupid=0x1fffff) page dumped because: VM_BUG_ON_PAGE(page_count(page) != 0) ------------[ cut here ]------------ kernel BUG at include/linux/page_ref.h:184! invalid opcode: 0000 [#1] SMP CPU: 3 PID: 38 Comm: kcompactd0 Not tainted 5.15.0-rc5 RIP: 0010:migrate_page_move_mapping+0xac3/0xe70 Call Trace: ubifs_migrate_page+0x22/0xc0 [ubifs] move_to_new_page+0xb4/0x600 migrate_pages+0x1523/0x1cc0 compact_zone+0x8c5/0x14b0 kcompactd+0x2bc/0x560 kthread+0x18c/0x1e0 ret_from_fork+0x1f/0x30 Before the time, we should make clean a concept, what does refcount means in page gotten from grab_cache_page_write_begin(). There are 2 situations: Situation 1: refcount is 3, page is created by __page_cache_alloc. TYPE_A - the write process is using this page TYPE_B - page is assigned to one certain mapping by calling __add_to_page_cache_locked() TYPE_C - page is added into pagevec list corresponding current cpu by calling lru_cache_add() Situation 2: refcount is 2, page is gotten from the mapping's tree TYPE_B - page has been assigned to one certain mapping TYPE_A - the write process is using this page (by calling page_cache_get_speculative()) Filesystem releases one refcount by calling put_page() in xxx_write_end(), the released refcount corresponds to TYPE_A (write task is using it). If there are any processes using a page, page migration process will skip the page by judging whether expected_page_refs() equals to page refcount. The BUG is caused by following process: PA(cpu 0) kcompactd(cpu 1) compact_zone ubifs_write_begin page_a = grab_cache_page_write_begin add_to_page_cache_lru lru_cache_add pagevec_add // put page into cpu 0's pagevec (refcnf = 3, for page creation process) ubifs_write_end SetPagePrivate(page_a) // doesn't increase page count ! unlock_page(page_a) put_page(page_a) // refcnt = 2 [...] PB(cpu 0) filemap_read filemap_get_pages add_to_page_cache_lru lru_cache_add __pagevec_lru_add // traverse all pages in cpu 0's pagevec __pagevec_lru_add_fn SetPageLRU(page_a) isolate_migratepages isolate_migratepages_block get_page_unless_zero(page_a) // refcnt = 3 list_add(page_a, from_list) migrate_pages(from_list) __unmap_and_move move_to_new_page ubifs_migrate_page(page_a) migrate_page_move_mapping expected_page_refs get 3 (migration[1] + mapping[1] + private[1]) release_pages put_page_testzero(page_a) // refcnt = 3 page_ref_freeze // refcnt = 0 page_ref_dec_and_test(0 - 1 = -1) page_ref_unfreeze VM_BUG_ON_PAGE(-1 != 0, page) UBIFS doesn't increase the page refcount after setting private flag, which leads to page migration task believes the page is not used by any other processes, so the page is migrated. This causes concurrent accessing on page refcount between put_page() called by other process(eg. read process calls lru_cache_add) and page_ref_unfreeze() called by migration task. Actually zhangjun has tried to fix this problem [2] by recalculating page refcnt in ubifs_migrate_page(). It's better to follow MM rules [1], because just like Kirill suggested in [2], we need to check all users of page_has_private() helper. Like f2fs does in [3], fix it by adding/deleting refcount when setting/clearing private for a page. BTW, according to [4], we set 'page->private' as 1 because ubifs just simply SetPagePrivate(). And, [5] provided a common helper to set/clear page private, ubifs can use this helper following the example of iomap, afs, btrfs, etc. Jump [6] to find a reproducer. [1] https://lore.kernel.org/lkml/2b19b3c4-2bc4-15fa-15cc-27a13e5c7af1@aol.com [2] https://www.spinics.net/lists/linux-mtd/msg04018.html [3] http://lkml.iu.edu/hypermail/linux/kernel/1903.0/03313.html [4] https://lore.kernel.org/linux-f2fs-devel/20210422154705.GO3596236@casper.infradead.org [5] https://lore.kernel.org/all/20200517214718.468-1-guoqing.jiang@cloud.ionos.com [6] https://bugzilla.kernel.org/show_bug.cgi?id=214961 Fixes: 1e51764a3c2ac0 ("UBIFS: add new flash file system") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2022-01-10ubifs: Fix read out-of-bounds in ubifs_wbuf_write_nolock()Zhihao Cheng
Function ubifs_wbuf_write_nolock() may access buf out of bounds in following process: ubifs_wbuf_write_nolock(): aligned_len = ALIGN(len, 8); // Assume len = 4089, aligned_len = 4096 if (aligned_len <= wbuf->avail) ... // Not satisfy if (wbuf->used) { ubifs_leb_write() // Fill some data in avail wbuf len -= wbuf->avail; // len is still not 8-bytes aligned aligned_len -= wbuf->avail; } n = aligned_len >> c->max_write_shift; if (n) { n <<= c->max_write_shift; err = ubifs_leb_write(c, wbuf->lnum, buf + written, wbuf->offs, n); // n > len, read out of bounds less than 8(n-len) bytes } , which can be catched by KASAN: ========================================================= BUG: KASAN: slab-out-of-bounds in ecc_sw_hamming_calculate+0x1dc/0x7d0 Read of size 4 at addr ffff888105594ff8 by task kworker/u8:4/128 Workqueue: writeback wb_workfn (flush-ubifs_0_0) Call Trace: kasan_report.cold+0x81/0x165 nand_write_page_swecc+0xa9/0x160 ubifs_leb_write+0xf2/0x1b0 [ubifs] ubifs_wbuf_write_nolock+0x421/0x12c0 [ubifs] write_head+0xdc/0x1c0 [ubifs] ubifs_jnl_write_inode+0x627/0x960 [ubifs] wb_workfn+0x8af/0xb80 Function ubifs_wbuf_write_nolock() accepts that parameter 'len' is not 8 bytes aligned, the 'len' represents the true length of buf (which is allocated in 'ubifs_jnl_xxx', eg. ubifs_jnl_write_inode), so ubifs_wbuf_write_nolock() must handle the length read from 'buf' carefully to write leb safely. Fetch a reproducer in [Link]. Fixes: 1e51764a3c2ac0 ("UBIFS: add new flash file system") Link: https://bugzilla.kernel.org/show_bug.cgi?id=214785 Reported-by: Chengsong Ke <kechengsong@huawei.com> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2022-01-10ubifs: setflags: Make dirtied_ino_d 8 bytes alignedZhihao Cheng
Make 'ui->data_len' aligned with 8 bytes before it is assigned to dirtied_ino_d. Since 8871d84c8f8b0c6b("ubifs: convert to fileattr") applied, 'setflags()' only affects regular files and directories, only xattr inode, symlink inode and special inode(pipe/char_dev/block_dev) have none- zero 'ui->data_len' field, so assertion '!(req->dirtied_ino_d & 7)' cannot fail in ubifs_budget_space(). To avoid assertion fails in future evolution(eg. setflags can operate special inodes), it's better to make dirtied_ino_d 8 bytes aligned, after all aligned size is still zero for regular files. Fixes: 1e51764a3c2ac05a ("UBIFS: add new flash file system") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2022-01-10ubifs: Rectify space amount budget for mkdir/tmpfile operationsZhihao Cheng
UBIFS should make sure the flash has enough space to store dirty (Data that is newer than disk) data (in memory), space budget is exactly designed to do that. If space budget calculates less data than we need, 'make_reservation()' will do more work(return -ENOSPC if no free space lelf, sometimes we can see "cannot reserve xxx bytes in jhead xxx, error -28" in ubifs error messages) with ubifs inodes locked, which may effect other syscalls. A simple way to decide how much space do we need when make a budget: See how much space is needed by 'make_reservation()' in ubifs_jnl_xxx() function according to corresponding operation. It's better to report ENOSPC in ubifs_budget_space(), as early as we can. Fixes: 474b93704f32163 ("ubifs: Implement O_TMPFILE") Fixes: 1e51764a3c2ac05 ("UBIFS: add new flash file system") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2022-01-10ubifs: Fix 'ui->dirty' race between do_tmpfile() and writeback workZhihao Cheng
'ui->dirty' is not protected by 'ui_mutex' in function do_tmpfile() which may race with ubifs_write_inode[wb_workfn] to access/update 'ui->dirty', finally dirty space is released twice. open(O_TMPFILE) wb_workfn do_tmpfile ubifs_budget_space(ino_req = { .dirtied_ino = 1}) d_tmpfile // mark inode(tmpfile) dirty ubifs_jnl_update // without holding tmpfile's ui_mutex mark_inode_clean(ui) if (ui->dirty) ubifs_release_dirty_inode_budget(ui) // release first time ubifs_write_inode mutex_lock(&ui->ui_mutex) ubifs_release_dirty_inode_budget(ui) // release second time mutex_unlock(&ui->ui_mutex) ui->dirty = 0 Run generic/476 can reproduce following message easily (See reproducer in [Link]): UBIFS error (ubi0:0 pid 2578): ubifs_assert_failed [ubifs]: UBIFS assert failed: c->bi.dd_growth >= 0, in fs/ubifs/budget.c:554 UBIFS warning (ubi0:0 pid 2578): ubifs_ro_mode [ubifs]: switched to read-only mode, error -22 Workqueue: writeback wb_workfn (flush-ubifs_0_0) Call Trace: ubifs_ro_mode+0x54/0x60 [ubifs] ubifs_assert_failed+0x4b/0x80 [ubifs] ubifs_release_budget+0x468/0x5a0 [ubifs] ubifs_release_dirty_inode_budget+0x53/0x80 [ubifs] ubifs_write_inode+0x121/0x1f0 [ubifs] ... wb_workfn+0x283/0x7b0 Fix it by holding tmpfile ubifs inode lock during ubifs_jnl_update(). Similar problem exists in whiteout renaming, but previous fix("ubifs: Rename whiteout atomically") has solved the problem. Fixes: 474b93704f32163 ("ubifs: Implement O_TMPFILE") Link: https://bugzilla.kernel.org/show_bug.cgi?id=214765 Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2022-01-10ubifs: Rename whiteout atomicallyZhihao Cheng
Currently, rename whiteout has 3 steps: 1. create tmpfile(which associates old dentry to tmpfile inode) for whiteout, and store tmpfile to disk 2. link whiteout, associate whiteout inode to old dentry agagin and store old dentry, old inode, new dentry on disk 3. writeback dirty whiteout inode to disk Suddenly power-cut or error occurring(eg. ENOSPC returned by budget, memory allocation failure) during above steps may cause kinds of problems: Problem 1: ENOSPC returned by whiteout space budget (before step 2), old dentry will disappear after rename syscall, whiteout file cannot be found either. ls dir // we get file, whiteout rename(dir/file, dir/whiteout, REANME_WHITEOUT) ENOSPC = ubifs_budget_space(&wht_req) // return ls dir // empty (no file, no whiteout) Problem 2: Power-cut happens before step 3, whiteout inode with 'nlink=1' is not stored on disk, whiteout dentry(old dentry) is written on disk, whiteout file is lost on next mount (We get "dead directory entry" after executing 'ls -l' on whiteout file). Now, we use following 3 steps to finish rename whiteout: 1. create an in-mem inode with 'nlink = 1' as whiteout 2. ubifs_jnl_rename (Write on disk to finish associating old dentry to whiteout inode, associating new dentry with old inode) 3. iput(whiteout) Rely writing in-mem inode on disk by ubifs_jnl_rename() to finish rename whiteout, which avoids middle disk state caused by suddenly power-cut and error occurring. Fixes: 9e0a1fff8db56ea ("ubifs: Implement RENAME_WHITEOUT") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2022-01-10Merge tag 'pstore-v5.17-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull pstore update from Kees Cook: - Add boot param for early ftrace recording in pstore (Uwe Kleine-König) * tag 'pstore-v5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: pstore/ftrace: Allow immediate recording
2022-01-10ksmbd: add smb-direct shutdownYufan Chen
When killing ksmbd server after connecting rdma, ksmbd threads does not terminate properly because the rdma connection is still alive. This patch add shutdown operation to disconnect rdma connection while ksmbd threads terminate. Signed-off-by: Yufan Chen <wiz.chen@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-10ksmbd: smbd: change the default maximum read/write, receive sizeHyunchul Lee
Due to restriction that cannot handle multiple buffer descriptor structures, decrease the maximum read/write size for Windows clients. And set the maximum fragmented receive size in consideration of the receive queue size. Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-10ksmbd: smbd: create MR poolHyunchul Lee
Create a memory region pool because rdma_rw_ctx_init() uses memory registration if memory registration yields better performance than using multiple SGE entries. Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-10ksmbd: add reserved room in ipc request/responseNamjae Jeon
Whenever new parameter is added to smb configuration, It is possible to break the execution of the IPC daemon by mismatch size of request/response. This patch tries to reserve space in ipc request/response in advance to prevent that. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-10ksmbd: smbd: call rdma_accept() under CM handlerHyunchul Lee
if CONFIG_LOCKDEP is enabled, the following kernel warning message is generated because rdma_accept() checks whehter the handler_mutex is held by lockdep_assert_held. CM(Connection Manager) holds the mutex before CM handler callback is called. [ 63.211405 ] WARNING: CPU: 1 PID: 345 at drivers/infiniband/core/cma.c:4405 rdma_accept+0x17a/0x350 [ 63.212080 ] RIP: 0010:rdma_accept+0x17a/0x350 ... [ 63.214036 ] Call Trace: [ 63.214098 ] <TASK> [ 63.214185 ] smb_direct_accept_client+0xb4/0x170 [ksmbd] [ 63.214412 ] smb_direct_prepare+0x322/0x8c0 [ksmbd] [ 63.214555 ] ? rcu_read_lock_sched_held+0x3a/0x70 [ 63.214700 ] ksmbd_conn_handler_loop+0x63/0x270 [ksmbd] [ 63.214826 ] ? ksmbd_conn_alive+0x80/0x80 [ksmbd] [ 63.214952 ] kthread+0x171/0x1a0 [ 63.215039 ] ? set_kthread_struct+0x40/0x40 [ 63.215128 ] ret_from_fork+0x22/0x30 To avoid this, move creating a queue pair and accepting a client from transport_ops->prepare() to smb_direct_handle_connect_request(). Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-10ksmbd: limits exceeding the maximum allowable outstanding requestsNamjae Jeon
If the client ignores the CreditResponse received from the server and continues to send the request, ksmbd limits the requests if it exceeds smb2 max credits. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-10ksmbd: move credit charge deduction under processing requestNamjae Jeon
Moves the credit charge deduction from total_credits under the processing a request. When repeating smb2 lock request and other command request, there will be a problem that ->total_credits does not decrease. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-10ksmbd: add support for smb2 max credit parameterNamjae Jeon
Add smb2 max credits parameter to adjust maximum credits value to limit number of outstanding requests. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-10ksmbd: set 445 port to smbdirect port by defaultNamjae Jeon
When SMB Direct is used with iWARP, Windows use 5445 port for smb direct port, 445 port for SMB. This patch check ib_device using ib_client to know if NICs type is iWARP or Infiniband. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-10ksmbd: register ksmbd ib client with ib_register_client()Hyunchul Lee
Register ksmbd ib client with ib_register_client() to find the rdma capable network adapter. If ops.get_netdev(Chelsio NICs) is NULL, ksmbd will find it using ib_device_get_by_netdev in old way. Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-01-10ext4: don't use the orphan list when migrating an inodeTheodore Ts'o
We probably want to remove the indirect block to extents migration feature after a deprecation window, but until then, let's fix a potential data loss problem caused by the fact that we put the tmp_inode on the orphan list. In the unlikely case where we crash and do a journal recovery, the data blocks belonging to the inode being migrated are also represented in the tmp_inode on the orphan list --- and so its data blocks will get marked unallocated, and available for reuse. Instead, stop putting the tmp_inode on the oprhan list. So in the case where we crash while migrating the inode, we'll leak an inode, which is not a disaster. It will be easily fixed the next time we run fsck, and it's better than potentially having blocks getting claimed by two different files, and losing data as a result. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Cc: stable@kernel.org
2022-01-10ext4: use BUG_ON instead of if condition followed by BUGxu xin
BUG_ON would be better. This issue was detected with the help of Coccinelle. Reported-by: Zeal robot <zealci@zte.com.cn> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: xu xin <xu.xin16@zte.com.cn> Link: https://lore.kernel.org/r/20211228073252.580296-1-xu.xin16@zte.com.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10ext4: fix a copy and paste typoDan Carpenter
This was obviously supposed to be an ext4 struct, not xfs. GCC doesn't care either way so it doesn't affect the build or runtime. Fixes: cebe85d570cf ("ext4: switch to the new mount api") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Link: https://lore.kernel.org/r/20211215114309.GB14552@kili Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10ext4: set csum seed in tmp inode while migrating to extentsLuís Henriques
When migrating to extents, the temporary inode will have it's own checksum seed. This means that, when swapping the inodes data, the inode checksums will be incorrect. This can be fixed by recalculating the extents checksums again. Or simply by copying the seed into the temporary inode. Link: https://bugzilla.kernel.org/show_bug.cgi?id=213357 Reported-by: Jeroen van Wolffelaar <jeroen@wolffelaar.nl> Signed-off-by: Luís Henriques <lhenriques@suse.de> Link: https://lore.kernel.org/r/20211214175058.19511-1-lhenriques@suse.de Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2022-01-10ext4: remove unnecessary 'offset' assignmentluo penghao
Although it is in the loop, offset is reassigned at the beginning of the while loop. And after the loop, the value will not be used The clang_analyzer complains as follows: fs/ext4/dir.c:306:3 warning: Value stored to 'offset' is never read Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: luo penghao <luo.penghao@zte.com.cn> Link: https://lore.kernel.org/r/20211208075307.404703-1-luo.penghao@zte.com.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10ext4: remove redundant o_start statementluo penghao
The if will goto out of the loop, and until the end of the function execution, o_start will not be used again. The clang_analyzer complains as follows: fs/ext4/move_extent.c:635:5 warning: Value stored to 'o_start' is never read Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: luo penghao <luo.penghao@zte.com.cn> Link: https://lore.kernel.org/r/20211208075157.404535-1-luo.penghao@zte.com.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10ext4: drop an always true checkAdam Borowski
EXT_FIRST_INDEX(ptr) is ptr+12, which can't possibly be null; gcc-12 warns about this. Signed-off-by: Adam Borowski <kilobyte@angband.pl> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20211115172020.57853-1-kilobyte@angband.pl Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10ext4: remove unused assignmentsluo penghao
The eh assignment in these two places is meaningless, because the function will goto to merge, which will not use eh. The clang_analyzer complains as follows: fs/ext4/extents.c:1988:4 warning: fs/ext4/extents.c:2016:4 warning: Value stored to 'eh' is never read Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: luo penghao <luo.penghao@zte.com.cn> Link: https://lore.kernel.org/r/20211104064007.2919-1-luo.penghao@zte.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10ext4: remove redundant statementluo penghao
The local variable assignment at the end of the function is meaningless. The clang_analyzer complains as follows: fs/ext4/fast_commit.c:779:2 warning: Value stored to 'dst' is never read Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: luo penghao <luo.penghao@zte.com.cn> Link: https://lore.kernel.org/r/20211104063406.2747-1-luo.penghao@zte.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10ext4: remove useless resetting io_end_size in mpage_process_page()Nghia Le
The command "make clang-analyzer" detects dead stores in mpage_process_page() function. Do not reset io_end_size to 0 in the current paths, as the function exits on those paths without further using io_end_size. Signed-off-by: Nghia Le <nghialm78@gmail.com> Link: https://lore.kernel.org/r/20211025221803.3326-1-nghialm78@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10ext4: allow to change s_last_trim_minblks via sysfsLukas Czerner
Ext4 has an optimization mechanism for batched disacrd (FITRIM) that should help speed up subsequent calls of FITRIM ioctl by skipping the groups that were previously trimmed. However because the FITRIM allows to set the minimum size of an extent to trim, ext4 stores the last minimum extent size and only avoids trimming the group if it was previously trimmed with minimum extent size equal to, or smaller than the current call. There is currently no way to bypass the optimization without umount/mount cycle. This becomes a problem when the file system is live migrated to a different storage, because the optimization will prevent possibly useful discard calls to the storage. Fix it by exporting the s_last_trim_minblks via sysfs interface which will allow us to set the minimum size to the number of blocks larger than subsequent FITRIM call, effectively bypassing the optimization. By setting the s_last_trim_minblks to ULONG_MAX the optimization will be effectively cleared regardless of the previous state, or file system configuration. For example: getconf ULONG_MAX > /sys/fs/ext4/dm-1/last_trim_minblks Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reported-by: Laurent GUERBY <laurent@guerby.net> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20211103145122.17338-2-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10ext4: change s_last_trim_minblks type to unsigned longLukas Czerner
There is no good reason for the s_last_trim_minblks to be atomic. There is no data integrity needed and there is no real danger in setting and reading it in a racy manner. Change it to be unsigned long, the same type as s_clusters_per_group which is the maximum that's allowed. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Suggested-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20211103145122.17338-1-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10ext4: implement support for get/set fs labelLukas Czerner
Implement support for FS_IOC_GETFSLABEL and FS_IOC_SETFSLABEL ioctls for online reading and setting of file system label. ext4_ioctl_getlabel() is simple, just get the label from the primary superblock. This might not be the first sb on the file system if 'sb=' mount option is used. In ext4_ioctl_setlabel() we update what ext4 currently views as a primary superblock and then proceed to update backup superblocks. There are two caveats: - the primary superblock might not be the first superblock and so it might not be the one used by userspace tools if read directly off the disk. - because the primary superblock might not be the first superblock we potentialy have to update it as part of backup superblock update. However the first sb location is a bit more complicated than the rest so we have to account for that. The superblock modification is created generic enough so the infrastructure can be used for other potential superblock modification operations, such as chaning UUID. Tested with generic/492 with various configurations. I also checked the behavior with 'sb=' mount options, including very large file systems with and without sparse_super/sparse_super2. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Link: https://lore.kernel.org/r/20211213135618.43303-1-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10ext4: only set EXT4_MOUNT_QUOTA when journalled quota file is specifiedLukas Czerner
Only set EXT4_MOUNT_QUOTA when journalled quota file is specified, otherwise simply disabling specific quota type (usrjquota=) will also set the EXT4_MOUNT_QUOTA super block option. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Fixes: e6e268cb6822 ("ext4: move quota configuration out of handle_mount_opt()") Link: https://lore.kernel.org/r/20220104143518.134465-2-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-01-10ext4: don't use kfree() on rcu protected pointer sbi->s_qf_namesLukas Czerner
During ext4 mount api rework the commit e6e268cb6822 ("ext4: move quota configuration out of handle_mount_opt()") introduced a bug where we would kfree(sbi->s_qf_names[i]) before assigning the new quota name in ext4_apply_quota_options(). This is wrong because we're using kfree() on rcu prointer that could be simultaneously accessed from ext4_show_quota_options() during remount. Fix it by using rcu_replace_pointer() to replace the old qname with the new one and then kfree_rcu() the old quota name. Also use get_qf_name() instead of sbi->s_qf_names in strcmp() to silence the sparse warning. Fixes: e6e268cb6822 ("ext4: move quota configuration out of handle_mount_opt()") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Lukas Czerner <lczerner@redhat.com> Link: https://lore.kernel.org/r/20220104143518.134465-1-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>