summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2024-05-06bcachefs: Add a better limit for maximum number of bucketsKent Overstreet
The bucket_gens array is a single array allocation (one byte per bucket), and kernel allocations are still limited to INT_MAX. Check this limit to avoid failing the bucket_gens array allocation. Reported-by: syzbot+b29f436493184ea42e2b@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06bcachefs: Fix lifetime issue in device iterator helpersKent Overstreet
bch2_get_next_dev() and bch2_get_next_online_dev() iterate over devices, dropping and taking refs as they go; we can't access the previous device (for ca->dev_idx) after we've dropped our ref to it, unless we take rcu_read_lock() first. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06bcachefs: Fix bch2_dev_lookup() refcountingKent Overstreet
bch2_dev_lookup() is supposed to take a ref on the device it returns, but for_each_member_device() takes refs as it iterates, for_each_member_device_rcu() does not. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06bcachefs: Initialize bch_write_op->failed in inline data pathKent Overstreet
Normally this is initialized in __bch2_write(), which is executed in a loop, but the inline data path skips this. Reported-by: syzbot+fd3ccb331eb21f05d13b@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06bcachefs: Fix refcount put in sb_field_resize error pathKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06bcachefs: Inodes need extra padding for varint_decode_fast()Kent Overstreet
Reported-by: syzbot+66b9b74f6520068596a9@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06bcachefs: Fix early error path in bch2_fs_btree_key_cache_exit()Kent Overstreet
Reported-by: syzbot+a35cdb62ec34d44fb062@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06bcachefs: bucket_pos_to_bp_noerror()Kent Overstreet
We don't want the assert when we're checking if the backpointer is valid. Reported-by: syzbot+bf7215c0525098e7747a@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06bcachefs: don't free error pointersKent Overstreet
Reported-by: syzbot+3333603f569fc2ef258c@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06bcachefs: Fix a scheduler splat in __bch2_next_write_buffer_flush_journal_buf()Kent Overstreet
We're using mutex_lock() inside a wait_event() conditional - prepare_to_wait() has already flipped task state, so potentially blocking ops need annotation. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-06orangefs: fix out-of-bounds fsid accessMike Marshall
Arnd Bergmann sent a patch to fsdevel, he says: "orangefs_statfs() copies two consecutive fields of the superblock into the statfs structure, which triggers a warning from the string fortification helpers" Jan Kara suggested an alternate way to do the patch to make it more readable. I ran both ideas through xfstests and both seem fine. This patch is based on Jan Kara's suggestion. Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2024-05-06nfsd: set security label during create operationsStephen Smalley
When security labeling is enabled, the client can pass a file security label as part of a create operation for the new file, similar to mode and other attributes. At present, the security label is received by nfsd and passed down to nfsd_create_setattr(), but nfsd_setattr() is never called and therefore the label is never set on the new file. This bug may have been introduced on or around commit d6a97d3f589a ("NFSD: add security label to struct nfsd_attrs"). Looking at nfsd_setattr() I am uncertain as to whether the same issue presents for file ACLs and therefore requires a similar fix for those. An alternative approach would be to introduce a new LSM hook to set the "create SID" of the current task prior to the actual file creation, which would atomically label the new inode at creation time. This would be better for SELinux and a similar approach has been used previously (see security_dentry_create_files_as) but perhaps not usable by other LSMs. Reproducer: 1. Install a Linux distro with SELinux - Fedora is easiest 2. git clone https://github.com/SELinuxProject/selinux-testsuite 3. Install the requisite dependencies per selinux-testsuite/README.md 4. Run something like the following script: MOUNT=$HOME/selinux-testsuite sudo systemctl start nfs-server sudo exportfs -o rw,no_root_squash,security_label localhost:$MOUNT sudo mkdir -p /mnt/selinux-testsuite sudo mount -t nfs -o vers=4.2 localhost:$MOUNT /mnt/selinux-testsuite pushd /mnt/selinux-testsuite/ sudo make -C policy load pushd tests/filesystem sudo runcon -t test_filesystem_t ./create_file -f trans_test_file \ -e test_filesystem_filetranscon_t -v sudo rm -f trans_test_file popd sudo make -C policy unload popd sudo umount /mnt/selinux-testsuite sudo exportfs -u localhost:$MOUNT sudo rmdir /mnt/selinux-testsuite sudo systemctl stop nfs-server Expected output: <eliding noise from commands run prior to or after the test itself> Process context: unconfined_u:unconfined_r:test_filesystem_t:s0-s0:c0.c1023 Created file: trans_test_file File context: unconfined_u:object_r:test_filesystem_filetranscon_t:s0 File context is correct Actual output: <eliding noise from commands run prior to or after the test itself> Process context: unconfined_u:unconfined_r:test_filesystem_t:s0-s0:c0.c1023 Created file: trans_test_file File context: system_u:object_r:test_file_t:s0 File context error, expected: test_filesystem_filetranscon_t got: test_file_t Signed-off-by: Stephen Smalley <stephen.smalley.work@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06NFSD: Add COPY status code to OFFLOAD_STATUS responseChuck Lever
Clients that send an OFFLOAD_STATUS might want to distinguish between an async COPY operation that is still running, has completed successfully, or that has failed. The intention of this patch is to make NFSD behave like this: * Copy still running: OFFLOAD_STATUS returns NFS4_OK, the number of bytes copied so far, and an empty osr_status array * Copy completed successfully: OFFLOAD_STATUS returns NFS4_OK, the number of bytes copied, and an osr_status of NFS4_OK * Copy failed: OFFLOAD_STATUS returns NFS4_OK, the number of bytes copied, and an osr_status other than NFS4_OK * Copy operation lost, canceled, or otherwise unrecognized: OFFLOAD_STATUS returns NFS4ERR_BAD_STATEID NB: Though RFC 7862 Section 11.2 lists a small set of NFS status codes that are valid for OFFLOAD_STATUS, there do not seem to be any explicit spec limits on the status codes that may be returned in the osr_status field. At this time we have no unit tests for COPY and its brethren, as pynfs does not yet implement support for NFSv4.2. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06NFSD: Record status of async copy operation in struct nfsd4_copyChuck Lever
After a client has started an asynchronous COPY operation, a subsequent OFFLOAD_STATUS operation will need to report the status code once that COPY operation has completed. The recorded status record will be used by a subsequent patch. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06NFSD: add listener-{set,get} netlink commandLorenzo Bianconi
Introduce write_ports netlink command. For listener-set, userspace is expected to provide a NFS listeners list it wants enabled. All other sockets will be closed. Reviewed-by: Jeff Layton <jlayton@kernel.org> Co-developed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06NFSD: add write_version to netlink commandLorenzo Bianconi
Introduce write_version netlink command through a "declarative" interface. This patch introduces a change in behavior since for version-set userspace is expected to provide a NFS major/minor version list it wants to enable while all the other ones will be disabled. (procfs write_version command implements imperative interface where the admin writes +3/-3 to enable/disable a single version. Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06NFSD: convert write_threads to netlink commandLorenzo Bianconi
Introduce write_threads netlink command similar to the one available through the procfs. Tested-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Co-developed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06NFSD: allow callers to pass in scope string to nfsd_svcJeff Layton
Currently admins set this by using unshare to create a new uts namespace, and then resetting the hostname. With the new netlink interface we can just pass this in directly. Prepare nfsd_svc for this change. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06NFSD: move nfsd_mutex handling into nfsd_svc callersJeff Layton
Currently nfsd_svc holds the nfsd_mutex over the whole function. For some of the later netlink patches though, we want to do some other things to the server before starting it. Move the mutex handling into the callers. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06lockd: host: Remove unnecessary statements'host = NULL;'Li kunyu
In 'nlm_alloc_host', the host has already been assigned a value of NULL when defined, so 'host=NULL;' Can be deleted. Signed-off-by: Li kunyu <kunyu@nfschina.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06nfsd: don't create nfsv4recoverydir in nfsdfs when not used.NeilBrown
When CONFIG_NFSD_LEGACY_CLIENT_TRACKING is not set, the virtual file /proc/fs/nfsd/nfsv4recoverydir is created but responds EINVAL to any access. This is not useful, is somewhat surprising, and it causes ltp to complain. The only known user of this file is in nfs-utils, which handles non-existence and read-failure equally well. So there is nothing to gain from leaving the file present but inaccessible. So this patch removes the file when its content is not available - i.e. when that config option is not selected. Also remove the #ifdef which hides some of the enum values when CONFIG_NFSD_V$ not selection. simple_fill_super() quietly ignores array entries that are not present, so having slots in the array that don't get used is perfectly acceptable. So there is no value in this #ifdef. Reported-by: Petr Vorel <pvorel@suse.cz> Reviewed-by: Jeff Layton <jlayton@kernel.org> Fixes: 74fd48739d04 ("nfsd: new Kconfig option for legacy client tracking") Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Petr Vorel <pvorel@suse.cz> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06nfsd: optimise recalculate_deny_mode() for a common caseNeilBrown
recalculate_deny_mode() takes time that is linear in the number of stateids active on the file. When called from release_openowner -> free_ol_stateid_reaplist ->nfs4_free_ol_stateid -> release_all_access the number of times it is called is linear in the number of stateids. The net result is that time taken by release_openowner is quadratic in the number of stateids. When the nfsd server is shut down while there are many active stateids this can result in a soft lockup. ("CPU stuck for 302s" seen in one case). In many cases all the states have the same deny modes and there is no need to examine the entire list in recalculate_deny_mode(). In particular, recalculate_deny_mode() will only reduce the deny mode, never increase it. So if some prefix of the list causes the original deny mode to be required, there is no need to examine the remainder of the list. So we can improve recalculate_deny_mode() to usually run in constant time, so release_openowner will typically be only linear in the number of states. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06nfsd: add tracepoint in mark_client_expired_lockedJeff Layton
Show client info alongside the number of cl_rpc_users. If that's elevated, then we can infer that this function returned nfserr_jukebox. [ cel: For additional debugging of RPC user refcounting ] Signed-off-by: Jeff Layton <jlayton@kernel.org> Tested-by: Vladimir Benes <vbenes@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06nfsd: new tracepoint for check_slot_seqidChuck Lever
Replace a dprintk in check_slot_seqid with tracepoints. These new tracepoints track slot sequence numbers during operation. Suggested-by: Jeffrey Layton <jlayton@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06nfsd: drop extraneous newline from nfsd tracepointsJeff Layton
We never want a newline in tracepoint output. Signed-off-by: Jeff Layton <jlayton@kernel.org> Tested-by: Vladimir Benes <vbenes@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06fs: nfsd: use group allocation/free of per-cpu counters APIKefeng Wang
Use group allocation/free of per-cpu counters api to accelerate nfsd percpu_counters init/destroy(), and also squash the nfsd_percpu_counters_init/reset/destroy() and nfsd_counters_init/destroy() into callers to simplify code. Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06nfsd: trivial GET_DIR_DELEGATION supportJeff Layton
This adds basic infrastructure for handing GET_DIR_DELEGATION calls from clients, including the decoders and encoders. For now, it always just returns NFS4_OK + GDD4_UNAVAIL. Eventually clients may start sending this operation, and it's better if we can return GDD4_UNAVAIL instead of having to abort the whole compound. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06NFSD: Move callback_wq into struct nfs4_clientChuck Lever
Commit 883820366747 ("nfsd: update workqueue creation") made the callback_wq single-threaded, presumably to protect modifications of cl_cb_client. See documenting comment for nfsd4_process_cb_update(). However, cl_cb_client is per-lease. There's no other reason that all callback operations need to be dispatched via a single thread. The single threading here means all client callbacks can be blocked by a problem with one client. Change the NFSv4 callback client so it serializes per-lease instead of serializing all NFSv4 callback operations on the server. Reported-by: Dai Ngo <dai.ngo@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06nfsd: drop st_mutex before calling move_to_close_lru()NeilBrown
move_to_close_lru() is currently called with ->st_mutex held. This can lead to a deadlock as move_to_close_lru() waits for sc_count to drop to 2, and some threads holding a reference might be waiting for the mutex. These references will never be dropped so sc_count will never reach 2. There can be no harm in dropping ->st_mutex before move_to_close_lru() because the only place that takes the mutex is nfsd4_lock_ol_stateid(), and it quickly aborts if sc_type is NFS4_CLOSED_STID, which it will be before move_to_close_lru() is called. See also https://lore.kernel.org/lkml/4dd1fe21e11344e5969bb112e954affb@jd.com/T/ where this problem was raised but not successfully resolved. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru()NeilBrown
move_to_close_lru() waits for sc_count to become zero while holding rp_mutex. This can deadlock if another thread holds a reference and is waiting for rp_mutex. By the time we get to move_to_close_lru() the openowner is unhashed and cannot be found any more. So code waiting for the mutex can safely retry the lookup if move_to_close_lru() has started. So change rp_mutex to an atomic_t with three states: RP_UNLOCK - state is still hashed, not locked for reply RP_LOCKED - state is still hashed, is locked for reply RP_UNHASHED - state is not hashed, no code can get a lock. Use wait_var_event() to wait for either a lock, or for the owner to be unhashed. In the latter case, retry the lookup. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06nfsd: move nfsd4_cstate_assign_replay() earlier in open handling.NeilBrown
Rather than taking the rp_mutex (via nfsd4_cstate_assign_replay) in nfsd4_cleanup_open_state() (which seems counter-intuitive), take it and assign rp_owner as soon as possible - in nfsd4_process_open1(). This will support a future change when nfsd4_cstate_assign_replay() might fail. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-06nfsd: perform all find_openstateowner_str calls in the one place.NeilBrown
Currently find_openstateowner_str look ups are done both in nfsd4_process_open1() and alloc_init_open_stateowner() - the latter possibly being a surprise based on its name. It would be easier to follow, and more conformant to common patterns, if the lookup was all in the one place. So replace alloc_init_open_stateowner() with find_or_alloc_open_stateowner() and use the latter in nfsd4_process_open1() without any calls to find_openstateowner_str(). This means all finds are find_openstateowner_str_locked() and find_openstateowner_str() is no longer needed. So discard find_openstateowner_str() and rename find_openstateowner_str_locked() to find_openstateowner_str(). Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-05-05mm: simplify thp_vma_allowable_orderMatthew Wilcox
Combine the three boolean arguments into one flags argument for readability. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05f2fs: convert f2fs_clear_page_cache_dirty_tag to use a folioMatthew Wilcox (Oracle)
Removes uses of page_mapping() and page_index(). Link: https://lkml.kernel.org/r/20240423225552.4113447-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05fscrypt: convert bh_get_inode_and_lblk_num to use a folioMatthew Wilcox (Oracle)
Patch series "Remove page_mapping()". There are only a few users left. Convert them all to either call folio_mapping() or just use folio->mapping directly. This patch (of 6): Remove uses of page->index, page_mapping() and b_page. Saves a call to compound_head(). Link: https://lkml.kernel.org/r/20240423225552.4113447-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240423225552.4113447-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05fs/proc/task_mmu: convert smaps_hugetlb_range() to work on foliosDavid Hildenbrand
Let's get rid of another page_mapcount() check and simply use folio_likely_mapped_shared(), which is precise for hugetlb folios. While at it, use huge_ptep_get() + pte_page() instead of ptep_get() + vm_normal_page(), just like we do in pagemap_hugetlb_range(). No functional change intended. Link: https://lkml.kernel.org/r/20240417092313.753919-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05fs/proc/task_mmu: convert pagemap_hugetlb_range() to work on foliosDavid Hildenbrand
Patch series "fs/proc/task_mmu: convert hugetlb functions to work on folis". Let's convert two more functions, getting rid of two more page_mapcount() calls. This patch (of 2): Let's get rid of another page_mapcount() check and simply use folio_likely_mapped_shared(), which is precise for hugetlb folios. While at it, also check for PMD table sharing, like we do in smaps_hugetlb_range(). No functional change intended, except that we would now detect hugetlb folios shared via PMD table sharing correctly. Link: https://lkml.kernel.org/r/20240417092313.753919-1-david@redhat.com Link: https://lkml.kernel.org/r/20240417092313.753919-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05buffer: improve bdev_getblk documentationMatthew Wilcox (Oracle)
Add some more information about the state of the buffer_head returned. Link: https://lkml.kernel.org/r/20240416031754.4076917-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05buffer: add kernel-doc for bforget() and __bforget()Matthew Wilcox (Oracle)
Distinguish these functions from brelse() and __brelse(). Link: https://lkml.kernel.org/r/20240416031754.4076917-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05buffer: add kernel-doc for brelse() and __brelse()Matthew Wilcox (Oracle)
Move the documentation for __brelse() to brelse(), format it as kernel-doc and update it from talking about pages to folios. Link: https://lkml.kernel.org/r/20240416031754.4076917-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05buffer: fix __bread and __bread_gfp kernel-docMatthew Wilcox (Oracle)
The extra indentation confused the kernel-doc parser, so remove it. Fix some other wording while I'm here, and advise the user they need to call brelse() on this buffer. __bread_gfp() isn't used directly by filesystems, but the other wrappers for it don't have documentation, so document it accordingly. Link: https://lkml.kernel.org/r/20240416031754.4076917-5-willy@infradead.org Co-developed-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05buffer: add kernel-doc for try_to_free_buffers()Matthew Wilcox (Oracle)
The documentation for this function has become separated from it over time; move it to the right place and turn it into kernel-doc. Mild editing of the content to make it more about what the function does, and less about how it does it. Link: https://lkml.kernel.org/r/20240416031754.4076917-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05buffer: add kernel-doc for block_dirty_folio()Matthew Wilcox (Oracle)
Turn the excellent documentation for this function into kernel-doc. Replace 'page' with 'folio' and make a few other minor updates. Link: https://lkml.kernel.org/r/20240416031754.4076917-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05fs/proc/task_mmu: fix uffd-wp confusion in pagemap_scan_pmd_entry()Ryan Roberts
pagemap_scan_pmd_entry() checks if uffd-wp is set on each pte to avoid unnecessary if set. However it was previously checking with `pte_uffd_wp(ptep_get(pte))` without first confirming that the pte was present. It is only valid to call pte_uffd_wp() for present ptes. For swap ptes, pte_swp_uffd_wp() must be called because the uffd-wp bit may be kept in a different position, depending on the arch. This was leading to test failures in the pagemap_ioctl mm selftest, when bringing up uffd-wp support on arm64 due to incorrectly interpretting the uffd-wp status of migration entries. Let's fix this by using the correct check based on pte_present(). While we are at it, let's pass the pte to make_uffd_wp_pte() to avoid the pointless extra ptep_get() which can't be optimized out due to READ_ONCE() on many arches. Link: https://lkml.kernel.org/r/20240429114104.182890-1-ryan.roberts@arm.com Fixes: 12f6b01a0bcb ("fs/proc/task_mmu: add fast paths to get/clear PAGE_IS_WRITTEN flag") Closes: https://lore.kernel.org/linux-arm-kernel/ZiuyGXt0XWwRgFh9@x1n/ Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Tested-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05fs/proc/task_mmu: fix loss of young/dirty bits during pagemap scanRyan Roberts
make_uffd_wp_pte() was previously doing: pte = ptep_get(ptep); ptep_modify_prot_start(ptep); pte = pte_mkuffd_wp(pte); ptep_modify_prot_commit(ptep, pte); But if another thread accessed or dirtied the pte between the first 2 calls, this could lead to loss of that information. Since ptep_modify_prot_start() gets and clears atomically, the following is the correct pattern and prevents any possible race. Any access after the first call would see an invalid pte and cause a fault: pte = ptep_modify_prot_start(ptep); pte = pte_mkuffd_wp(pte); ptep_modify_prot_commit(ptep, pte); Link: https://lkml.kernel.org/r/20240429114017.182570-1-ryan.roberts@arm.com Fixes: 52526ca7fdb9 ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/userfaultfd: reset ptes when close() for wr-protected onesPeter Xu
Userfaultfd unregister includes a step to remove wr-protect bits from all the relevant pgtable entries, but that only covered an explicit UFFDIO_UNREGISTER ioctl, not a close() on the userfaultfd itself. Cover that too. This fixes a WARN trace. The only user visible side effect is the user can observe leftover wr-protect bits even if the user close()ed on an userfaultfd when releasing the last reference of it. However hopefully that should be harmless, and nothing bad should happen even if so. This change is now more important after the recent page-table-check patch we merged in mm-unstable (446dd9ad37d0 ("mm/page_table_check: support userfault wr-protect entries")), as we'll do sanity check on uffd-wp bits without vma context. So it's better if we can 100% guarantee no uffd-wp bit leftovers, to make sure each report will be valid. Link: https://lore.kernel.org/all/000000000000ca4df20616a0fe16@google.com/ Fixes: f369b07c8614 ("mm/uffd: reset write protection when unregister with wp-mode") Analyzed-by: David Hildenbrand <david@redhat.com> Link: https://lkml.kernel.org/r/20240422133311.2987675-1-peterx@redhat.com Reported-by: syzbot+d8426b591c36b21c750e@syzkaller.appspotmail.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05epoll: be better about file lifetimesLinus Torvalds
epoll can call out to vfs_poll() with a file pointer that may race with the last 'fput()'. That would make f_count go down to zero, and while the ep->mtx locking means that the resulting file pointer tear-down will be blocked until the poll returns, it means that f_count is already dead, and any use of it won't actually get a reference to the file any more: it's dead regardless. Make sure we have a valid ref on the file pointer before we call down to vfs_poll() from the epoll routines. Link: https://lore.kernel.org/lkml/0000000000002d631f0615918f1e@google.com/ Reported-by: syzbot+045b454ab35fd82a35fb@syzkaller.appspotmail.com Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-05-05Merge tag 'trace-v6.9-rc6-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing and tracefs fixes from Steven Rostedt: - Fix RCU callback of freeing an eventfs_inode. The freeing of the eventfs_inode from the kref going to zero freed the contents of the eventfs_inode and then used kfree_rcu() to free the inode itself. But the contents should also be protected by RCU. Switch to a call_rcu() that calls a function to free all of the eventfs_inode after the RCU synchronization. - The tracing subsystem maps its own descriptor to a file represented by eventfs. The freeing of this descriptor needs to know when the last reference of an eventfs_inode is released, but currently there is no interface for that. Add a "release" callback to the eventfs_inode entry array that allows for freeing of data that can be referenced by the eventfs_inode being opened. Then increment the ref counter for this descriptor when the eventfs_inode file is created, and decrement/free it when the last reference to the eventfs_inode is released and the file is removed. This prevents races between freeing the descriptor and the opening of the eventfs file. - Fix the permission processing of eventfs. The change to make the permissions of eventfs default to the mount point but keep track of when changes were made had a side effect that could cause security concerns. When the tracefs is remounted with a given gid or uid, all the files within it should inherit that gid or uid. But if the admin had changed the permission of some file within the tracefs file system, it would not get updated by the remount. This caused the kselftest of file permissions to fail the second time it is run. The first time, all changes would look fine, but the second time, because the changes were "saved", the remount did not reset them. Create a link list of all existing tracefs inodes, and clear the saved flags on them on a remount if the remount changes the corresponding gid or uid fields. This also simplifies the code by removing the distinction between the toplevel eventfs and an instance eventfs. They should both act the same. They were different because of a misconception due to the remount not resetting the flags. Now that remount resets all the files and directories to default to the root node if a uid/gid is specified, it makes the logic simpler to implement. * tag 'trace-v6.9-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: eventfs: Have "events" directory get permissions from its parent eventfs: Do not treat events directory different than other directories eventfs: Do not differentiate the toplevel events directory tracefs: Still use mount point as default permissions for instances tracefs: Reset permissions on remount if permissions are options eventfs: Free all of the eventfs_inode after RCU eventfs/tracing: Add callback for release of an eventfs_inode
2024-05-04ksmbd: do not grant v2 lease if parent lease key and epoch are not setNamjae Jeon
This patch fix xfstests generic/070 test with smb2 leases = yes. cifs.ko doesn't set parent lease key and epoch in create context v2 lease. ksmbd suppose that parent lease and epoch are vaild if data length is v2 lease context size and handle directory lease using this values. ksmbd should hanle it as v1 lease not v2 lease if parent lease key and epoch are not set in create context v2 lease. Cc: stable@vger.kernel.org Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-05-04ksmbd: use rwsem instead of rwlock for lease breakNamjae Jeon
lease break wait for lease break acknowledgment. rwsem is more suitable than unlock while traversing the list for parent lease break in ->m_op_list. Cc: stable@vger.kernel.org Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>