summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-01-13btrfs: open-code btrfs_copy_from_user()Qu Wenruo
The function btrfs_copy_from_user() handles the folio dirtying for buffered write. The original design is to allow that function to handle multiple folios, but since commit c87c299776e4 ("btrfs: make buffered write to copy one page a time") there is no need to support multiple folios. So here open-code btrfs_copy_from_user() to copy_folio_from_iter_atomic() and flush_dcache_folio() calls. The short-copy check and revert are still kept as-is. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: improve the warning and error message for btrfs_remove_qgroup()Qu Wenruo
[WARNING] There are several warnings about the recently introduced qgroup auto-removal that it triggers WARN_ON() for the non-zero rfer/excl numbers, e.g: ------------[ cut here ]------------ WARNING: CPU: 67 PID: 2882 at fs/btrfs/qgroup.c:1854 btrfs_remove_qgroup+0x3df/0x450 CPU: 67 UID: 0 PID: 2882 Comm: btrfs-cleaner Kdump: loaded Not tainted 6.11.6-300.fc41.x86_64 #1 RIP: 0010:btrfs_remove_qgroup+0x3df/0x450 Call Trace: <TASK> btrfs_qgroup_cleanup_dropped_subvolume+0x97/0xc0 btrfs_drop_snapshot+0x44e/0xa80 btrfs_clean_one_deleted_snapshot+0xc3/0x110 cleaner_kthread+0xd8/0x130 kthread+0xd2/0x100 ret_from_fork+0x34/0x50 ret_from_fork_asm+0x1a/0x30 </TASK> ---[ end trace 0000000000000000 ]--- BTRFS warning (device sda): to be deleted qgroup 0/319 has non-zero numbers, rfer 258478080 rfer_cmpr 258478080 excl 0 excl_cmpr 0 [CAUSE] Although the root cause is still unclear, as if qgroup is consistent a fully dropped subvolume (with extra transaction committed) should lead to all zero numbers for the qgroup. My current guess is the subvolume drop triggered the new subtree drop threshold thus marked qgroup inconsistent, then rescan cleared it but some corner case is not properly handled during subvolume dropping. But at least for this particular case, since it's only the rfer/excl not properly reset to 0, and qgroup is already marked inconsistent, there is nothing to be worried for the end users. The user space tool utilizing qgroup would queue a rescan to handle everything, so the kernel wanring is a little overkilled. [ENHANCEMENT] Enhance the warning inside btrfs_remove_qgroup() by: - Only do WARN() if CONFIG_BTRFS_DEBUG is enabled As explained the kernel can handle inconsistent qgroups by simply do a rescan, there is nothing to bother the end users. - Treat the reserved space leak the same as non-zero numbers By outputting the values and trigger a WARN() if it's a debug build. So far I haven't experienced any case related to reserved space so I hope we will never need to bother them. Fixes: 839d6ea4f86d ("btrfs: automatically remove the subvolume qgroup") Link: https://github.com/kdave/btrfs-progs/issues/922 Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: remove detached list from struct btrfs_backref_cacheJosef Bacik
We don't ever look at this list, remove it. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: remove the ->lowest and ->leaves members from struct btrfs_backref_nodeJosef Bacik
Before we were keeping all of our nodes on various lists in order to make sure everything got cleaned up correctly. We used node->lowest to indicate that node->lower was linked into the cache->leaves list. Now that we do cleanup based on the rb-tree both the list and the flag are useless, so delete them both. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: simplify btrfs_backref_release_cache()Josef Bacik
We rely on finding all our nodes on the various lists in the backref cache, when they are all also in the rbtree. Instead just search through the rbtree and free everything. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: do not handle non-shareable roots in backref cacheJosef Bacik
Now that we handle relocation for non-shareable roots without using the backref cache, remove the ->cowonly field from the backref nodes and update the handling to throw an error. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: don't build backref tree for COW-only blocksJosef Bacik
We already determine the owner for any blocks we find when we're relocating, and for COW-only blocks (and the data reloc tree) we COW down to the block and call it good enough. However we still build a whole backref tree for them, even though we're not going to use it, and then just don't put these blocks in the cache. Rework the code to check if the block belongs to a COW-only root or the data reloc root, and then just cow down to the block, skipping the backref cache generation. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: remove clone_backref_node() from relocationJosef Bacik
Since we no longer maintain backref cache across transactions, and this is only called when we're creating the reloc root for a newly created snapshot in the transaction critical section, we will end up doing a bunch of work that will just get thrown away when we start the transaction in the relocation loop. Delete this code as it no longer does anything for us. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: simplify loop in select_reloc_root()Josef Bacik
We have this setup as a loop, but in reality we will never walk back up the backref tree, if we do then it's a bug. Get rid of the loop and handle the case where we have node->new_bytenr set at all. Previous check was only if node->new_bytenr != root->node->start, but if it did then we would hit the WARN_ON() and walk back up the tree. Instead we want to just return error if ->new_bytenr is set, and then do the normal updating of the node for the reloc root and carry on. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: add a comment for new_bytenr in backref_cache_nodeJosef Bacik
Add a comment for this field so we know what it is used for. Previously we used it to update the backref cache, so people may mistakenly think it is useless, but in fact exists to make sure the backref cache makes sense. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: remove the changed list for backref cacheJosef Bacik
Now that we're not updating the backref cache when we switch transids we can remove the changed list. We're going to keep the new_bytenr field because it serves as a good sanity check for the backref cache and relocation, and can prevent us from making extent tree corruption worse. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: convert BUG_ON in btrfs_reloc_cow_block() to proper error handlingJosef Bacik
This BUG_ON is meant to catch backref cache problems, but these can arise from either bugs in the backref cache or corruption in the extent tree. Fix it to be a proper error. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: fix data race when accessing the inode's disk_i_size at ↵Hao-ran Zheng
btrfs_drop_extents() A data race occurs when the function `insert_ordered_extent_file_extent()` and the function `btrfs_inode_safe_disk_i_size_write()` are executed concurrently. The function `insert_ordered_extent_file_extent()` is not locked when reading inode->disk_i_size, causing `btrfs_inode_safe_disk_i_size_write()` to cause data competition when writing inode->disk_i_size, thus affecting the value of `modify_tree`. The specific call stack that appears during testing is as follows: ============DATA_RACE============ btrfs_drop_extents+0x89a/0xa060 [btrfs] insert_reserved_file_extent+0xb54/0x2960 [btrfs] insert_ordered_extent_file_extent+0xff5/0x1760 [btrfs] btrfs_finish_one_ordered+0x1b85/0x36a0 [btrfs] btrfs_finish_ordered_io+0x37/0x60 [btrfs] finish_ordered_fn+0x3e/0x50 [btrfs] btrfs_work_helper+0x9c9/0x27a0 [btrfs] process_scheduled_works+0x716/0xf10 worker_thread+0xb6a/0x1190 kthread+0x292/0x330 ret_from_fork+0x4d/0x80 ret_from_fork_asm+0x1a/0x30 ============OTHER_INFO============ btrfs_inode_safe_disk_i_size_write+0x4ec/0x600 [btrfs] btrfs_finish_one_ordered+0x24c7/0x36a0 [btrfs] btrfs_finish_ordered_io+0x37/0x60 [btrfs] finish_ordered_fn+0x3e/0x50 [btrfs] btrfs_work_helper+0x9c9/0x27a0 [btrfs] process_scheduled_works+0x716/0xf10 worker_thread+0xb6a/0x1190 kthread+0x292/0x330 ret_from_fork+0x4d/0x80 ret_from_fork_asm+0x1a/0x30 ================================= The main purpose of the check of the inode's disk_i_size is to avoid taking write locks on a btree path when we have a write at or beyond EOF, since in these cases we don't expect to find extent items in the root to drop. However if we end up taking write locks due to a data race on disk_i_size, everything is still correct, we only add extra lock contention on the tree in case there's concurrency from other tasks. If the race causes us to not take write locks when we actually need them, then everything is functionally correct as well, since if we find out we have extent items to drop and we took read locks (modify_tree set to 0), we release the path and retry again with write locks. Since this data race does not affect the correctness of the function, it is a harmless data race, use data_race() to check inode->disk_i_size. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Hao-ran Zheng <zhenghaoran154@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: don't BUG_ON() in btrfs_drop_extents()Johannes Thumshirn
btrfs_drop_extents() calls BUG_ON() in case the counter of to be deleted extents is greater than 0. But all of these code paths can handle errors, so there's no need to crash the kernel. Instead WARN() that the condition has been met and gracefully bail out. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: zoned: reclaim unused zone by zone resettingNaohiro Aota
On the zoned mode, once used and freed region is still not reusable after the freeing. The underlying zone needs to be reset before reusing. Btrfs resets a zone when it removes a block group, and then new block group is allocated on the zones to reuse the zones. But, it is sometime too late to catch up with a write side. This commit introduces a new space-info reclaim method ZONE_RESET. That will pick a block group from the unused list and reset its zone to reuse the zone_unusable space. It is faster than removing the block group and re-creating a new block group on the same zones. For the first implementation, the ZONE_RESET is only applied to a block group whose region is fully zone_unusable. Reclaiming partial zone_unusable block group could be implemented later. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: drop fs_info argument from btrfs_update_space_info_*()Naohiro Aota
Since commit e1e577aafe41 ("btrfs: store fs_info in space_info"), we have the fs_info in a space_info. So, we can drop fs_info argument from btrfs_update_space_info_*. There is no behavior change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: factor out btrfs_return_free_space()Naohiro Aota
Factor out a part of unpin_extent_range() that returns space back to the space info, prioritizing global block reserve. Also, move the "len" variable into the loop to clarify we don't need to carry it beyond an iteration. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: handle FS_IOC_READ_VERITY_METADATA ioctlAllison Karlitskaya
Commit 146054090b08 ("btrfs: initial fsverity support") introduced fs-verity support for btrfs, but didn't add support for FS_IOC_READ_VERITY_METADATA to directly query the Merkle tree, descriptor and signature blocks for fs-verity enabled files. Add the (trival) implementation: we just need to wire it through to the fs-verity code, the same way as is done in the other two filesystems which support this ioctl (ext4, f2fs). The fs-verity code already has access to the required data. This is also safe to backport to older stable trees (5.15+) if needed. Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: send: remove redundant assignments to variable retColin Ian King
The variable ret is being initialized to zero and also later re-assigned to zero. In both cases the assignment is redundant since the value is never read after the assignment and hence they can be removed. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: use PTR_ERR() instead of PTR_ERR_OR_ZERO() for btrfs_get_extent()Qu Wenruo
The function btrfs_get_extent() will only return an PTR_ERR() or a valid extent map pointer. It will not return NULL. Thus the usage of PTR_ERR_OR_ZERO() inside submit_one_sector() is not needed, use plain PTR_ERR() instead, and that is the only usage of PTR_ERR_OR_ZERO() after btrfs_get_extent(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: selftests: add delayed ref self test casesJosef Bacik
The recent fix for a stupid mistake I made uncovered the fact that we don't have adequate testing in the delayed refs code, as it took a pretty extensive and long running stress test to uncover something that a unit test would have uncovered right away. Fix this by adding a delayed refs self test suite. This will validate that the btrfs_ref transformation does the correct thing, that we do the correct thing when merging delayed refs, and that we get the delayed refs in the order that we expect. These are all crucial to how the delayed refs operate. I introduced various bugs (including the original bug) into the delayed refs code to validate that these tests caught all of the shenanigans that I could think of. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: move select_delayed_ref() and export itJosef Bacik
This helper is how we select the delayed ref to run once we've selected the delayed ref head. I need this exported to add a unit test for delayed refs, and it's more natural home is in delayed-ref.c. Rename it to btrfs_select_delayed_ref and move it into delayed-ref.c. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUEPeter Zijlstra
Normally dequeue_entities() will continue to dequeue an empty group entity; except DELAY_DEQUEUE changes things -- it retains empty entities such that they might continue to compete and burn off some lag. However, doing this results in update_cfs_group() re-computing the cgroup weight 'slice' for an empty group, which it (rightly) figures isn't much at all. This in turn means that the delayed entity is not competing at the expected weight. Worse, the very low weight causes its lag to be inflated, which combined with avg_vruntime() using scale_load_down(), leads to artifacts. As such, don't adjust the weight for empty group entities and let them compete at their original weight. Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250110115720.GA17405@noisy.programming.kicks-ass.net
2025-01-13x86: Disable EXECMEM_ROX supportPeter Zijlstra
The whole module_writable_address() nonsense made a giant mess of alternative.c, not to mention it still contains bugs -- notable some of the CFI variants crash and burn. Mike has been working on patches to clean all this up again, but given the current state of things, this stuff just isn't ready. Disable for now, lets try again next cycle. Fixes: 5185e7f9f3bd ("x86/module: enable ROX caches for module text on 64 bit") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250113112934.GA8385@noisy.programming.kicks-ass.net
2025-01-13drm/tests: connector: Add ycbcr_420_allowed testsCristian Ciocaltea
Extend HDMI connector output format tests to verify its registration succeeds only when the presence of YUV420 in the supported formats matches the state of ycbcr_420_allowed flag. Signed-off-by: Cristian Ciocaltea <cristian.ciocaltea@collabora.com> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Reviewed-by: Maxime Ripard <mripard@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20241224-bridge-conn-fmt-prio-v4-4-a9ceb5671379@collabora.com Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
2025-01-13drm/connector: hdmi: Validate supported_formats matches ycbcr_420_allowedCristian Ciocaltea
Ensure HDMI connector initialization fails when the presence of HDMI_COLORSPACE_YUV420 in the given supported_formats bitmask doesn't match the value of drm_connector->ycbcr_420_allowed. Suggested-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Signed-off-by: Cristian Ciocaltea <cristian.ciocaltea@collabora.com> Reviewed-by: Maxime Ripard <mripard@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20241224-bridge-conn-fmt-prio-v4-3-a9ceb5671379@collabora.com Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
2025-01-13drm/bridge-connector: Sync supported_formats with computed ycbcr_420_allowedCristian Ciocaltea
The case of having an HDMI bridge in the pipeline which advertises YUV420 capability via its ->supported_formats and a non-HDMI one that didn't enable ->ycbcr_420_allowed, is incorrectly handled because supported_formats is passed as is to the helper initializing the HDMI connector. Ensure HDMI_COLORSPACE_YUV420 is removed from the bitmask passed to drmm_connector_hdmi_init() when connector's ->ycbcr_420_allowed flag ends up not being set. Fixes: 3ced1c687512 ("drm/display: bridge_connector: handle ycbcr_420_allowed") Signed-off-by: Cristian Ciocaltea <cristian.ciocaltea@collabora.com> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Reviewed-by: Maxime Ripard <mripard@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20241224-bridge-conn-fmt-prio-v4-2-a9ceb5671379@collabora.com Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
2025-01-13drm/bridge: Prioritize supported_formats over ycbcr_420_allowedCristian Ciocaltea
Bridges having DRM_BRIDGE_OP_HDMI set in their ->ops are supposed to rely on the ->supported_formats bitmask to advertise the permitted colorspaces, including HDMI_COLORSPACE_YUV420. However, a new flag ->ycbcr_420_allowed has been recently introduced, which brings the necessity to require redundant and potentially inconsistent information to be provided on HDMI bridges initialization. Adjust ->ycbcr_420_allowed for HDMI bridges according to ->supported_formats, right before adding them to the global bridge list. This keeps the initialization process straightforward and unambiguous, thereby preventing any further confusion. Fixes: 3ced1c687512 ("drm/display: bridge_connector: handle ycbcr_420_allowed") Signed-off-by: Cristian Ciocaltea <cristian.ciocaltea@collabora.com> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Reviewed-by: Maxime Ripard <mripard@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20241224-bridge-conn-fmt-prio-v4-1-a9ceb5671379@collabora.com Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
2025-01-13pktgen: Avoid out-of-bounds access in get_imix_entriesArtem Chernyshev
Passing a sufficient amount of imix entries leads to invalid access to the pkt_dev->imix_entries array because of the incorrect boundary check. UBSAN: array-index-out-of-bounds in net/core/pktgen.c:874:24 index 20 is out of range for type 'imix_pkt [20]' CPU: 2 PID: 1210 Comm: bash Not tainted 6.10.0-rc1 #121 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) Call Trace: <TASK> dump_stack_lvl lib/dump_stack.c:117 __ubsan_handle_out_of_bounds lib/ubsan.c:429 get_imix_entries net/core/pktgen.c:874 pktgen_if_write net/core/pktgen.c:1063 pde_write fs/proc/inode.c:334 proc_reg_write fs/proc/inode.c:346 vfs_write fs/read_write.c:593 ksys_write fs/read_write.c:644 do_syscall_64 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe arch/x86/entry/entry_64.S:130 Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: 52a62f8603f9 ("pktgen: Parse internet mix (imix) input") Signed-off-by: Artem Chernyshev <artem.chernyshev@red-soft.ru> [ fp: allow to fill the array completely; minor changelog cleanup ] Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-01-13ALSA: hda/realtek: Fix volume adjustment issue on Lenovo ThinkBook 16P Gen5Yage Geng
This patch fixes the volume adjustment issue on the Lenovo ThinkBook 16P Gen5 by applying the necessary quirk configuration for the Realtek ALC287 codec. The issue was caused by incorrect configuration in the driver, which prevented proper volume control on certain systems. Signed-off-by: Yage Geng <icoderdev@gmail.com> Link: https://patch.msgid.link/20250113085208.15351-1-icoderdev@gmail.com Signed-off-by: Takashi Iwai <tiwai@suse.de>
2025-01-13s390/bitops: Provide optimized arch_test_bit()Heiko Carstens
Provide an optimized arch_test_bit() implementation which makes use of flag output constraint. This generates slightly better code: bloat-o-meter: add/remove: 51/19 grow/shrink: 450/2444 up/down: 25198/-49136 (-23938) Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2025-01-13s390/bitops: Switch to generic bitopsHeiko Carstens
The generic bitops implementation is nearly identical to the s390 implementation therefore switch to the generic variant. This results in a small kernel image size decrease. This is because for the generic variant the nr parameter for most bitops functions is of type unsigned int while the s390 variant uses unsigned long. bloat-o-meter: add/remove: 670/670 grow/shrink: 167/209 up/down: 21440/-21792 (-352) Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2025-01-13s390/ebcdic: Fix length decrement in codepage_convert()Sven Schnelle
The inline assembly uses the ahi instruction to decrement and test whether more than 256 bytes are left for conversion. But the nr variable passed is of type unsigned long. Therefore use aghi. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reported-by: Jens Remus <jremus@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2025-01-13s390/ebcdic: Fix length check in codepage_convert()Sven Schnelle
The current code compares whether the nr argument is less or equal to zero. As nr is of type unsigned long, this isn't correct. Fix this by just testing for zero. This is also reported by checkpatch: unsignedLessThanZero: Checking if unsigned expression 'nr--' is less than zero. Reported-by: Jens Remus <jremus@linux.ibm.com> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2025-01-13s390/ebcdic: Use exrl instead of exSven Schnelle
exrl is present in all machines currently supported, therefore prefer it over ex. This saves one instruction and doesn't need an additional register to hold the address of the target instruction. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2025-01-13s390/amode31: Use exrl instead of exSven Schnelle
exrl is present in all machines currently supported, therefore prefer it over ex. This saves one instruction and doesn't need an additional register to hold the address of the target instruction. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2025-01-13s390/stackleak: Use exrl instead of ex in __stackleak_poison()Sven Schnelle
exrl is present in all machines currently supported, therefore prefer it over ex. This saves one instruction and doesn't need an additional register to hold the address of the target instruction. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2025-01-13s390/lib: Use exrl instead of ex in xor functionsSven Schnelle
exrl is present in all machines currently supported, therefore prefer it over ex. This saves one instruction and doesn't need an additional register to hold the address of the target instruction. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2025-01-13s390/topology: Improve topology detectionMete Durlu
Add early polarization detection instead of assuming horizontal polarization. Signed-off-by: Mete Durlu <meted@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2025-01-12cifs: support reconnect with alternate password for SMB1Meetakshi Setiya
SMB1 shares the mount and remount code paths with SMB2/3 and already supports password rotation in some scenarios. This patch extends the password rotation support to SMB1 reconnects as well. Cc: stable@vger.kernel.org Signed-off-by: Meetakshi Setiya <msetiya@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-12fs/proc: fix softlockup in __read_vmcore (part 2)Rik van Riel
Since commit 5cbcb62dddf5 ("fs/proc: fix softlockup in __read_vmcore") the number of softlockups in __read_vmcore at kdump time have gone down, but they still happen sometimes. In a memory constrained environment like the kdump image, a softlockup is not just a harmless message, but it can interfere with things like RCU freeing memory, causing the crashdump to get stuck. The second loop in __read_vmcore has a lot more opportunities for natural sleep points, like scheduling out while waiting for a data write to happen, but apparently that is not always enough. Add a cond_resched() to the second loop in __read_vmcore to (hopefully) get rid of the softlockups. Link: https://lkml.kernel.org/r/20250110102821.2a37581b@fangorn Fixes: 5cbcb62dddf5 ("fs/proc: fix softlockup in __read_vmcore") Signed-off-by: Rik van Riel <riel@surriel.com> Reported-by: Breno Leitao <leitao@debian.org> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12mm: fix assertion in folio_end_read()Matthew Wilcox (Oracle)
We only need to assert that the uptodate flag is clear if we're going to set it. This hasn't been a problem before now because we have only used folio_end_read() when completing with an error, but it's convenient to use it in squashfs if we discover the folio is already uptodate. Link: https://lkml.kernel.org/r/20250110163300.3346321-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Phillip Lougher <phillip@squashfs.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12mm: vmscan : pgdemote vmstat is not getting updated when MGLRU is enabled.Donet Tom
When MGLRU is enabled, the pgdemote_kswapd, pgdemote_direct, and pgdemote_khugepaged stats in vmstat are not being updated. Commit f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA balancing operations") moved the pgdemote vmstat update from demote_folio_list() to shrink_inactive_list(), which is in the normal LRU path. As a result, the pgdemote stats are updated correctly for the normal LRU but not for MGLRU. To address this, we have added the pgdemote stat update in the evict_folios() function, which is in the MGLRU path. With this patch, the pgdemote stats will now be updated correctly when MGLRU is enabled. Without this patch vmstat output when MGLRU is enabled ====================================================== pgdemote_kswapd 0 pgdemote_direct 0 pgdemote_khugepaged 0 With this patch vmstat output when MGLRU is enabled =================================================== pgdemote_kswapd 43234 pgdemote_direct 4691 pgdemote_khugepaged 0 Link: https://lkml.kernel.org/r/20250109060540.451261-1-donettom@linux.ibm.com Fixes: f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA balancing operations") Signed-off-by: Donet Tom <donettom@linux.ibm.com> Acked-by: Yu Zhao <yuzhao@google.com> Tested-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kaiyang Zhao <kaiyang2@cs.cmu.edu> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Wei Xu <weixugc@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12vmstat: disable vmstat_work on vmstat_cpu_down_prep()Koichiro Den
The upstream commit adcfb264c3ed ("vmstat: disable vmstat_work on vmstat_cpu_down_prep()") introduced another warning during the boot phase so was soon reverted on upstream by commit cd6313beaeae ("Revert "vmstat: disable vmstat_work on vmstat_cpu_down_prep()""). This commit resolves it and reattempts the original fix. Even after mm/vmstat:online teardown, shepherd may still queue work for the dying cpu until the cpu is removed from online mask. While it's quite rare, this means that after unbind_workers() unbinds a per-cpu kworker, it potentially runs vmstat_update for the dying CPU on an irrelevant cpu before entering atomic AP states. When CONFIG_DEBUG_PREEMPT=y, it results in the following error with the backtrace. BUG: using smp_processor_id() in preemptible [00000000] code: \ kworker/7:3/1702 caller is refresh_cpu_vm_stats+0x235/0x5f0 CPU: 0 UID: 0 PID: 1702 Comm: kworker/7:3 Tainted: G Tainted: [N]=TEST Workqueue: mm_percpu_wq vmstat_update Call Trace: <TASK> dump_stack_lvl+0x8d/0xb0 check_preemption_disabled+0xce/0xe0 refresh_cpu_vm_stats+0x235/0x5f0 vmstat_update+0x17/0xa0 process_one_work+0x869/0x1aa0 worker_thread+0x5e5/0x1100 kthread+0x29e/0x380 ret_from_fork+0x2d/0x70 ret_from_fork_asm+0x1a/0x30 </TASK> So, for mm/vmstat:online, disable vmstat_work reliably on teardown and symmetrically enable it on startup. For secondary CPUs during CPU hotplug scenarios, ensure the delayed work is disabled immediately after the initialization. These CPUs are not yet online when start_shepherd_timer() runs on boot CPU. vmstat_cpu_online() will enable the work for them. Link: https://lkml.kernel.org/r/20250108042807.3429745-1-koichiro.den@canonical.com Signed-off-by: Huacai Chen <chenhuacai@kernel.org> Signed-off-by: Koichiro Den <koichiro.den@canonical.com> Suggested-by: Huacai Chen <chenhuacai@kernel.org> Tested-by: Charalampos Mitrodimas <charmitro@posteo.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12zram: fix potential UAF of zram tableKairui Song
If zram_meta_alloc failed early, it frees allocated zram->table without setting it NULL. Which will potentially cause zram_meta_free to access the table if user reset an failed and uninitialized device. Link: https://lkml.kernel.org/r/20250107065446.86928-1-ryncsn@gmail.com Fixes: 74363ec674cb ("zram: fix uninitialized ZRAM not releasing backing device") Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12selftests/mm: set allocated memory to non-zero content in cow testRyan Roberts
After commit b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp"), cow test cases involving swapping out THPs via madvise(MADV_PAGEOUT) started to be skipped due to the subsequent check via pagemap determining that the memory was not actually swapped out. Logs similar to this were emitted: ... # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP (16 kB) ok 2 # SKIP MADV_PAGEOUT did not work, is swap enabled? # [RUN] Basic COW after fork() ... with single PTE of swapped-out THP (16 kB) ok 3 # SKIP MADV_PAGEOUT did not work, is swap enabled? # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP (32 kB) ok 4 # SKIP MADV_PAGEOUT did not work, is swap enabled? ... The commit in question introduces the behaviour of scanning THPs and if their content is predominantly zero, it splits them and replaces the pages which are wholly zero with the zero page. These cow test cases were getting caught up in this. So let's avoid that by filling the contents of all allocated memory with a non-zero value. With this in place, the tests are passing again. Link: https://lkml.kernel.org/r/20250107142555.1870101-1-ryan.roberts@arm.com Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12mm: clear uffd-wp PTE/PMD state on mremap()Ryan Roberts
When mremap()ing a memory region previously registered with userfaultfd as write-protected but without UFFD_FEATURE_EVENT_REMAP, an inconsistency in flag clearing leads to a mismatch between the vma flags (which have uffd-wp cleared) and the pte/pmd flags (which do not have uffd-wp cleared). This mismatch causes a subsequent mprotect(PROT_WRITE) to trigger a warning in page_table_check_pte_flags() due to setting the pte to writable while uffd-wp is still set. Fix this by always explicitly clearing the uffd-wp pte/pmd flags on any such mremap() so that the values are consistent with the existing clearing of VM_UFFD_WP. Be careful to clear the logical flag regardless of its physical form; a PTE bit, a swap PTE bit, or a PTE marker. Cover PTE, huge PMD and hugetlb paths. Link: https://lkml.kernel.org/r/20250107144755.1871363-2-ryan.roberts@arm.com Co-developed-by: Mikołaj Lenczewski <miko.lenczewski@arm.com> Signed-off-by: Mikołaj Lenczewski <miko.lenczewski@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Closes: https://lore.kernel.org/linux-mm/810b44a8-d2ae-4107-b665-5a42eae2d948@arm.com/ Fixes: 63b2d4174c4a ("userfaultfd: wp: add the writeprotect API to userfaultfd ioctl") Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12module: fix writing of livepatch relocations in ROX textPetr Pavlu
A livepatch module can contain a special relocation section .klp.rela.<objname>.<secname> to apply its relocations at the appropriate time and to additionally access local and unexported symbols. When <objname> points to another module, such relocations are processed separately from the regular module relocation process. For instance, only when the target <objname> actually becomes loaded. With CONFIG_STRICT_MODULE_RWX, when the livepatch core decides to apply these relocations, their processing results in the following bug: [ 25.827238] BUG: unable to handle page fault for address: 00000000000012ba [ 25.827819] #PF: supervisor read access in kernel mode [ 25.828153] #PF: error_code(0x0000) - not-present page [ 25.828588] PGD 0 P4D 0 [ 25.829063] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI [ 25.829742] CPU: 2 UID: 0 PID: 452 Comm: insmod Tainted: G O K 6.13.0-rc4-00078-g059dd502b263 #7820 [ 25.830417] Tainted: [O]=OOT_MODULE, [K]=LIVEPATCH [ 25.830768] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-20220807_005459-localhost 04/01/2014 [ 25.831651] RIP: 0010:memcmp+0x24/0x60 [ 25.832190] Code: [...] [ 25.833378] RSP: 0018:ffffa40b403a3ae8 EFLAGS: 00000246 [ 25.833637] RAX: 0000000000000000 RBX: ffff93bc81d8e700 RCX: ffffffffc0202000 [ 25.834072] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 00000000000012ba [ 25.834548] RBP: ffffa40b403a3b68 R08: ffffa40b403a3b30 R09: 0000004a00000002 [ 25.835088] R10: ffffffffffffd222 R11: f000000000000000 R12: 0000000000000000 [ 25.835666] R13: ffffffffc02032ba R14: ffffffffc007d1e0 R15: 0000000000000004 [ 25.836139] FS: 00007fecef8c3080(0000) GS:ffff93bc8f900000(0000) knlGS:0000000000000000 [ 25.836519] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 25.836977] CR2: 00000000000012ba CR3: 0000000002f24000 CR4: 00000000000006f0 [ 25.837442] Call Trace: [ 25.838297] <TASK> [ 25.841083] __write_relocate_add.constprop.0+0xc7/0x2b0 [ 25.841701] apply_relocate_add+0x75/0xa0 [ 25.841973] klp_write_section_relocs+0x10e/0x140 [ 25.842304] klp_write_object_relocs+0x70/0xa0 [ 25.842682] klp_init_object_loaded+0x21/0xf0 [ 25.842972] klp_enable_patch+0x43d/0x900 [ 25.843572] do_one_initcall+0x4c/0x220 [ 25.844186] do_init_module+0x6a/0x260 [ 25.844423] init_module_from_file+0x9c/0xe0 [ 25.844702] idempotent_init_module+0x172/0x270 [ 25.845008] __x64_sys_finit_module+0x69/0xc0 [ 25.845253] do_syscall_64+0x9e/0x1a0 [ 25.845498] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 25.846056] RIP: 0033:0x7fecef9eb25d [ 25.846444] Code: [...] [ 25.847563] RSP: 002b:00007ffd0c5d6de8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 [ 25.848082] RAX: ffffffffffffffda RBX: 000055b03f05e470 RCX: 00007fecef9eb25d [ 25.848456] RDX: 0000000000000000 RSI: 000055b001e74e52 RDI: 0000000000000003 [ 25.848969] RBP: 00007ffd0c5d6ea0 R08: 0000000000000040 R09: 0000000000004100 [ 25.849411] R10: 00007fecefac7b20 R11: 0000000000000246 R12: 000055b001e74e52 [ 25.849905] R13: 0000000000000000 R14: 000055b03f05e440 R15: 0000000000000000 [ 25.850336] </TASK> [ 25.850553] Modules linked in: deku(OK+) uinput [ 25.851408] CR2: 00000000000012ba [ 25.852085] ---[ end trace 0000000000000000 ]--- The problem is that the .klp.rela.<objname>.<secname> relocations are processed after the module was already formed and mod->rw_copy was reset. However, the code in __write_relocate_add() calls module_writable_address() which translates the target address 'loc' still to 'loc + (mem->rw_copy - mem->base)', with mem->rw_copy now being 0. Fix the problem by returning directly 'loc' in module_writable_address() when the module is already formed. Function __write_relocate_add() knows to use text_poke() in such a case. Link: https://lkml.kernel.org/r/20250107153507.14733-1-petr.pavlu@suse.com Fixes: 0c133b1e78cd ("module: prepare to handle ROX allocations for text") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reported-by: Marek Maslanka <mmaslanka@google.com> Closes: https://lore.kernel.org/linux-modules/CAGcaFA2hdThQV6mjD_1_U+GNHThv84+MQvMWLgEuX+LVbAyDxg@mail.gmail.com/ Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Petr Mladek <pmladek@suse.com> Cc: Joe Lawrence <joe.lawrence@redhat.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12mm: zswap: properly synchronize freeing resources during CPU hotunplugYosry Ahmed
In zswap_compress() and zswap_decompress(), the per-CPU acomp_ctx of the current CPU at the beginning of the operation is retrieved and used throughout. However, since neither preemption nor migration are disabled, it is possible that the operation continues on a different CPU. If the original CPU is hotunplugged while the acomp_ctx is still in use, we run into a UAF bug as some of the resources attached to the acomp_ctx are freed during hotunplug in zswap_cpu_comp_dead() (i.e. acomp_ctx.buffer, acomp_ctx.req, or acomp_ctx.acomp). The problem was introduced in commit 1ec3b5fe6eec ("mm/zswap: move to use crypto_acomp API for hardware acceleration") when the switch to the crypto_acomp API was made. Prior to that, the per-CPU crypto_comp was retrieved using get_cpu_ptr() which disables preemption and makes sure the CPU cannot go away from under us. Preemption cannot be disabled with the crypto_acomp API as a sleepable context is needed. Use the acomp_ctx.mutex to synchronize CPU hotplug callbacks allocating and freeing resources with compression/decompression paths. Make sure that acomp_ctx.req is NULL when the resources are freed. In the compression/decompression paths, check if acomp_ctx.req is NULL after acquiring the mutex (meaning the CPU was offlined) and retry on the new CPU. The initialization of acomp_ctx.mutex is moved from the CPU hotplug callback to the pool initialization where it belongs (where the mutex is allocated). In addition to adding clarity, this makes sure that CPU hotplug cannot reinitialize a mutex that is already locked by compression/decompression. Previously a fix was attempted by holding cpus_read_lock() [1]. This would have caused a potential deadlock as it is possible for code already holding the lock to fall into reclaim and enter zswap (causing a deadlock). A fix was also attempted using SRCU for synchronization, but Johannes pointed out that synchronize_srcu() cannot be used in CPU hotplug notifiers [2]. Alternative fixes that were considered/attempted and could have worked: - Refcounting the per-CPU acomp_ctx. This involves complexity in handling the race between the refcount dropping to zero in zswap_[de]compress() and the refcount being re-initialized when the CPU is onlined. - Disabling migration before getting the per-CPU acomp_ctx [3], but that's discouraged and is a much bigger hammer than needed, and could result in subtle performance issues. [1]https://lkml.kernel.org/20241219212437.2714151-1-yosryahmed@google.com/ [2]https://lkml.kernel.org/20250107074724.1756696-2-yosryahmed@google.com/ [3]https://lkml.kernel.org/20250107222236.2715883-2-yosryahmed@google.com/ [yosryahmed@google.com: remove comment] Link: https://lkml.kernel.org/r/CAJD7tkaxS1wjn+swugt8QCvQ-rVF5RZnjxwPGX17k8x9zSManA@mail.gmail.com Link: https://lkml.kernel.org/r/20250108222441.3622031-1-yosryahmed@google.com Fixes: 1ec3b5fe6eec ("mm/zswap: move to use crypto_acomp API for hardware acceleration") Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Closes: https://lore.kernel.org/lkml/20241113213007.GB1564047@cmpxchg.org/ Reported-by: Sam Sun <samsun1006219@gmail.com> Closes: https://lore.kernel.org/lkml/CAEkJfYMtSdM5HceNsXUDf5haghD5+o2e7Qv4OcuruL4tPg6OaQ@mail.gmail.com/ Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12Revert "mm: zswap: fix race between [de]compression and CPU hotunplug"Yosry Ahmed
This reverts commit eaebeb93922ca6ab0dd92027b73d0112701706ef. Commit eaebeb93922c ("mm: zswap: fix race between [de]compression and CPU hotunplug") used the CPU hotplug lock in zswap compress/decompress operations to protect against a race with CPU hotunplug making some per-CPU resources go away. However, zswap compress/decompress can be reached through reclaim while the lock is held, resulting in a potential deadlock as reported by syzbot: ====================================================== WARNING: possible circular locking dependency detected 6.13.0-rc6-syzkaller-00006-g5428dc1906dd #0 Not tainted ------------------------------------------------------ kswapd0/89 is trying to acquire lock: ffffffff8e7d2ed0 (cpu_hotplug_lock){++++}-{0:0}, at: acomp_ctx_get_cpu mm/zswap.c:886 [inline] ffffffff8e7d2ed0 (cpu_hotplug_lock){++++}-{0:0}, at: zswap_compress mm/zswap.c:908 [inline] ffffffff8e7d2ed0 (cpu_hotplug_lock){++++}-{0:0}, at: zswap_store_page mm/zswap.c:1439 [inline] ffffffff8e7d2ed0 (cpu_hotplug_lock){++++}-{0:0}, at: zswap_store+0xa74/0x1ba0 mm/zswap.c:1546 but task is already holding lock: ffffffff8ea355a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat mm/vmscan.c:6871 [inline] ffffffff8ea355a0 (fs_reclaim){+.+.}-{0:0}, at: kswapd+0xb58/0x2f30 mm/vmscan.c:7253 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (fs_reclaim){+.+.}-{0:0}: lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 __fs_reclaim_acquire mm/page_alloc.c:3853 [inline] fs_reclaim_acquire+0x88/0x130 mm/page_alloc.c:3867 might_alloc include/linux/sched/mm.h:318 [inline] slab_pre_alloc_hook mm/slub.c:4070 [inline] slab_alloc_node mm/slub.c:4148 [inline] __kmalloc_cache_node_noprof+0x40/0x3a0 mm/slub.c:4337 kmalloc_node_noprof include/linux/slab.h:924 [inline] alloc_worker kernel/workqueue.c:2638 [inline] create_worker+0x11b/0x720 kernel/workqueue.c:2781 workqueue_prepare_cpu+0xe3/0x170 kernel/workqueue.c:6628 cpuhp_invoke_callback+0x48d/0x830 kernel/cpu.c:194 __cpuhp_invoke_callback_range kernel/cpu.c:965 [inline] cpuhp_invoke_callback_range kernel/cpu.c:989 [inline] cpuhp_up_callbacks kernel/cpu.c:1020 [inline] _cpu_up+0x2b3/0x580 kernel/cpu.c:1690 cpu_up+0x184/0x230 kernel/cpu.c:1722 cpuhp_bringup_mask+0xdf/0x260 kernel/cpu.c:1788 cpuhp_bringup_cpus_parallel+0xf9/0x160 kernel/cpu.c:1878 bringup_nonboot_cpus+0x2b/0x50 kernel/cpu.c:1892 smp_init+0x34/0x150 kernel/smp.c:1009 kernel_init_freeable+0x417/0x5d0 init/main.c:1569 kernel_init+0x1d/0x2b0 init/main.c:1466 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 -> #0 (cpu_hotplug_lock){++++}-{0:0}: check_prev_add kernel/locking/lockdep.c:3161 [inline] check_prevs_add kernel/locking/lockdep.c:3280 [inline] validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904 __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] cpus_read_lock+0x42/0x150 kernel/cpu.c:490 acomp_ctx_get_cpu mm/zswap.c:886 [inline] zswap_compress mm/zswap.c:908 [inline] zswap_store_page mm/zswap.c:1439 [inline] zswap_store+0xa74/0x1ba0 mm/zswap.c:1546 swap_writepage+0x647/0xce0 mm/page_io.c:279 shmem_writepage+0x1248/0x1610 mm/shmem.c:1579 pageout mm/vmscan.c:696 [inline] shrink_folio_list+0x35ee/0x57e0 mm/vmscan.c:1374 shrink_inactive_list mm/vmscan.c:1967 [inline] shrink_list mm/vmscan.c:2205 [inline] shrink_lruvec+0x16db/0x2f30 mm/vmscan.c:5734 mem_cgroup_shrink_node+0x385/0x8e0 mm/vmscan.c:6575 mem_cgroup_soft_reclaim mm/memcontrol-v1.c:312 [inline] memcg1_soft_limit_reclaim+0x346/0x810 mm/memcontrol-v1.c:362 balance_pgdat mm/vmscan.c:6975 [inline] kswapd+0x17b3/0x2f30 mm/vmscan.c:7253 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(cpu_hotplug_lock); lock(fs_reclaim); rlock(cpu_hotplug_lock); *** DEADLOCK *** 1 lock held by kswapd0/89: #0: ffffffff8ea355a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat mm/vmscan.c:6871 [inline] #0: ffffffff8ea355a0 (fs_reclaim){+.+.}-{0:0}, at: kswapd+0xb58/0x2f30 mm/vmscan.c:7253 stack backtrace: CPU: 0 UID: 0 PID: 89 Comm: kswapd0 Not tainted 6.13.0-rc6-syzkaller-00006-g5428dc1906dd #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 print_circular_bug+0x13a/0x1b0 kernel/locking/lockdep.c:2074 check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2206 check_prev_add kernel/locking/lockdep.c:3161 [inline] check_prevs_add kernel/locking/lockdep.c:3280 [inline] validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904 __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] cpus_read_lock+0x42/0x150 kernel/cpu.c:490 acomp_ctx_get_cpu mm/zswap.c:886 [inline] zswap_compress mm/zswap.c:908 [inline] zswap_store_page mm/zswap.c:1439 [inline] zswap_store+0xa74/0x1ba0 mm/zswap.c:1546 swap_writepage+0x647/0xce0 mm/page_io.c:279 shmem_writepage+0x1248/0x1610 mm/shmem.c:1579 pageout mm/vmscan.c:696 [inline] shrink_folio_list+0x35ee/0x57e0 mm/vmscan.c:1374 shrink_inactive_list mm/vmscan.c:1967 [inline] shrink_list mm/vmscan.c:2205 [inline] shrink_lruvec+0x16db/0x2f30 mm/vmscan.c:5734 mem_cgroup_shrink_node+0x385/0x8e0 mm/vmscan.c:6575 mem_cgroup_soft_reclaim mm/memcontrol-v1.c:312 [inline] memcg1_soft_limit_reclaim+0x346/0x810 mm/memcontrol-v1.c:362 balance_pgdat mm/vmscan.c:6975 [inline] kswapd+0x17b3/0x2f30 mm/vmscan.c:7253 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> Revert the change. A different fix for the race with CPU hotunplug will follow. Link: https://lkml.kernel.org/r/20250107222236.2715883-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sam Sun <samsun1006219@gmail.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>