summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-01-13btrfs: use PTR_ERR() instead of PTR_ERR_OR_ZERO() for btrfs_get_extent()Qu Wenruo
The function btrfs_get_extent() will only return an PTR_ERR() or a valid extent map pointer. It will not return NULL. Thus the usage of PTR_ERR_OR_ZERO() inside submit_one_sector() is not needed, use plain PTR_ERR() instead, and that is the only usage of PTR_ERR_OR_ZERO() after btrfs_get_extent(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: selftests: add delayed ref self test casesJosef Bacik
The recent fix for a stupid mistake I made uncovered the fact that we don't have adequate testing in the delayed refs code, as it took a pretty extensive and long running stress test to uncover something that a unit test would have uncovered right away. Fix this by adding a delayed refs self test suite. This will validate that the btrfs_ref transformation does the correct thing, that we do the correct thing when merging delayed refs, and that we get the delayed refs in the order that we expect. These are all crucial to how the delayed refs operate. I introduced various bugs (including the original bug) into the delayed refs code to validate that these tests caught all of the shenanigans that I could think of. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: move select_delayed_ref() and export itJosef Bacik
This helper is how we select the delayed ref to run once we've selected the delayed ref head. I need this exported to add a unit test for delayed refs, and it's more natural home is in delayed-ref.c. Rename it to btrfs_select_delayed_ref and move it into delayed-ref.c. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13psi: Fix race when task wakes up before psi_sched_switch() adjusts flagsChengming Zhou
When running hackbench in a cgroup with bandwidth throttling enabled, following PSI splat was observed: psi: inconsistent task state! task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4 When investigating the series of events leading up to the splat, following sequence was observed: [008] d..2.: sched_switch: ... ==> next_comm=hackbench next_pid=1831 next_prio=120 ... [008] dN.2.: dequeue_entity(task delayed): task=hackbench pid=1831 cfs_rq->throttled=0 [008] dN.2.: pick_task_fair: check_cfs_rq_runtime() throttled cfs_rq on CPU8 # CPU8 goes into newidle balance and releases the rq lock ... # CPU15 on same LLC Domain is trying to wakeup hackbench(pid=1831) [015] d..4.: psi_flags_change: psi: task state: task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4 final=14 # Splat (cfs_rq->throttled=1) [015] d..4.: sched_wakeup: comm=hackbench pid=1831 prio=120 target_cpu=008 # Task has woken on a throttled hierarchy [008] d..2.: sched_switch: prev_comm=hackbench prev_pid=1831 prev_prio=120 prev_state=S ==> ... psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags for the blocked entity, however, with the introduction of DELAY_DEQUEUE, the block task can wakeup when newidle balance drops the runqueue lock during __schedule(). If a task wakes before psi_sched_switch() adjusts the PSI flags, skip any modifications in psi_enqueue() which would still see the flags of a running task and not a blocked one. Instead, rely on psi_sched_switch() to do the right thing. Since the status returned by try_to_block_task() may no longer be true by the time schedule reaches psi_sched_switch(), check if the task is blocked or not using a combination of task_on_rq_queued() and p->se.sched_delayed checks. [ prateek: Commit message, testing, early bailout in psi_enqueue() ] Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") # 1a6151017ee5 Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Link: https://lore.kernel.org/r/20241227061941.2315-1-kprateek.nayak@amd.com
2025-01-13sched, psi: Don't account irq time if sched_clock_irqtime is disabledYafang Shao
sched_clock_irqtime may be disabled due to the clock source. When disabled, irq_time_read() won't change over time, so there is nothing to account. We can save iterating the whole hierarchy on every tick and context switch. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20250103022409.2544-4-laoar.shao@gmail.com
2025-01-13sched: Don't account irq time if sched_clock_irqtime is disabledYafang Shao
sched_clock_irqtime may be disabled due to the clock source, in which case IRQ time should not be accounted. Let's add a conditional check to avoid unnecessary logic. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250103022409.2544-3-laoar.shao@gmail.com
2025-01-13sched: Define sched_clock_irqtime as static keyYafang Shao
Since CPU time accounting is a performance-critical path, let's define sched_clock_irqtime as a static key to minimize potential overhead. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250103022409.2544-2-laoar.shao@gmail.com
2025-01-13sched/fair: Do not compute overloaded status unnecessarily during lbK Prateek Nayak
Only set sg_overloaded when computing sg_lb_stats() at the highest sched domain since rd->overloaded status is updated only when load balancing at the highest domain. While at it, move setting of sg_overloaded below idle_cpu() check since an idle CPU can never be overloaded. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20241223043407.1611-8-kprateek.nayak@amd.com
2025-01-13sched/fair: Do not compute NUMA Balancing stats unnecessarily during lbK Prateek Nayak
Aggregate nr_numa_running and nr_preferred_running when load balancing at NUMA domains only. While at it, also move the aggregation below the idle_cpu() check since an idle CPU cannot have any preferred tasks. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20241223043407.1611-7-kprateek.nayak@amd.com
2025-01-13x86/topology: Use x86_sched_itmt_flags for PKG domain unconditionallyK Prateek Nayak
x86_sched_itmt_flags() returns SD_ASYM_PACKING if ITMT support is enabled by the system. Without ITMT support being enabled, it returns 0 similar to current x86_die_flags() on non-Hybrid systems (!X86_HYBRID_CPU and !X86_FEATURE_AMD_HETEROGENEOUS_CORES) On Intel systems that enable ITMT support, either the MC domain coincides with the PKG domain, or in case of multiple MC groups within a PKG domain, either Sub-NUMA Cluster (SNC) is enabled or the processor features Hybrid core layout (X86_HYBRID_CPU) which leads to three distinct possibilities: o If PKG and MC domains coincide, PKG domain is degenerated by sd_parent_degenerate() when building sched domain topology. o If SNC is enabled, PKG domain is never added since "x86_has_numa_in_package" is set and the topology will instead contain NODE and NUMA domains. o On X86_HYBRID_CPU which contains multiple MC groups within the PKG, the PKG domain requires x86_sched_itmt_flags(). Thus, on Intel systems that contains multiple MC groups within the PKG and enables ITMT support, the PKG domain requires x86_sched_itmt_flags(). In all other cases PKG domain is either never added or is degenerated. Thus, returning x86_sched_itmt_flags() unconditionally at PKG domain on Intel systems should not lead to any functional changes. On AMD systems with multiple LLCs (MC groups) within a PKG domain, enabling ITMT support requires setting SD_ASYM_PACKING to the PKG domain since the core rankings are assigned PKG-wide. Core rankings on AMD processors is currently set by the amd-pstate driver when Preferred Core feature is supported. A subset of systems that support Preferred Core feature can be detected using X86_FEATURE_AMD_HETEROGENEOUS_CORES however, this does not cover all the systems that support Preferred Core ranking. Detecting Preferred Core support on AMD systems requires inspecting CPPC Highest Perf on all present CPUs and checking if it differs on at least one CPU. Previous suggestion to use a synthetic feature to detect Preferred Core support [1] was found to be non-trivial to implement since BSP alone cannot detect if Preferred Core is supported and by the time AP comes up, alternatives are patched and setting a X86_FEATURE_* then is not possible. Since x86 processors enabling ITMT support that consists multiple non-NUMA MC groups within a PKG requires SD_ASYM_PACKING flag set at the PKG domain, return x86_sched_itmt_flags unconditionally for the PKG domain. Since x86_die_flags() would have just returned x86_sched_itmt_flags() after the change, remove the unnecessary wrapper and pass x86_sched_itmt_flags() directly as the flags function. Fixes: f3a052391822 ("cpufreq: amd-pstate: Enable amd-pstate preferred core support") Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Link: https://lore.kernel.org/r/20241223043407.1611-6-kprateek.nayak@amd.com
2025-01-13x86/topology: Remove x86_smt_flags and use cpu_smt_flags directlyK Prateek Nayak
x86_*_flags() wrappers were introduced with commit d3d37d850d1d ("x86/sched: Add SD_ASYM_PACKING flags to x86 ITMT CPU") to add x86_sched_itmt_flags() in addition to the default domain flags for SMT and MC domain. commit 995998ebdebd ("x86/sched: Remove SD_ASYM_PACKING from the SMT domain flags") removed the ITMT flags for SMT domain but not the x86_smt_flags() wrappers which directly returns cpu_smt_flags(). Remove x86_smt_flags() and directly use cpu_smt_flags() to derive the flags for SMT domain. No functional changes intended. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Link: https://lore.kernel.org/r/20241223043407.1611-5-kprateek.nayak@amd.com
2025-01-13x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfsK Prateek Nayak
"sched_itmt_enabled" was only introduced as a debug toggle for any funky ITMT behavior. Move the sysctl controlled from "/proc/sys/kernel/sched_itmt_enabled" to debugfs at "/sys/kernel/debug/x86/sched_itmt_enabled" with a notable change that a cat on the file will return "Y" or "N" instead of "1" or "0" to indicate that feature is enabled or disabled respectively. Either "0" or "N" (or any string that kstrtobool() interprets as false) can be written to the file will disable the feature, and writing either "1" or "Y" (or any string that kstrtobool() interprets as true) will enable it back when the platform supports ITMT ranking. Since ITMT is x86 specific (and PowerPC uses SD_ASYM_PACKING too), the toggle was moved to "/sys/kernel/debug/x86/" as opposed to "/sys/kernel/debug/sched/" Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Link: https://lore.kernel.org/r/20241223043407.1611-4-kprateek.nayak@amd.com
2025-01-13x86/itmt: Use guard() for itmt_update_mutexK Prateek Nayak
Use guard() for itmt_update_mutex which avoids the extra mutex_unlock() in the bailout and return paths. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Link: https://lore.kernel.org/r/20241223043407.1611-3-kprateek.nayak@amd.com
2025-01-13x86/itmt: Convert "sysctl_sched_itmt_enabled" to booleanK Prateek Nayak
In preparation to move "sysctl_sched_itmt_enabled" to debugfs, convert the unsigned int to bool since debugfs readily exposes boolean fops primitives (debugfs_read_file_bool, debugfs_write_file_bool) which can streamline the conversion. Since the current ctl_table initializes extra1 and extra2 to SYSCTL_ZERO and SYSCTL_ONE respectively, the value of "sysctl_sched_itmt_enabled" can only be 0 or 1 and this datatype conversion should not cause any functional changes. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Link: https://lore.kernel.org/r/20241223043407.1611-2-kprateek.nayak@amd.com
2025-01-13sched/core: Prioritize migrating eligible tasks in sched_balance_rq()Hao Jia
When the PLACE_LAG scheduling feature is enabled and dst_cfs_rq->nr_queued is greater than 1, if a task is ineligible (lag < 0) on the source cpu runqueue, it will also be ineligible when it is migrated to the destination cpu runqueue. Because we will keep the original equivalent lag of the task in place_entity(). So if the task was ineligible before, it will still be ineligible after migration. So in sched_balance_rq(), we prioritize migrating eligible tasks, and we soft-limit ineligible tasks, allowing them to migrate only when nr_balance_failed is non-zero to avoid load-balancing trying very hard to balance the load. Below are some benchmark test results. From my test results, this patch shows a slight improvement on hackbench. Benchmark ========= All of the benchmarks are done inside a normal cpu cgroup in a clean environment with cpu turbo disabled, and test machine is: Single NUMA machine model is 13th Gen Intel(R) Core(TM) i7-13700, 12 Core/24 HT. Based on master b86545e02e8c. Results ======= hackbench-process-pipes vanilla patched Amean 1 0.5837 ( 0.00%) 0.5733 ( 1.77%) Amean 4 1.4423 ( 0.00%) 1.4503 ( -0.55%) Amean 7 2.5147 ( 0.00%) 2.4773 ( 1.48%) Amean 12 3.9347 ( 0.00%) 3.8880 ( 1.19%) Amean 21 5.3943 ( 0.00%) 5.3873 ( 0.13%) Amean 30 6.7840 ( 0.00%) 6.6660 ( 1.74%) Amean 48 9.8313 ( 0.00%) 9.6100 ( 2.25%) Amean 79 15.4403 ( 0.00%) 14.9580 ( 3.12%) Amean 96 18.4970 ( 0.00%) 17.9533 ( 2.94%) hackbench-process-sockets vanilla patched Amean 1 0.6297 ( 0.00%) 0.6223 ( 1.16%) Amean 4 2.1517 ( 0.00%) 2.0887 ( 2.93%) Amean 7 3.6377 ( 0.00%) 3.5670 ( 1.94%) Amean 12 6.1277 ( 0.00%) 5.9290 ( 3.24%) Amean 21 10.0380 ( 0.00%) 9.7623 ( 2.75%) Amean 30 14.1517 ( 0.00%) 13.7513 ( 2.83%) Amean 48 24.7253 ( 0.00%) 24.2287 ( 2.01%) Amean 79 43.9523 ( 0.00%) 43.2330 ( 1.64%) Amean 96 54.5310 ( 0.00%) 53.7650 ( 1.40%) tbench4 Throughput vanilla patched Hmean 1 255.97 ( 0.00%) 275.01 ( 7.44%) Hmean 2 511.60 ( 0.00%) 544.27 ( 6.39%) Hmean 4 996.70 ( 0.00%) 1006.57 ( 0.99%) Hmean 8 1646.46 ( 0.00%) 1649.15 ( 0.16%) Hmean 16 2259.42 ( 0.00%) 2274.35 ( 0.66%) Hmean 32 4725.48 ( 0.00%) 4735.57 ( 0.21%) Hmean 64 4411.47 ( 0.00%) 4400.05 ( -0.26%) Hmean 96 4284.31 ( 0.00%) 4267.39 ( -0.39%) Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Hao Jia <jiahao1@lixiang.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20241223091446.90208-1-jiahao.kernel@gmail.com
2025-01-13sched/debug: Change need_resched warnings to pr_errDavid Rientjes
need_resched warnings, if enabled, are treated as WARNINGs. If kernel.panic_on_warn is enabled, then this causes a kernel panic. It's highly unlikely that a panic is desired for these warnings, only a stack trace is normally required to debug and resolve. Thus, switch need_resched warnings to simply be a printk with an associated stack trace so they are no longer in scope for panic_on_warn. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Acked-by: Josh Don <joshdon@google.com> Link: https://lkml.kernel.org/r/e8d52023-5291-26bd-5299-8bb9eb604929@google.com
2025-01-13sched/fair: Encapsulate set custom slice in a __setparam_fair() functionVincent Guittot
Similarly to dl, create a __setparam_fair() function to set parameters related to fair class and move it in the fair.c file. No functional changes expected Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lore.kernel.org/r/20250110144656.484601-1-vincent.guittot@linaro.org
2025-01-13sched: Fix race between yield_to() and try_to_wake_up()Tianchen Ding
We met a SCHED_WARN in set_next_buddy(): __warn_printk set_next_buddy yield_to_task_fair yield_to kvm_vcpu_yield_to [kvm] ... After a short dig, we found the rq_lock held by yield_to() may not be exactly the rq that the target task belongs to. There is a race window against try_to_wake_up(). CPU0 target_task blocking on CPU1 lock rq0 & rq1 double check task_rq == p_rq, ok woken to CPU2 (lock task_pi & rq2) task_rq = rq2 yield_to_task_fair (w/o lock rq2) In this race window, yield_to() is operating the task w/o the correct lock. Fix this by taking task pi_lock first. Fixes: d95f41220065 ("sched: Add yield_to(task, preempt) functionality") Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20241231055020.6521-1-dtcccc@linux.alibaba.com
2025-01-13sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUEPeter Zijlstra
Normally dequeue_entities() will continue to dequeue an empty group entity; except DELAY_DEQUEUE changes things -- it retains empty entities such that they might continue to compete and burn off some lag. However, doing this results in update_cfs_group() re-computing the cgroup weight 'slice' for an empty group, which it (rightly) figures isn't much at all. This in turn means that the delayed entity is not competing at the expected weight. Worse, the very low weight causes its lag to be inflated, which combined with avg_vruntime() using scale_load_down(), leads to artifacts. As such, don't adjust the weight for empty group entities and let them compete at their original weight. Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250110115720.GA17405@noisy.programming.kicks-ass.net
2025-01-13x86: Disable EXECMEM_ROX supportPeter Zijlstra
The whole module_writable_address() nonsense made a giant mess of alternative.c, not to mention it still contains bugs -- notable some of the CFI variants crash and burn. Mike has been working on patches to clean all this up again, but given the current state of things, this stuff just isn't ready. Disable for now, lets try again next cycle. Fixes: 5185e7f9f3bd ("x86/module: enable ROX caches for module text on 64 bit") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250113112934.GA8385@noisy.programming.kicks-ass.net
2025-01-13drm/tests: connector: Add ycbcr_420_allowed testsCristian Ciocaltea
Extend HDMI connector output format tests to verify its registration succeeds only when the presence of YUV420 in the supported formats matches the state of ycbcr_420_allowed flag. Signed-off-by: Cristian Ciocaltea <cristian.ciocaltea@collabora.com> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Reviewed-by: Maxime Ripard <mripard@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20241224-bridge-conn-fmt-prio-v4-4-a9ceb5671379@collabora.com Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
2025-01-13drm/connector: hdmi: Validate supported_formats matches ycbcr_420_allowedCristian Ciocaltea
Ensure HDMI connector initialization fails when the presence of HDMI_COLORSPACE_YUV420 in the given supported_formats bitmask doesn't match the value of drm_connector->ycbcr_420_allowed. Suggested-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Signed-off-by: Cristian Ciocaltea <cristian.ciocaltea@collabora.com> Reviewed-by: Maxime Ripard <mripard@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20241224-bridge-conn-fmt-prio-v4-3-a9ceb5671379@collabora.com Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
2025-01-13drm/bridge-connector: Sync supported_formats with computed ycbcr_420_allowedCristian Ciocaltea
The case of having an HDMI bridge in the pipeline which advertises YUV420 capability via its ->supported_formats and a non-HDMI one that didn't enable ->ycbcr_420_allowed, is incorrectly handled because supported_formats is passed as is to the helper initializing the HDMI connector. Ensure HDMI_COLORSPACE_YUV420 is removed from the bitmask passed to drmm_connector_hdmi_init() when connector's ->ycbcr_420_allowed flag ends up not being set. Fixes: 3ced1c687512 ("drm/display: bridge_connector: handle ycbcr_420_allowed") Signed-off-by: Cristian Ciocaltea <cristian.ciocaltea@collabora.com> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Reviewed-by: Maxime Ripard <mripard@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20241224-bridge-conn-fmt-prio-v4-2-a9ceb5671379@collabora.com Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
2025-01-13drm/bridge: Prioritize supported_formats over ycbcr_420_allowedCristian Ciocaltea
Bridges having DRM_BRIDGE_OP_HDMI set in their ->ops are supposed to rely on the ->supported_formats bitmask to advertise the permitted colorspaces, including HDMI_COLORSPACE_YUV420. However, a new flag ->ycbcr_420_allowed has been recently introduced, which brings the necessity to require redundant and potentially inconsistent information to be provided on HDMI bridges initialization. Adjust ->ycbcr_420_allowed for HDMI bridges according to ->supported_formats, right before adding them to the global bridge list. This keeps the initialization process straightforward and unambiguous, thereby preventing any further confusion. Fixes: 3ced1c687512 ("drm/display: bridge_connector: handle ycbcr_420_allowed") Signed-off-by: Cristian Ciocaltea <cristian.ciocaltea@collabora.com> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Reviewed-by: Maxime Ripard <mripard@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20241224-bridge-conn-fmt-prio-v4-1-a9ceb5671379@collabora.com Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
2025-01-13pktgen: Avoid out-of-bounds access in get_imix_entriesArtem Chernyshev
Passing a sufficient amount of imix entries leads to invalid access to the pkt_dev->imix_entries array because of the incorrect boundary check. UBSAN: array-index-out-of-bounds in net/core/pktgen.c:874:24 index 20 is out of range for type 'imix_pkt [20]' CPU: 2 PID: 1210 Comm: bash Not tainted 6.10.0-rc1 #121 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) Call Trace: <TASK> dump_stack_lvl lib/dump_stack.c:117 __ubsan_handle_out_of_bounds lib/ubsan.c:429 get_imix_entries net/core/pktgen.c:874 pktgen_if_write net/core/pktgen.c:1063 pde_write fs/proc/inode.c:334 proc_reg_write fs/proc/inode.c:346 vfs_write fs/read_write.c:593 ksys_write fs/read_write.c:644 do_syscall_64 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe arch/x86/entry/entry_64.S:130 Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: 52a62f8603f9 ("pktgen: Parse internet mix (imix) input") Signed-off-by: Artem Chernyshev <artem.chernyshev@red-soft.ru> [ fp: allow to fill the array completely; minor changelog cleanup ] Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-01-13ALSA: hda/realtek: Fix volume adjustment issue on Lenovo ThinkBook 16P Gen5Yage Geng
This patch fixes the volume adjustment issue on the Lenovo ThinkBook 16P Gen5 by applying the necessary quirk configuration for the Realtek ALC287 codec. The issue was caused by incorrect configuration in the driver, which prevented proper volume control on certain systems. Signed-off-by: Yage Geng <icoderdev@gmail.com> Link: https://patch.msgid.link/20250113085208.15351-1-icoderdev@gmail.com Signed-off-by: Takashi Iwai <tiwai@suse.de>
2025-01-13s390/bitops: Provide optimized arch_test_bit()Heiko Carstens
Provide an optimized arch_test_bit() implementation which makes use of flag output constraint. This generates slightly better code: bloat-o-meter: add/remove: 51/19 grow/shrink: 450/2444 up/down: 25198/-49136 (-23938) Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2025-01-13s390/bitops: Switch to generic bitopsHeiko Carstens
The generic bitops implementation is nearly identical to the s390 implementation therefore switch to the generic variant. This results in a small kernel image size decrease. This is because for the generic variant the nr parameter for most bitops functions is of type unsigned int while the s390 variant uses unsigned long. bloat-o-meter: add/remove: 670/670 grow/shrink: 167/209 up/down: 21440/-21792 (-352) Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2025-01-13s390/ebcdic: Fix length decrement in codepage_convert()Sven Schnelle
The inline assembly uses the ahi instruction to decrement and test whether more than 256 bytes are left for conversion. But the nr variable passed is of type unsigned long. Therefore use aghi. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reported-by: Jens Remus <jremus@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2025-01-13s390/ebcdic: Fix length check in codepage_convert()Sven Schnelle
The current code compares whether the nr argument is less or equal to zero. As nr is of type unsigned long, this isn't correct. Fix this by just testing for zero. This is also reported by checkpatch: unsignedLessThanZero: Checking if unsigned expression 'nr--' is less than zero. Reported-by: Jens Remus <jremus@linux.ibm.com> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2025-01-13s390/ebcdic: Use exrl instead of exSven Schnelle
exrl is present in all machines currently supported, therefore prefer it over ex. This saves one instruction and doesn't need an additional register to hold the address of the target instruction. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2025-01-13s390/amode31: Use exrl instead of exSven Schnelle
exrl is present in all machines currently supported, therefore prefer it over ex. This saves one instruction and doesn't need an additional register to hold the address of the target instruction. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2025-01-13s390/stackleak: Use exrl instead of ex in __stackleak_poison()Sven Schnelle
exrl is present in all machines currently supported, therefore prefer it over ex. This saves one instruction and doesn't need an additional register to hold the address of the target instruction. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2025-01-13s390/lib: Use exrl instead of ex in xor functionsSven Schnelle
exrl is present in all machines currently supported, therefore prefer it over ex. This saves one instruction and doesn't need an additional register to hold the address of the target instruction. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2025-01-13s390/topology: Improve topology detectionMete Durlu
Add early polarization detection instead of assuming horizontal polarization. Signed-off-by: Mete Durlu <meted@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2025-01-12cifs: support reconnect with alternate password for SMB1Meetakshi Setiya
SMB1 shares the mount and remount code paths with SMB2/3 and already supports password rotation in some scenarios. This patch extends the password rotation support to SMB1 reconnects as well. Cc: stable@vger.kernel.org Signed-off-by: Meetakshi Setiya <msetiya@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-12fs/proc: fix softlockup in __read_vmcore (part 2)Rik van Riel
Since commit 5cbcb62dddf5 ("fs/proc: fix softlockup in __read_vmcore") the number of softlockups in __read_vmcore at kdump time have gone down, but they still happen sometimes. In a memory constrained environment like the kdump image, a softlockup is not just a harmless message, but it can interfere with things like RCU freeing memory, causing the crashdump to get stuck. The second loop in __read_vmcore has a lot more opportunities for natural sleep points, like scheduling out while waiting for a data write to happen, but apparently that is not always enough. Add a cond_resched() to the second loop in __read_vmcore to (hopefully) get rid of the softlockups. Link: https://lkml.kernel.org/r/20250110102821.2a37581b@fangorn Fixes: 5cbcb62dddf5 ("fs/proc: fix softlockup in __read_vmcore") Signed-off-by: Rik van Riel <riel@surriel.com> Reported-by: Breno Leitao <leitao@debian.org> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12mm: fix assertion in folio_end_read()Matthew Wilcox (Oracle)
We only need to assert that the uptodate flag is clear if we're going to set it. This hasn't been a problem before now because we have only used folio_end_read() when completing with an error, but it's convenient to use it in squashfs if we discover the folio is already uptodate. Link: https://lkml.kernel.org/r/20250110163300.3346321-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Phillip Lougher <phillip@squashfs.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12mm: vmscan : pgdemote vmstat is not getting updated when MGLRU is enabled.Donet Tom
When MGLRU is enabled, the pgdemote_kswapd, pgdemote_direct, and pgdemote_khugepaged stats in vmstat are not being updated. Commit f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA balancing operations") moved the pgdemote vmstat update from demote_folio_list() to shrink_inactive_list(), which is in the normal LRU path. As a result, the pgdemote stats are updated correctly for the normal LRU but not for MGLRU. To address this, we have added the pgdemote stat update in the evict_folios() function, which is in the MGLRU path. With this patch, the pgdemote stats will now be updated correctly when MGLRU is enabled. Without this patch vmstat output when MGLRU is enabled ====================================================== pgdemote_kswapd 0 pgdemote_direct 0 pgdemote_khugepaged 0 With this patch vmstat output when MGLRU is enabled =================================================== pgdemote_kswapd 43234 pgdemote_direct 4691 pgdemote_khugepaged 0 Link: https://lkml.kernel.org/r/20250109060540.451261-1-donettom@linux.ibm.com Fixes: f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA balancing operations") Signed-off-by: Donet Tom <donettom@linux.ibm.com> Acked-by: Yu Zhao <yuzhao@google.com> Tested-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kaiyang Zhao <kaiyang2@cs.cmu.edu> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Wei Xu <weixugc@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12vmstat: disable vmstat_work on vmstat_cpu_down_prep()Koichiro Den
The upstream commit adcfb264c3ed ("vmstat: disable vmstat_work on vmstat_cpu_down_prep()") introduced another warning during the boot phase so was soon reverted on upstream by commit cd6313beaeae ("Revert "vmstat: disable vmstat_work on vmstat_cpu_down_prep()""). This commit resolves it and reattempts the original fix. Even after mm/vmstat:online teardown, shepherd may still queue work for the dying cpu until the cpu is removed from online mask. While it's quite rare, this means that after unbind_workers() unbinds a per-cpu kworker, it potentially runs vmstat_update for the dying CPU on an irrelevant cpu before entering atomic AP states. When CONFIG_DEBUG_PREEMPT=y, it results in the following error with the backtrace. BUG: using smp_processor_id() in preemptible [00000000] code: \ kworker/7:3/1702 caller is refresh_cpu_vm_stats+0x235/0x5f0 CPU: 0 UID: 0 PID: 1702 Comm: kworker/7:3 Tainted: G Tainted: [N]=TEST Workqueue: mm_percpu_wq vmstat_update Call Trace: <TASK> dump_stack_lvl+0x8d/0xb0 check_preemption_disabled+0xce/0xe0 refresh_cpu_vm_stats+0x235/0x5f0 vmstat_update+0x17/0xa0 process_one_work+0x869/0x1aa0 worker_thread+0x5e5/0x1100 kthread+0x29e/0x380 ret_from_fork+0x2d/0x70 ret_from_fork_asm+0x1a/0x30 </TASK> So, for mm/vmstat:online, disable vmstat_work reliably on teardown and symmetrically enable it on startup. For secondary CPUs during CPU hotplug scenarios, ensure the delayed work is disabled immediately after the initialization. These CPUs are not yet online when start_shepherd_timer() runs on boot CPU. vmstat_cpu_online() will enable the work for them. Link: https://lkml.kernel.org/r/20250108042807.3429745-1-koichiro.den@canonical.com Signed-off-by: Huacai Chen <chenhuacai@kernel.org> Signed-off-by: Koichiro Den <koichiro.den@canonical.com> Suggested-by: Huacai Chen <chenhuacai@kernel.org> Tested-by: Charalampos Mitrodimas <charmitro@posteo.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12zram: fix potential UAF of zram tableKairui Song
If zram_meta_alloc failed early, it frees allocated zram->table without setting it NULL. Which will potentially cause zram_meta_free to access the table if user reset an failed and uninitialized device. Link: https://lkml.kernel.org/r/20250107065446.86928-1-ryncsn@gmail.com Fixes: 74363ec674cb ("zram: fix uninitialized ZRAM not releasing backing device") Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12selftests/mm: set allocated memory to non-zero content in cow testRyan Roberts
After commit b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp"), cow test cases involving swapping out THPs via madvise(MADV_PAGEOUT) started to be skipped due to the subsequent check via pagemap determining that the memory was not actually swapped out. Logs similar to this were emitted: ... # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP (16 kB) ok 2 # SKIP MADV_PAGEOUT did not work, is swap enabled? # [RUN] Basic COW after fork() ... with single PTE of swapped-out THP (16 kB) ok 3 # SKIP MADV_PAGEOUT did not work, is swap enabled? # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP (32 kB) ok 4 # SKIP MADV_PAGEOUT did not work, is swap enabled? ... The commit in question introduces the behaviour of scanning THPs and if their content is predominantly zero, it splits them and replaces the pages which are wholly zero with the zero page. These cow test cases were getting caught up in this. So let's avoid that by filling the contents of all allocated memory with a non-zero value. With this in place, the tests are passing again. Link: https://lkml.kernel.org/r/20250107142555.1870101-1-ryan.roberts@arm.com Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12mm: clear uffd-wp PTE/PMD state on mremap()Ryan Roberts
When mremap()ing a memory region previously registered with userfaultfd as write-protected but without UFFD_FEATURE_EVENT_REMAP, an inconsistency in flag clearing leads to a mismatch between the vma flags (which have uffd-wp cleared) and the pte/pmd flags (which do not have uffd-wp cleared). This mismatch causes a subsequent mprotect(PROT_WRITE) to trigger a warning in page_table_check_pte_flags() due to setting the pte to writable while uffd-wp is still set. Fix this by always explicitly clearing the uffd-wp pte/pmd flags on any such mremap() so that the values are consistent with the existing clearing of VM_UFFD_WP. Be careful to clear the logical flag regardless of its physical form; a PTE bit, a swap PTE bit, or a PTE marker. Cover PTE, huge PMD and hugetlb paths. Link: https://lkml.kernel.org/r/20250107144755.1871363-2-ryan.roberts@arm.com Co-developed-by: Mikołaj Lenczewski <miko.lenczewski@arm.com> Signed-off-by: Mikołaj Lenczewski <miko.lenczewski@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Closes: https://lore.kernel.org/linux-mm/810b44a8-d2ae-4107-b665-5a42eae2d948@arm.com/ Fixes: 63b2d4174c4a ("userfaultfd: wp: add the writeprotect API to userfaultfd ioctl") Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12module: fix writing of livepatch relocations in ROX textPetr Pavlu
A livepatch module can contain a special relocation section .klp.rela.<objname>.<secname> to apply its relocations at the appropriate time and to additionally access local and unexported symbols. When <objname> points to another module, such relocations are processed separately from the regular module relocation process. For instance, only when the target <objname> actually becomes loaded. With CONFIG_STRICT_MODULE_RWX, when the livepatch core decides to apply these relocations, their processing results in the following bug: [ 25.827238] BUG: unable to handle page fault for address: 00000000000012ba [ 25.827819] #PF: supervisor read access in kernel mode [ 25.828153] #PF: error_code(0x0000) - not-present page [ 25.828588] PGD 0 P4D 0 [ 25.829063] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI [ 25.829742] CPU: 2 UID: 0 PID: 452 Comm: insmod Tainted: G O K 6.13.0-rc4-00078-g059dd502b263 #7820 [ 25.830417] Tainted: [O]=OOT_MODULE, [K]=LIVEPATCH [ 25.830768] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-20220807_005459-localhost 04/01/2014 [ 25.831651] RIP: 0010:memcmp+0x24/0x60 [ 25.832190] Code: [...] [ 25.833378] RSP: 0018:ffffa40b403a3ae8 EFLAGS: 00000246 [ 25.833637] RAX: 0000000000000000 RBX: ffff93bc81d8e700 RCX: ffffffffc0202000 [ 25.834072] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 00000000000012ba [ 25.834548] RBP: ffffa40b403a3b68 R08: ffffa40b403a3b30 R09: 0000004a00000002 [ 25.835088] R10: ffffffffffffd222 R11: f000000000000000 R12: 0000000000000000 [ 25.835666] R13: ffffffffc02032ba R14: ffffffffc007d1e0 R15: 0000000000000004 [ 25.836139] FS: 00007fecef8c3080(0000) GS:ffff93bc8f900000(0000) knlGS:0000000000000000 [ 25.836519] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 25.836977] CR2: 00000000000012ba CR3: 0000000002f24000 CR4: 00000000000006f0 [ 25.837442] Call Trace: [ 25.838297] <TASK> [ 25.841083] __write_relocate_add.constprop.0+0xc7/0x2b0 [ 25.841701] apply_relocate_add+0x75/0xa0 [ 25.841973] klp_write_section_relocs+0x10e/0x140 [ 25.842304] klp_write_object_relocs+0x70/0xa0 [ 25.842682] klp_init_object_loaded+0x21/0xf0 [ 25.842972] klp_enable_patch+0x43d/0x900 [ 25.843572] do_one_initcall+0x4c/0x220 [ 25.844186] do_init_module+0x6a/0x260 [ 25.844423] init_module_from_file+0x9c/0xe0 [ 25.844702] idempotent_init_module+0x172/0x270 [ 25.845008] __x64_sys_finit_module+0x69/0xc0 [ 25.845253] do_syscall_64+0x9e/0x1a0 [ 25.845498] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 25.846056] RIP: 0033:0x7fecef9eb25d [ 25.846444] Code: [...] [ 25.847563] RSP: 002b:00007ffd0c5d6de8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 [ 25.848082] RAX: ffffffffffffffda RBX: 000055b03f05e470 RCX: 00007fecef9eb25d [ 25.848456] RDX: 0000000000000000 RSI: 000055b001e74e52 RDI: 0000000000000003 [ 25.848969] RBP: 00007ffd0c5d6ea0 R08: 0000000000000040 R09: 0000000000004100 [ 25.849411] R10: 00007fecefac7b20 R11: 0000000000000246 R12: 000055b001e74e52 [ 25.849905] R13: 0000000000000000 R14: 000055b03f05e440 R15: 0000000000000000 [ 25.850336] </TASK> [ 25.850553] Modules linked in: deku(OK+) uinput [ 25.851408] CR2: 00000000000012ba [ 25.852085] ---[ end trace 0000000000000000 ]--- The problem is that the .klp.rela.<objname>.<secname> relocations are processed after the module was already formed and mod->rw_copy was reset. However, the code in __write_relocate_add() calls module_writable_address() which translates the target address 'loc' still to 'loc + (mem->rw_copy - mem->base)', with mem->rw_copy now being 0. Fix the problem by returning directly 'loc' in module_writable_address() when the module is already formed. Function __write_relocate_add() knows to use text_poke() in such a case. Link: https://lkml.kernel.org/r/20250107153507.14733-1-petr.pavlu@suse.com Fixes: 0c133b1e78cd ("module: prepare to handle ROX allocations for text") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reported-by: Marek Maslanka <mmaslanka@google.com> Closes: https://lore.kernel.org/linux-modules/CAGcaFA2hdThQV6mjD_1_U+GNHThv84+MQvMWLgEuX+LVbAyDxg@mail.gmail.com/ Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Petr Mladek <pmladek@suse.com> Cc: Joe Lawrence <joe.lawrence@redhat.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12mm: zswap: properly synchronize freeing resources during CPU hotunplugYosry Ahmed
In zswap_compress() and zswap_decompress(), the per-CPU acomp_ctx of the current CPU at the beginning of the operation is retrieved and used throughout. However, since neither preemption nor migration are disabled, it is possible that the operation continues on a different CPU. If the original CPU is hotunplugged while the acomp_ctx is still in use, we run into a UAF bug as some of the resources attached to the acomp_ctx are freed during hotunplug in zswap_cpu_comp_dead() (i.e. acomp_ctx.buffer, acomp_ctx.req, or acomp_ctx.acomp). The problem was introduced in commit 1ec3b5fe6eec ("mm/zswap: move to use crypto_acomp API for hardware acceleration") when the switch to the crypto_acomp API was made. Prior to that, the per-CPU crypto_comp was retrieved using get_cpu_ptr() which disables preemption and makes sure the CPU cannot go away from under us. Preemption cannot be disabled with the crypto_acomp API as a sleepable context is needed. Use the acomp_ctx.mutex to synchronize CPU hotplug callbacks allocating and freeing resources with compression/decompression paths. Make sure that acomp_ctx.req is NULL when the resources are freed. In the compression/decompression paths, check if acomp_ctx.req is NULL after acquiring the mutex (meaning the CPU was offlined) and retry on the new CPU. The initialization of acomp_ctx.mutex is moved from the CPU hotplug callback to the pool initialization where it belongs (where the mutex is allocated). In addition to adding clarity, this makes sure that CPU hotplug cannot reinitialize a mutex that is already locked by compression/decompression. Previously a fix was attempted by holding cpus_read_lock() [1]. This would have caused a potential deadlock as it is possible for code already holding the lock to fall into reclaim and enter zswap (causing a deadlock). A fix was also attempted using SRCU for synchronization, but Johannes pointed out that synchronize_srcu() cannot be used in CPU hotplug notifiers [2]. Alternative fixes that were considered/attempted and could have worked: - Refcounting the per-CPU acomp_ctx. This involves complexity in handling the race between the refcount dropping to zero in zswap_[de]compress() and the refcount being re-initialized when the CPU is onlined. - Disabling migration before getting the per-CPU acomp_ctx [3], but that's discouraged and is a much bigger hammer than needed, and could result in subtle performance issues. [1]https://lkml.kernel.org/20241219212437.2714151-1-yosryahmed@google.com/ [2]https://lkml.kernel.org/20250107074724.1756696-2-yosryahmed@google.com/ [3]https://lkml.kernel.org/20250107222236.2715883-2-yosryahmed@google.com/ [yosryahmed@google.com: remove comment] Link: https://lkml.kernel.org/r/CAJD7tkaxS1wjn+swugt8QCvQ-rVF5RZnjxwPGX17k8x9zSManA@mail.gmail.com Link: https://lkml.kernel.org/r/20250108222441.3622031-1-yosryahmed@google.com Fixes: 1ec3b5fe6eec ("mm/zswap: move to use crypto_acomp API for hardware acceleration") Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Closes: https://lore.kernel.org/lkml/20241113213007.GB1564047@cmpxchg.org/ Reported-by: Sam Sun <samsun1006219@gmail.com> Closes: https://lore.kernel.org/lkml/CAEkJfYMtSdM5HceNsXUDf5haghD5+o2e7Qv4OcuruL4tPg6OaQ@mail.gmail.com/ Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12Revert "mm: zswap: fix race between [de]compression and CPU hotunplug"Yosry Ahmed
This reverts commit eaebeb93922ca6ab0dd92027b73d0112701706ef. Commit eaebeb93922c ("mm: zswap: fix race between [de]compression and CPU hotunplug") used the CPU hotplug lock in zswap compress/decompress operations to protect against a race with CPU hotunplug making some per-CPU resources go away. However, zswap compress/decompress can be reached through reclaim while the lock is held, resulting in a potential deadlock as reported by syzbot: ====================================================== WARNING: possible circular locking dependency detected 6.13.0-rc6-syzkaller-00006-g5428dc1906dd #0 Not tainted ------------------------------------------------------ kswapd0/89 is trying to acquire lock: ffffffff8e7d2ed0 (cpu_hotplug_lock){++++}-{0:0}, at: acomp_ctx_get_cpu mm/zswap.c:886 [inline] ffffffff8e7d2ed0 (cpu_hotplug_lock){++++}-{0:0}, at: zswap_compress mm/zswap.c:908 [inline] ffffffff8e7d2ed0 (cpu_hotplug_lock){++++}-{0:0}, at: zswap_store_page mm/zswap.c:1439 [inline] ffffffff8e7d2ed0 (cpu_hotplug_lock){++++}-{0:0}, at: zswap_store+0xa74/0x1ba0 mm/zswap.c:1546 but task is already holding lock: ffffffff8ea355a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat mm/vmscan.c:6871 [inline] ffffffff8ea355a0 (fs_reclaim){+.+.}-{0:0}, at: kswapd+0xb58/0x2f30 mm/vmscan.c:7253 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (fs_reclaim){+.+.}-{0:0}: lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 __fs_reclaim_acquire mm/page_alloc.c:3853 [inline] fs_reclaim_acquire+0x88/0x130 mm/page_alloc.c:3867 might_alloc include/linux/sched/mm.h:318 [inline] slab_pre_alloc_hook mm/slub.c:4070 [inline] slab_alloc_node mm/slub.c:4148 [inline] __kmalloc_cache_node_noprof+0x40/0x3a0 mm/slub.c:4337 kmalloc_node_noprof include/linux/slab.h:924 [inline] alloc_worker kernel/workqueue.c:2638 [inline] create_worker+0x11b/0x720 kernel/workqueue.c:2781 workqueue_prepare_cpu+0xe3/0x170 kernel/workqueue.c:6628 cpuhp_invoke_callback+0x48d/0x830 kernel/cpu.c:194 __cpuhp_invoke_callback_range kernel/cpu.c:965 [inline] cpuhp_invoke_callback_range kernel/cpu.c:989 [inline] cpuhp_up_callbacks kernel/cpu.c:1020 [inline] _cpu_up+0x2b3/0x580 kernel/cpu.c:1690 cpu_up+0x184/0x230 kernel/cpu.c:1722 cpuhp_bringup_mask+0xdf/0x260 kernel/cpu.c:1788 cpuhp_bringup_cpus_parallel+0xf9/0x160 kernel/cpu.c:1878 bringup_nonboot_cpus+0x2b/0x50 kernel/cpu.c:1892 smp_init+0x34/0x150 kernel/smp.c:1009 kernel_init_freeable+0x417/0x5d0 init/main.c:1569 kernel_init+0x1d/0x2b0 init/main.c:1466 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 -> #0 (cpu_hotplug_lock){++++}-{0:0}: check_prev_add kernel/locking/lockdep.c:3161 [inline] check_prevs_add kernel/locking/lockdep.c:3280 [inline] validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904 __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] cpus_read_lock+0x42/0x150 kernel/cpu.c:490 acomp_ctx_get_cpu mm/zswap.c:886 [inline] zswap_compress mm/zswap.c:908 [inline] zswap_store_page mm/zswap.c:1439 [inline] zswap_store+0xa74/0x1ba0 mm/zswap.c:1546 swap_writepage+0x647/0xce0 mm/page_io.c:279 shmem_writepage+0x1248/0x1610 mm/shmem.c:1579 pageout mm/vmscan.c:696 [inline] shrink_folio_list+0x35ee/0x57e0 mm/vmscan.c:1374 shrink_inactive_list mm/vmscan.c:1967 [inline] shrink_list mm/vmscan.c:2205 [inline] shrink_lruvec+0x16db/0x2f30 mm/vmscan.c:5734 mem_cgroup_shrink_node+0x385/0x8e0 mm/vmscan.c:6575 mem_cgroup_soft_reclaim mm/memcontrol-v1.c:312 [inline] memcg1_soft_limit_reclaim+0x346/0x810 mm/memcontrol-v1.c:362 balance_pgdat mm/vmscan.c:6975 [inline] kswapd+0x17b3/0x2f30 mm/vmscan.c:7253 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(cpu_hotplug_lock); lock(fs_reclaim); rlock(cpu_hotplug_lock); *** DEADLOCK *** 1 lock held by kswapd0/89: #0: ffffffff8ea355a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat mm/vmscan.c:6871 [inline] #0: ffffffff8ea355a0 (fs_reclaim){+.+.}-{0:0}, at: kswapd+0xb58/0x2f30 mm/vmscan.c:7253 stack backtrace: CPU: 0 UID: 0 PID: 89 Comm: kswapd0 Not tainted 6.13.0-rc6-syzkaller-00006-g5428dc1906dd #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 print_circular_bug+0x13a/0x1b0 kernel/locking/lockdep.c:2074 check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2206 check_prev_add kernel/locking/lockdep.c:3161 [inline] check_prevs_add kernel/locking/lockdep.c:3280 [inline] validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904 __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] cpus_read_lock+0x42/0x150 kernel/cpu.c:490 acomp_ctx_get_cpu mm/zswap.c:886 [inline] zswap_compress mm/zswap.c:908 [inline] zswap_store_page mm/zswap.c:1439 [inline] zswap_store+0xa74/0x1ba0 mm/zswap.c:1546 swap_writepage+0x647/0xce0 mm/page_io.c:279 shmem_writepage+0x1248/0x1610 mm/shmem.c:1579 pageout mm/vmscan.c:696 [inline] shrink_folio_list+0x35ee/0x57e0 mm/vmscan.c:1374 shrink_inactive_list mm/vmscan.c:1967 [inline] shrink_list mm/vmscan.c:2205 [inline] shrink_lruvec+0x16db/0x2f30 mm/vmscan.c:5734 mem_cgroup_shrink_node+0x385/0x8e0 mm/vmscan.c:6575 mem_cgroup_soft_reclaim mm/memcontrol-v1.c:312 [inline] memcg1_soft_limit_reclaim+0x346/0x810 mm/memcontrol-v1.c:362 balance_pgdat mm/vmscan.c:6975 [inline] kswapd+0x17b3/0x2f30 mm/vmscan.c:7253 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> Revert the change. A different fix for the race with CPU hotunplug will follow. Link: https://lkml.kernel.org/r/20250107222236.2715883-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sam Sun <samsun1006219@gmail.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12hugetlb: fix NULL pointer dereference in trace_hugetlbfs_alloc_inodeMuchun Song
hugetlb_file_setup() will pass a NULL @dir to hugetlbfs_get_inode(), so we will access a NULL pointer for @dir. Fix it and set __entry->dr to 0 if @dir is NULL. Because ->i_ino cannot be 0 (see get_next_ino()), there is no confusing if user sees a 0 inode number. Link: https://lkml.kernel.org/r/20250106033118.4640-1-songmuchun@bytedance.com Fixes: 318580ad7f28 ("hugetlbfs: support tracepoint") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reported-by: Cheung Wall <zzqq0103.hey@gmail.com> Closes: https://lore.kernel.org/linux-mm/02858D60-43C1-4863-A84F-3C76A8AF1F15@linux.dev/T/# Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Cc: cheung wall <zzqq0103.hey@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12mm: fix div by zero in bdi_ratio_from_pagesStefan Roesch
During testing it has been detected, that it is possible to get div by zero error in bdi_set_min_bytes. The error is caused by the function bdi_ratio_from_pages(). bdi_ratio_from_pages() calls global_dirty_limits. If the dirty threshold is 0, the div by zero is raised. This can happen if the root user is setting: echo 0 > /proc/sys/vm/dirty_ratio The following is a test case: echo 0 > /proc/sys/vm/dirty_ratio cd /sys/class/bdi/<device> echo 1 > strict_limit echo 8192 > min_bytes ==> error is raised. The problem is addressed by returning -EINVAL if dirty_ratio or dirty_bytes is set to 0. [shr@devkernel.io: check for -EINVAL in bdi_set_min_bytes() and bdi_set_max_bytes()] Link: https://lkml.kernel.org/r/20250108014723.166637-1-shr@devkernel.io [shr@devkernel.io: v3] Link: https://lkml.kernel.org/r/20250109063411.6591-1-shr@devkernel.io Link: https://lkml.kernel.org/r/20250104012037.159386-1-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Reported-by: cheung wall <zzqq0103.hey@gmail.com> Closes: https://lore.kernel.org/linux-mm/87pll35yd0.fsf@devkernel.io/T/#t Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Qiang Zhang <zzqq0103.hey@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12x86/execmem: fix ROX cache usage in Xen PV guestsJuergen Gross
The recently introduced ROX cache for modules is assuming large page support in 64-bit mode without testing the related feature bit. This results in breakage when running as a Xen PV guest, as in this mode large pages are not supported. Fix that by testing the X86_FEATURE_PSE capability when deciding whether to enable the ROX cache. Link: https://lkml.kernel.org/r/20250103065631.26459-1-jgross@suse.com Fixes: 2e45474ab14f ("execmem: add support for cache of large ROX pages") Signed-off-by: Juergen Gross <jgross@suse.com> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12filemap: avoid truncating 64-bit offset to 32 bitsMarco Nelissen
On 32-bit kernels, folio_seek_hole_data() was inadvertently truncating a 64-bit value to 32 bits, leading to a possible infinite loop when writing to an xfs filesystem. Link: https://lkml.kernel.org/r/20250102190540.1356838-1-marco.nelissen@gmail.com Fixes: 54fa39ac2e00 ("iomap: use mapping_seek_hole_data") Signed-off-by: Marco Nelissen <marco.nelissen@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>