summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-01-13btrfs: remove the ->lowest and ->leaves members from struct btrfs_backref_nodeJosef Bacik
Before we were keeping all of our nodes on various lists in order to make sure everything got cleaned up correctly. We used node->lowest to indicate that node->lower was linked into the cache->leaves list. Now that we do cleanup based on the rb-tree both the list and the flag are useless, so delete them both. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: simplify btrfs_backref_release_cache()Josef Bacik
We rely on finding all our nodes on the various lists in the backref cache, when they are all also in the rbtree. Instead just search through the rbtree and free everything. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: do not handle non-shareable roots in backref cacheJosef Bacik
Now that we handle relocation for non-shareable roots without using the backref cache, remove the ->cowonly field from the backref nodes and update the handling to throw an error. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: don't build backref tree for COW-only blocksJosef Bacik
We already determine the owner for any blocks we find when we're relocating, and for COW-only blocks (and the data reloc tree) we COW down to the block and call it good enough. However we still build a whole backref tree for them, even though we're not going to use it, and then just don't put these blocks in the cache. Rework the code to check if the block belongs to a COW-only root or the data reloc root, and then just cow down to the block, skipping the backref cache generation. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: remove clone_backref_node() from relocationJosef Bacik
Since we no longer maintain backref cache across transactions, and this is only called when we're creating the reloc root for a newly created snapshot in the transaction critical section, we will end up doing a bunch of work that will just get thrown away when we start the transaction in the relocation loop. Delete this code as it no longer does anything for us. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: simplify loop in select_reloc_root()Josef Bacik
We have this setup as a loop, but in reality we will never walk back up the backref tree, if we do then it's a bug. Get rid of the loop and handle the case where we have node->new_bytenr set at all. Previous check was only if node->new_bytenr != root->node->start, but if it did then we would hit the WARN_ON() and walk back up the tree. Instead we want to just return error if ->new_bytenr is set, and then do the normal updating of the node for the reloc root and carry on. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: add a comment for new_bytenr in backref_cache_nodeJosef Bacik
Add a comment for this field so we know what it is used for. Previously we used it to update the backref cache, so people may mistakenly think it is useless, but in fact exists to make sure the backref cache makes sense. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: remove the changed list for backref cacheJosef Bacik
Now that we're not updating the backref cache when we switch transids we can remove the changed list. We're going to keep the new_bytenr field because it serves as a good sanity check for the backref cache and relocation, and can prevent us from making extent tree corruption worse. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: convert BUG_ON in btrfs_reloc_cow_block() to proper error handlingJosef Bacik
This BUG_ON is meant to catch backref cache problems, but these can arise from either bugs in the backref cache or corruption in the extent tree. Fix it to be a proper error. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: fix data race when accessing the inode's disk_i_size at ↵Hao-ran Zheng
btrfs_drop_extents() A data race occurs when the function `insert_ordered_extent_file_extent()` and the function `btrfs_inode_safe_disk_i_size_write()` are executed concurrently. The function `insert_ordered_extent_file_extent()` is not locked when reading inode->disk_i_size, causing `btrfs_inode_safe_disk_i_size_write()` to cause data competition when writing inode->disk_i_size, thus affecting the value of `modify_tree`. The specific call stack that appears during testing is as follows: ============DATA_RACE============ btrfs_drop_extents+0x89a/0xa060 [btrfs] insert_reserved_file_extent+0xb54/0x2960 [btrfs] insert_ordered_extent_file_extent+0xff5/0x1760 [btrfs] btrfs_finish_one_ordered+0x1b85/0x36a0 [btrfs] btrfs_finish_ordered_io+0x37/0x60 [btrfs] finish_ordered_fn+0x3e/0x50 [btrfs] btrfs_work_helper+0x9c9/0x27a0 [btrfs] process_scheduled_works+0x716/0xf10 worker_thread+0xb6a/0x1190 kthread+0x292/0x330 ret_from_fork+0x4d/0x80 ret_from_fork_asm+0x1a/0x30 ============OTHER_INFO============ btrfs_inode_safe_disk_i_size_write+0x4ec/0x600 [btrfs] btrfs_finish_one_ordered+0x24c7/0x36a0 [btrfs] btrfs_finish_ordered_io+0x37/0x60 [btrfs] finish_ordered_fn+0x3e/0x50 [btrfs] btrfs_work_helper+0x9c9/0x27a0 [btrfs] process_scheduled_works+0x716/0xf10 worker_thread+0xb6a/0x1190 kthread+0x292/0x330 ret_from_fork+0x4d/0x80 ret_from_fork_asm+0x1a/0x30 ================================= The main purpose of the check of the inode's disk_i_size is to avoid taking write locks on a btree path when we have a write at or beyond EOF, since in these cases we don't expect to find extent items in the root to drop. However if we end up taking write locks due to a data race on disk_i_size, everything is still correct, we only add extra lock contention on the tree in case there's concurrency from other tasks. If the race causes us to not take write locks when we actually need them, then everything is functionally correct as well, since if we find out we have extent items to drop and we took read locks (modify_tree set to 0), we release the path and retry again with write locks. Since this data race does not affect the correctness of the function, it is a harmless data race, use data_race() to check inode->disk_i_size. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Hao-ran Zheng <zhenghaoran154@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: don't BUG_ON() in btrfs_drop_extents()Johannes Thumshirn
btrfs_drop_extents() calls BUG_ON() in case the counter of to be deleted extents is greater than 0. But all of these code paths can handle errors, so there's no need to crash the kernel. Instead WARN() that the condition has been met and gracefully bail out. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: zoned: reclaim unused zone by zone resettingNaohiro Aota
On the zoned mode, once used and freed region is still not reusable after the freeing. The underlying zone needs to be reset before reusing. Btrfs resets a zone when it removes a block group, and then new block group is allocated on the zones to reuse the zones. But, it is sometime too late to catch up with a write side. This commit introduces a new space-info reclaim method ZONE_RESET. That will pick a block group from the unused list and reset its zone to reuse the zone_unusable space. It is faster than removing the block group and re-creating a new block group on the same zones. For the first implementation, the ZONE_RESET is only applied to a block group whose region is fully zone_unusable. Reclaiming partial zone_unusable block group could be implemented later. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: drop fs_info argument from btrfs_update_space_info_*()Naohiro Aota
Since commit e1e577aafe41 ("btrfs: store fs_info in space_info"), we have the fs_info in a space_info. So, we can drop fs_info argument from btrfs_update_space_info_*. There is no behavior change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: factor out btrfs_return_free_space()Naohiro Aota
Factor out a part of unpin_extent_range() that returns space back to the space info, prioritizing global block reserve. Also, move the "len" variable into the loop to clarify we don't need to carry it beyond an iteration. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: handle FS_IOC_READ_VERITY_METADATA ioctlAllison Karlitskaya
Commit 146054090b08 ("btrfs: initial fsverity support") introduced fs-verity support for btrfs, but didn't add support for FS_IOC_READ_VERITY_METADATA to directly query the Merkle tree, descriptor and signature blocks for fs-verity enabled files. Add the (trival) implementation: we just need to wire it through to the fs-verity code, the same way as is done in the other two filesystems which support this ioctl (ext4, f2fs). The fs-verity code already has access to the required data. This is also safe to backport to older stable trees (5.15+) if needed. Signed-off-by: Allison Karlitskaya <allison.karlitskaya@redhat.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: send: remove redundant assignments to variable retColin Ian King
The variable ret is being initialized to zero and also later re-assigned to zero. In both cases the assignment is redundant since the value is never read after the assignment and hence they can be removed. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: use PTR_ERR() instead of PTR_ERR_OR_ZERO() for btrfs_get_extent()Qu Wenruo
The function btrfs_get_extent() will only return an PTR_ERR() or a valid extent map pointer. It will not return NULL. Thus the usage of PTR_ERR_OR_ZERO() inside submit_one_sector() is not needed, use plain PTR_ERR() instead, and that is the only usage of PTR_ERR_OR_ZERO() after btrfs_get_extent(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: selftests: add delayed ref self test casesJosef Bacik
The recent fix for a stupid mistake I made uncovered the fact that we don't have adequate testing in the delayed refs code, as it took a pretty extensive and long running stress test to uncover something that a unit test would have uncovered right away. Fix this by adding a delayed refs self test suite. This will validate that the btrfs_ref transformation does the correct thing, that we do the correct thing when merging delayed refs, and that we get the delayed refs in the order that we expect. These are all crucial to how the delayed refs operate. I introduced various bugs (including the original bug) into the delayed refs code to validate that these tests caught all of the shenanigans that I could think of. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: move select_delayed_ref() and export itJosef Bacik
This helper is how we select the delayed ref to run once we've selected the delayed ref head. I need this exported to add a unit test for delayed refs, and it's more natural home is in delayed-ref.c. Rename it to btrfs_select_delayed_ref and move it into delayed-ref.c. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13psi: Fix race when task wakes up before psi_sched_switch() adjusts flagsChengming Zhou
When running hackbench in a cgroup with bandwidth throttling enabled, following PSI splat was observed: psi: inconsistent task state! task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4 When investigating the series of events leading up to the splat, following sequence was observed: [008] d..2.: sched_switch: ... ==> next_comm=hackbench next_pid=1831 next_prio=120 ... [008] dN.2.: dequeue_entity(task delayed): task=hackbench pid=1831 cfs_rq->throttled=0 [008] dN.2.: pick_task_fair: check_cfs_rq_runtime() throttled cfs_rq on CPU8 # CPU8 goes into newidle balance and releases the rq lock ... # CPU15 on same LLC Domain is trying to wakeup hackbench(pid=1831) [015] d..4.: psi_flags_change: psi: task state: task=1831:hackbench cpu=8 psi_flags=14 clear=0 set=4 final=14 # Splat (cfs_rq->throttled=1) [015] d..4.: sched_wakeup: comm=hackbench pid=1831 prio=120 target_cpu=008 # Task has woken on a throttled hierarchy [008] d..2.: sched_switch: prev_comm=hackbench prev_pid=1831 prev_prio=120 prev_state=S ==> ... psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags for the blocked entity, however, with the introduction of DELAY_DEQUEUE, the block task can wakeup when newidle balance drops the runqueue lock during __schedule(). If a task wakes before psi_sched_switch() adjusts the PSI flags, skip any modifications in psi_enqueue() which would still see the flags of a running task and not a blocked one. Instead, rely on psi_sched_switch() to do the right thing. Since the status returned by try_to_block_task() may no longer be true by the time schedule reaches psi_sched_switch(), check if the task is blocked or not using a combination of task_on_rq_queued() and p->se.sched_delayed checks. [ prateek: Commit message, testing, early bailout in psi_enqueue() ] Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") # 1a6151017ee5 Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Link: https://lore.kernel.org/r/20241227061941.2315-1-kprateek.nayak@amd.com
2025-01-13sched, psi: Don't account irq time if sched_clock_irqtime is disabledYafang Shao
sched_clock_irqtime may be disabled due to the clock source. When disabled, irq_time_read() won't change over time, so there is nothing to account. We can save iterating the whole hierarchy on every tick and context switch. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20250103022409.2544-4-laoar.shao@gmail.com
2025-01-13sched: Don't account irq time if sched_clock_irqtime is disabledYafang Shao
sched_clock_irqtime may be disabled due to the clock source, in which case IRQ time should not be accounted. Let's add a conditional check to avoid unnecessary logic. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250103022409.2544-3-laoar.shao@gmail.com
2025-01-13sched: Define sched_clock_irqtime as static keyYafang Shao
Since CPU time accounting is a performance-critical path, let's define sched_clock_irqtime as a static key to minimize potential overhead. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250103022409.2544-2-laoar.shao@gmail.com
2025-01-13sched/fair: Do not compute overloaded status unnecessarily during lbK Prateek Nayak
Only set sg_overloaded when computing sg_lb_stats() at the highest sched domain since rd->overloaded status is updated only when load balancing at the highest domain. While at it, move setting of sg_overloaded below idle_cpu() check since an idle CPU can never be overloaded. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20241223043407.1611-8-kprateek.nayak@amd.com
2025-01-13sched/fair: Do not compute NUMA Balancing stats unnecessarily during lbK Prateek Nayak
Aggregate nr_numa_running and nr_preferred_running when load balancing at NUMA domains only. While at it, also move the aggregation below the idle_cpu() check since an idle CPU cannot have any preferred tasks. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20241223043407.1611-7-kprateek.nayak@amd.com
2025-01-13x86/topology: Use x86_sched_itmt_flags for PKG domain unconditionallyK Prateek Nayak
x86_sched_itmt_flags() returns SD_ASYM_PACKING if ITMT support is enabled by the system. Without ITMT support being enabled, it returns 0 similar to current x86_die_flags() on non-Hybrid systems (!X86_HYBRID_CPU and !X86_FEATURE_AMD_HETEROGENEOUS_CORES) On Intel systems that enable ITMT support, either the MC domain coincides with the PKG domain, or in case of multiple MC groups within a PKG domain, either Sub-NUMA Cluster (SNC) is enabled or the processor features Hybrid core layout (X86_HYBRID_CPU) which leads to three distinct possibilities: o If PKG and MC domains coincide, PKG domain is degenerated by sd_parent_degenerate() when building sched domain topology. o If SNC is enabled, PKG domain is never added since "x86_has_numa_in_package" is set and the topology will instead contain NODE and NUMA domains. o On X86_HYBRID_CPU which contains multiple MC groups within the PKG, the PKG domain requires x86_sched_itmt_flags(). Thus, on Intel systems that contains multiple MC groups within the PKG and enables ITMT support, the PKG domain requires x86_sched_itmt_flags(). In all other cases PKG domain is either never added or is degenerated. Thus, returning x86_sched_itmt_flags() unconditionally at PKG domain on Intel systems should not lead to any functional changes. On AMD systems with multiple LLCs (MC groups) within a PKG domain, enabling ITMT support requires setting SD_ASYM_PACKING to the PKG domain since the core rankings are assigned PKG-wide. Core rankings on AMD processors is currently set by the amd-pstate driver when Preferred Core feature is supported. A subset of systems that support Preferred Core feature can be detected using X86_FEATURE_AMD_HETEROGENEOUS_CORES however, this does not cover all the systems that support Preferred Core ranking. Detecting Preferred Core support on AMD systems requires inspecting CPPC Highest Perf on all present CPUs and checking if it differs on at least one CPU. Previous suggestion to use a synthetic feature to detect Preferred Core support [1] was found to be non-trivial to implement since BSP alone cannot detect if Preferred Core is supported and by the time AP comes up, alternatives are patched and setting a X86_FEATURE_* then is not possible. Since x86 processors enabling ITMT support that consists multiple non-NUMA MC groups within a PKG requires SD_ASYM_PACKING flag set at the PKG domain, return x86_sched_itmt_flags unconditionally for the PKG domain. Since x86_die_flags() would have just returned x86_sched_itmt_flags() after the change, remove the unnecessary wrapper and pass x86_sched_itmt_flags() directly as the flags function. Fixes: f3a052391822 ("cpufreq: amd-pstate: Enable amd-pstate preferred core support") Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Link: https://lore.kernel.org/r/20241223043407.1611-6-kprateek.nayak@amd.com
2025-01-13x86/topology: Remove x86_smt_flags and use cpu_smt_flags directlyK Prateek Nayak
x86_*_flags() wrappers were introduced with commit d3d37d850d1d ("x86/sched: Add SD_ASYM_PACKING flags to x86 ITMT CPU") to add x86_sched_itmt_flags() in addition to the default domain flags for SMT and MC domain. commit 995998ebdebd ("x86/sched: Remove SD_ASYM_PACKING from the SMT domain flags") removed the ITMT flags for SMT domain but not the x86_smt_flags() wrappers which directly returns cpu_smt_flags(). Remove x86_smt_flags() and directly use cpu_smt_flags() to derive the flags for SMT domain. No functional changes intended. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Link: https://lore.kernel.org/r/20241223043407.1611-5-kprateek.nayak@amd.com
2025-01-13x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfsK Prateek Nayak
"sched_itmt_enabled" was only introduced as a debug toggle for any funky ITMT behavior. Move the sysctl controlled from "/proc/sys/kernel/sched_itmt_enabled" to debugfs at "/sys/kernel/debug/x86/sched_itmt_enabled" with a notable change that a cat on the file will return "Y" or "N" instead of "1" or "0" to indicate that feature is enabled or disabled respectively. Either "0" or "N" (or any string that kstrtobool() interprets as false) can be written to the file will disable the feature, and writing either "1" or "Y" (or any string that kstrtobool() interprets as true) will enable it back when the platform supports ITMT ranking. Since ITMT is x86 specific (and PowerPC uses SD_ASYM_PACKING too), the toggle was moved to "/sys/kernel/debug/x86/" as opposed to "/sys/kernel/debug/sched/" Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Link: https://lore.kernel.org/r/20241223043407.1611-4-kprateek.nayak@amd.com
2025-01-13x86/itmt: Use guard() for itmt_update_mutexK Prateek Nayak
Use guard() for itmt_update_mutex which avoids the extra mutex_unlock() in the bailout and return paths. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Link: https://lore.kernel.org/r/20241223043407.1611-3-kprateek.nayak@amd.com
2025-01-13x86/itmt: Convert "sysctl_sched_itmt_enabled" to booleanK Prateek Nayak
In preparation to move "sysctl_sched_itmt_enabled" to debugfs, convert the unsigned int to bool since debugfs readily exposes boolean fops primitives (debugfs_read_file_bool, debugfs_write_file_bool) which can streamline the conversion. Since the current ctl_table initializes extra1 and extra2 to SYSCTL_ZERO and SYSCTL_ONE respectively, the value of "sysctl_sched_itmt_enabled" can only be 0 or 1 and this datatype conversion should not cause any functional changes. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Link: https://lore.kernel.org/r/20241223043407.1611-2-kprateek.nayak@amd.com
2025-01-13sched/core: Prioritize migrating eligible tasks in sched_balance_rq()Hao Jia
When the PLACE_LAG scheduling feature is enabled and dst_cfs_rq->nr_queued is greater than 1, if a task is ineligible (lag < 0) on the source cpu runqueue, it will also be ineligible when it is migrated to the destination cpu runqueue. Because we will keep the original equivalent lag of the task in place_entity(). So if the task was ineligible before, it will still be ineligible after migration. So in sched_balance_rq(), we prioritize migrating eligible tasks, and we soft-limit ineligible tasks, allowing them to migrate only when nr_balance_failed is non-zero to avoid load-balancing trying very hard to balance the load. Below are some benchmark test results. From my test results, this patch shows a slight improvement on hackbench. Benchmark ========= All of the benchmarks are done inside a normal cpu cgroup in a clean environment with cpu turbo disabled, and test machine is: Single NUMA machine model is 13th Gen Intel(R) Core(TM) i7-13700, 12 Core/24 HT. Based on master b86545e02e8c. Results ======= hackbench-process-pipes vanilla patched Amean 1 0.5837 ( 0.00%) 0.5733 ( 1.77%) Amean 4 1.4423 ( 0.00%) 1.4503 ( -0.55%) Amean 7 2.5147 ( 0.00%) 2.4773 ( 1.48%) Amean 12 3.9347 ( 0.00%) 3.8880 ( 1.19%) Amean 21 5.3943 ( 0.00%) 5.3873 ( 0.13%) Amean 30 6.7840 ( 0.00%) 6.6660 ( 1.74%) Amean 48 9.8313 ( 0.00%) 9.6100 ( 2.25%) Amean 79 15.4403 ( 0.00%) 14.9580 ( 3.12%) Amean 96 18.4970 ( 0.00%) 17.9533 ( 2.94%) hackbench-process-sockets vanilla patched Amean 1 0.6297 ( 0.00%) 0.6223 ( 1.16%) Amean 4 2.1517 ( 0.00%) 2.0887 ( 2.93%) Amean 7 3.6377 ( 0.00%) 3.5670 ( 1.94%) Amean 12 6.1277 ( 0.00%) 5.9290 ( 3.24%) Amean 21 10.0380 ( 0.00%) 9.7623 ( 2.75%) Amean 30 14.1517 ( 0.00%) 13.7513 ( 2.83%) Amean 48 24.7253 ( 0.00%) 24.2287 ( 2.01%) Amean 79 43.9523 ( 0.00%) 43.2330 ( 1.64%) Amean 96 54.5310 ( 0.00%) 53.7650 ( 1.40%) tbench4 Throughput vanilla patched Hmean 1 255.97 ( 0.00%) 275.01 ( 7.44%) Hmean 2 511.60 ( 0.00%) 544.27 ( 6.39%) Hmean 4 996.70 ( 0.00%) 1006.57 ( 0.99%) Hmean 8 1646.46 ( 0.00%) 1649.15 ( 0.16%) Hmean 16 2259.42 ( 0.00%) 2274.35 ( 0.66%) Hmean 32 4725.48 ( 0.00%) 4735.57 ( 0.21%) Hmean 64 4411.47 ( 0.00%) 4400.05 ( -0.26%) Hmean 96 4284.31 ( 0.00%) 4267.39 ( -0.39%) Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Hao Jia <jiahao1@lixiang.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20241223091446.90208-1-jiahao.kernel@gmail.com
2025-01-13sched/debug: Change need_resched warnings to pr_errDavid Rientjes
need_resched warnings, if enabled, are treated as WARNINGs. If kernel.panic_on_warn is enabled, then this causes a kernel panic. It's highly unlikely that a panic is desired for these warnings, only a stack trace is normally required to debug and resolve. Thus, switch need_resched warnings to simply be a printk with an associated stack trace so they are no longer in scope for panic_on_warn. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Acked-by: Josh Don <joshdon@google.com> Link: https://lkml.kernel.org/r/e8d52023-5291-26bd-5299-8bb9eb604929@google.com
2025-01-13sched/fair: Encapsulate set custom slice in a __setparam_fair() functionVincent Guittot
Similarly to dl, create a __setparam_fair() function to set parameters related to fair class and move it in the fair.c file. No functional changes expected Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lore.kernel.org/r/20250110144656.484601-1-vincent.guittot@linaro.org
2025-01-13sched: Fix race between yield_to() and try_to_wake_up()Tianchen Ding
We met a SCHED_WARN in set_next_buddy(): __warn_printk set_next_buddy yield_to_task_fair yield_to kvm_vcpu_yield_to [kvm] ... After a short dig, we found the rq_lock held by yield_to() may not be exactly the rq that the target task belongs to. There is a race window against try_to_wake_up(). CPU0 target_task blocking on CPU1 lock rq0 & rq1 double check task_rq == p_rq, ok woken to CPU2 (lock task_pi & rq2) task_rq = rq2 yield_to_task_fair (w/o lock rq2) In this race window, yield_to() is operating the task w/o the correct lock. Fix this by taking task pi_lock first. Fixes: d95f41220065 ("sched: Add yield_to(task, preempt) functionality") Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20241231055020.6521-1-dtcccc@linux.alibaba.com
2025-01-13wifi: iwlwifi: mvm: rename iwl_dev_tx_power_common::mac_context_idEmmanuel Grumbach
This is becoming the link_id. Since this makes no difference on non-MLD devices, just rename to link_id for all the APIs that use the common structure. Starting from command 9, feed the link_id to the firmware instead of the mac id. Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20241227095718.f1155e713201.I753900d10e82f339cf9679ed403027d38dc1fd58@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-01-13wifi: iwlwifi: mvm: skip short statistics window when updating EMLSRBenjamin Berg
The statistics are not synchronized with the time that we enter EMLSR. This means that we can receive the statistic notification just after having cleared the counters, causing us to immediately exit EMLSR again. Fix this by checking that most of the time for the window has passed. If that is not the case, ignore this window and wait for the next notification. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20241227095718.0eb0f2044535.Ic2af92737ccfc873f3b6c228704238ebb9f983ca@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-01-13wifi: iwlwifi: mvm: remove warning on unallocated BAIDJohannes Berg
Due to the firmware allocating the BAID, we can only install the data structure after the BAID is valid from the firmware's point of view. As a result, the firmware can start sending frame release notifications to the driver immediately. This isn't supposed to happen by protocol, since the peer STA is not expected to use the blockack session until the AddBA has a response. However, firmware doesn't know that, our RX path can't know when it was, so simply don't WARN in this case but only have a debug message. Since the BAID comes from firmware, also use IWL_FW_CHECK() instead of a warning for the validity check. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20241227095718.4360f2b9e185.I447f9a5fc6dfdc78ec238200338e2da040ee7e61@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-01-13wifi: iwlwifi: differentiate NIC error typesJohannes Berg
Instead of differentiating only sync/async, differentiate the type of error, and document that only reset handshake timeout (IWL_ERR_TYPE_RESET_HS_TIMEOUT) needs sync handling. The special sync handling is somewhat temporary, the idea is to later split the nic_error() method into error dump, synchronizing the dump, and SW reset methods, and the type is mostly in order to unify command queue full handling into that new architecture as well. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20241227095718.aed9c9e4fac0.I2288042bec4728a75b61cb7f6ded5214bfa3ce85@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-01-13wifi: mac80211: Remove unused ieee80211_smps_is_restrictiveDr. David Alan Gilbert
The last use of ieee80211_smps_is_restrictive() was removed in 2020 by commit 52b4810bed83 ("mac80211: Remove support for changing AP SMPS mode") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Link: https://patch.msgid.link/20241226170119.108947-1-linux@treblig.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-01-13wifi: iwlwifi: mvm: Move TSO code to shared utilityDaniel Gabay
Move TSO segment logic from mvm to the iwlwifi level, as this code is not opmode-dependent and can be shared with the mld driver. Signed-off-by: Daniel Gabay <daniel.gabay@intel.com> Link: https://patch.msgid.link/20250102163748.56efefb9566e.Ib7188572f18afb31840d193a348c17c9b292c7af@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-01-13wifi: iwlwifi: mvm: add UHB canada support in GET_TAS_STATUS cmd respAnjaneyulu
dump UHB canada is enabled or not based on firmware capability. Signed-off-by: Anjaneyulu <pagadala.yesu.anjaneyulu@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20241226174257.dfd6b8893322.I196393dc3c9c28882f90b43a821a2d76a5c9a046@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-01-13wifi: iwlwifi: mvm: add UHB canada support in TAS_CONFIG cmdAnjaneyulu
extend TAS table support to revision 2 for getting UHB canada enablement from BIOS and send to firmware via TAS_CONFIG cmd based on firmware capability. While on it fixed kernel-doc for struct iwl_tas_config_cmd_v4. Signed-off-by: Anjaneyulu <pagadala.yesu.anjaneyulu@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20241226174257.0b1d92ad59b8.Ib80f8514a64fc2800a2a20131e730c2bd9c4c4af@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-01-13wifi: iwlwifi: mvm: Use IWL_FW_CHECK() for BAR notif size validationDaniel Gabay
Use IWL_FW_CHECK() for BAR notification size validation, improving diagnostics with a clear error message on failure. Signed-off-by: Daniel Gabay <daniel.gabay@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20241226174257.913d5d476929.I8cd62f45bacc088c309b0152fc392dc2579e82e0@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-01-13wifi: iwlwifi: pcie: Add support for new device idsSomashekhar(Som)
Add support for new device-ids 0x2730 and 0x272F. Signed-off-by: Somashekhar(Som) <somashekhar.puttagangaiah@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Link: https://patch.msgid.link/20241226174257.6a0db60436e7.I50a66544dde6c88acd9abe4b31badab96ef04cfc@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-01-13wifi: iwlwifi: add a new NMI typeEmmanuel Grumbach
0x88 is not a regular firmware crash but a PREG NMI which means that we access a place we're not supposed to. Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20241226174257.596dfc97f6b1.Iec765d5fe12ac74c6ee0035e9cb62b98c11639cb@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-01-13wifi: iwlwifi: mvm: Check BAR packet size before accessing dataDaniel Gabay
Validate the BAR frame release size before using its fields to avoid potential invalid memory access. Signed-off-by: Daniel Gabay <daniel.gabay@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20241226174257.72161a6c07c3.I4887bad2355213b201fca2da1836c9a3203ab42d@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-01-13wifi: iwlwifi: support BIOS override for 5G9 in CA also in LARI version 8Miri Korenblit
Commit 6b3e87cc0ca5 ("iwlwifi: Add support for LARI_CONFIG_CHANGE_CMD cmd v9") added a few bits to iwl_lari_config_change_cmd::oem_unii4_allow_bitmap if the FW has LARI version >= 9. But we also need to send those bits for version 8 if the FW is capable of this feature (indicated with capability bits) Add the FW capability bit, and set the additional bits in the cmd when the version is 8 and the FW capability bit is set. Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Link: https://patch.msgid.link/20241226174257.dc5836f84514.I1e38f94465a36731034c94b9811de10cb6ee5921@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-01-13wifi: iwlwifi: support BIOS override for UNII4 in CA/US also in LARI ↵Miri Korenblit
versions < 12 Commit ef7ddf4e2f94 ("iwlwifi: Add support for LARI_CONFIG_CHANGE_CMD v12") added a few bits to iwl_lari_config_change_cmd::chan_state_active_bitmap if the FW has LARI version >= 12. But we also need to send those bits for version 8-11 if the FW is capable of this feature (indicated with capability bits) Add the FW capability bit, and set the additional bits in the cmd when the version is 8 and the FW capability bit is set. Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Link: https://patch.msgid.link/20241226174257.672651ad849c.I67a00d9544c48ad964f8e998ebe8c168071c3d01@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-01-13wifi: iwlwifi: bump FW API to 95 for BZ/SC devicesMiri Korenblit
Start supporting API version 95 for new devices. Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20241226174257.d5b73c1e9e17.I121e155b0c1fdfb7fbac934bb2f84fe0e1d13ba0@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-01-13wifi: iwlwifi: mvm: log error for failures after D3Benjamin Berg
We only logged an error in the fast resume path. However, as the hardware is being restarted it makes sense to log an error to make it easier to understand what is happening. Add a new error message into the normal resume path and update the error in the fast resume path to match. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20241226174257.df1e451d4928.Ibe286bc010ad7fecebba5650097e16ed22a654e4@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>