summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-09-05docs: netdev: document guidance on cleanup.hJakub Kicinski
Document what was discussed multiple times on list and various virtual / in-person conversations. guard() being okay in functions <= 20 LoC is a bit of my own invention. If the function is trivial it should be fine, but feel free to disagree :) We'll obviously revisit this guidance as time passes and we and other subsystems get more experience. Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20240830171443.3532077-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-09-05HID: picoLCD: Use backlight power constantsThomas Zimmermann
Replace FB_BLANK_ constants with their counterparts from the backlight subsystem. The values are identical, so there's no change in functionality or semantics. Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2024-09-05x86/bugs: Add missing NO_SSB flagDaniel Sneddon
The Moorefield and Lightning Mountain Atom processors are missing the NO_SSB flag in the vulnerabilities whitelist. This will cause unaffected parts to incorrectly be reported as vulnerable. Add the missing flag. These parts are currently out of service and were verified internally with archived documentation that they need the NO_SSB flag. Closes: https://lore.kernel.org/lkml/CAEJ9NQdhh+4GxrtG1DuYgqYhvc0hi-sKZh-2niukJ-MyFLntAA@mail.gmail.com/ Reported-by: Shanavas.K.S <shanavasks@gmail.com> Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240829192437.4074196-1-daniel.sneddon@linux.intel.com
2024-09-05drm/panthor: flush FW AS caches in slow reset pathAdrián Larumbe
In the off-chance that waiting for the firmware to signal its booted status timed out in the fast reset path, one must flush the cache lines for the entire FW VM address space before reloading the regions, otherwise stale values eventually lead to a scheduler job timeout. Fixes: 647810ec2476 ("drm/panthor: Add the MMU/VM logical block") Cc: stable@vger.kernel.org Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com> Acked-by: Liviu Dudau <liviu.dudau@arm.com> Reviewed-by: Steven Price <steven.price@arm.com> Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240902130237.3440720-1-adrian.larumbe@collabora.com
2024-09-05drm: panel: nv3052c: Correct WL-355608-A8 panel compatibleRyan Walklin
As per the previous dt-binding commit, update the WL-355608-A8 panel compatible to reflect the the integrating device vendor and name as the panel OEM is unknown. Fixes: 62ea2eeba7bf ("drm: panel: nv3052c: Add WL-355608-A8 panel") Signed-off-by: Ryan Walklin <ryan@testtoast.com> Signed-off-by: Maxime Ripard <mripard@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240904012456.35429-3-ryan@testtoast.com
2024-09-05dt-bindings: display: panel: Rename WL-355608-A8 panel to rg35xx-*-panelRyan Walklin
The WL-355608-A8 is a 3.5" 640x480@60Hz RGB LCD display from an unknown OEM used in a number of handheld gaming devices made by Anbernic. Previously committed using the OEM serial without a vendor prefix, however following subsequent discussion the preference is to use the integrating device vendor and name where the OEM is unknown. There are 4 RG35XX series devices from Anbernic based on an Allwinner H700 SoC using this panel, with the -Plus variant introduced first. Therefore the -Plus is used as the fallback for the subsequent -H, -2024, and -SP devices. Alter the filename and compatible string to reflect the convention. Fixes: 45b888a8980a ("dt-bindings: display: panel: Add WL-355608-A8 panel") Signed-off-by: Ryan Walklin <ryan@testtoast.com> Acked-by: Rob Herring (Arm) <robh@kernel.org> Signed-off-by: Maxime Ripard <mripard@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240904012456.35429-2-ryan@testtoast.com
2024-09-05gpio: sama5d2-piobu: convert comma to semicolonChen Ni
Replace comma between expressions with semicolons. Using a ',' in place of a ';' can have unintended side effects. Although that is not the case here, it is seems best to use ';' unless ',' is intended. Found by inspection. No functional change intended. Compile tested only. Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Link: https://lore.kernel.org/r/20240905024245.1642989-1-nichen@iscas.ac.cn Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2024-09-05drm/panthor: Restrict high priorities on group_createMary Guillemard
We were allowing any users to create a high priority group without any permission checks. As a result, this was allowing possible denial of service. We now only allow the DRM master or users with the CAP_SYS_NICE capability to set higher priorities than PANTHOR_GROUP_PRIORITY_MEDIUM. As the sole user of that uAPI lives in Mesa and hardcode a value of MEDIUM [1], this should be safe to do. Additionally, as those checks are performed at the ioctl level, panthor_group_create now only check for priority level validity. [1]https://gitlab.freedesktop.org/mesa/mesa/-/blob/f390835074bdf162a63deb0311d1a6de527f9f89/src/gallium/drivers/panfrost/pan_csf.c#L1038 Signed-off-by: Mary Guillemard <mary.guillemard@collabora.com> Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block") Cc: stable@vger.kernel.org Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240903144955.144278-2-mary.guillemard@collabora.com
2024-09-05tools: hv: rm .*.cmd when make cleanzhang jiao
rm .*.cmd when make clean Signed-off-by: zhang jiao <zhangjiao2@cmss.chinamobile.com> Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com> Link: https://lore.kernel.org/r/20240902042103.5867-1-zhangjiao2@cmss.chinamobile.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20240902042103.5867-1-zhangjiao2@cmss.chinamobile.com>
2024-09-05x86/hyperv: fix kexec crash due to VP assist page corruptionAnirudh Rayabharam (Microsoft)
commit 9636be85cc5b ("x86/hyperv: Fix hyperv_pcpu_input_arg handling when CPUs go online/offline") introduces a new cpuhp state for hyperv initialization. cpuhp_setup_state() returns the state number if state is CPUHP_AP_ONLINE_DYN or CPUHP_BP_PREPARE_DYN and 0 for all other states. For the hyperv case, since a new cpuhp state was introduced it would return 0. However, in hv_machine_shutdown(), the cpuhp_remove_state() call is conditioned upon "hyperv_init_cpuhp > 0". This will never be true and so hv_cpu_die() won't be called on all CPUs. This means the VP assist page won't be reset. When the kexec kernel tries to setup the VP assist page again, the hypervisor corrupts the memory region of the old VP assist page causing a panic in case the kexec kernel is using that memory elsewhere. This was originally fixed in commit dfe94d4086e4 ("x86/hyperv: Fix kexec panic/hang issues"). Get rid of hyperv_init_cpuhp entirely since we are no longer using a dynamic cpuhp state and use CPUHP_AP_HYPERV_ONLINE directly with cpuhp_remove_state(). Cc: stable@vger.kernel.org Fixes: 9636be85cc5b ("x86/hyperv: Fix hyperv_pcpu_input_arg handling when CPUs go online/offline") Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Link: https://lore.kernel.org/r/20240828112158.3538342-1-anirudh@anirudhrb.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20240828112158.3538342-1-anirudh@anirudhrb.com>
2024-09-05arm64: dts: amlogic: gxlx-s905l-p271: drop saradc gxlx compatibleNeil Armstrong
Drop the undocumented amlogic,meson-gxlx-saradc compatible by dropping the compatible override, and fix the following DTBs check: /soc/bus@c1100000/adc@8680: failed to match any schema with compatible: ['amlogic,meson-gxlx-saradc', 'amlogic,meson-saradc'] Fixes: f6386b5afa81 ("arm64: dts: meson: add GXLX/S905L/p271 support") Link: https://lore.kernel.org/r/20240905-topic-amlogic-upstream-gxlx-drop-iio-compat-v2-1-7a690eb95bc2@linaro.org [narmstrong: fix commit message] Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org>
2024-09-05wifi: rtw89: avoid reading out of bounds when loading TX power FW elementsZong-Zhe Yang
Because the loop-expression will do one more time before getting false from cond-expression, the original code copied one more entry size beyond valid region. Fix it by moving the entry copy to loop-body. Signed-off-by: Zong-Zhe Yang <kevin_yang@realtek.com> Signed-off-by: Ping-Ke Shih <pkshih@realtek.com> Link: https://patch.msgid.link/20240902015803.20420-1-pkshih@realtek.com
2024-09-05wifi: rtw89: use frequency domain RSSIEric Huang
To get more accurate RSSI, this commit includes frequency domain RSSI info in RSSI calculation. Add correspond physts parsing and macro to get frequency domain RSSI information for supported IC. Signed-off-by: Eric Huang <echuang@realtek.com> Signed-off-by: Ping-Ke Shih <pkshih@realtek.com> Link: https://patch.msgid.link/20240828055217.10263-3-pkshih@realtek.com
2024-09-05wifi: rtw89: adjust DIG threshold to reduce false alarmEric Huang
Use RSSI without subtracting offset as packet detection lower bound and set an absolute minimal threshold. It's equivalent to setting a higher noise floor, thereby reducing false alarm and improving interference endurance. Signed-off-by: Eric Huang <echuang@realtek.com> Signed-off-by: Ping-Ke Shih <pkshih@realtek.com> Link: https://patch.msgid.link/20240828055217.10263-2-pkshih@realtek.com
2024-09-04Merge branch '100GbE' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== ice: fix synchronization between .ndo_bpf() and reset Larysa Zaremba says: PF reset can be triggered asynchronously, by tx_timeout or by a user. With some unfortunate timings both ice_vsi_rebuild() and .ndo_bpf will try to access and modify XDP rings at the same time, causing system crash. The first patch factors out rtnl-locked code from VSI rebuild code to avoid deadlock. The following changes lock rebuild and .ndo_bpf() critical sections with an internal mutex as well and provide complementary fixes. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: ice: do not bring the VSI up, if it was down before the XDP setup ice: remove ICE_CFG_BUSY locking from AF_XDP code ice: check ICE_VSI_DOWN under rtnl_lock when preparing for reset ice: check for XDP rings instead of bpf program when unconfiguring ice: protect XDP configuration with a mutex ice: move netif_queue_set_napi to rtnl-protected sections ==================== Link: https://patch.msgid.link/20240903183034.3530411-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-04Merge tag 'wireless-next-2024-09-04' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Kalle Valo says: ==================== pull-request: wireless-next-2024-09-04 here's a pull request to net-next tree, more info below. Please let me know if there are any problems. ==================== Conflicts: drivers/net/wireless/ath/ath12k/hw.c 38055789d151 ("wifi: ath12k: use 128 bytes aligned iova in transmit path for WCN7850") 8be12629b428 ("wifi: ath12k: restore ASPM for supported hardwares only") https://lore.kernel.org/87msldyj97.fsf@kernel.org Link: https://patch.msgid.link/20240904153205.64C11C4CEC2@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-04Merge tag 'wireless-2024-09-04' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Kalle Valo says: ==================== wireless fixes for v6.11 Hopefully final fixes for v6.11 and this time only fixes to ath11k driver. We need to revert hibernation support due to reported regressions and we have a fix for kernel crash introduced in v6.11-rc1. * tag 'wireless-2024-09-04' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: MAINTAINERS: wifi: cw1200: add net-cw1200.h Revert "wifi: ath11k: support hibernation" Revert "wifi: ath11k: restore country code during resume" wifi: ath11k: fix NULL pointer dereference in ath11k_mac_get_eirp_power() ==================== Link: https://patch.msgid.link/20240904135906.5986EC4CECA@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-04net: cadence: macb: Enable software IRQ coalescing by defaultSean Anderson
This NIC doesn't have hardware IRQ coalescing. Under high load, interrupts can adversely affect performance. To mitigate this, enable software IRQ coalescing by default. On my system this increases receive throughput with iperf3 from 853 MBit/sec to 934 MBit/s, decreases interrupts from 69489/sec to 2016/sec, and decreases CPU utilization from 27% (4x Cortex-A53) to 14%. Latency is not affected (as far as I can tell). Signed-off-by: Sean Anderson <sean.anderson@linux.dev> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240903184912.4151926-1-sean.anderson@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-04Merge branch 'fix-accessing-first-syscall-argument-on-rv64'Andrii Nakryiko
Pu Lehui says: ==================== Fix accessing first syscall argument on RV64 On RV64, as Ilya mentioned before [0], the first syscall parameter should be accessed through orig_a0 (see arch/riscv64/include/asm/syscall.h), otherwise it will cause selftests like bpf_syscall_macro, vmlinux, test_lsm, etc. to fail on RV64. Link: https://lore.kernel.org/bpf/20220209021745.2215452-1-iii@linux.ibm.com [0] v3: - Fix test case error. v2: https://lore.kernel.org/all/20240831023646.1558629-1-pulehui@huaweicloud.com/ - Access first syscall argument with CO-RE direct read. (Andrii) v1: https://lore.kernel.org/all/20240829133453.882259-1-pulehui@huaweicloud.com/ ==================== Link: https://lore.kernel.org/r/20240831041934.1629216-1-pulehui@huaweicloud.com Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-09-04net: xilinx: axienet: Fix race in axienet_stopSean Anderson
axienet_dma_err_handler can race with axienet_stop in the following manner: CPU 1 CPU 2 ====================== ================== axienet_stop() napi_disable() axienet_dma_stop() axienet_dma_err_handler() napi_disable() axienet_dma_stop() axienet_dma_start() napi_enable() cancel_work_sync() free_irq() Fix this by setting a flag in axienet_stop telling axienet_dma_err_handler not to bother doing anything. I chose not to use disable_work_sync to allow for easier backporting. Signed-off-by: Sean Anderson <sean.anderson@linux.dev> Fixes: 8a3b7a252dca ("drivers/net/ethernet/xilinx: added Xilinx AXI Ethernet driver") Link: https://patch.msgid.link/20240903175141.4132898-1-sean.anderson@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-04libbpf: Fix accessing first syscall argument on RV64Pu Lehui
On RV64, as Ilya mentioned before [0], the first syscall parameter should be accessed through orig_a0 (see arch/riscv64/include/asm/syscall.h), otherwise it will cause selftests like bpf_syscall_macro, vmlinux, test_lsm, etc. to fail on RV64. Let's fix it by using the struct pt_regs style CO-RE direct access. Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220209021745.2215452-1-iii@linux.ibm.com [0] Link: https://lore.kernel.org/bpf/20240831041934.1629216-5-pulehui@huaweicloud.com
2024-09-04selftests/bpf: Enable test_bpf_syscall_macro: Syscall_arg1 on s390 and arm64Pu Lehui
Considering that CO-RE direct read access to the first system call argument is already available on s390 and arm64, let's enable test_bpf_syscall_macro:syscall_arg1 on these architectures. Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240831041934.1629216-4-pulehui@huaweicloud.com
2024-09-04libbpf: Access first syscall argument with CO-RE direct read on arm64Pu Lehui
Currently PT_REGS_PARM1 SYSCALL(x) is consistent with PT_REGS_PARM1_CORE SYSCALL(x), which will introduce the overhead of BPF_CORE_READ(), taking into account the read pt_regs comes directly from the context, let's use CO-RE direct read to access the first system call argument. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Xu Kuohai <xukuohai@huawei.com> Link: https://lore.kernel.org/bpf/20240831041934.1629216-3-pulehui@huaweicloud.com
2024-09-04libbpf: Access first syscall argument with CO-RE direct read on s390Pu Lehui
Currently PT_REGS_PARM1 SYSCALL(x) is consistent with PT_REGS_PARM1_CORE SYSCALL(x), which will introduce the overhead of BPF_CORE_READ(), taking into account the read pt_regs comes directly from the context, let's use CO-RE direct read to access the first system call argument. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240831041934.1629216-2-pulehui@huaweicloud.com
2024-09-04net: phy: Check for read errors in SIOCGMIIREGNiklas Söderlund
When reading registers from the PHY using the SIOCGMIIREG IOCTL any errors returned from either mdiobus_read() or mdiobus_c45_read() are ignored, and parts of the returned error is passed as the register value back to user-space. For example, if mdiobus_c45_read() is used with a bus that do not implement the read_c45() callback -EOPNOTSUPP is returned. This is however directly stored in mii_data->val_out and returned as the registers content. As val_out is a u16 the error code is truncated and returned as a plausible register value. Fix this by first checking the return value for errors before returning it as the register content. Before this patch, # phytool read eth0/0:1/0 0xffa1 After this change, $ phytool read eth0/0:1/0 error: phy_read (-95) Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Tested-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://patch.msgid.link/20240903171536.628930-1-niklas.soderlund+renesas@ragnatech.se Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-04pds_core: Remove redundant null pointer checksLi Zetao
Since the debugfs_create_dir() never returns a null pointer, checking the return value for a null pointer is redundant, and using IS_ERR is safe enough. Signed-off-by: Li Zetao <lizetao1@huawei.com> Link: https://patch.msgid.link/20240903143343.2004652-1-lizetao1@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-04ionic: Remove redundant null pointer checks in ionic_debugfs_add_qcq()Li Zetao
Since the debugfs_create_dir() never returns a null pointer, checking the return value for a null pointer is redundant, and using IS_ERR is safe enough. Signed-off-by: Li Zetao <lizetao1@huawei.com> Link: https://patch.msgid.link/20240903143149.2004530-1-lizetao1@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-04Merge branch 'unmask-upper-dscp-bits-part-3'Jakub Kicinski
Ido Schimmel says: ==================== Unmask upper DSCP bits - part 3 tl;dr - This patchset continues to unmask the upper DSCP bits in the IPv4 flow key in preparation for allowing IPv4 FIB rules to match on DSCP. No functional changes are expected. The TOS field in the IPv4 flow key ('flowi4_tos') is used during FIB lookup to match against the TOS selector in FIB rules and routes. It is currently impossible for user space to configure FIB rules that match on the DSCP value as the upper DSCP bits are either masked in the various call sites that initialize the IPv4 flow key or along the path to the FIB core. In preparation for adding a DSCP selector to IPv4 and IPv6 FIB rules, we need to make sure the entire DSCP value is present in the IPv4 flow key. This patchset continues to unmask the upper DSCP bits, but this time in the output route path, specifically in the callers of ip_route_output_ports(). The next patchset (last) will handle the callers of ip_route_output_key(). Split from this patchset to avoid going over the 15 patches limit. No functional changes are expected as commit 1fa3314c14c6 ("ipv4: Centralize TOS matching") moved the masking of the upper DSCP bits to the core where 'flowi4_tos' is matched against the TOS selector. ==================== Link: https://patch.msgid.link/20240903135327.2810535-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-04ipv6: sit: Unmask upper DSCP bits in ipip6_tunnel_bind_dev()Ido Schimmel
Unmask the upper DSCP bits when calling ip_route_output_ports() so that in the future it could perform the FIB lookup according to the full DSCP value. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240903135327.2810535-5-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-04ip6_tunnel: Unmask upper DSCP bits in ip4ip6_err()Ido Schimmel
Unmask the upper DSCP bits when calling ip_route_output_ports() so that in the future it could perform the FIB lookup according to the full DSCP value. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240903135327.2810535-4-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-04ipv4: ipmr: Unmask upper DSCP bits in ipmr_queue_xmit()Ido Schimmel
Unmask the upper DSCP bits when calling ip_route_output_ports() so that in the future it could perform the FIB lookup according to the full DSCP value. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240903135327.2810535-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-04ipv4: Unmask upper DSCP bits in __ip_queue_xmit()Ido Schimmel
The function is passed the full DS field in its 'tos' argument by its two callers. It then masks the upper DSCP bits using RT_TOS() when passing it to ip_route_output_ports(). Unmask the upper DSCP bits when passing 'tos' to ip_route_output_ports() so that in the future it could perform the FIB lookup according to the full DSCP value. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240903135327.2810535-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-04selftests: net: convert comma to semicolonChen Ni
Replace comma between expressions with semicolons. Using a ',' in place of a ';' can have unintended side effects. Although that is not the case here, it is seems best to use ';' unless ',' is intended. Found by inspection. No functional change intended. Compile tested only. Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240904014441.1065753-1-nichen@iscas.ac.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-04selftests/bpf: Add a selftest for x86 jit convergence issuesYonghong Song
The core part of the selftest, i.e., the je <-> jmp cycle, mimics the original sched-ext bpf program. The test will fail without the previous patch. I tried to create some cases for other potential cycles (je <-> je, jmp <-> je and jmp <-> jmp) with similar pattern to the test in this patch, but failed. So this patch only contains one test for je <-> jmp cycle. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240904221256.37389-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-04bpf, x64: Fix a jit convergence issueYonghong Song
Daniel Hodges reported a jit error when playing with a sched-ext program. The error message is: unexpected jmp_cond padding: -4 bytes But further investigation shows the error is actual due to failed convergence. The following are some analysis: ... pass4, final_proglen=4391: ... 20e: 48 85 ff test rdi,rdi 211: 74 7d je 0x290 213: 48 8b 77 00 mov rsi,QWORD PTR [rdi+0x0] ... 289: 48 85 ff test rdi,rdi 28c: 74 17 je 0x2a5 28e: e9 7f ff ff ff jmp 0x212 293: bf 03 00 00 00 mov edi,0x3 Note that insn at 0x211 is 2-byte cond jump insn for offset 0x7d (-125) and insn at 0x28e is 5-byte jmp insn with offset -129. pass5, final_proglen=4392: ... 20e: 48 85 ff test rdi,rdi 211: 0f 84 80 00 00 00 je 0x297 217: 48 8b 77 00 mov rsi,QWORD PTR [rdi+0x0] ... 28d: 48 85 ff test rdi,rdi 290: 74 1a je 0x2ac 292: eb 84 jmp 0x218 294: bf 03 00 00 00 mov edi,0x3 Note that insn at 0x211 is 6-byte cond jump insn now since its offset becomes 0x80 based on previous round (0x293 - 0x213 = 0x80). At the same time, insn at 0x292 is a 2-byte insn since its offset is -124. pass6 will repeat the same code as in pass4. pass7 will repeat the same code as in pass5, and so on. This will prevent eventual convergence. Passes 1-14 are with padding = 0. At pass15, padding is 1 and related insn looks like: 211: 0f 84 80 00 00 00 je 0x297 217: 48 8b 77 00 mov rsi,QWORD PTR [rdi+0x0] ... 24d: 48 85 d2 test rdx,rdx The similar code in pass14: 211: 74 7d je 0x290 213: 48 8b 77 00 mov rsi,QWORD PTR [rdi+0x0] ... 249: 48 85 d2 test rdx,rdx 24c: 74 21 je 0x26f 24e: 48 01 f7 add rdi,rsi ... Before generating the following insn, 250: 74 21 je 0x273 "padding = 1" enables some checking to ensure nops is either 0 or 4 where #define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp))) nops = INSN_SZ_DIFF - 2 In this specific case, addrs[i] = 0x24e // from pass14 addrs[i-1] = 0x24d // from pass15 prog - temp = 3 // from 'test rdx,rdx' in pass15 so nops = -4 and this triggers the failure. To fix the issue, we need to break cycles of je <-> jmp. For example, in the above case, we have 211: 74 7d je 0x290 the offset is 0x7d. If 2-byte je insn is generated only if the offset is less than 0x7d (<= 0x7c), the cycle can be break and we can achieve the convergence. I did some study on other cases like je <-> je, jmp <-> je and jmp <-> jmp which may cause cycles. Those cases are not from actual reproducible cases since it is pretty hard to construct a test case for them. the results show that the offset <= 0x7b (0x7b = 123) should be enough to cover all cases. This patch added a new helper to generate 8-bit cond/uncond jmp insns only if the offset range is [-128, 123]. Reported-by: Daniel Hodges <hodgesd@meta.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240904221251.37109-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-04ipv4: Fix user space build failure due to header changeIdo Schimmel
RT_TOS() from include/uapi/linux/in_route.h is defined using IPTOS_TOS_MASK from include/uapi/linux/ip.h. This is problematic for files such as include/net/ip_fib.h that want to use RT_TOS() as without including both header files kernel compilation fails: In file included from ./include/net/ip_fib.h:25, from ./include/net/route.h:27, from ./include/net/lwtunnel.h:9, from net/core/dst.c:24: ./include/net/ip_fib.h: In function ‘fib_dscp_masked_match’: ./include/uapi/linux/in_route.h:31:32: error: ‘IPTOS_TOS_MASK’ undeclared (first use in this function) 31 | #define RT_TOS(tos) ((tos)&IPTOS_TOS_MASK) | ^~~~~~~~~~~~~~ ./include/net/ip_fib.h:440:45: note: in expansion of macro ‘RT_TOS’ 440 | return dscp == inet_dsfield_to_dscp(RT_TOS(fl4->flowi4_tos)); Therefore, cited commit changed linux/in_route.h to include linux/ip.h. However, as reported by David, this breaks iproute2 compilation due overlapping definitions between linux/ip.h and /usr/include/netinet/ip.h: In file included from ../include/uapi/linux/in_route.h:5, from iproute.c:19: ../include/uapi/linux/ip.h:25:9: warning: "IPTOS_TOS" redefined 25 | #define IPTOS_TOS(tos) ((tos)&IPTOS_TOS_MASK) | ^~~~~~~~~ In file included from iproute.c:17: /usr/include/netinet/ip.h:222:9: note: this is the location of the previous definition 222 | #define IPTOS_TOS(tos) ((tos) & IPTOS_TOS_MASK) Fix by changing include/net/ip_fib.h to include linux/ip.h. Note that usage of RT_TOS() should not spread further in the kernel due to recent work in this area. Fixes: 1fa3314c14c6 ("ipv4: Centralize TOS matching") Reported-by: David Ahern <dsahern@kernel.org> Closes: https://lore.kernel.org/netdev/2f5146ff-507d-4cab-a195-b28c0c9e654e@kernel.org/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20240903133554.2807343-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-04l2tp: remove unneeded null check in l2tp_v2_session_get_nextJames Chapman
Commit aa92c1cec92b ("l2tp: add tunnel/session get_next helpers") uses idr_get_next APIs to iterate over l2tp session IDR lists. Sessions in l2tp_v2_session_idr always have a non-null session->tunnel pointer since l2tp_session_register sets it before inserting the session into the IDR. Therefore the null check on session->tunnel in l2tp_v2_session_get_next is redundant and can be removed. Removing the check avoids a warning from lkp. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/r/202408111407.HtON8jqa-lkp@intel.com/ CC: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: James Chapman <jchapman@katalix.com> Acked-by: Tom Parkin <tparkin@katalix.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240903113547.1261048-1-jchapman@katalix.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-04net: bridge: br_fdb_external_learn_add(): always set EXT_LEARNJonas Gorski
When userspace wants to take over a fdb entry by setting it as EXTERN_LEARNED, we set both flags BR_FDB_ADDED_BY_EXT_LEARN and BR_FDB_ADDED_BY_USER in br_fdb_external_learn_add(). If the bridge updates the entry later because its port changed, we clear the BR_FDB_ADDED_BY_EXT_LEARN flag, but leave the BR_FDB_ADDED_BY_USER flag set. If userspace then wants to take over the entry again, br_fdb_external_learn_add() sees that BR_FDB_ADDED_BY_USER and skips setting the BR_FDB_ADDED_BY_EXT_LEARN flags, thus silently ignores the update. Fix this by always allowing to set BR_FDB_ADDED_BY_EXT_LEARN regardless if this was a user fdb entry or not. Fixes: 710ae7287737 ("net: bridge: Mark FDB entries that were added by user as such") Signed-off-by: Jonas Gorski <jonas.gorski@bisdn.de> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20240903081958.29951-1-jonas.gorski@bisdn.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-04r8152: fix the firmware doesn't workHayes Wang
generic_ocp_write() asks the parameter "size" must be 4 bytes align. Therefore, write the bp would fail, if the mac->bp_num is odd. Align the size to 4 for fixing it. The way may write an extra bp, but the rtl8152_is_fw_mac_ok() makes sure the value must be 0 for the bp whose index is more than mac->bp_num. That is, there is no influence for the firmware. Besides, I check the return value of generic_ocp_write() to make sure everything is correct. Fixes: e5c266a61186 ("r8152: set bp in bulk") Signed-off-by: Hayes Wang <hayeswang@realtek.com> Link: https://patch.msgid.link/20240903063333.4502-1-hayeswang@realtek.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-04fou: Fix null-ptr-deref in GRO.Kuniyuki Iwashima
We observed a null-ptr-deref in fou_gro_receive() while shutting down a host. [0] The NULL pointer is sk->sk_user_data, and the offset 8 is of protocol in struct fou. When fou_release() is called due to netns dismantle or explicit tunnel teardown, udp_tunnel_sock_release() sets NULL to sk->sk_user_data. Then, the tunnel socket is destroyed after a single RCU grace period. So, in-flight udp4_gro_receive() could find the socket and execute the FOU GRO handler, where sk->sk_user_data could be NULL. Let's use rcu_dereference_sk_user_data() in fou_from_sock() and add NULL checks in FOU GRO handlers. [0]: BUG: kernel NULL pointer dereference, address: 0000000000000008 PF: supervisor read access in kernel mode PF: error_code(0x0000) - not-present page PGD 80000001032f4067 P4D 80000001032f4067 PUD 103240067 PMD 0 SMP PTI CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.216-204.855.amzn2.x86_64 #1 Hardware name: Amazon EC2 c5.large/, BIOS 1.0 10/16/2017 RIP: 0010:fou_gro_receive (net/ipv4/fou.c:233) [fou] Code: 41 5f c3 cc cc cc cc e8 e7 2e 69 f4 0f 1f 80 00 00 00 00 0f 1f 44 00 00 49 89 f8 41 54 48 89 f7 48 89 d6 49 8b 80 88 02 00 00 <0f> b6 48 08 0f b7 42 4a 66 25 fd fd 80 cc 02 66 89 42 4a 0f b6 42 RSP: 0018:ffffa330c0003d08 EFLAGS: 00010297 RAX: 0000000000000000 RBX: ffff93d9e3a6b900 RCX: 0000000000000010 RDX: ffff93d9e3a6b900 RSI: ffff93d9e3a6b900 RDI: ffff93dac2e24d08 RBP: ffff93d9e3a6b900 R08: ffff93dacbce6400 R09: 0000000000000002 R10: 0000000000000000 R11: ffffffffb5f369b0 R12: ffff93dacbce6400 R13: ffff93dac2e24d08 R14: 0000000000000000 R15: ffffffffb4edd1c0 FS: 0000000000000000(0000) GS:ffff93daee800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000008 CR3: 0000000102140001 CR4: 00000000007706f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <IRQ> ? show_trace_log_lvl (arch/x86/kernel/dumpstack.c:259) ? __die_body.cold (arch/x86/kernel/dumpstack.c:478 arch/x86/kernel/dumpstack.c:420) ? no_context (arch/x86/mm/fault.c:752) ? exc_page_fault (arch/x86/include/asm/irqflags.h:49 arch/x86/include/asm/irqflags.h:89 arch/x86/mm/fault.c:1435 arch/x86/mm/fault.c:1483) ? asm_exc_page_fault (arch/x86/include/asm/idtentry.h:571) ? fou_gro_receive (net/ipv4/fou.c:233) [fou] udp_gro_receive (include/linux/netdevice.h:2552 net/ipv4/udp_offload.c:559) udp4_gro_receive (net/ipv4/udp_offload.c:604) inet_gro_receive (net/ipv4/af_inet.c:1549 (discriminator 7)) dev_gro_receive (net/core/dev.c:6035 (discriminator 4)) napi_gro_receive (net/core/dev.c:6170) ena_clean_rx_irq (drivers/amazon/net/ena/ena_netdev.c:1558) [ena] ena_io_poll (drivers/amazon/net/ena/ena_netdev.c:1742) [ena] napi_poll (net/core/dev.c:6847) net_rx_action (net/core/dev.c:6917) __do_softirq (arch/x86/include/asm/jump_label.h:25 include/linux/jump_label.h:200 include/trace/events/irq.h:142 kernel/softirq.c:299) asm_call_irq_on_stack (arch/x86/entry/entry_64.S:809) </IRQ> do_softirq_own_stack (arch/x86/include/asm/irq_stack.h:27 arch/x86/include/asm/irq_stack.h:77 arch/x86/kernel/irq_64.c:77) irq_exit_rcu (kernel/softirq.c:393 kernel/softirq.c:423 kernel/softirq.c:435) common_interrupt (arch/x86/kernel/irq.c:239) asm_common_interrupt (arch/x86/include/asm/idtentry.h:626) RIP: 0010:acpi_idle_do_entry (arch/x86/include/asm/irqflags.h:49 arch/x86/include/asm/irqflags.h:89 drivers/acpi/processor_idle.c:114 drivers/acpi/processor_idle.c:575) Code: 8b 15 d1 3c c4 02 ed c3 cc cc cc cc 65 48 8b 04 25 40 ef 01 00 48 8b 00 a8 08 75 eb 0f 1f 44 00 00 0f 00 2d d5 09 55 00 fb f4 <fa> c3 cc cc cc cc e9 be fc ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 RSP: 0018:ffffffffb5603e58 EFLAGS: 00000246 RAX: 0000000000004000 RBX: ffff93dac0929c00 RCX: ffff93daee833900 RDX: ffff93daee800000 RSI: ffff93daee87dc00 RDI: ffff93daee87dc64 RBP: 0000000000000001 R08: ffffffffb5e7b6c0 R09: 0000000000000044 R10: ffff93daee831b04 R11: 00000000000001cd R12: 0000000000000001 R13: ffffffffb5e7b740 R14: 0000000000000001 R15: 0000000000000000 ? sched_clock_cpu (kernel/sched/clock.c:371) acpi_idle_enter (drivers/acpi/processor_idle.c:712 (discriminator 3)) cpuidle_enter_state (drivers/cpuidle/cpuidle.c:237) cpuidle_enter (drivers/cpuidle/cpuidle.c:353) cpuidle_idle_call (kernel/sched/idle.c:158 kernel/sched/idle.c:239) do_idle (kernel/sched/idle.c:302) cpu_startup_entry (kernel/sched/idle.c:395 (discriminator 1)) start_kernel (init/main.c:1048) secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:310) Modules linked in: udp_diag tcp_diag inet_diag nft_nat ipip tunnel4 dummy fou ip_tunnel nft_masq nft_chain_nat nf_nat wireguard nft_ct curve25519_x86_64 libcurve25519_generic nf_conntrack libchacha20poly1305 nf_defrag_ipv6 nf_defrag_ipv4 nft_objref chacha_x86_64 nft_counter nf_tables nfnetlink poly1305_x86_64 ip6_udp_tunnel udp_tunnel libchacha crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper mousedev psmouse button ena ptp pps_core crc32c_intel CR2: 0000000000000008 Fixes: d92283e338f6 ("fou: change to use UDP socket GRO") Reported-by: Alphonse Kurian <alkurian@amazon.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20240902173927.62706-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-04net: mana: Improve mana_set_channels() in low mem conditionsShradha Gupta
The mana_set_channels() function requires detaching the mana driver and reattaching it with changed channel values. During this operation if the system is low on memory, the reattach might fail, causing the network device being down. To avoid this we pre-allocate buffers at the beginning of set operation, to prevent complete network loss Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com> Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Link: https://patch.msgid.link/1725248734-21760-1-git-send-email-shradhagupta@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-04md: Report failed arrays as broken in mdstatMateusz Kusiak
Depending on if array has personality, it is either reported as active or inactive. This patch adds third status "broken" for arrays with personality that became inoperative. The reason is end users tend to assume that "active" indicates array is operational. Add "broken" state for inoperative arrays with personality and refactor the code. Signed-off-by: Mateusz Kusiak <mateusz.kusiak@intel.com> Link: https://lore.kernel.org/r/20240903142949.53628-1-mateusz.kusiak@intel.com Signed-off-by: Song Liu <song@kernel.org>
2024-09-04bareudp: Fix device stats updates.Guillaume Nault
Bareudp devices update their stats concurrently. Therefore they need proper atomic increments. Fixes: 571912c69f0e ("net: UDP tunnel encapsulation module for tunnelling different protocols like MPLS, IP, NSH etc.") Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/04b7b9d0b480158eb3ab4366ec80aa2ab7e41fcb.1725031794.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-04Merge branch 'bpf/master' into for-6.12Tejun Heo
Pull bpf/master to receive baebe9aaba1e ("bpf: allow passing struct bpf_iter_<type> as kfunc arguments") and related changes in preparation for the DSQ iterator patchset. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-04selftests/ftrace: Fix eventfs ownership testcase to find mount pointMasami Hiramatsu (Google)
Fix eventfs ownership testcase to find mount point if stat -c "%m" failed. This can happen on the system based on busybox. In this case, this will try to use the current working directory, which should be a tracefs top directory (and eventfs is mounted as a part of tracefs.) If it does not work, the test is skipped as UNRESOLVED because of the environmental problem. Fixes: ee9793be08b1 ("tracing/selftests: Add ownership modification tests for eventfs") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-09-04Merge tag 'bcachefs-2024-09-04' of git://evilpiepirate.org/bcachefsLinus Torvalds
Pull bcachefs fixes from Kent Overstreet: - Fix a typo in the rebalance accounting changes - BCH_SB_MEMBER_INVALID: small on disk format feature which will be needed for full erasure coding support; this is only the minimum so that 6.11 can handle future versions without barfing. * tag 'bcachefs-2024-09-04' of git://evilpiepirate.org/bcachefs: bcachefs: BCH_SB_MEMBER_INVALID bcachefs: fix rebalance accounting
2024-09-04cgroup/cpuset: Move cpu.h include to cpuset-internal.hWaiman Long
The newly created cpuset-v1.c file uses cpus_read_lock/unlock() functions which are defined in cpu.h but not included in cpuset-internal.h yet leading to compilation error under certain kernel configurations. Fix it by moving the cpu.h include from cpuset.c to cpuset-internal.h. While at it, sort the include files in alphabetic order. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202408311612.mQTuO946-lkp@intel.com/ Fixes: 047b83097448 ("cgroup/cpuset: move relax_domain_level to cpuset-v1.c") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-04sched_ext: Add a cgroup scheduler which uses flattened hierarchyTejun Heo
This patch adds scx_flatcg example scheduler which implements hierarchical weight-based cgroup CPU control by flattening the cgroup hierarchy into a single layer by compounding the active weight share at each level. This flattening of hierarchy can bring a substantial performance gain when the cgroup hierarchy is nested multiple levels. in a simple benchmark using wrk[8] on apache serving a CGI script calculating sha1sum of a small file, it outperforms CFS by ~3% with CPU controller disabled and by ~10% with two apache instances competing with 2:1 weight ratio nested four level deep. However, the gain comes at the cost of not being able to properly handle thundering herd of cgroups. For example, if many cgroups which are nested behind a low priority parent cgroup wake up around the same time, they may be able to consume more CPU cycles than they are entitled to. In many use cases, this isn't a real concern especially given the performance gain. Also, there are ways to mitigate the problem further by e.g. introducing an extra scheduling layer on cgroup delegation boundaries. v5: - Updated to specify SCX_OPS_HAS_CGROUP_WEIGHT instead of SCX_OPS_KNOB_CGROUP_WEIGHT. v4: - Revert reference counted kptr for cgv_node as the change caused easily reproducible stalls. v3: - Updated to reflect the core API changes including ops.init/exit_task() and direct dispatch from ops.select_cpu(). Fixes and improvements including additional statistics. - Use reference counted kptr for cgv_node instead of xchg'ing against stash location. - Dropped '-p' option. v2: - Use SCX_BUG[_ON]() to simplify error handling. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
2024-09-04sched_ext: Add cgroup supportTejun Heo
Add sched_ext_ops operations to init/exit cgroups, and track task migrations and config changes. A BPF scheduler may not implement or implement only subset of cgroup features. The implemented features can be indicated using %SCX_OPS_HAS_CGOUP_* flags. If cgroup configuration makes use of features that are not implemented, a warning is triggered. While a BPF scheduler is being enabled and disabled, relevant cgroup operations are locked out using scx_cgroup_rwsem. This avoids situations like task prep taking place while the task is being moved across cgroups, making things easier for BPF schedulers. v7: - cgroup interface file visibility toggling is dropped in favor just warning messages. Dynamically changing interface visiblity caused more confusion than helping. v6: - Updated to reflect the removal of SCX_KF_SLEEPABLE. - Updated to use CONFIG_GROUP_SCHED_WEIGHT and fixes for !CONFIG_FAIR_GROUP_SCHED && CONFIG_EXT_GROUP_SCHED. v5: - Flipped the locking order between scx_cgroup_rwsem and cpus_read_lock() to avoid locking order conflict w/ cpuset. Better documentation around locking. - sched_move_task() takes an early exit if the source and destination are identical. This triggered the warning in scx_cgroup_can_attach() as it left p->scx.cgrp_moving_from uncleared. Updated the cgroup migration path so that ops.cgroup_prep_move() is skipped for identity migrations so that its invocations always match ops.cgroup_move() one-to-one. v4: - Example schedulers moved into their own patches. - Fix build failure when !CONFIG_CGROUP_SCHED, reported by Andrea Righi. v3: - Make scx_example_pair switch all tasks by default. - Convert to BPF inline iterators. - scx_bpf_task_cgroup() is added to determine the current cgroup from CPU controller's POV. This allows BPF schedulers to accurately track CPU cgroup membership. - scx_example_flatcg added. This demonstrates flattened hierarchy implementation of CPU cgroup control and shows significant performance improvement when cgroups which are nested multiple levels are under competition. v2: - Build fixes for different CONFIG combinations. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Andrea Righi <andrea.righi@canonical.com>
2024-09-04sched: Introduce CONFIG_GROUP_SCHED_WEIGHTTejun Heo
sched_ext will soon add cgroup cpu.weigh support. The cgroup interface code is currently gated behind CONFIG_FAIR_GROUP_SCHED. As the fair class and/or SCX may implement the feature, put the interface code behind the new CONFIG_CGROUP_SCHED_WEIGHT which is selected by CONFIG_FAIR_GROUP_SCHED. This allows either sched class to enable the itnerface code without ading more complex CONFIG tests. When !CONFIG_FAIR_GROUP_SCHED, a dummy version of sched_group_set_shares() is added to support later CONFIG_CGROUP_SCHED_WEIGHT && !CONFIG_FAIR_GROUP_SCHED builds. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>