linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2024-11-12	leds: pwm: Add optional DT property default-brightness	George Stark
	When probing if default LED state is on then default brightness will be applied instead of max brightness. Signed-off-by: George Stark <gnstark@salutedevices.com> Link: https://lore.kernel.org/r/20241105185006.1380166-3-gnstark@salutedevices.com Signed-off-by: Lee Jones <lee@kernel.org>
2024-11-12	intel-hid: fix volume buttons on Thinkpad X12 Detachable Tablet Gen 1	Cole Stowell
	Volume buttons on Lenovo Thinkpad X12 Detachable Tablet Gen 1 did not send any input events when pressed. When loading intel-hid with the 5 Button Array explicitly enabled, the buttons functioned normally. Adds the X12 Detachable Tablet Gen 1 to the `button_array_table`. However, the driver is unable to call INTEL_HID_DSM_BTNE_FN and prints the warning "failed to set button capability" when attempting to enable or disable the 5 Button Array. The warning should be harmless and adding more special handling to avoid it is not worth it. Co-developed-by: Mary Strodl <mstrodl@csh.rit.edu> Signed-off-by: Mary Strodl <mstrodl@csh.rit.edu> Signed-off-by: Cole Stowell <cole@stowell.pro> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20241107205908.69279-1-cole@stowell.pro Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2024-11-12	platform/x86: asus-wmi: Use platform_profile_cycle()	Armin Wolf
	Replace throttle_thermal_policy_switch_next() with platform_profile_cycle() to reduce code duplication and avoid a potential race condition with the platform profile handler. Suggested-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Armin Wolf <W_Armin@gmx.de> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20241107003811.615574-3-W_Armin@gmx.de Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2024-11-12	platform/x86: asus-wmi: Fix inconsistent use of thermal policies	Armin Wolf
	When changing the thermal policy using the platform profile API, a Vivobook thermal policy is stored in throttle_thermal_policy_mode. However everywhere else a normal thermal policy is stored inside this variable, potentially confusing the platform profile. Fix this by always storing normal thermal policy values inside throttle_thermal_policy_mode and only do the conversion when writing the thermal policy to hardware. This also fixes the order in which throttle_thermal_policy_switch_next() steps through the thermal modes on Vivobook machines. Tested-by: Casey G Bowman <casey.g.bowman@intel.com> Fixes: bcbfcebda2cb ("platform/x86: asus-wmi: add support for vivobook fan profiles") Signed-off-by: Armin Wolf <W_Armin@gmx.de> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20241107003811.615574-2-W_Armin@gmx.de Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2024-11-12	platform/x86: hp: hp-bioscfg: remove redundant if statement	Colin Ian King
	The if statement performs the same action if the strcmp result is true or false since there is identical code on both branches. The if statement is redundant and can be replaced with just one call to sysfs_remove_group. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20241107113543.17137-1-colin.i.king@gmail.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2024-11-12	HID: hyperv: streamline driver probe to avoid devres issues	Vitaly Kuznetsov
	It was found that unloading 'hid_hyperv' module results in a devres complaint: ... hv_vmbus: unregistering driver hid_hyperv ------------[ cut here ]------------ WARNING: CPU: 2 PID: 3983 at drivers/base/devres.c:691 devres_release_group+0x1f2/0x2c0 ... Call Trace: <TASK> ? devres_release_group+0x1f2/0x2c0 ? __warn+0xd1/0x1c0 ? devres_release_group+0x1f2/0x2c0 ? report_bug+0x32a/0x3c0 ? handle_bug+0x53/0xa0 ? exc_invalid_op+0x18/0x50 ? asm_exc_invalid_op+0x1a/0x20 ? devres_release_group+0x1f2/0x2c0 ? devres_release_group+0x90/0x2c0 ? rcu_is_watching+0x15/0xb0 ? __pfx_devres_release_group+0x10/0x10 hid_device_remove+0xf5/0x220 device_release_driver_internal+0x371/0x540 ? klist_put+0xf3/0x170 bus_remove_device+0x1f1/0x3f0 device_del+0x33f/0x8c0 ? __pfx_device_del+0x10/0x10 ? cleanup_srcu_struct+0x337/0x500 hid_destroy_device+0xc8/0x130 mousevsc_remove+0xd2/0x1d0 [hid_hyperv] device_release_driver_internal+0x371/0x540 driver_detach+0xc5/0x180 bus_remove_driver+0x11e/0x2a0 ? __mutex_unlock_slowpath+0x160/0x5e0 vmbus_driver_unregister+0x62/0x2b0 [hv_vmbus] ... And the issue seems to be that the corresponding devres group is not allocated. Normally, devres_open_group() is called from __hid_device_probe() but Hyper-V HID driver overrides 'hid_dev->driver' with 'mousevsc_hid_driver' stub and basically re-implements __hid_device_probe() by calling hid_parse() and hid_hw_start() but not devres_open_group(). hid_device_probe() does not call __hid_device_probe() for it. Later, when the driver is removed, hid_device_remove() calls devres_release_group() as it doesn't check whether hdev->driver was initially overridden or not. The issue seems to be related to the commit 62c68e7cee33 ("HID: ensure timely release of driver-allocated resources") but the commit itself seems to be correct. Fix the issue by dropping the 'hid_dev->driver' override and using hid_register_driver()/hid_unregister_driver() instead. Alternatively, it would have been possible to rely on the default handling but HID_CONNECT_DEFAULT implies HID_CONNECT_HIDRAW and it doesn't seem to work for mousevsc as-is. Fixes: 62c68e7cee33 ("HID: ensure timely release of driver-allocated resources") Suggested-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Saurabh Sengar <ssengar@linux.microsoft.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2024-11-12	drm/i915: Grab intel_display from the encoder to avoid potential oopsies	Ville Syrjälä
	Grab the intel_display from 'encoder' rather than 'state' in the encoder hooks to avoid the massive footgun that is intel_sanitize_encoder(), which passes NULL as the 'state' argument to encoder .disable() and .post_disable(). TODO: figure out how to actually fix intel_sanitize_encoder()... Fixes: ab0b0eb5c85c ("drm/i915/tv: convert to struct intel_display") Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241107161123.16269-2-ville.syrjala@linux.intel.com Reviewed-by: Jani Nikula <jani.nikula@intel.com> (cherry picked from commit dc3806d9eb66d0105f8d55d462d4ef681d9eac59) Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
2024-11-12	gpiolib: cdev: use !mem_is_zero() instead of memchr_inv(s, 0, n)	Andy Shevchenko
	Use the mem_is_zero() helper where possible. Signed-off-by: Andy Shevchenko <andy.shevchenko@gmail.com> Link: https://lore.kernel.org/r/20241110201706.16614-1-andy.shevchenko@gmail.com Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2024-11-12	mtd: spi-nor: core: replace dummy buswidth from addr to data	Cheng Ming Lin
	The default dummy cycle for Macronix SPI NOR flash in Octal Output Read Mode(1-1-8) is 20. Currently, the dummy buswidth is set according to the address bus width. In the 1-1-8 mode, this means the dummy buswidth is 1. When converting dummy cycles to bytes, this results in 20 x 1 / 8 = 2 bytes, causing the host to read data 4 cycles too early. Since the protocol data buswidth is always greater than or equal to the address buswidth. Setting the dummy buswidth to match the data buswidth increases the likelihood that the dummy cycle-to-byte conversion will be divisible, preventing the host from reading data prematurely. Fixes: 0e30f47232ab ("mtd: spi-nor: add support for DTR protocol") Cc: stable@vger.kernel.org Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Signed-off-by: Cheng Ming Lin <chengminglin@mxic.com.tw> Link: https://lore.kernel.org/r/20241112075242.174010-2-linchengming884@gmail.com Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org>
2024-11-12	RDMA/bnxt_re: Add set_func_resources support for P5/P7 adapters	Kalesh AP
	Enable set_func_resources for P5 and P7 adapters to handle VF resource distribution. Remove setting max resources per VF during PF initialization. This change is required for firmwares which does not support RoCE VF resource management by NIC driver. The code is same for all adapters now. Reviewed-by: Stephen Shi <stephen.shi@broadcom.com> Reviewed-by: Rukhsana Ansari <rukhsana.ansari@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Link: https://patch.msgid.link/1730882676-24434-4-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-11-12	RDMA/bnxt_re: Enhance RoCE SRIOV resource configuration design	Bhargava Chenna Marreddy
	Refine RoCE SRIOV resource configuration design, using the INITIALIZE_FW's flag as an indication for the new design to the firmware. RoCE driver does not have to provision resources to VF when firmware advertises support for RoCE resource management by NIC driver. Signed-off-by: Bhargava Chenna Marreddy <bhargava.marreddy@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com> Reviewed-by: Selvin Xavier <selvin.xavier@broadcom.com> CC: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Link: https://patch.msgid.link/1730882676-24434-3-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-11-12	bnxt_en: Add support for RoCE sriov configuration	Vikas Gupta
	During driver load, PF RDMA driver provisions resources to the RDMA VFs. This logic takes into consideration of the total number of VFs supported on the PF while allocating resources. Firmware now advertises a capability where NIC driver can allocate resources for RDMA VFs when the user actually creates a VF. So this resource distribution can be based on the number of active VFs. This patch adds the support to check for the firmware capability and follow the new RDMA VF resource allocation strategy. The current logic in the RDMA driver will be removed for the newer Firmware versions in a subsequent patch in this series. Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com> Reviewed-by: Selvin Xavier <selvin.xavier@broadcom.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Link: https://patch.msgid.link/1730882676-24434-2-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-11-12	drm/i915/gsc: ARL-H and ARL-U need a newer GSC FW.	Daniele Ceraolo Spurio
	All MTL and ARL SKUs share the same GSC FW, but the newer platforms are only supported in newer blobs. In particular, ARL-S is supported starting from 102.0.10.1878 (which is already the minimum required version for ARL in the code), while ARL-H and ARL-U are supported from 102.1.15.1926. Therefore, the driver needs to check which specific ARL subplatform its running on when verifying that the GSC FW is new enough for it. Fixes: 2955ae8186c8 ("drm/i915: ARL requires a newer GSC firmware") Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: John Harrison <John.C.Harrison@Intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241028233132.149745-1-daniele.ceraolospurio@intel.com (cherry picked from commit 3c1d5ced18db8a67251c8436cf9bdc061f972bdb) Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
2024-11-11	soc: qcom: ice: Remove the device_link field in qcom_ice	Joe Hattori
	The struct device_link *link field in struct qcom_ice is only used to store the result of a device_link_add call with the DL_FLAG_AUTOREMOVE_SUPPLIER flag. With this flag, the resulting value can only be used to check whether the link is present or not, as per the device_link_add description, hence this commit removes the field. Signed-off-by: Joe Hattori <joe@pf.is.s.u-tokyo.ac.jp> Link: https://lore.kernel.org/r/20241030025046.303342-1-joe@pf.is.s.u-tokyo.ac.jp Signed-off-by: Bjorn Andersson <andersson@kernel.org>
2024-11-11	clk: qcom: remove unused data from gcc-ipq5424.c	Arnd Bergmann
	The newly added driver causes a warnings when enabling -Wunused-const-variables: drivers/clk/qcom/gcc-ipq5424.c:1064:30: error: 'ftbl_gcc_q6_axi_clk_src' defined but not used [-Werror=unused-const-variable=] 1064 \| static const struct freq_tbl ftbl_gcc_q6_axi_clk_src[] = { \| ^~~~~~~~~~~~~~~~~~~~~~~ drivers/clk/qcom/gcc-ipq5424.c:957:30: error: 'ftbl_gcc_qpic_clk_src' defined but not used [-Werror=unused-const-variable=] 957 \| static const struct freq_tbl ftbl_gcc_qpic_clk_src[] = { \| ^~~~~~~~~~~~~~~~~~~~~ drivers/clk/qcom/gcc-ipq5424.c:497:30: error: 'ftbl_gcc_qupv3_2x_core_clk_src' defined but not used [-Werror=unused-const-variable=] 497 \| static const struct freq_tbl ftbl_gcc_qupv3_2x_core_clk_src[] = { \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In order to hopefully enable this warning by default in the future, remove the data for now. If it gets used in the future, it can trivially get added back. Fixes: 21b5d5a4a311 ("clk: qcom: add Global Clock controller (GCC) driver for IPQ5424 SoC") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Link: https://lore.kernel.org/r/20241111103258.3336183-1-arnd@kernel.org Signed-off-by: Bjorn Andersson <andersson@kernel.org>
2024-11-11	net/mlx5e: SHAMPO, Rework header allocation loop	Dragos Tatulea
	The current loop code was based on the assumption that there can be page leftovers from previous function calls. This patch changes the allocation loop to make it clearer how pages get allocated every MLX5E_SHAMPO_WQ_HEADER_PER_PAGE headers. This change has no functional implications. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-13-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	net/mlx5e: SHAMPO, Drop info array	Dragos Tatulea
	The info array is used to store a pointer to the dma address of the header and to the frag page. However, this array is not really required: - The frag page can be calculated from the header index frag page index = header index / headers per page. - The dma address can be calculated through a formula: dma page address + header offset. This series gets rid of the info array and uses the above formulas instead. The current_page_index was used in conjunction with the info array to store page fragment indices. This variable is dropped as well. There was no performance regression observed. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-12-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	net/mlx5e: SHAMPO, Change frag page setup order during allocation	Dragos Tatulea
	Now that the UMR allocation has been simplified, it is no longer possible to have a leftover page from a previous call to mlx5e_build_shampo_hd_umr(). This patch simplifies the code by switching the order of operations: first take the frag page and then increment the index. This is more straightforward and it also paves the way for dropping the info array. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-11-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	net/mlx5e: SHAMPO, Fix page_index calculation inconsistency	Dragos Tatulea
	When calculating the index for the next frag page slot, the divisor is incorrect: it should be the number of pages per queue not the number of headers per queue. This is currently harmless because frag pages are not used directly, but they are intermediated through the info array. But it needs to be fixed as an upcoming patch will get rid of the info array. This patch introduces a new pages per queue variable and plugs it in the formula. Now that this variable exists, additional code can be simplified in the SHAMPO initialization code. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-10-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	net/mlx5e: SHAMPO, Simplify UMR allocation for headers	Dragos Tatulea
	Allocating page fragments for header data split is currently more complicated than it should be. That's because the number of KSM entries allocated is not aligned to the number of headers per page. This leads to having leftovers in the next allocation which require additional accounting and needlessly complicated code. This patch aligns (down) the number of KSM entries in the UMR WQE to the number of headers per page by: 1) Aligning the max number of entries allocated per UMR WQE (max_ksm_entries) to MLX5E_SHAMPO_WQ_HEADER_PER_PAGE. 2) Aligning the total number of free headers to MLX5E_SHAMPO_WQ_HEADER_PER_PAGE. ... and then it drops the extra accounting code from mlx5e_build_shampo_hd_umr(). Although the number of entries allocated per UMR WQE is slightly smaller due to aligning down, no performance impact was observed. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-9-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	net/mlx5: Make vport QoS enablement more flexible for future extensions	Carolina Jubran
	Refactor esw_qos_vport_enable to support more generic configurations, allowing it to be reused for new vport node types in future patches. This refactor includes a new way to change the vport parent node by disabling the current setup and re-enabling it with the new parent. This change sets the foundation for adapting configuration based on the parent type in future patches. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-8-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	net/mlx5: Integrate esw_qos_vport_enable logic into rate operations	Carolina Jubran
	Fold the esw_qos_vport_enable function into operations for configuring maximum and minimum rates, simplifying QoS logic. This change consolidates enabling and updating the scheduling element configuration, streamlining how vport QoS is initialized and adjusted. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-7-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	net/mlx5: Generalize scheduling element operations	Carolina Jubran
	Introduce helper functions to create and destroy scheduling elements, allowing flexible configuration for different scheduling element types. The new helper functions streamline the process by centralizing error handling and logging through esw_qos_sched_elem_op_warn, which now accepts the operation type (create, destroy, or modify). The changes also adjust the esw_qos_vport_enable and mlx5_esw_qos_vport_disable functions to leverage the new generalized create/destroy helpers. The destroy functions now log errors with esw_warn without returning them. This prevents unnecessary error handling since the node was already destroyed and no further action is required from callers. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-6-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	net/mlx5: Refactor scheduling element configuration bitmasks	Carolina Jubran
	Refactor esw_qos_sched_elem_config to set bitmasks only when max_rate or bw_share values change, allowing the function to configure nodes with only one of these parameters. This enables more flexible usage for nodes where only one parameter requires configuration. Remove scattered assignments and checks to centralize them within this function, removing the now redundant esw_qos_set_node_max_rate entirely. With this refactor, also remove the assignment of the vport scheduling node max rate to the parent max rate for unlimited vports (where max rate is set to zero), as firmware already handles this behavior. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	net/mlx5: Generalize max_rate and min_rate setting for nodes	Carolina Jubran
	Refactor max_rate and min_rate setting functions to operate on mlx5_esw_sched_node, allowing for generalized handling of both vports and nodes. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	net/mlx5: Simplify QoS normalization by removing error handling	Carolina Jubran
	This change updates esw_qos_normalize_min_rate to not return errors, significantly simplifying the code. Normalization failures are software bugs, and it's unnecessary to handle them with rollback mechanisms. Instead, `esw_qos_update_sched_node_bw_share` and `esw_qos_normalize_min_rate` now return void, with any errors logged as warnings to indicate potential software issues. This approach avoids compensating for hidden bugs and removes error handling from all places that perform normalization, streamlining future patches. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	net/mlx5: E-switch, refactor eswitch mode change	Patrisious Haddad
	The E-switch mode was previously updated before removing and re-adding the IB device, which could cause a temporary mismatch between the E-switch mode and the IB device configuration. To prevent this discrepancy, the IB device is now removed first, then the E-switch mode is updated, and finally, the IB device is re-added. This sequence ensures consistent alignment between the E-switch mode and the IB device whenever the mode changes, regardless of the new mode value. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	net/mlx5e: Disable loopback self-test on multi-PF netdev	Carolina Jubran
	In Multi-PF (Socket Direct) configurations, when a loopback packet is sent through one of the secondary devices, it will always be received on the primary device. This causes the loopback layer to fail in identifying the loopback packet as the devices are different. To avoid false test failures, disable the loopback self-test in Multi-PF configurations. Fixes: ed29705e4ed1 ("net/mlx5: Enable SD feature") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107183527.676877-8-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	net/mlx5e: CT: Fix null-ptr-deref in add rule err flow	Moshe Shemesh
	In error flow of mlx5_tc_ct_entry_add_rule(), in case ct_rule_add() callback returns error, zone_rule->attr is used uninitiated. Fix it to use attr which has the needed pointer value. Kernel log: BUG: kernel NULL pointer dereference, address: 0000000000000110 RIP: 0010:mlx5_tc_ct_entry_add_rule+0x2b1/0x2f0 [mlx5_core] … Call Trace: <TASK> ? __die+0x20/0x70 ? page_fault_oops+0x150/0x3e0 ? exc_page_fault+0x74/0x140 ? asm_exc_page_fault+0x22/0x30 ? mlx5_tc_ct_entry_add_rule+0x2b1/0x2f0 [mlx5_core] ? mlx5_tc_ct_entry_add_rule+0x1d5/0x2f0 [mlx5_core] mlx5_tc_ct_block_flow_offload+0xc6a/0xf90 [mlx5_core] ? nf_flow_offload_tuple+0xd8/0x190 [nf_flow_table] nf_flow_offload_tuple+0xd8/0x190 [nf_flow_table] flow_offload_work_handler+0x142/0x320 [nf_flow_table] ? finish_task_switch.isra.0+0x15b/0x2b0 process_one_work+0x16c/0x320 worker_thread+0x28c/0x3a0 ? __pfx_worker_thread+0x10/0x10 kthread+0xb8/0xf0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2d/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Fixes: 7fac5c2eced3 ("net/mlx5: CT: Avoid reusing modify header context for natted entries") Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107183527.676877-7-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	net/mlx5e: clear xdp features on non-uplink representors	William Tu
	Non-uplink representor port does not support XDP. The patch clears the xdp feature by checking the net_device_ops.ndo_bpf is set or not. Verify using the netlink tool: $ tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump dev-get Representor netdev before the patch: {'ifindex': 8, 'xdp-features': {'basic', 'ndo-xmit', 'ndo-xmit-sg', 'redirect', 'rx-sg', 'xsk-zerocopy'}, 'xdp-rx-metadata-features': set(), 'xdp-zc-max-segs': 1, 'xsk-features': set()}, With the patch: {'ifindex': 8, 'xdp-features': set(), 'xdp-rx-metadata-features': set(), 'xsk-features': set()}, Fixes: 4d5ab0ad964d ("net/mlx5e: take into account device reconfiguration for xdp_features flag") Signed-off-by: William Tu <witu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107183527.676877-6-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	net/mlx5e: kTLS, Fix incorrect page refcounting	Dragos Tatulea
	The kTLS tx handling code is using a mix of get_page() and page_ref_inc() APIs to increment the page reference. But on the release path (mlx5e_ktls_tx_handle_resync_dump_comp()), only put_page() is used. This is an issue when using pages from large folios: the get_page() references are stored on the folio page while the page_ref_inc() references are stored directly in the given page. On release the folio page will be dereferenced too many times. This was found while doing kTLS testing with sendfile() + ZC when the served file was read from NFS on a kernel with NFS large folios support (commit 49b29a573da8 ("nfs: add support for large folios")). Fixes: 84d1bb2b139e ("net/mlx5e: kTLS, Limit DUMP wqe size") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107183527.676877-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	net/mlx5: fs, lock FTE when checking if active	Mark Bloch
	The referenced commits introduced a two-step process for deleting FTEs: - Lock the FTE, delete it from hardware, set the hardware deletion function to NULL and unlock the FTE. - Lock the parent flow group, delete the software copy of the FTE, and remove it from the xarray. However, this approach encounters a race condition if a rule with the same match value is added simultaneously. In this scenario, fs_core may set the hardware deletion function to NULL prematurely, causing a panic during subsequent rule deletions. To prevent this, ensure the active flag of the FTE is checked under a lock, which will prevent the fs_core layer from attaching a new steering rule to an FTE that is in the process of deletion. [ 438.967589] MOSHE: 2496 mlx5_del_flow_rules del_hw_func [ 438.968205] ------------[ cut here ]------------ [ 438.968654] refcount_t: decrement hit 0; leaking memory. [ 438.969249] WARNING: CPU: 0 PID: 8957 at lib/refcount.c:31 refcount_warn_saturate+0xfb/0x110 [ 438.970054] Modules linked in: act_mirred cls_flower act_gact sch_ingress openvswitch nsh mlx5_vdpa vringh vhost_iotlb vdpa mlx5_ib mlx5_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm ib_uverbs ib_core zram zsmalloc fuse [last unloaded: cls_flower] [ 438.973288] CPU: 0 UID: 0 PID: 8957 Comm: tc Not tainted 6.12.0-rc1+ #8 [ 438.973888] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 438.974874] RIP: 0010:refcount_warn_saturate+0xfb/0x110 [ 438.975363] Code: 40 66 3b 82 c6 05 16 e9 4d 01 01 e8 1f 7c a0 ff 0f 0b c3 cc cc cc cc 48 c7 c7 10 66 3b 82 c6 05 fd e8 4d 01 01 e8 05 7c a0 ff <0f> 0b c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 90 [ 438.976947] RSP: 0018:ffff888124a53610 EFLAGS: 00010286 [ 438.977446] RAX: 0000000000000000 RBX: ffff888119d56de0 RCX: 0000000000000000 [ 438.978090] RDX: ffff88852c828700 RSI: ffff88852c81b3c0 RDI: ffff88852c81b3c0 [ 438.978721] RBP: ffff888120fa0e88 R08: 0000000000000000 R09: ffff888124a534b0 [ 438.979353] R10: 0000000000000001 R11: 0000000000000001 R12: ffff888119d56de0 [ 438.979979] R13: ffff888120fa0ec0 R14: ffff888120fa0ee8 R15: ffff888119d56de0 [ 438.980607] FS: 00007fe6dcc0f800(0000) GS:ffff88852c800000(0000) knlGS:0000000000000000 [ 438.983984] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 438.984544] CR2: 00000000004275e0 CR3: 0000000186982001 CR4: 0000000000372eb0 [ 438.985205] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 438.985842] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 438.986507] Call Trace: [ 438.986799] <TASK> [ 438.987070] ? __warn+0x7d/0x110 [ 438.987426] ? refcount_warn_saturate+0xfb/0x110 [ 438.987877] ? report_bug+0x17d/0x190 [ 438.988261] ? prb_read_valid+0x17/0x20 [ 438.988659] ? handle_bug+0x53/0x90 [ 438.989054] ? exc_invalid_op+0x14/0x70 [ 438.989458] ? asm_exc_invalid_op+0x16/0x20 [ 438.989883] ? refcount_warn_saturate+0xfb/0x110 [ 438.990348] mlx5_del_flow_rules+0x2f7/0x340 [mlx5_core] [ 438.990932] __mlx5_eswitch_del_rule+0x49/0x170 [mlx5_core] [ 438.991519] ? mlx5_lag_is_sriov+0x3c/0x50 [mlx5_core] [ 438.992054] ? xas_load+0x9/0xb0 [ 438.992407] mlx5e_tc_rule_unoffload+0x45/0xe0 [mlx5_core] [ 438.993037] mlx5e_tc_del_fdb_flow+0x2a6/0x2e0 [mlx5_core] [ 438.993623] mlx5e_flow_put+0x29/0x60 [mlx5_core] [ 438.994161] mlx5e_delete_flower+0x261/0x390 [mlx5_core] [ 438.994728] tc_setup_cb_destroy+0xb9/0x190 [ 438.995150] fl_hw_destroy_filter+0x94/0xc0 [cls_flower] [ 438.995650] fl_change+0x11a4/0x13c0 [cls_flower] [ 438.996105] tc_new_tfilter+0x347/0xbc0 [ 438.996503] ? ___slab_alloc+0x70/0x8c0 [ 438.996929] rtnetlink_rcv_msg+0xf9/0x3e0 [ 438.997339] ? __netlink_sendskb+0x4c/0x70 [ 438.997751] ? netlink_unicast+0x286/0x2d0 [ 438.998171] ? __pfx_rtnetlink_rcv_msg+0x10/0x10 [ 438.998625] netlink_rcv_skb+0x54/0x100 [ 438.999020] netlink_unicast+0x203/0x2d0 [ 438.999421] netlink_sendmsg+0x1e4/0x420 [ 438.999820] __sock_sendmsg+0xa1/0xb0 [ 439.000203] ____sys_sendmsg+0x207/0x2a0 [ 439.000600] ? copy_msghdr_from_user+0x6d/0xa0 [ 439.001072] ___sys_sendmsg+0x80/0xc0 [ 439.001459] ? ___sys_recvmsg+0x8b/0xc0 [ 439.001848] ? generic_update_time+0x4d/0x60 [ 439.002282] __sys_sendmsg+0x51/0x90 [ 439.002658] do_syscall_64+0x50/0x110 [ 439.003040] entry_SYSCALL_64_after_hwframe+0x76/0x7e Fixes: 718ce4d601db ("net/mlx5: Consolidate update FTE for all removal changes") Fixes: cefc23554fc2 ("net/mlx5: Fix FTE cleanup") Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107183527.676877-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	net/mlx5: Fix msix vectors to respect platform limit	Parav Pandit
	The number of PCI vectors allocated by the platform (which may be fewer than requested) is currently not honored when creating the SF pool; only the PCI MSI-X capability is considered. As a result, when a platform allocates fewer vectors (in non-dynamic mode) than requested, the PF and SF pools end up with an invalid vector range. This causes incorrect SF vector accounting, which leads to the following call trace when an invalid IRQ vector is allocated. This issue is resolved by ensuring that the platform's vector limit is respected for both the SF and PF pools. Workqueue: mlx5_vhca_event0 mlx5_sf_dev_add_active_work [mlx5_core] RIP: 0010:pci_irq_vector+0x23/0x80 RSP: 0018:ffffabd5cebd7248 EFLAGS: 00010246 RAX: ffff980880e7f308 RBX: ffff9808932fb880 RCX: 0000000000000001 RDX: 00000000000001ff RSI: 0000000000000200 RDI: ffff980880e7f308 RBP: 0000000000000200 R08: 0000000000000010 R09: ffff97a9116f0860 R10: 0000000000000002 R11: 0000000000000228 R12: ffff980897cd0160 R13: 0000000000000000 R14: ffff97a920fec0c0 R15: ffffabd5cebd72d0 FS: 0000000000000000(0000) GS:ffff97c7ff9c0000(0000) knlGS:0000000000000000 ? rescuer_thread+0x350/0x350 kthread+0x11b/0x140 ? __kthread_bind_mask+0x60/0x60 ret_from_fork+0x22/0x30 mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22 mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22 mlx5_core.sf mlx5_core.sf.6: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 enhanced) mlx5_core.sf mlx5_core.sf.7: firmware version: 32.43.356 mlx5_core.sf mlx5_core.sf.6 enpa1s0f0s4: renamed from eth0 mlx5_core.sf mlx5_core.sf.7: Rate limit: 127 rates are supported, range: 0Mbps to 195312Mbps mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22 mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22 mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22 Fixes: 3354822cde5a ("net/mlx5: Use dynamic msix vectors allocation") Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Amir Tzin <amirtz@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107183527.676877-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	net/mlx5: E-switch, unload IB representors when unloading ETH representors	Chiara Meiohas
	IB representors depend on ETH representors, so the IB representors should not exist without the ETH ones. When unloading the ETH representors, the corresponding IB representors should be also unloaded. The commit 8d159eb2117b ("RDMA/mlx5: Use IB set_netdev and get_netdev functions") introduced the use of the ib_device_set_netdev API in IB repsresentors. ib_device_set_netdev() increments the refcount of the representor's netdev when loading an IB representor and decrements it when unloading. Without the unloading of the IB representor, the refcount of the representor's netdev remains greater than 0, preventing it from being unregistered. The patch uncovered an underlying bug where the eth representor is unloaded, without unloading the IB representor. This issue happened when using multiport E-switch and rebooting, causing the shutdown to hang when unloading the ETH representor because the refcount of the representor's netdevice was greater than 0. Call trace: unregister_netdevice: waiting for eth3 to become free. Usage count = 2 ref_tracker: eth%d@00000000661d60f7 has 1/1 users at ib_device_set_netdev+0x160/0x2d0 [ib_core] mlx5_ib_vport_rep_load+0x104/0x3f0 [mlx5_ib] mlx5_eswitch_reload_ib_reps+0xfc/0x110 [mlx5_core] mlx5_mpesw_work+0x236/0x330 [mlx5_core] process_one_work+0x169/0x320 worker_thread+0x288/0x3a0 kthread+0xb8/0xe0 ret_from_fork+0x2d/0x50 ret_from_fork_asm+0x11/0x20 Fixes: 8d159eb2117b ("RDMA/mlx5: Use IB set_netdev and get_netdev functions") Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107183527.676877-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	net: phylink: clean up phylink_resolve()	Russell King (Oracle)
	Now that we have reduced the indentation level, clean up the code formatting. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1t9RQz-002Ff5-EA@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	net: phylink: remove switch() statement in resolve handling	Russell King (Oracle)
	The switch() statement doesn't sit very well with the preceeding if() statements, so let's just convert everything to if()s. As a result of the two preceding commits, there is now only one case in the switch() statement. Remove the switch statement and reduce the code indentation. Code reformatting will be in the following commit. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1t9RQu-002Fez-AA@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	net: phylink: move MLO_AN_PHY resolve handling to if() statement	Russell King (Oracle)
	The switch() statement doesn't sit very well with the preceeding if() statements, and results in excessive indentation that spoils code readability. Continue cleaning this up by converting the MLO_AN_PHY case to use an if() statmeent. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1t9RQp-002Fet-5W@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	net: phylink: move MLO_AN_FIXED resolve handling to if() statement	Russell King (Oracle)
	The switch() statement doesn't sit very well with the preceeding if() statements, and results in excessive indentation that spoils code readability. Begin cleaning this up by converting the MLO_AN_FIXED case to an if() statement. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1t9RQk-002Fen-1A@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	net: phylink: move manual flow control setting	Russell King (Oracle)
	Move the handling of manual flow control configuration to a common location during resolve. We currently evaluate this for all but fixed links. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1t9RQe-002Feh-T1@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	bnxt_en: add unlocked version of bnxt_refclk_read	Vadim Fedorenko
	Serialization of PHC read with FW reset mechanism uses ptp_lock which also protects timecounter updates. This means we cannot grab it when called from bnxt_cc_read(). Let's move locking into different function. Fixes: 6c0828d00f07 ("bnxt_en: replace PTP spinlock with seqlock") Signed-off-by: Vadim Fedorenko <vadfed@meta.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Link: https://patch.msgid.link/20241107214917.2980976-1-vadfed@meta.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	netkit: Set IFLA_NETKIT_PEER_INFO to netkit_link_ops.peer_type.	Kuniyuki Iwashima
	For per-netns RTNL, we need to prefetch the peer device's netns. Let's set rtnl_link_ops.peer_type and accordingly remove duplicated validation in ->newlink(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20241108004823.29419-9-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	vxcan: Set VXCAN_INFO_PEER to vxcan_link_ops.peer_type.	Kuniyuki Iwashima
	For per-netns RTNL, we need to prefetch the peer device's netns. Let's set rtnl_link_ops.peer_type and accordingly remove duplicated validation in ->newlink(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20241108004823.29419-8-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	veth: Set VETH_INFO_PEER to veth_link_ops.peer_type.	Kuniyuki Iwashima
	For per-netns RTNL, we need to prefetch the peer device's netns. Let's set rtnl_link_ops.peer_type and accordingly remove duplicated validation in ->newlink(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20241108004823.29419-7-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	rtnetlink: Remove __rtnl_link_register()	Kuniyuki Iwashima
	link_ops is protected by link_ops_mutex and no longer needs RTNL, so we have no reason to have __rtnl_link_register() separately. Let's remove it and call rtnl_link_register() from ifb.ko and dummy.ko. Note that both modules' init() work on init_net only, so we need not export pernet_ops_rwsem and can use rtnl_net_lock() there. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20241108004823.29419-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	rtnetlink: Remove __rtnl_link_unregister().	Kuniyuki Iwashima
	rtnl_link_unregister() holds RTNL and calls __rtnl_link_unregister(), where we call synchronize_srcu() to wait inflight RTM_NEWLINK requests for per-netns RTNL. We put synchronize_srcu() in __rtnl_link_unregister() due to ifb.ko and dummy.ko. However, rtnl_newlink() will acquire SRCU before RTNL later in this series. Then, lockdep will detect the deadlock: rtnl_link_unregister() rtnl_newlink() ---- ---- lock(rtnl_mutex); lock(&ops->srcu); lock(rtnl_mutex); sync(&ops->srcu); To avoid the problem, we must call synchronize_srcu() before RTNL in rtnl_link_unregister(). As a preparation, let's remove __rtnl_link_unregister(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20241108004823.29419-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11	zram: ZRAM_DEF_COMP should depend on ZRAM	Geert Uytterhoeven
	When Compressed RAM block device support is disabled, the CONFIG_ZRAM_DEF_COMP symbol still ends up in the generated config file: CONFIG_ZRAM_DEF_COMP="unset-value" While this causes no real harm, avoid polluting the config file by adding a dependency on ZRAM. Link: https://lkml.kernel.org/r/64e05bad68a9bd5cc322efd114a04d25de525940.1730807319.git.geert@linux-m68k.org Fixes: 917a59e81c34 ("zram: introduce custom comp backends API") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11	kmemleak: iommu/iova: fix transient kmemleak false positive	Catalin Marinas
	The introduction of iova_depot_pop() in 911aa1245da8 ("iommu/iova: Make the rcache depot scale better") confused kmemleak by moving a struct iova_magazine object from a singly linked list to rcache->depot and resetting the 'next' pointer referencing it. Unlike doubly linked lists, the content of the object being referred is never changed on removal from a singly linked list and the kmemleak checksum heuristics do not detect such scenario. This leads to false positives like: unreferenced object 0xffff8881a5301000 (size 1024): comm "softirq", pid 0, jiffies 4306297099 (age 462.991s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 e7 7d 05 00 00 00 00 00 .........}...... 0f b4 05 00 00 00 00 00 b4 96 05 00 00 00 00 00 ................ backtrace: [<ffffffff819f5f08>] __kmem_cache_alloc_node+0x1e8/0x320 [<ffffffff818a239a>] kmalloc_trace+0x2a/0x60 [<ffffffff8231d31e>] free_iova_fast+0x28e/0x4e0 [<ffffffff82310860>] fq_ring_free_locked+0x1b0/0x310 [<ffffffff8231225d>] fq_flush_timeout+0x19d/0x2e0 [<ffffffff813e95ba>] call_timer_fn+0x19a/0x5c0 [<ffffffff813ea16b>] __run_timers+0x78b/0xb80 [<ffffffff813ea5bd>] run_timer_softirq+0x5d/0xd0 [<ffffffff82f1d915>] __do_softirq+0x205/0x8b5 Introduce kmemleak_transient_leak() which resets the object checksum requiring another scan pass before it is reported (if still unreferenced). Call this new API in iova_depot_pop(). Link: https://lkml.kernel.org/r/20241104111944.2207155-1-catalin.marinas@arm.com Link: https://lore.kernel.org/r/ZY1osaGLyT-sdKE8@shredder/ Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Ido Schimmel <idosch@idosch.org> Tested-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Robin Murphy <robin.murphy@arm.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11	mm/list_lru: simplify the list_lru walk callback function	Kairui Song
	Now isolation no longer takes the list_lru global node lock, only use the per-cgroup lock instead. And this lock is inside the list_lru_one being walked, no longer needed to pass the lock explicitly. Link: https://lkml.kernel.org/r/20241104175257.60853-7-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11	mm/list_lru: split the lock to per-cgroup scope	Kairui Song
	Currently, every list_lru has a per-node lock that protects adding, deletion, isolation, and reparenting of all list_lru_one instances belonging to this list_lru on this node. This lock contention is heavy when multiple cgroups modify the same list_lru. This lock can be split into per-cgroup scope to reduce contention. To achieve this, we need a stable list_lru_one for every cgroup. This commit adds a lock to each list_lru_one and introduced a helper function lock_list_lru_of_memcg, making it possible to pin the list_lru of a memcg. Then reworked the reparenting process. Reparenting will switch the list_lru_one instances one by one. By locking each instance and marking it dead using the nr_items counter, reparenting ensures that all items in the corresponding cgroup (on-list or not, because items have a stable cgroup, see below) will see the list_lru_one switch synchronously. Objcg reparent is also moved after list_lru reparent so items will have a stable mem cgroup until all list_lru_one instances are drained. The only caller that doesn't work the *_obj interfaces are direct calls to list_lru_{add,del}. But it's only used by zswap and that's also based on objcg, so it's fine. This also changes the bahaviour of the isolation function when LRU_RETRY or LRU_REMOVED_RETRY is returned, because now releasing the lock could unblock reparenting and free the list_lru_one, isolation function will have to return withoug re-lock the lru. prepare() { mkdir /tmp/test-fs modprobe brd rd_nr=1 rd_size=33554432 mkfs.xfs -f /dev/ram0 mount -t xfs /dev/ram0 /tmp/test-fs for i in $(seq 1 512); do mkdir "/tmp/test-fs/$i" for j in $(seq 1 10240); do echo TEST-CONTENT > "/tmp/test-fs/$i/$j" done & done; wait } do_test() { read_worker() { sleep 1 tar -cv "$1" &>/dev/null } read_in_all() { cd "/tmp/test-fs" && ls for i in $(seq 1 512); do (exec sh -c 'echo "$PPID"') > "/sys/fs/cgroup/benchmark/$i/cgroup.procs" read_worker "$i" & done; wait } for i in $(seq 1 512); do mkdir -p "/sys/fs/cgroup/benchmark/$i" done echo +memory > /sys/fs/cgroup/benchmark/cgroup.subtree_control echo 512M > /sys/fs/cgroup/benchmark/memory.max echo 3 > /proc/sys/vm/drop_caches time read_in_all } Above script simulates compression of small files in multiple cgroups with memory pressure. Run prepare() then do_test for 6 times: Before: real 0m7.762s user 0m11.340s sys 3m11.224s real 0m8.123s user 0m11.548s sys 3m2.549s real 0m7.736s user 0m11.515s sys 3m11.171s real 0m8.539s user 0m11.508s sys 3m7.618s real 0m7.928s user 0m11.349s sys 3m13.063s real 0m8.105s user 0m11.128s sys 3m14.313s After this commit (about ~15% faster): real 0m6.953s user 0m11.327s sys 2m42.912s real 0m7.453s user 0m11.343s sys 2m51.942s real 0m6.916s user 0m11.269s sys 2m43.957s real 0m6.894s user 0m11.528s sys 2m45.346s real 0m6.911s user 0m11.095s sys 2m43.168s real 0m6.773s user 0m11.518s sys 2m40.774s Link: https://lkml.kernel.org/r/20241104175257.60853-6-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11	r8169: use helper r8169_mod_reg8_cond to simplify rtl_jumbo_config	Heiner Kallweit
	Use recently added helper r8169_mod_reg8_cond() to simplify jumbo mode configuration. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/3df1d484-a02e-46e7-8f75-db5b428e422e@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>