summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-02-16net: bridge: switchdev: Ensure deferred event delivery on unoffloadTobias Waldekranz
When unoffloading a device, it is important to ensure that all relevant deferred events are delivered to it before it disassociates itself from the bridge. Before this change, this was true for the normal case when a device maps 1:1 to a net_bridge_port, i.e. br0 / swp0 When swp0 leaves br0, the call to switchdev_deferred_process() in del_nbp() makes sure to process any outstanding events while the device is still associated with the bridge. In the case when the association is indirect though, i.e. when the device is attached to the bridge via an intermediate device, like a LAG... br0 / lag0 / swp0 ...then detaching swp0 from lag0 does not cause any net_bridge_port to be deleted, so there was no guarantee that all events had been processed before the device disassociated itself from the bridge. Fix this by always synchronously processing all deferred events before signaling completion of unoffloading back to the driver. Fixes: 4e51bf44a03a ("net: bridge: move the switchdev object replay helpers to "push" mode") Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-16net: bridge: switchdev: Skip MDB replays of deferred events on offloadTobias Waldekranz
Before this change, generation of the list of MDB events to replay would race against the creation of new group memberships, either from the IGMP/MLD snooping logic or from user configuration. While new memberships are immediately visible to walkers of br->mdb_list, the notification of their existence to switchdev event subscribers is deferred until a later point in time. So if a replay list was generated during a time that overlapped with such a window, it would also contain a replay of the not-yet-delivered event. The driver would thus receive two copies of what the bridge internally considered to be one single event. On destruction of the bridge, only a single membership deletion event was therefore sent. As a consequence of this, drivers which reference count memberships (at least DSA), would be left with orphan groups in their hardware database when the bridge was destroyed. This is only an issue when replaying additions. While deletion events may still be pending on the deferred queue, they will already have been removed from br->mdb_list, so no duplicates can be generated in that scenario. To a user this meant that old group memberships, from a bridge in which a port was previously attached, could be reanimated (in hardware) when the port joined a new bridge, without the new bridge's knowledge. For example, on an mv88e6xxx system, create a snooping bridge and immediately add a port to it: root@infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && \ > ip link set dev x3 up master br0 And then destroy the bridge: root@infix-06-0b-00:~$ ip link del dev br0 root@infix-06-0b-00:~$ mvls atu ADDRESS FID STATE Q F 0 1 2 3 4 5 6 7 8 9 a DEV:0 Marvell 88E6393X 33:33:00:00:00:6a 1 static - - 0 . . . . . . . . . . 33:33:ff:87:e4:3f 1 static - - 0 . . . . . . . . . . ff:ff:ff:ff:ff:ff 1 static - - 0 1 2 3 4 5 6 7 8 9 a root@infix-06-0b-00:~$ The two IPv6 groups remain in the hardware database because the port (x3) is notified of the host's membership twice: once via the original event and once via a replay. Since only a single delete notification is sent, the count remains at 1 when the bridge is destroyed. Then add the same port (or another port belonging to the same hardware domain) to a new bridge, this time with snooping disabled: root@infix-06-0b-00:~$ ip link add dev br1 up type bridge mcast_snooping 0 && \ > ip link set dev x3 up master br1 All multicast, including the two IPv6 groups from br0, should now be flooded, according to the policy of br1. But instead the old memberships are still active in the hardware database, causing the switch to only forward traffic to those groups towards the CPU (port 0). Eliminate the race in two steps: 1. Grab the write-side lock of the MDB while generating the replay list. This prevents new memberships from showing up while we are generating the replay list. But it leaves the scenario in which a deferred event was already generated, but not delivered, before we grabbed the lock. Therefore: 2. Make sure that no deferred version of a replay event is already enqueued to the switchdev deferred queue, before adding it to the replay list, when replaying additions. Fixes: 4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined mdb entries") Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-16net/iucv: fix the allocation size of iucv_path_table arrayAlexander Gordeev
iucv_path_table is a dynamically allocated array of pointers to struct iucv_path items. Yet, its size is calculated as if it was an array of struct iucv_path items. Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Reviewed-by: Alexandra Winter <wintera@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-16Merge branch 'ionic-xdp-support'David S. Miller
Shannon Nelson says: ==================== ionic: add XDP support This patchset is new support in ionic for XDP processing, including basic XDP on Rx packets, TX and REDIRECT, and frags for jumbo frames. Since ionic has not yet been converted to use the page_pool APIs, this uses the simple MEM_TYPE_PAGE_ORDER0 buffering. There are plans to convert the driver in the near future. v4: - removed "inline" from short utility functions - changed to use "goto err_out" in ionic_xdp_register_rxq_info() - added "continue" to reduce nesting in ionic_xdp_queues_config() - used xdp_prog in ionic_rx_clean() to flag whether or not to sync the rx buffer after calling ionix_xdp_run() - swapped order of XDP_TX and XDP_REDIRECT cases in ionic_xdp_run() to make patch 6 a little cleaner v3: https://lore.kernel.org/netdev/20240210004827.53814-1-shannon.nelson@amd.com/ - removed budget==0 patch, sent it separately to net v2: https://lore.kernel.org/netdev/20240208005725.65134-1-shannon.nelson@amd.com/ - added calls to txq_trans_cond_update() - added a new patch to catch NAPI budget==0 v1: https://lore.kernel.org/netdev/20240130013042.11586-1-shannon.nelson@amd.com/ RFC: https://lore.kernel.org/netdev/20240118192500.58665-1-shannon.nelson@amd.com/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-16ionic: implement xdp frags supportShannon Nelson
Add support for using scatter-gather / frags in XDP in both Rx and Tx paths. Co-developed-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-16ionic: add ndo_xdp_xmitShannon Nelson
When our ndo_xdp_xmit is called we mark the buffer with XDP_REDIRECT so we know to return it to the XDP stack for cleaning. Co-developed-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-16ionic: Add XDP_REDIRECT supportShannon Nelson
The XDP_REDIRECT packets are given to the XDP stack and we drop the use of the related page: it will get freed by the driver that ends up doing the Tx. Because we have some hardware configurations with limited queue resources, we use the existing datapath Tx queues rather than creating and managing a separate set of xdp_tx queues. Co-developed-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-16ionic: Add XDP_TX supportShannon Nelson
The XDP_TX packets get fed back into the Rx queue's partnered Tx queue as an xdp_frame. Co-developed-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-16ionic: Add XDP packet headroomShannon Nelson
If an xdp program is loaded, add headroom at the beginning of the frame to allow for editing and insertions that an XDP program might need room for, and tailroom used later for XDP frame tracking. These are only needed in the first Rx buffer in a packet, not for any trailing frags. Co-developed-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-16ionic: add initial framework for XDP supportShannon Nelson
Set up the basics for running Rx packets through XDP programs. Add new queue setup and teardown steps for adding/removing an XDP program, and add the call to run the XDP on a packet. The XDP frame size needs to be the MTU plus standard ethernet header, plus head room for XDP scribblings and tail room for a struct skb_shared_info. Also, at this point, we don't support XDP frags, only a single contiguous Rx buffer. This means that our page splitting is not very useful, so when XDP is in use we need to use the full Rx buffer size and not do sharing. Co-developed-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-16ionic: use dma range APIsShannon Nelson
Convert Rx datapath handling to use the DMA range APIs in preparation for adding XDP handling. Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Brett Creeley <brett.creeley@amd.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-16ionic: add helpers for accessing buffer infoShannon Nelson
These helpers clean up some of the code around DMA mapping and other buffer references, and will be used in the next few patches for the XDP support. Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Brett Creeley <brett.creeley@amd.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-16ionic: set adminq irq affinityShannon Nelson
We claim to have the AdminQ on our irq0 and thus cpu id 0, but we need to be sure we set the affinity hint to try to keep it there. Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Brett Creeley <brett.creeley@amd.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-16net/iucv: fix virtual vs physical address confusionAlexander Gordeev
Fix virtual vs physical address confusion. This does not fix a bug since virtual and physical address spaces are currently the same. Acked-by: Alexandra Winter <wintera@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-16Merge branch 'ravb-rutime-PM-support'David S. Miller
Claudiu Beznea says: ==================== net: ravb: Add runtime PM support (part 2) Series adds runtime PM support for the ravb driver. This is a continuation of [1]. There are 5 more preparation patches (patches 1-5) and patch 6 adds runtime PM support. Patches in this series were part of [2]. Changes in v4: - remove unnecessary code from patch 4/6 - improve the code in patch 5/6 Changes in v3: - fixed typos - added patch "net: ravb: Move the update of ndev->features to ravb_set_features()" - changes title of patch "net: ravb: Do not apply RX checksum settings to hardware if the interface is down" from v2 into "net: ravb: Do not apply features to hardware if the interface is down", changed patch description and updated the patch - collected tags Changes in v2: - address review comments - in patch 4/5 take into account the latest changes introduced in ravb_set_features_gbeth() Changes since [2]: - patch 1/5 is new - use pm_runtime_get_noresume() and pm_runtime_active() in patches 3/5, 4/5 - fixed higlighted typos in patch 4/5 [1] https://lore.kernel.org/all/20240202084136.3426492-1-claudiu.beznea.uj@bp.renesas.com/ [2] https://lore.kernel.org/all/20240105082339.1468817-1-claudiu.beznea.uj@bp.renesas.com/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-16net: ravb: Add runtime PM supportClaudiu Beznea
Add runtime PM support for the ravb driver. As the driver is used by different IP variants, with different behaviors, to be able to have the runtime PM support available for all devices, the preparatory commits moved all the resources parsing and allocations in the driver's probe function and kept the settings for ravb_open(). This is due to the fact that on some IP variants-platforms tuples disabling/enabling the clocks will switch the IP to the reset operation mode where register contents is lost and reconfiguration needs to be done. For this the rabv_open() function enables the clocks, switches the IP to configuration mode, applies all the register settings and switches the IP to the operational mode. At the end of ravb_open() IP is ready to send/receive data. In ravb_close() necessary reverts are done (compared with ravb_open()), the IP is switched to reset mode and clocks are disabled. The ethtool APIs or IOCTLs that might execute while the interface is down are either cached (and applied in ravb_open()) or rejected (as at that time the IP is in reset mode). Keeping the IP in the reset mode also increases the power saved (according to the hardware manual). Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-16net: ravb: Do not apply features to hardware if the interface is downClaudiu Beznea
Do not apply features to hardware if the interface is down. In case runtime PM is enabled, and while the interface is down, the IP will be in reset mode (as for some platforms disabling the clocks will switch the IP to reset mode, which will lead to losing register contents) and applying settings in reset mode is not an option. Instead, cache the features and apply them in ravb_open() through ravb_emac_init(). To avoid accessing the hardware while the interface is down pm_runtime_active() check was introduced. Along with it the device runtime PM usage counter has been incremented to avoid disabling the device clocks while the check is in progress (if any). Commit prepares for the addition of runtime PM. Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-16net: ravb: Move the update of ndev->features to ravb_set_features()Claudiu Beznea
Commit c2da9408579d ("ravb: Add Rx checksum offload support for GbEth") introduced support for setting GbEth features. With this the IP-specific features update functions update the ndev->features individually. Next commits add runtime PM support for the ravb driver. The runtime PM implementation will enable/disable the IP clocks on the ravb_open()/ravb_close() functions. Accessing the IP registers with clocks disabled blocks the system. The ravb_set_features() function could be executed when the Ethernet interface is closed so we need to ensure we don't access IP registers while the interface is down when runtime PM support will be in place. For these, move the update of ndev->features to ravb_set_features(). In this way we update the ndev->features only when the IP-specific features set function returns success and we can avoid code duplication when introducing runtime PM registers protection. Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-16net: ravb: Return cached statistics if the interface is downClaudiu Beznea
Return the cached statistics in case the interface is down. There should be no drawback to this, as cached statistics are updated in ravb_close(). In order to avoid accessing the IP registers while the IP is runtime suspended pm_runtime_active() check was introduced. The device runtime PM usage counter has been incremented to avoid disabling the device clocks while the check is in progress (if any). The commit prepares the code for the addition of runtime PM support. Suggested-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-16net: ravb: Keep the reverse order of operations in ravb_close()Claudiu Beznea
Keep the reverse order of operations in ravb_close() when compared with ravb_open(). This is the recommended configuration sequence. Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-16net: ravb: Get rid of the temporary variable irqClaudiu Beznea
The 4th argument of ravb_setup_irq() is used to save the IRQ number that will be further used by the driver code. Not all ravb_setup_irqs() calls need to save the IRQ number. The previous code used to pass a dummy variable as the 4th argument in case the IRQ is not needed for further usage. That is not necessary as the code from ravb_setup_irq() can detect by itself if the IRQ needs to be saved. Thus, get rid of the code that is not needed. Reported-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-16selftests: bonding: make sure new active is not nullHangbin Liu
One of Jakub's tests[1] shows that there may be period all ports are down and no active slave. This makes the new_active_slave null and the test fails. Add a check to make sure the new active is not null. [ 189.051966] br0: port 2(s1) entered disabled state [ 189.317881] bond0: (slave eth1): link status definitely down, disabling slave [ 189.318487] bond0: (slave eth2): making interface the new active one [ 190.435430] br0: port 4(s2) entered disabled state [ 190.773786] bond0: (slave eth0): link status definitely down, disabling slave [ 190.774204] bond0: (slave eth2): link status definitely down, disabling slave [ 190.774715] bond0: now running without any active interface! [ 190.877760] bond0: (slave eth0): link status definitely up [ 190.878098] bond0: (slave eth0): making interface the new active one [ 190.878495] bond0: active interface up! [ 191.802872] br0: port 4(s2) entered blocking state [ 191.803157] br0: port 4(s2) entered forwarding state [ 191.813756] bond0: (slave eth2): link status definitely up [ 192.847095] br0: port 2(s1) entered blocking state [ 192.847396] br0: port 2(s1) entered forwarding state [ 192.853740] bond0: (slave eth1): link status definitely up # TEST: prio (active-backup ns_ip6_target primary_reselect 1) [FAIL] # Current active slave is null but not eth0 [1] https://netdev-3.bots.linux.dev/vmksft-bonding/results/464481/1-bond-options-sh/stdout Fixes: 45bf79bc56c4 ("selftests: bonding: reduce garp_test/arp_validate test time") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-16Merge tag 'drm-msm-fixes-2024-02-15' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/msm into drm-fixes Fixes for v6.8-rc5 GPU: - dmabuf vmap fix - a610 UBWC corruption fix (incorrect hbb) - revert a commit that was making GPU recovery unreliable - tlb invalidation fix Signed-off-by: Dave Airlie <airlied@redhat.com> From: Rob Clark <robdclark@gmail.com> Link: https://patchwork.freedesktop.org/patch/msgid/CAF6AEGszDSiw66+a=ttBr-hat+zrcBtfc_cZ4LQqXu89DJ0UeQ@mail.gmail.com
2024-02-16Merge tag 'amd-drm-fixes-6.8-2024-02-15-2' of ↵Dave Airlie
https://gitlab.freedesktop.org/agd5f/linux into drm-fixes amd-drm-fixes-6.8-2024-02-15-2: amdgpu: - PSR fixes - Suspend/resume fixes - Link training fix - Aspect ratio fix - DCN 3.5 fixes - VCN 4.x fix - GFX 11 fix - Misc display fixes - Misc small fixes amdkfd: - Cache size reporting fix - SIMD distribution fix Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240215192452.11805-1-alexander.deucher@amd.com
2024-02-16Merge tag 'drm-xe-fixes-2024-02-15' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes Driver Changes: - Fix an out-of-bounds shift. - Fix the display code thinking xe uses shmem - Fix a warning about index out-of-bound - Fix a clang-16 compilation warning Signed-off-by: Dave Airlie <airlied@redhat.com> From: Thomas Hellstrom <thomas.hellstrom@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/Zc4GpcrbFVqdK9Ws@fedora
2024-02-16Merge tag 'drm-intel-fixes-2024-02-15' of ↵Dave Airlie
git://anongit.freedesktop.org/drm/drm-intel into drm-fixes Fix for #10172: Blank screen on JSL Chromebooks. Stable fix to limit DP SST link rate to <=8.1Gbps. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/Zc37W27F5OvoeSkG@jlahtine-mobl.ger.corp.intel.com
2024-02-15net: dsa: remove OF-based MDIO bus registration from DSA coreArınç ÜNAL
The code block under the "!ds->user_mii_bus && ds->ops->phy_read" check under dsa_switch_setup() populates ds->user_mii_bus. The use of ds->user_mii_bus is inappropriate when the MDIO bus of the switch is described on the device tree [1]. For this reason, use this code block only for switches [with MDIO bus] probed on platform_data, and OF which the switch MDIO bus isn't described on the device tree. Therefore, remove OF-based MDIO bus registration as it's useless for these cases. These subdrivers which control switches [with MDIO bus] probed on OF, will lose the ability to register the MDIO bus OF-based: drivers/net/dsa/b53/b53_common.c drivers/net/dsa/lan9303-core.c drivers/net/dsa/vitesse-vsc73xx-core.c These subdrivers let the DSA core driver register the bus: - ds->ops->phy_read() and ds->ops->phy_write() are present. - ds->user_mii_bus is not populated. The commit fe7324b93222 ("net: dsa: OF-ware slave_mii_bus") which brought OF-based MDIO bus registration on the DSA core driver is reasonably recent and, in this time frame, there have been no device trees in the Linux repository that started describing the MDIO bus, or dt-bindings defining the MDIO bus for the switches these subdrivers control. So I don't expect any devices to be affected. The logic we encourage is that all subdrivers should register the switch MDIO bus on their own [2]. And, for subdrivers which control switches [with MDIO bus] probed on OF, this logic must be followed to support all cases properly: No switch MDIO bus defined: Populate ds->user_mii_bus, register the MDIO bus, set the interrupts for PHYs if "interrupt-controller" is defined at the switch node. This case should only be covered for the switches which their dt-bindings documentation didn't document the MDIO bus from the start. This is to keep supporting the device trees that do not describe the MDIO bus on the device tree but the MDIO bus is being used nonetheless. Switch MDIO bus defined: Don't populate ds->user_mii_bus, register the MDIO bus, set the interrupts for PHYs if ["interrupt-controller" is defined at the switch node and "interrupts" is defined at the PHY nodes under the switch MDIO bus node]. Switch MDIO bus defined but explicitly disabled: If the device tree says status = "disabled" for the MDIO bus, we shouldn't need an MDIO bus at all. Instead, just exit as early as possible and do not call any MDIO API. After all subdrivers that control switches with MDIO buses are made to register the MDIO buses on their own, we will be able to get rid of dsa_switch_ops :: phy_read() and :: phy_write(), and the code block for registering the MDIO bus on the DSA core driver. Link: https://lore.kernel.org/netdev/20231213120656.x46fyad6ls7sqyzv@skbuf/ [1] Link: https://lore.kernel.org/netdev/20240103184459.dcbh57wdnlox6w7d@skbuf/ [2] Suggested-by: Luiz Angelo Daros de Luca <luizluca@gmail.com> Acked-by: Luiz Angelo Daros de Luca <luizluca@gmail.com> Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Link: https://lore.kernel.org/r/20240213-for-netnext-dsa-mdio-bus-v2-1-0ff6f4823a9e@arinc9.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-16Merge tag 'drm-misc-fixes-2024-02-15' of ↵Dave Airlie
git://anongit.freedesktop.org/drm/drm-misc into drm-fixes A suspend/resume error fix for ivpu, a couple of scheduler fixes for nouveau, a patch to support large page arrays in prime, a uninitialized variable fix in crtc, a locking fix in rockchip/vop2 and a buddy allocator error reporting fix. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maxime Ripard <mripard@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/b4ffqzigtfh6cgzdpwuk6jlrv3dnk4hu6etiizgvibysqgtl2p@42n2gdfdd5eu
2024-02-15smb: Fix regression in writes when non-standard maximum write size negotiatedSteve French
The conversion to netfs in the 6.3 kernel caused a regression when maximum write size is set by the server to an unexpected value which is not a multiple of 4096 (similarly if the user overrides the maximum write size by setting mount parm "wsize", but sets it to a value that is not a multiple of 4096). When negotiated write size is not a multiple of 4096 the netfs code can skip the end of the final page when doing large sequential writes, causing data corruption. This section of code is being rewritten/removed due to a large netfs change, but until that point (ie for the 6.3 kernel until now) we can not support non-standard maximum write sizes. Add a warning if a user specifies a wsize on mount that is not a multiple of 4096 (and round down), also add a change where we round down the maximum write size if the server negotiates a value that is not a multiple of 4096 (we also have to check to make sure that we do not round it down to zero). Reported-by: R. Diez" <rdiez-2006@rd10.de> Fixes: d08089f649a0 ("cifs: Change the I/O paths to use an iterator rather than a page list") Suggested-by: Ronnie Sahlberg <ronniesahlberg@gmail.com> Acked-by: Ronnie Sahlberg <ronniesahlberg@gmail.com> Tested-by: Matthew Ruffell <matthew.ruffell@canonical.com> Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Cc: stable@vger.kernel.org # v6.3+ Cc: David Howells <dhowells@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-02-15Merge branch 'fix-the-read-of-vsyscall-page-through-bpf'Alexei Starovoitov
Hou Tao says: ==================== Fix the read of vsyscall page through bpf From: Hou Tao <houtao1@huawei.com> Hi, As reported by syzboot [1] and [2], when trying to read vsyscall page by using bpf_probe_read_kernel() or bpf_probe_read(), oops may happen. Thomas Gleixner had proposed a test patch [3], but it seems that no formal patch is posted after about one month [4], so I post it instead and add an Originally-by tag in patch #2. Patch #1 makes is_vsyscall_vaddr() being a common helper. Patch #2 fixes the problem by disallowing vsyscall page read for copy_from_kernel_nofault(). Patch #3 adds one test case to ensure the read of vsyscall page through bpf is rejected. Please see individual patches for more details. Comments are always welcome. [1]: https://lore.kernel.org/bpf/CAG48ez06TZft=ATH1qh2c5mpS5BT8UakwNkzi6nvK5_djC-4Nw@mail.gmail.com/ [2]: https://lore.kernel.org/bpf/CABOYnLynjBoFZOf3Z4BhaZkc5hx_kHfsjiW+UWLoB=w33LvScw@mail.gmail.com/ [3]: https://lore.kernel.org/bpf/87r0jwquhv.ffs@tglx/ [4]: https://lore.kernel.org/bpf/e24b125c-8ff4-9031-6c53-67ff2e01f316@huaweicloud.com/ Change Log: v3: * rephrase commit message for patch #1 & #2 (Sohil) * reword comments in copy_from_kernel_nofault_allowed() (Sohil) * add Rvb tag for patch #1 and Acked-by tag for patch #3 (Sohil, Yonghong) v2: https://lore.kernel.org/bpf/20240126115423.3943360-1-houtao@huaweicloud.com/ * move is_vsyscall_vaddr to asm/vsyscall.h instead (Sohil) * elaborate on the reason for disallowing of vsyscall page read in copy_from_kernel_nofault_allowed() (Sohil) * update the commit message of patch #2 to more clearly explain how the oops occurs. (Sohil) * update the commit message of patch #3 to explain the expected return values of various bpf helpers (Yonghong) v1: https://lore.kernel.org/bpf/20240119073019.1528573-1-houtao@huaweicloud.com/ ==================== Link: https://lore.kernel.org/r/20240202103935.3154011-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-15selftest/bpf: Test the read of vsyscall page under x86-64Hou Tao
Under x86-64, when using bpf_probe_read_kernel{_str}() or bpf_probe_read{_str}() to read vsyscall page, the read may trigger oops, so add one test case to ensure that the problem is fixed. Beside those four bpf helpers mentioned above, testing the read of vsyscall page by using bpf_probe_read_user{_str} and bpf_copy_from_user{_task}() as well. The test case passes the address of vsyscall page to these six helpers and checks whether the returned values are expected: 1) For bpf_probe_read_kernel{_str}()/bpf_probe_read{_str}(), the expected return value is -ERANGE as shown below: bpf_probe_read_kernel_common copy_from_kernel_nofault // false, return -ERANGE copy_from_kernel_nofault_allowed 2) For bpf_probe_read_user{_str}(), the expected return value is -EFAULT as show below: bpf_probe_read_user_common copy_from_user_nofault // false, return -EFAULT __access_ok 3) For bpf_copy_from_user(), the expected return value is -EFAULT: // return -EFAULT bpf_copy_from_user copy_from_user _copy_from_user // return false access_ok 4) For bpf_copy_from_user_task(), the expected return value is -EFAULT: // return -EFAULT bpf_copy_from_user_task access_process_vm // return 0 vma_lookup() // return 0 expand_stack() The occurrence of oops depends on the availability of CPU SMAP [1] feature and there are three possible configurations of vsyscall page in the boot cmd-line: vsyscall={xonly|none|emulate}, so there are a total of six possible combinations. Under all these combinations, the test case runs successfully. [1]: https://en.wikipedia.org/wiki/Supervisor_Mode_Access_Prevention Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20240202103935.3154011-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-15x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault()Hou Tao
When trying to use copy_from_kernel_nofault() to read vsyscall page through a bpf program, the following oops was reported: BUG: unable to handle page fault for address: ffffffffff600000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 3231067 P4D 3231067 PUD 3233067 PMD 3235067 PTE 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 1 PID: 20390 Comm: test_progs ...... 6.7.0+ #58 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...... RIP: 0010:copy_from_kernel_nofault+0x6f/0x110 ...... Call Trace: <TASK> ? copy_from_kernel_nofault+0x6f/0x110 bpf_probe_read_kernel+0x1d/0x50 bpf_prog_2061065e56845f08_do_probe_read+0x51/0x8d trace_call_bpf+0xc5/0x1c0 perf_call_bpf_enter.isra.0+0x69/0xb0 perf_syscall_enter+0x13e/0x200 syscall_trace_enter+0x188/0x1c0 do_syscall_64+0xb5/0xe0 entry_SYSCALL_64_after_hwframe+0x6e/0x76 </TASK> ...... ---[ end trace 0000000000000000 ]--- The oops is triggered when: 1) A bpf program uses bpf_probe_read_kernel() to read from the vsyscall page and invokes copy_from_kernel_nofault() which in turn calls __get_user_asm(). 2) Because the vsyscall page address is not readable from kernel space, a page fault exception is triggered accordingly. 3) handle_page_fault() considers the vsyscall page address as a user space address instead of a kernel space address. This results in the fix-up setup by bpf not being applied and a page_fault_oops() is invoked due to SMAP. Considering handle_page_fault() has already considered the vsyscall page address as a userspace address, fix the problem by disallowing vsyscall page read for copy_from_kernel_nofault(). Originally-by: Thomas Gleixner <tglx@linutronix.de> Reported-by: syzbot+72aa0161922eba61b50e@syzkaller.appspotmail.com Closes: https://lore.kernel.org/bpf/CAG48ez06TZft=ATH1qh2c5mpS5BT8UakwNkzi6nvK5_djC-4Nw@mail.gmail.com Reported-by: xingwei lee <xrivendell7@gmail.com> Closes: https://lore.kernel.org/bpf/CABOYnLynjBoFZOf3Z4BhaZkc5hx_kHfsjiW+UWLoB=w33LvScw@mail.gmail.com Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240202103935.3154011-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-15x86/mm: Move is_vsyscall_vaddr() into asm/vsyscall.hHou Tao
Move is_vsyscall_vaddr() into asm/vsyscall.h to make it available for copy_from_kernel_nofault_allowed() in arch/x86/mm/maccess.c. Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20240202103935.3154011-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-15Merge branch 'for-thermal-genetlink-family-bind-unbind-callbacks'Jakub Kicinski
Stanislaw Gruszka says: ==================== thermal/netlink/intel_hfi: Enable HFI feature only when required The patchset introduces a new genetlink family bind/unbind callbacks and thermal/netlink notifications, which allow drivers to send netlink multicast events based on the presence of actual user-space consumers. This functionality optimizes resource usage by allowing disabling of features when not needed. v1: https://lore.kernel.org/linux-pm/20240131120535.933424-1-stanislaw.gruszka@linux.intel.com// v2: https://lore.kernel.org/linux-pm/20240206133605.1518373-1-stanislaw.gruszka@linux.intel.com/ v3: https://lore.kernel.org/linux-pm/20240209120625.1775017-1-stanislaw.gruszka@linux.intel.com/ ==================== Link: https://lore.kernel.org/r/20240212161615.161935-1-stanislaw.gruszka@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-15genetlink: Add per family bind/unbind callbacksStanislaw Gruszka
Add genetlink family bind()/unbind() callbacks when adding/removing multicast group to/from netlink client socket via setsockopt() or bind() syscall. They can be used to track if consumers of netlink multicast messages emerge or disappear. Thus, a client implementing callbacks, can now send events only when there are active consumers, preventing unnecessary work when none exist. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20240212161615.161935-2-stanislaw.gruszka@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-15configs/debug: add NET debug configMatthieu Baerts (NGI0)
The debug.config file is really great to easily enable a bunch of general debugging features on a CI-like setup. But it would be great to also include core networking debugging config. A few CI's validating features from the Net tree also enable a few other debugging options on top of debug.config. A small selection is quite generic for the whole net tree. They validate some assumptions in different parts of the core net tree. As suggested by Jakub Kicinski in [1], having them added to this debug.config file would help other CIs using network features to find bugs in this area. Note that the two REFCNT configs also select REF_TRACKER, which doesn't seem to be an issue. Link: https://lore.kernel.org/netdev/20240202093148.33bd2b14@kernel.org/T/ [1] Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20240212-kconfig-debug-enable-net-v1-1-fb026de8174c@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-16zonefs: Improve error handlingDamien Le Moal
Write error handling is racy and can sometime lead to the error recovery path wrongly changing the inode size of a sequential zone file to an incorrect value which results in garbage data being readable at the end of a file. There are 2 problems: 1) zonefs_file_dio_write() updates a zone file write pointer offset after issuing a direct IO with iomap_dio_rw(). This update is done only if the IO succeed for synchronous direct writes. However, for asynchronous direct writes, the update is done without waiting for the IO completion so that the next asynchronous IO can be immediately issued. However, if an asynchronous IO completes with a failure right before the i_truncate_mutex lock protecting the update, the update may change the value of the inode write pointer offset that was corrected by the error path (zonefs_io_error() function). 2) zonefs_io_error() is called when a read or write error occurs. This function executes a report zone operation using the callback function zonefs_io_error_cb(), which does all the error recovery handling based on the current zone condition, write pointer position and according to the mount options being used. However, depending on the zoned device being used, a report zone callback may be executed in a context that is different from the context of __zonefs_io_error(). As a result, zonefs_io_error_cb() may be executed without the inode truncate mutex lock held, which can lead to invalid error processing. Fix both problems as follows: - Problem 1: Perform the inode write pointer offset update before a direct write is issued with iomap_dio_rw(). This is safe to do as partial direct writes are not supported (IOMAP_DIO_PARTIAL is not set) and any failed IO will trigger the execution of zonefs_io_error() which will correct the inode write pointer offset to reflect the current state of the one on the device. - Problem 2: Change zonefs_io_error_cb() into zonefs_handle_io_error() and call this function directly from __zonefs_io_error() after obtaining the zone information using blkdev_report_zones() with a simple callback function that copies to a local stack variable the struct blk_zone obtained from the device. This ensures that error handling is performed holding the inode truncate mutex. This change also simplifies error handling for conventional zone files by bypassing the execution of report zones entirely. This is safe to do because the condition of conventional zones cannot be read-only or offline and conventional zone files are always fully mapped with a constant file size. Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
2024-02-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: net/core/dev.c 9f30831390ed ("net: add rcu safety to rtnl_prop_list_size()") 723de3ebef03 ("net: free altname using an RCU callback") net/unix/garbage.c 11498715f266 ("af_unix: Remove io_uring code for GC.") 25236c91b5ab ("af_unix: Fix task hung while purging oob_skb in GC.") drivers/net/ethernet/renesas/ravb_main.c ed4adc07207d ("net: ravb: Count packets instead of descriptors in GbEth RX path" ) c2da9408579d ("ravb: Add Rx checksum offload support for GbEth") net/mptcp/protocol.c bdd70eb68913 ("mptcp: drop the push_pending field") 28e5c1380506 ("mptcp: annotate lockless accesses around read-mostly fields") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-15md: Don't suspend the array for interrupted reshapeYu Kuai
md_start_sync() will suspend the array if there are spares that can be added or removed from conf, however, if reshape is still in progress, this won't happen at all or data will be corrupted(remove_and_add_spares won't be called from md_choose_sync_action for reshape), hence there is no need to suspend the array if reshape is not done yet. Meanwhile, there is a potential deadlock for raid456: 1) reshape is interrupted; 2) set one of the disk WantReplacement, and add a new disk to the array, however, recovery won't start until the reshape is finished; 3) then issue an IO across reshpae position, this IO will wait for reshape to make progress; 4) continue to reshape, then md_start_sync() found there is a spare disk that can be added to conf, mddev_suspend() is called; Step 4 and step 3 is waiting for each other, deadlock triggered. Noted this problem is found by code review, and it's not reporduced yet. Fix this porblem by don't suspend the array for interrupted reshape, this is safe because conf won't be changed until reshape is done. Fixes: bc08041b32ab ("md: suspend array in md_start_sync() if array need reconfiguration") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240201092559.910982-6-yukuai1@huaweicloud.com
2024-02-15md: Don't register sync_thread for reshape directlyYu Kuai
Currently, if reshape is interrupted, then reassemble the array will register sync_thread directly from pers->run(), in this case 'MD_RECOVERY_RUNNING' is set directly, however, there is no guarantee that md_do_sync() will be executed, hence stop_sync_thread() will hang because 'MD_RECOVERY_RUNNING' can't be cleared. Last patch make sure that md_do_sync() will set MD_RECOVERY_DONE, however, following hang can still be triggered by dm-raid test shell/lvconvert-raid-reshape.sh occasionally: [root@fedora ~]# cat /proc/1982/stack [<0>] stop_sync_thread+0x1ab/0x270 [md_mod] [<0>] md_frozen_sync_thread+0x5c/0xa0 [md_mod] [<0>] raid_presuspend+0x1e/0x70 [dm_raid] [<0>] dm_table_presuspend_targets+0x40/0xb0 [dm_mod] [<0>] __dm_destroy+0x2a5/0x310 [dm_mod] [<0>] dm_destroy+0x16/0x30 [dm_mod] [<0>] dev_remove+0x165/0x290 [dm_mod] [<0>] ctl_ioctl+0x4bb/0x7b0 [dm_mod] [<0>] dm_ctl_ioctl+0x11/0x20 [dm_mod] [<0>] vfs_ioctl+0x21/0x60 [<0>] __x64_sys_ioctl+0xb9/0xe0 [<0>] do_syscall_64+0xc6/0x230 [<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74 Meanwhile mddev->recovery is: MD_RECOVERY_RUNNING | MD_RECOVERY_INTR | MD_RECOVERY_RESHAPE | MD_RECOVERY_FROZEN Fix this problem by remove the code to register sync_thread directly from raid10 and raid5. And let md_check_recovery() to register sync_thread. Fixes: f67055780caa ("[PATCH] md: Checkpoint and allow restart of raid5 reshape") Fixes: f52f5c71f3d4 ("md: fix stopping sync thread") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240201092559.910982-5-yukuai1@huaweicloud.com
2024-02-15md: Make sure md_do_sync() will set MD_RECOVERY_DONEYu Kuai
stop_sync_thread() will interrupt md_do_sync(), and md_do_sync() must set MD_RECOVERY_DONE, so that follow up md_check_recovery() will unregister sync_thread, clear MD_RECOVERY_RUNNING and wake up stop_sync_thread(). If MD_RECOVERY_WAIT is set or the array is read-only, md_do_sync() will return without setting MD_RECOVERY_DONE, and after commit f52f5c71f3d4 ("md: fix stopping sync thread"), dm-raid switch from md_reap_sync_thread() to stop_sync_thread() to unregister sync_thread from md_stop() and md_stop_writes(), causing the test shell/lvconvert-raid-reshape.sh hang. We shouldn't switch back to md_reap_sync_thread() because it's problematic in the first place. Fix the problem by making sure md_do_sync() will set MD_RECOVERY_DONE. Reported-by: Mikulas Patocka <mpatocka@redhat.com> Closes: https://lore.kernel.org/all/ece2b06f-d647-6613-a534-ff4c9bec1142@redhat.com/ Fixes: d5d885fd514f ("md: introduce new personality funciton start()") Fixes: 5fd6c1dce06e ("[PATCH] md: allow checkpoint of recovery with version-1 superblock") Fixes: f52f5c71f3d4 ("md: fix stopping sync thread") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240201092559.910982-4-yukuai1@huaweicloud.com
2024-02-15md: Don't ignore read-only array in md_check_recovery()Yu Kuai
Usually if the array is not read-write, md_check_recovery() won't register new sync_thread in the first place. And if the array is read-write and sync_thread is registered, md_set_readonly() will unregister sync_thread before setting the array read-only. md/raid follow this behavior hence there is no problem. After commit f52f5c71f3d4 ("md: fix stopping sync thread"), following hang can be triggered by test shell/integrity-caching.sh: 1) array is read-only. dm-raid update super block: rs_update_sbs ro = mddev->ro mddev->ro = 0 -> set array read-write md_update_sb 2) register new sync thread concurrently. 3) dm-raid set array back to read-only: rs_update_sbs mddev->ro = ro 4) stop the array: raid_dtr md_stop stop_sync_thread set_bit(MD_RECOVERY_INTR, &mddev->recovery); md_wakeup_thread_directly(mddev->sync_thread); wait_event(..., !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) 5) sync thread done: md_do_sync set_bit(MD_RECOVERY_DONE, &mddev->recovery); md_wakeup_thread(mddev->thread); 6) daemon thread can't unregister sync thread: md_check_recovery if (!md_is_rdwr(mddev) && !test_bit(MD_RECOVERY_NEEDED, &mddev->recovery)) return; -> -> MD_RECOVERY_RUNNING can't be cleared, hence step 4 hang; The root cause is that dm-raid manipulate 'mddev->ro' by itself, however, dm-raid really should stop sync thread before setting the array read-only. Unfortunately, I need to read more code before I can refacter the handler of 'mddev->ro' in dm-raid, hence let's fix the problem the easy way for now to prevent dm-raid regression. Reported-by: Mikulas Patocka <mpatocka@redhat.com> Closes: https://lore.kernel.org/all/9801e40-8ac7-e225-6a71-309dcf9dc9aa@redhat.com/ Fixes: ecbfb9f118bc ("dm raid: add raid level takeover support") Fixes: f52f5c71f3d4 ("md: fix stopping sync thread") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240201092559.910982-3-yukuai1@huaweicloud.com
2024-02-15md: Don't ignore suspended array in md_check_recovery()Yu Kuai
mddev_suspend() never stop sync_thread, hence it doesn't make sense to ignore suspended array in md_check_recovery(), which might cause sync_thread can't be unregistered. After commit f52f5c71f3d4 ("md: fix stopping sync thread"), following hang can be triggered by test shell/integrity-caching.sh: 1) suspend the array: raid_postsuspend mddev_suspend 2) stop the array: raid_dtr md_stop __md_stop_writes stop_sync_thread set_bit(MD_RECOVERY_INTR, &mddev->recovery); md_wakeup_thread_directly(mddev->sync_thread); wait_event(..., !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) 3) sync thread done: md_do_sync set_bit(MD_RECOVERY_DONE, &mddev->recovery); md_wakeup_thread(mddev->thread); 4) daemon thread can't unregister sync thread: md_check_recovery if (mddev->suspended) return; -> return directly md_read_sync_thread clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery); -> MD_RECOVERY_RUNNING can't be cleared, hence step 2 hang; This problem is not just related to dm-raid, fix it by ignoring suspended array in md_check_recovery(). And follow up patches will improve dm-raid better to frozen sync thread during suspend. Reported-by: Mikulas Patocka <mpatocka@redhat.com> Closes: https://lore.kernel.org/all/8fb335e-6d2c-dbb5-d7-ded8db5145a@redhat.com/ Fixes: 68866e425be2 ("MD: no sync IO while suspended") Fixes: f52f5c71f3d4 ("md: fix stopping sync thread") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240201092559.910982-2-yukuai1@huaweicloud.com
2024-02-15bpf: Fix test verif_scale_strobemeta_subprogs failure due to llvm19Yonghong Song
With latest llvm19, I hit the following selftest failures with $ ./test_progs -j libbpf: prog 'on_event': BPF program load failed: Permission denied libbpf: prog 'on_event': -- BEGIN PROG LOAD LOG -- combined stack size of 4 calls is 544. Too large verification time 1344153 usec stack depth 24+440+0+32 processed 51008 insns (limit 1000000) max_states_per_insn 19 total_states 1467 peak_states 303 mark_read 146 -- END PROG LOAD LOG -- libbpf: prog 'on_event': failed to load: -13 libbpf: failed to load object 'strobemeta_subprogs.bpf.o' scale_test:FAIL:expect_success unexpected error: -13 (errno 13) #498 verif_scale_strobemeta_subprogs:FAIL The verifier complains too big of the combined stack size (544 bytes) which exceeds the maximum stack limit 512. This is a regression from llvm19 ([1]). In the above error log, the original stack depth is 24+440+0+32. To satisfy interpreter's need, in verifier the stack depth is adjusted to 32+448+32+32=544 which exceeds 512, hence the error. The same adjusted stack size is also used for jit case. But the jitted codes could use smaller stack size. $ egrep -r stack_depth | grep round_up arm64/net/bpf_jit_comp.c: ctx->stack_size = round_up(prog->aux->stack_depth, 16); loongarch/net/bpf_jit.c: bpf_stack_adjust = round_up(ctx->prog->aux->stack_depth, 16); powerpc/net/bpf_jit_comp.c: cgctx.stack_size = round_up(fp->aux->stack_depth, 16); riscv/net/bpf_jit_comp32.c: round_up(ctx->prog->aux->stack_depth, STACK_ALIGN); riscv/net/bpf_jit_comp64.c: bpf_stack_adjust = round_up(ctx->prog->aux->stack_depth, 16); s390/net/bpf_jit_comp.c: u32 stack_depth = round_up(fp->aux->stack_depth, 8); sparc/net/bpf_jit_comp_64.c: stack_needed += round_up(stack_depth, 16); x86/net/bpf_jit_comp.c: EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8)); x86/net/bpf_jit_comp.c: int tcc_off = -4 - round_up(stack_depth, 8); x86/net/bpf_jit_comp.c: round_up(stack_depth, 8)); x86/net/bpf_jit_comp.c: int tcc_off = -4 - round_up(stack_depth, 8); x86/net/bpf_jit_comp.c: EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8)); In the above, STACK_ALIGN in riscv/net/bpf_jit_comp32.c is defined as 16. So stack is aligned in either 8 or 16, x86/s390 having 8-byte stack alignment and the rest having 16-byte alignment. This patch calculates total stack depth based on 16-byte alignment if jit is requested. For the above failing case, the new stack size will be 32+448+0+32=512 and no verification failure. llvm19 regression will be discussed separately in llvm upstream. The verifier change caused three test failures as these tests compared messages with stack size. More specifically, - test_global_funcs/global_func1: fail with interpreter mode and success with jit mode. Adjusted stack sizes so both jit and interpreter modes will fail. - async_stack_depth/{pseudo_call_check, async_call_root_check}: since jit and interpreter will calculate different stack sizes, the failure msg is adjusted to omit those specific stack size numbers. [1] https://lore.kernel.org/bpf/32bde0f0-1881-46c9-931a-673be566c61d@linux.dev/ Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240214232951.4113094-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-15bpf: improve duplicate source code line detectionAndrii Nakryiko
Verifier log avoids printing the same source code line multiple times when a consecutive block of BPF assembly instructions are covered by the same original (C) source code line. This greatly improves verifier log legibility. Unfortunately, this check is imperfect and in production applications it quite often happens that verifier log will have multiple duplicated source lines emitted, for no apparently good reason. E.g., this is excerpt from a real-world BPF application (with register states omitted for clarity): BEFORE ====== ; for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) { @ strobemeta_probe.bpf.c:394 5369: (07) r8 += 2 ; 5370: (07) r7 += 16 ; ; for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) { @ strobemeta_probe.bpf.c:394 5371: (07) r9 += 1 ; 5372: (79) r4 = *(u64 *)(r10 -32) ; ; for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) { @ strobemeta_probe.bpf.c:394 5373: (55) if r9 != 0xf goto pc+2 ; if (i >= map->cnt) @ strobemeta_probe.bpf.c:396 5376: (79) r1 = *(u64 *)(r10 -40) ; 5377: (79) r1 = *(u64 *)(r1 +8) ; ; if (i >= map->cnt) @ strobemeta_probe.bpf.c:396 5378: (dd) if r1 s<= r9 goto pc-5 ; ; descr->key_lens[i] = 0; @ strobemeta_probe.bpf.c:398 5379: (b4) w1 = 0 ; 5380: (6b) *(u16 *)(r8 -30) = r1 ; ; task, data, off, STROBE_MAX_STR_LEN, map->entries[i].key); @ strobemeta_probe.bpf.c:400 5381: (79) r3 = *(u64 *)(r7 -8) ; 5382: (7b) *(u64 *)(r10 -24) = r6 ; ; task, data, off, STROBE_MAX_STR_LEN, map->entries[i].key); @ strobemeta_probe.bpf.c:400 5383: (bc) w6 = w6 ; ; barrier_var(payload_off); @ strobemeta_probe.bpf.c:280 5384: (bf) r2 = r6 ; 5385: (bf) r1 = r4 ; As can be seen, line 394 is emitted thrice, 396 is emitted twice, and line 400 is duplicated as well. Note that there are no intermingling other lines of source code in between these duplicates, so the issue is not compiler reordering assembly instruction such that multiple original source code lines are in effect. It becomes more obvious what's going on if we look at *full* original line info information (using btfdump for this, [0]): #2764: line: insn #5363 --> 394:3 @ ./././strobemeta_probe.bpf.c for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) { #2765: line: insn #5373 --> 394:21 @ ./././strobemeta_probe.bpf.c for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) { #2766: line: insn #5375 --> 394:47 @ ./././strobemeta_probe.bpf.c for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) { #2767: line: insn #5377 --> 394:3 @ ./././strobemeta_probe.bpf.c for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) { #2768: line: insn #5378 --> 414:10 @ ./././strobemeta_probe.bpf.c return off; We can see that there are four line info records covering instructions #5363 through #5377 (instruction indices are shifted due to subprog instruction being appended to main program), all of them are pointing to the same C source code line #394. But each of them points to a different part of that line, which is denoted by differing column numbers (3, 21, 47, 3). But verifier log doesn't distinguish between parts of the same source code line and doesn't emit this column number information, so for end user it's just a repetitive visual noise. So let's improve the detection of repeated source code line and avoid this. With the changes in this patch, we get this output for the same piece of BPF program log: AFTER ===== ; for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) { @ strobemeta_probe.bpf.c:394 5369: (07) r8 += 2 ; 5370: (07) r7 += 16 ; 5371: (07) r9 += 1 ; 5372: (79) r4 = *(u64 *)(r10 -32) ; 5373: (55) if r9 != 0xf goto pc+2 ; if (i >= map->cnt) @ strobemeta_probe.bpf.c:396 5376: (79) r1 = *(u64 *)(r10 -40) ; 5377: (79) r1 = *(u64 *)(r1 +8) ; 5378: (dd) if r1 s<= r9 goto pc-5 ; ; descr->key_lens[i] = 0; @ strobemeta_probe.bpf.c:398 5379: (b4) w1 = 0 ; 5380: (6b) *(u16 *)(r8 -30) = r1 ; ; task, data, off, STROBE_MAX_STR_LEN, map->entries[i].key); @ strobemeta_probe.bpf.c:400 5381: (79) r3 = *(u64 *)(r7 -8) ; 5382: (7b) *(u64 *)(r10 -24) = r6 ; 5383: (bc) w6 = w6 ; ; barrier_var(payload_off); @ strobemeta_probe.bpf.c:280 5384: (bf) r2 = r6 ; 5385: (bf) r1 = r4 ; All the duplication is gone and the log is cleaner and less distracting. [0] https://github.com/anakryiko/btfdump Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240214174100.2847419-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-15scsi: jazz_esp: Only build if SCSI core is builtinRandy Dunlap
JAZZ_ESP is a bool kconfig symbol that selects SCSI_SPI_ATTRS. When CONFIG_SCSI=m, this results in SCSI_SPI_ATTRS=m while JAZZ_ESP=y, which causes many undefined symbol linker errors. Fix this by only offering to build this driver when CONFIG_SCSI=y. [mkp: JAZZ_ESP is unique in that it does not support being compiled as a module unlike the remaining SPI SCSI HBA drivers] Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20240214055953.9612-1-rdunlap@infradead.org Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: linux-mips@vger.kernel.org Cc: Arnd Bergmann <arnd@arndb.de> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Nicolas Schier <nicolas@fjasle.eu> Cc: James E.J. Bottomley <jejb@linux.ibm.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: linux-scsi@vger.kernel.org Cc: Geert Uytterhoeven <geert@linux-m68k.org> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202402112222.Gl0udKyU-lkp@intel.com/ Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2024-02-15scsi: smartpqi: Fix disable_managed_interruptsDon Brace
Correct blk-mq registration issue with module parameter disable_managed_interrupts enabled. When we turn off the default PCI_IRQ_AFFINITY flag, the driver needs to register with blk-mq using blk_mq_map_queues(). The driver is currently calling blk_mq_pci_map_queues() which results in a stack trace and possibly undefined behavior. Stack Trace: [ 7.860089] scsi host2: smartpqi [ 7.871934] WARNING: CPU: 0 PID: 238 at block/blk-mq-pci.c:52 blk_mq_pci_map_queues+0xca/0xd0 [ 7.889231] Modules linked in: sd_mod t10_pi sg uas smartpqi(+) crc32c_intel scsi_transport_sas usb_storage dm_mirror dm_region_hash dm_log dm_mod ipmi_devintf ipmi_msghandler fuse [ 7.924755] CPU: 0 PID: 238 Comm: kworker/0:3 Not tainted 4.18.0-372.88.1.el8_6_smartpqi_test.x86_64 #1 [ 7.944336] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 03/08/2022 [ 7.963026] Workqueue: events work_for_cpu_fn [ 7.978275] RIP: 0010:blk_mq_pci_map_queues+0xca/0xd0 [ 7.978278] Code: 48 89 de 89 c7 e8 f6 0f 4f 00 3b 05 c4 b7 8e 01 72 e1 5b 31 c0 5d 41 5c 41 5d 41 5e 41 5f e9 7d df 73 00 31 c0 e9 76 df 73 00 <0f> 0b eb bc 90 90 0f 1f 44 00 00 41 57 49 89 ff 41 56 41 55 41 54 [ 7.978280] RSP: 0018:ffffa95fc3707d50 EFLAGS: 00010216 [ 7.978283] RAX: 00000000ffffffff RBX: 0000000000000000 RCX: 0000000000000010 [ 7.978284] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffff9190c32d4310 [ 7.978286] RBP: 0000000000000000 R08: ffffa95fc3707d38 R09: ffff91929b81ac00 [ 7.978287] R10: 0000000000000001 R11: ffffa95fc3707ac0 R12: 0000000000000000 [ 7.978288] R13: ffff9190c32d4000 R14: 00000000ffffffff R15: ffff9190c4c950a8 [ 7.978290] FS: 0000000000000000(0000) GS:ffff9193efc00000(0000) knlGS:0000000000000000 [ 7.978292] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8.172814] CR2: 000055d11166c000 CR3: 00000002dae10002 CR4: 00000000007706f0 [ 8.172816] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 8.172817] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 8.172818] PKRU: 55555554 [ 8.172819] Call Trace: [ 8.172823] blk_mq_alloc_tag_set+0x12e/0x310 [ 8.264339] scsi_add_host_with_dma.cold.9+0x30/0x245 [ 8.279302] pqi_ctrl_init+0xacf/0xc8e [smartpqi] [ 8.294085] ? pqi_pci_probe+0x480/0x4c8 [smartpqi] [ 8.309015] pqi_pci_probe+0x480/0x4c8 [smartpqi] [ 8.323286] local_pci_probe+0x42/0x80 [ 8.337855] work_for_cpu_fn+0x16/0x20 [ 8.351193] process_one_work+0x1a7/0x360 [ 8.364462] ? create_worker+0x1a0/0x1a0 [ 8.379252] worker_thread+0x1ce/0x390 [ 8.392623] ? create_worker+0x1a0/0x1a0 [ 8.406295] kthread+0x10a/0x120 [ 8.418428] ? set_kthread_struct+0x50/0x50 [ 8.431532] ret_from_fork+0x1f/0x40 [ 8.444137] ---[ end trace 1bf0173d39354506 ]--- Fixes: cf15c3e734e8 ("scsi: smartpqi: Add module param to disable managed ints") Tested-by: Yogesh Chandra Pandey <YogeshChandra.Pandey@microchip.com> Reviewed-by: Scott Benesh <scott.benesh@microchip.com> Reviewed-by: Scott Teel <scott.teel@microchip.com> Reviewed-by: Mahesh Rajashekhara <mahesh.rajashekhara@microchip.com> Reviewed-by: Mike McGowen <mike.mcgowen@microchip.com> Reviewed-by: Kevin Barnett <kevin.barnett@microchip.com> Signed-off-by: Don Brace <don.brace@microchip.com> Link: https://lore.kernel.org/r/20240213162200.1875970-2-don.brace@microchip.com Reviewed-by: Tomas Henzl <thenzl@redhat.com> Reviewed-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2024-02-15scsi: ufs: Uninitialized variable in ufshcd_devfreq_target()Dan Carpenter
There is one goto where "sched_clk_scaling_suspend_work" is true but "scale_up" is uninitialized. It leads to a Smatch uninitialized variable warning: drivers/ufs/core/ufshcd.c:1589 ufshcd_devfreq_target() error: uninitialized symbol 'scale_up'. Fixes: 1d969731b87f ("scsi: ufs: core: Only suspend clock scaling if scaling down") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/c787d37f-1107-4512-8991-bccf80e74a35@moroto.mountain Reviewed-by: Peter Wang <peter.wang@mediatek.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2024-02-15scsi: target: pscsi: Fix bio_put() for error caseNaohiro Aota
As of commit 066ff571011d ("block: turn bio_kmalloc into a simple kmalloc wrapper"), a bio allocated by bio_kmalloc() must be freed by bio_uninit() and kfree(). That is not done properly for the error case, hitting WARN and NULL pointer dereference in bio_free(). Fixes: 066ff571011d ("block: turn bio_kmalloc into a simple kmalloc wrapper") CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Link: https://lore.kernel.org/r/20240214144356.101814-1-naohiro.aota@wdc.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2024-02-15scsi: core: Consult supported VPD page list prior to fetching pageMartin K. Petersen
Commit c92a6b5d6335 ("scsi: core: Query VPD size before getting full page") removed the logic which checks whether a VPD page is present on the supported pages list before asking for the page itself. That was done because SPC helpfully states "The Supported VPD Pages VPD page list may or may not include all the VPD pages that are able to be returned by the device server". Testing had revealed a few devices that supported some of the 0xBn pages but didn't actually list them in page 0. Julian Sikorski bisected a problem with his drive resetting during discovery to the commit above. As it turns out, this particular drive firmware will crash if we attempt to fetch page 0xB9. Various approaches were attempted to work around this. In the end, reinstating the logic that consults VPD page 0 before fetching any other page was the path of least resistance. A firmware update for the devices which originally compelled us to remove the check has since been released. Link: https://lore.kernel.org/r/20240214221411.2888112-1-martin.petersen@oracle.com Fixes: c92a6b5d6335 ("scsi: core: Query VPD size before getting full page") Cc: stable@vger.kernel.org Cc: Bart Van Assche <bvanassche@acm.org> Reported-by: Julian Sikorski <belegdol@gmail.com> Tested-by: Julian Sikorski <belegdol@gmail.com> Reviewed-by: Lee Duncan <lee.duncan@suse.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>