summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-03-05net: gro: rename skb_gro_header_hard()Eric Dumazet
skb_gro_header_hard() is renamed to skb_gro_may_pull() to match the convention used by common helpers like pskb_may_pull(). This means the condition is inverted: if (skb_gro_header_hard(skb, hlen)) slow_path(); becomes: if (!skb_gro_may_pull(skb, hlen)) slow_path(); Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-05Merge branch 'mt7530-dsa-subdriver-improvements-act-iii'Paolo Abeni
says: ==================== MT7530 DSA Subdriver Improvements Act III This is the third patch series with the goal of simplifying the MT7530 DSA subdriver and improving support for MT7530, MT7531, and the switch on the MT7988 SoC. I have done a simple ping test to confirm basic communication on all switch ports on MCM and standalone MT7530, and MT7531 switch with this patch series applied. MT7621 Unielec, MCM MT7530: rgmii-only-gmac0-mt7621-unielec-u7621-06-16m.dtb gmac0-and-gmac1-mt7621-unielec-u7621-06-16m.dtb tftpboot 0x80008000 mips-uzImage.bin; tftpboot 0x83000000 mips-rootfs.cpio.uboot; tftpboot 0x83f00000 $dtb; bootm 0x80008000 0x83000000 0x83f00000 MT7622 Bananapi, MT7531: gmac0-and-gmac1-mt7622-bananapi-bpi-r64.dtb tftpboot 0x40000000 arm64-Image; tftpboot 0x45000000 arm64-rootfs.cpio.uboot; tftpboot 0x4a000000 $dtb; booti 0x40000000 0x45000000 0x4a000000 MT7623 Bananapi, standalone MT7530: rgmii-only-gmac0-mt7623n-bananapi-bpi-r2.dtb gmac0-and-gmac1-mt7623n-bananapi-bpi-r2.dtb tftpboot 0x80008000 arm-zImage; tftpboot 0x83000000 arm-rootfs.cpio.uboot; tftpboot 0x83f00000 $dtb; bootz 0x80008000 0x83000000 0x83f00000 This patch series is the continuation of the patch series linked below. https://lore.kernel.org/r/20230522121532.86610-1-arinc.unal@arinc9.com Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> --- Changes in v3: - Patch 8 - Explain properly the behaviour of setting link down on all ports at setup. - Split the changes for simplifying the link settings operations out to another patch. - Link to v2: https://lore.kernel.org/r/20240216-for-netnext-mt7530-improvements-3-v2-0-094cae3ff23b@arinc9.com Changes in v2: - Patch 8 - Use a single mt7530_rmw() instead of two mt7530_clear() and mt7530_set() commands. - Link to v1: https://lore.kernel.org/r/20240208-for-netnext-mt7530-improvements-3-v1-0-d7c1cfd502ca@arinc9.com ==================== Link: https://lore.kernel.org/r/20240301-for-netnext-mt7530-improvements-3-v3-0-449f4f166454@arinc9.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-05net: dsa: mt7530: simplify link operationsArınç ÜNAL
The "MT7621 Giga Switch Programming Guide v0.3", "MT7531 Reference Manual for Development Board v1.0", and "MT7988A Wi-Fi 7 Generation Router Platform: Datasheet (Open Version) v0.1" documents show that these bits are enabled at reset: PMCR_IFG_XMIT(1) (not part of PMCR_LINK_SETTINGS_MASK) PMCR_MAC_MODE (not part of PMCR_LINK_SETTINGS_MASK) PMCR_TX_EN PMCR_RX_EN PMCR_BACKOFF_EN (not part of PMCR_LINK_SETTINGS_MASK) PMCR_BACKPR_EN (not part of PMCR_LINK_SETTINGS_MASK) PMCR_TX_FC_EN PMCR_RX_FC_EN These bits also don't exist on the MT7530_PMCR_P(6) register of the switch on the MT7988 SoC: PMCR_IFG_XMIT() PMCR_MAC_MODE PMCR_BACKOFF_EN PMCR_BACKPR_EN Remove the setting of the bits not part of PMCR_LINK_SETTINGS_MASK on phylink_mac_config as they're already set. The bit for setting the port on force mode is already done on mt7530_setup() and mt7531_setup_common(). So get rid of PMCR_FORCE_MODE_ID() which helped determine which bit to use for the switch model. Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-05net: dsa: mt7530: sort link settings ops and force link down on all portsArınç ÜNAL
port_enable and port_disable clears the link settings. Move that to mt7530_setup() and mt7531_setup_common() which set up the switches. This way, the link settings are cleared on all ports at setup, and then only once with phylink_mac_link_down() when a link goes down. Enable force mode at setup to apply the force part of the link settings. This ensures that disabled ports will have their link down. Suggested-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-05net: dsa: mt7530: put initialising PCS devices code back to original orderArınç ÜNAL
The commit fae463084032 ("net: dsa: mt753x: fix pcs conversion regression") fixes regression caused by cpu_port_config manually calling phylink operations. cpu_port_config was deemed useless and was removed. Therefore, put initialising PCS devices code back to its original order. Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-05net: dsa: mt7530: get rid of mt753x_mac_config()Arınç ÜNAL
There is no need for a separate function to call priv->info->mac_port_config(). Call it from mt753x_phylink_mac_config() instead and remove mt753x_mac_config(). Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-05net: dsa: mt7530: get rid of priv->info->cpu_port_config()Arınç ÜNAL
priv->info->cpu_port_config() is used for MT7531 and the switch on the MT7988 SoC. It sets up the ports described as a CPU port earlier than the phylink code path would do. This function is useless as: - Configuring the MACs can be done from the phylink_mac_config code path instead. - All the link configuration it does on the CPU ports are later undone with the port_enable, phylink_mac_config, and then phylink_mac_link_up code path [1]. priv->p5_interface and priv->p6_interface were being used to prevent configuring the MACs from the phylink_mac_config code path. Remove them now that they hold no purpose. Remove priv->info->cpu_port_config(). On mt753x_phylink_mac_config, switch to if statements to simplify the code. Remove the overwriting of the speed and duplex interfaces for certain interface modes. Phylink already provides the speed and duplex variables with proper values. Phylink already sets the max speed of TRGMII to SPEED_1000. Add SPEED_2500 for PHY_INTERFACE_MODE_2500BASEX to where the speed and EEE bits are set instead. On the switch on the MT7988 SoC, PHY_INTERFACE_MODE_INTERNAL is being used to describe the interface mode of the 10G MAC, which is of port 6. On mt7988_cpu_port_config() PMCR_FORCE_SPEED_1000 was set via the PMCR_CPU_PORT_SETTING() mask. Add SPEED_10000 case to where the speed bits are set to cover this. No need to add it to where the EEE bits are set as the "MT7988A Wi-Fi 7 Generation Router Platform: Datasheet (Open Version) v0.1" document shows that these bits don't exist on the MT7530_PMCR_P(6) register. Remove the definition of PMCR_CPU_PORT_SETTING() now that it holds no purpose. Change mt753x_cpu_port_enable() to void now that there're no error cases left. Link: https://lore.kernel.org/netdev/ZHy2jQLesdYFMQtO@shell.armlinux.org.uk/ [1] Suggested-by: Russell King (Oracle) <linux@armlinux.org.uk> Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-05net: dsa: mt7530: get rid of useless error returns on phylink code pathArınç ÜNAL
Remove error returns on the cases where they are already handled with the function the mac_port_get_caps member in mt753x_table points to. mt7531_mac_config() is also called from mt7531_cpu_port_config() outside of phylink but the port and interface modes are already handled there. Change the functions and the mac_port_config function pointer to void now that there're no error returns anymore. Remove mt753x_is_mac_port() that used to help the said error returns. On mt7531_mac_config(), switch to if statements to simplify the code. Remove internal phy cases from mt753x_phylink_mac_config(), there is no need to check the interface mode as that's already handled with the function the mac_port_get_caps member in mt753x_table points to. Acked-by: Daniel Golle <daniel@makrotopia.org> Tested-by: Daniel Golle <daniel@makrotopia.org> Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-05net: dsa: mt7530: do not use SW_PHY_RST to reset MT7531 switchArınç ÜNAL
According to the document MT7531 Reference Manual for Development Board v1.0, the SW_PHY_RST bit on the SYS_CTRL register doesn't exist for MT7531. This is likely why forcing link down on all ports is necessary for MT7531. Therefore, do not set SW_PHY_RST on mt7531_setup(). Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-05net: dsa: mt7530: set interrupt register only for MT7530Arınç ÜNAL
Setting this register related to interrupts is only needed for the MT7530 switch. Make an exclusive check to ensure this. Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Acked-by: Daniel Golle <daniel@makrotopia.org> Tested-by: Daniel Golle <daniel@makrotopia.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-05net: dsa: mt7530: remove .mac_port_config for MT7988 and make it optionalArınç ÜNAL
For the switch on the MT7988 SoC, the mac_port_config member for ID_MT7988 in mt753x_table is not needed as the interfaces of all MACs are already handled on mt7988_mac_port_get_caps(). Therefore, remove the mac_port_config member from ID_MT7988 in mt753x_table. Before calling priv->info->mac_port_config(), if there's no mac_port_config member in mt753x_table, exit mt753x_mac_config() successfully. Remove calling priv->info->mac_port_config() from the sanity check as the sanity check requires a pointer to a mac_port_config function to be non-NULL. This will fail for MT7988 as mac_port_config won't be a member of its info table. Co-developed-by: Daniel Golle <daniel@makrotopia.org> Signed-off-by: Daniel Golle <daniel@makrotopia.org> Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-05Merge branch 'remove-page-frag-implementation-in-vhost_net'Paolo Abeni
Yunsheng Lin says: ==================== remove page frag implementation in vhost_net Currently there are three implementations for page frag: 1. mm/page_alloc.c: net stack seems to be using it in the rx part with 'struct page_frag_cache' and the main API being page_frag_alloc_align(). 2. net/core/sock.c: net stack seems to be using it in the tx part with 'struct page_frag' and the main API being skb_page_frag_refill(). 3. drivers/vhost/net.c: vhost seems to be using it to build xdp frame, and it's implementation seems to be a mix of the above two. This patchset tries to unfiy the page frag implementation a little bit by unifying gfp bit for order 3 page allocation and replacing page frag implementation in vhost.c with the one in page_alloc.c. After this patchset, we are not only able to unify the page frag implementation a little, but also able to have about 0.5% performance boost testing by using the vhost_net_test introduced in the last patch. Before this patchset: Performance counter stats for './vhost_net_test' (10 runs): 305325.78 msec task-clock # 1.738 CPUs utilized ( +- 0.12% ) 1048668 context-switches # 3.435 K/sec ( +- 0.00% ) 11 cpu-migrations # 0.036 /sec ( +- 17.64% ) 33 page-faults # 0.108 /sec ( +- 0.49% ) 244651819491 cycles # 0.801 GHz ( +- 0.43% ) (64) 64714638024 stalled-cycles-frontend # 26.45% frontend cycles idle ( +- 2.19% ) (67) 30774313491 stalled-cycles-backend # 12.58% backend cycles idle ( +- 7.68% ) (70) 201749748680 instructions # 0.82 insn per cycle # 0.32 stalled cycles per insn ( +- 0.41% ) (66.76%) 65494787909 branches # 214.508 M/sec ( +- 0.35% ) (64) 4284111313 branch-misses # 6.54% of all branches ( +- 0.45% ) (66) 175.699 +- 0.189 seconds time elapsed ( +- 0.11% ) After this patchset: Performance counter stats for './vhost_net_test' (10 runs): 303974.38 msec task-clock # 1.739 CPUs utilized ( +- 0.14% ) 1048807 context-switches # 3.450 K/sec ( +- 0.00% ) 14 cpu-migrations # 0.046 /sec ( +- 12.86% ) 33 page-faults # 0.109 /sec ( +- 0.46% ) 251289376347 cycles # 0.827 GHz ( +- 0.32% ) (60) 67885175415 stalled-cycles-frontend # 27.01% frontend cycles idle ( +- 0.48% ) (63) 27809282600 stalled-cycles-backend # 11.07% backend cycles idle ( +- 0.36% ) (71) 195543234672 instructions # 0.78 insn per cycle # 0.35 stalled cycles per insn ( +- 0.29% ) (69.04%) 62423183552 branches # 205.357 M/sec ( +- 0.48% ) (67) 4135666632 branch-misses # 6.63% of all branches ( +- 0.63% ) (67) 174.764 +- 0.214 seconds time elapsed ( +- 0.12% ) Changelog: V6: Add timeout for poll() and simplify some logic as suggested by Jason. V5: Address the comment from jason in vhost_net_test.c and the comment about leaving out the gfp change for page frag in sock.c as suggested by Paolo. V4: Resend based on latest net-next branch. V3: 1. Add __page_frag_alloc_align() which is passed with the align mask the original function expected as suggested by Alexander. 2. Drop patch 3 in v2 suggested by Alexander. 3. Reorder patch 4 & 5 in v2 suggested by Alexander. Note that placing this gfp flags handing for order 3 page in an inline function is not considered, as we may be able to unify the page_frag and page_frag_cache handling. V2: Change 'xor'd' to 'masked off', add vhost tx testing for vhost_net_test. V1: Fix some typo, drop RFC tag and rebase on latest net-next. ==================== Link: https://lore.kernel.org/r/20240228093013.8263-1-linyunsheng@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-05tools: virtio: introduce vhost_net_testYunsheng Lin
introduce vhost_net_test for both vhost_net tx and rx basing on virtio_test to test vhost_net changing in the kernel. Steps for vhost_net tx testing: 1. Prepare a out buf. 2. Kick the vhost_net to do tx processing. 3. Do the receiving in the tun side. 4. verify the data received by tun is correct. Steps for vhost_net rx testing: 1. Prepare a in buf. 2. Do the sending in the tun side. 3. Kick the vhost_net to do rx processing. 4. verify the data received by vhost_net is correct. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-05vhost/net: remove vhost_net_page_frag_refill()Yunsheng Lin
The page frag in vhost_net_page_frag_refill() uses the 'struct page_frag' from skb_page_frag_refill(), but it's implementation is similar to page_frag_alloc_align() now. This patch removes vhost_net_page_frag_refill() by using 'struct page_frag_cache' instead of 'struct page_frag', and allocating frag using page_frag_alloc_align(). The added benefit is that not only unifying the page frag implementation a little, but also having about 0.5% performance boost testing by using the vhost_net_test introduced in the last patch. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-05net: introduce page_frag_cache_drain()Yunsheng Lin
When draining a page_frag_cache, most user are doing the similar steps, so introduce an API to avoid code duplication. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-05page_frag: unify gfp bits for order 3 page allocationYunsheng Lin
Currently there seems to be three page frag implementations which all try to allocate order 3 page, if that fails, it then fail back to allocate order 0 page, and each of them all allow order 3 page allocation to fail under certain condition by using specific gfp bits. The gfp bits for order 3 page allocation are different between different implementation, __GFP_NOMEMALLOC is or'd to forbid access to emergency reserves memory for __page_frag_cache_refill(), but it is not or'd in other implementions, __GFP_DIRECT_RECLAIM is masked off to avoid direct reclaim in vhost_net_page_frag_refill(), but it is not masked off in __page_frag_cache_refill(). This patch unifies the gfp bits used between different implementions by or'ing __GFP_NOMEMALLOC and masking off __GFP_DIRECT_RECLAIM for order 3 page allocation to avoid possible pressure for mm. Leave the gfp unifying for page frag implementation in sock.c for now as suggested by Paolo Abeni. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> CC: Alexander Duyck <alexander.duyck@gmail.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-05mm/page_alloc: modify page_frag_alloc_align() to accept align as an argumentYunsheng Lin
napi_alloc_frag_align() and netdev_alloc_frag_align() accept align as an argument, and they are thin wrappers around the __napi_alloc_frag_align() and __netdev_alloc_frag_align() APIs doing the alignment checking and align mask conversion, in order to call page_frag_alloc_align() directly. The intention here is to keep the alignment checking and the alignmask conversion in in-line wrapper to avoid those kind of operations during execution time since it can usually be handled during compile time. We are going to use page_frag_alloc_align() in vhost_net.c, it need the same kind of alignment checking and alignmask conversion, so split up page_frag_alloc_align into an inline wrapper doing the above operation, and add __page_frag_alloc_align() which is passed with the align mask the original function expected as suggested by Alexander. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> CC: Alexander Duyck <alexander.duyck@gmail.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-05net: txgbe: fix to clear interrupt status after handling IRQJiawen Wu
GPIO EOI is not set to clear interrupt status after handling the interrupt. It should be done in irq_chip->irq_ack, but this function is not called in handle_nested_irq(). So executing function txgbe_gpio_irq_ack() manually in txgbe_gpio_irq_handler(). Fixes: aefd013624a1 ("net: txgbe: use irq_domain for interrupt controller") Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Link: https://lore.kernel.org/r/20240301092956.18544-2-jiawenwu@trustnetic.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-05net: txgbe: fix GPIO interrupt blockingJiawen Wu
The register of GPIO interrupt status is masked before MAC IRQ is enabled. This is because of hardware deficiency. So manually clear the interrupt status before using them. Otherwise, GPIO interrupts will never be reported again. There is a workaround for clearing interrupts to set GPIO EOI in txgbe_up_complete(). Fixes: aefd013624a1 ("net: txgbe: use irq_domain for interrupt controller") Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Link: https://lore.kernel.org/r/20240301092956.18544-1-jiawenwu@trustnetic.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-05slab: remove PARTIAL_NODE slab_stateChengming Zhou
The PARTIAL_NODE slab_state has gone with SLAB removed, so just remove it. Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-03-05string: Convert helpers selftest to KUnitKees Cook
Convert test-string_helpers.c to KUnit so it can be easily run with everything else. Failure reporting doesn't need to be open-coded in most places, for example, forcing a failure in the expected output for upper/lower testing looks like this: [12:18:43] # test_upper_lower: EXPECTATION FAILED at lib/string_helpers_kunit.c:579 [12:18:43] Expected dst == strings_upper[i].out, but [12:18:43] dst == "ABCDEFGH1234567890TEST" [12:18:43] strings_upper[i].out == "ABCDEFGH1234567890TeST" [12:18:43] [FAILED] test_upper_lower Currently passes without problems: $ ./tools/testing/kunit/kunit.py run string_helpers ... [12:23:55] Starting KUnit Kernel (1/1)... [12:23:55] ============================================================ [12:23:55] =============== string_helpers (3 subtests) ================ [12:23:55] [PASSED] test_get_size [12:23:55] [PASSED] test_upper_lower [12:23:55] [PASSED] test_unescape [12:23:55] ================= [PASSED] string_helpers ================== [12:23:55] ============================================================ [12:23:55] Testing complete. Ran 3 tests: passed: 3 [12:23:55] Elapsed time: 6.709s total, 0.001s configuring, 6.591s building, 0.066s running Link: https://lore.kernel.org/r/20240301202732.2688342-2-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org>
2024-03-05string: Convert selftest to KUnitKees Cook
Convert test_string.c to KUnit so it can be easily run with everything else. Additional text context is retained for failure reporting. For example, when forcing a bad match, we can see the loop counters reported for the memset() tests: [09:21:52] # test_memset64: ASSERTION FAILED at lib/string_kunit.c:93 [09:21:52] Expected v == 0xa2a1a1a1a1a1a1a1ULL, but [09:21:52] v == -6799976246779207263 (0xa1a1a1a1a1a1a1a1) [09:21:52] 0xa2a1a1a1a1a1a1a1ULL == -6727918652741279327 (0xa2a1a1a1a1a1a1a1) [09:21:52] i:0 j:0 k:0 [09:21:52] [FAILED] test_memset64 Currently passes without problems: $ ./tools/testing/kunit/kunit.py run string ... [09:37:40] Starting KUnit Kernel (1/1)... [09:37:40] ============================================================ [09:37:40] =================== string (6 subtests) ==================== [09:37:40] [PASSED] test_memset16 [09:37:40] [PASSED] test_memset32 [09:37:40] [PASSED] test_memset64 [09:37:40] [PASSED] test_strchr [09:37:40] [PASSED] test_strnchr [09:37:40] [PASSED] test_strspn [09:37:40] ===================== [PASSED] string ====================== [09:37:40] ============================================================ [09:37:40] Testing complete. Ran 6 tests: passed: 6 [09:37:40] Elapsed time: 6.730s total, 0.001s configuring, 6.562s building, 0.131s running Link: https://lore.kernel.org/r/20240301202732.2688342-1-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org>
2024-03-05xfrm: set skb control buffer based on packet offload as wellMike Yu
In packet offload, packets are not encrypted in XFRM stack, so the next network layer which the packets will be forwarded to should depend on where the packet came from (either xfrm4_output or xfrm6_output) rather than the matched SA's family type. Test: verified IPv6-in-IPv4 packets on Android device with IPsec packet offload enabled Signed-off-by: Mike Yu <yumike@google.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-03-05xfrm: fix xfrm child route lookup for packet offloadMike Yu
In current code, xfrm_bundle_create() always uses the matched SA's family type to look up a xfrm child route for the skb. The route returned by xfrm_dst_lookup() will eventually be used in xfrm_output_resume() (skb_dst(skb)->ops->local_out()). If packet offload is used, the above behavior can lead to calling ip_local_out() for an IPv6 packet or calling ip6_local_out() for an IPv4 packet, which is likely to fail. This change fixes the behavior by checking if the matched SA has packet offload enabled. If not, keep the same behavior; if yes, use the matched SP's family type for the lookup. Test: verified IPv6-in-IPv4 packets on Android device with IPsec packet offload enabled Signed-off-by: Mike Yu <yumike@google.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-03-05fs/aio: Check IOCB_AIO_RW before the struct aio_kiocb conversionBart Van Assche
The first kiocb_set_cancel_fn() argument may point at a struct kiocb that is not embedded inside struct aio_kiocb. With the current code, depending on the compiler, the req->ki_ctx read happens either before the IOCB_AIO_RW test or after that test. Move the req->ki_ctx read such that it is guaranteed that the IOCB_AIO_RW test happens first. Reported-by: Eric Biggers <ebiggers@kernel.org> Cc: Benjamin LaHaise <ben@communityfibre.ca> Cc: Eric Biggers <ebiggers@google.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Avi Kivity <avi@scylladb.com> Cc: Sandeep Dhavale <dhavale@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: stable@vger.kernel.org Fixes: b820de741ae4 ("fs/aio: Restrict kiocb_set_cancel_fn() to I/O submitted via libaio") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240304235715.3790858-1-bvanassche@acm.org Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-05drm/i915: Don't explode when the dig port we don't have an AUX CHVille Syrjälä
The icl+ power well code currently assumes that every AUX power well maps to an encoder which is using said power well. That is by no menas guaranteed as we: - only register encoders for ports declared in the VBT - combo PHY HDMI-only encoder no longer get an AUX CH since commit 9856308c94ca ("drm/i915: Only populate aux_ch if really needed") However we have places such as intel_power_domains_sanitize_state() that blindly traverse all the possible power wells. So these bits of code may very well encounbter an aux power well with no associated encoder. In this particular case the BIOS seems to have left one AUX power well enabled even though we're dealing with a HDMI only encoder on a combo PHY. We then proceed to turn off said power well and explode when we can't find a matching encoder. As a short term fix we should be able to just skip the PHY related parts of the power well programming since we know this situation can only happen with combo PHYs. Another option might be to go back to always picking an AUX CH for all encoders. However I'm a bit wary about that since we might in theory end up conflicting with the VBT AUX CH assignment. Also that wouldn't help with encoders not declared in the VBT, should we ever need to poke the corresponding power wells. Longer term we need to figure out what the actual relationship is between the PHY vs. AUX CH vs. AUX power well. Currently this is entirely unclear. Cc: stable@vger.kernel.org Fixes: 9856308c94ca ("drm/i915: Only populate aux_ch if really needed") Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/10184 Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240223203216.15210-1-ville.syrjala@linux.intel.com Reviewed-by: Imre Deak <imre.deak@intel.com> (cherry picked from commit 6a8c66bf0e565c34ad0a18f820e0bb17951f7f91) Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
2024-03-05platform/x86/amd/pmf: Fix missing error code in amd_pmf_init_smart_pc()Harshit Mogalapalli
On the error path, assign -ENOMEM to ret when memory allocation of "dev->prev_data" fails. Fixes: e70961505808 ("platform/x86/amd/pmf: Fixup error handling for amd_pmf_init_smart_pc()") Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://lore.kernel.org/r/20240226144011.2100804-1-harshit.m.mogalapalli@oracle.com Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2024-03-05platform/x86: p2sb: On Goldmont only cache P2SB and SPI devfn BARHans de Goede
On Goldmont p2sb_bar() only ever gets called for 2 devices, the actual P2SB devfn 13,0 and the SPI controller which is part of the P2SB, devfn 13,2. But the current p2sb code tries to cache BAR0 info for all of devfn 13,0 to 13,7 . This involves calling pci_scan_single_device() for device 13 functions 0-7 and the hw does not seem to like pci_scan_single_device() getting called for some of the other hidden devices. E.g. on an ASUS VivoBook D540NV-GQ065T this leads to continuous ACPI errors leading to high CPU usage. Fix this by only caching BAR0 info and thus only calling pci_scan_single_device() for the P2SB and the SPI controller. Fixes: 5913320eb0b3 ("platform/x86: p2sb: Allow p2sb_bar() calls during PCI device probe") Reported-by: Danil Rybakov <danilrybakov249@gmail.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218531 Tested-by: Danil Rybakov <danilrybakov249@gmail.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20240304134356.305375-2-hdegoede@redhat.com
2024-03-05ALSA: hda/realtek: fix mute/micmute LEDs for HP EliteBookAndy Chi
The HP EliteBook using ALC236 codec which using 0x02 to control mute LED and 0x01 to control micmute LED. Therefore, add a quirk to make it works. Signed-off-by: Andy Chi <andy.chi@canonical.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20240304134033.773348-1-andy.chi@canonical.com Signed-off-by: Takashi Iwai <tiwai@suse.de>
2024-03-05Revert "fs/aio: Make io_cancel() generate completions again"Bart Van Assche
Patch "fs/aio: Make io_cancel() generate completions again" is based on the assumption that calling kiocb->ki_cancel() does not complete R/W requests. This is incorrect: the two drivers that call kiocb_set_cancel_fn() callers set a cancellation function that calls usb_ep_dequeue(). According to its documentation, usb_ep_dequeue() calls the completion routine with status -ECONNRESET. Hence this revert. Cc: Benjamin LaHaise <ben@communityfibre.ca> Cc: Eric Biggers <ebiggers@google.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Avi Kivity <avi@scylladb.com> Cc: Sandeep Dhavale <dhavale@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: stable@vger.kernel.org Reported-by: syzbot+b91eb2ed18f599dd3c31@syzkaller.appspotmail.com Fixes: 54cbc058d86b ("fs/aio: Make io_cancel() generate completions again") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240304182945.3646109-1-bvanassche@acm.org Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-05Merge tag 'drm-intel-fixes-2024-03-01' of ↵Daniel Vetter
https://anongit.freedesktop.org/git/drm/drm-intel into drm-fixes - Fix to extract HDCP information from primary connector - Check for NULL mmu_interval_notifier before removing Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> From: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/ZeGOUTfiA0_FNKLg@jlahtine-mobl.ger.corp.intel.com
2024-03-05MAINTAINERS: Update email address for Tvrtko UrsulinTvrtko Ursulin
I will lose access to my @.*intel.com e-mail addresses soon so let me adjust the maintainers entry and update the mailmap too. While at it consolidate a few other of my old emails to point to the main one. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Dave Airlie <airlied@redhat.com> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/20240228142240.2539358-1-tvrtko.ursulin@linux.intel.com
2024-03-04Merge tag 'mlx5-fixes-2024-03-01' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5 fixes 2024-03-01 This series provides bug fixes to mlx5 driver. Please pull and let me know if there is any problem. * tag 'mlx5-fixes-2024-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5e: Switch to using _bh variant of of spinlock API in port timestamping NAPI poll context net/mlx5e: Use a memory barrier to enforce PTP WQ xmit submission tracking occurs after populating the metadata_map net/mlx5e: Fix MACsec state loss upon state update in offload path net/mlx5e: Change the warning when ignore_flow_level is not supported net/mlx5: Check capability for fw_reset net/mlx5: Fix fw reporter diagnose output net/mlx5: E-switch, Change flow rule destination checking Revert "net/mlx5e: Check the number of elements before walk TC rhashtable" Revert "net/mlx5: Block entering switchdev mode with ns inconsistency" ==================== Link: https://lore.kernel.org/r/20240302070318.62997-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-04Merge branch '10GbE' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2024-03-01 (ixgbe, i40e, ice) This series contains updates to ixgbe, i40e, and ice drivers. Maciej corrects disable flow for ixgbe, i40e, and ice drivers which could cause non-functional interface with AF_XDP. Michal restores host configuration when changing MSI-X count for VFs on ice driver. * '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: ice: reconfig host after changing MSI-X on VF ice: reorder disabling IRQ and NAPI in ice_qp_dis i40e: disable NAPI right after disabling irqs when handling xsk_pool ixgbe: {dis, en}able irqs in ixgbe_txrx_ring_{dis, en}able ==================== Link: https://lore.kernel.org/r/20240301192549.2993798-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-04Merge branch ↵Jakub Kicinski
'intel-wired-lan-driver-updates-2024-02-28-ixgbe-igc-igb-e1000e-e100' Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2024-02-28 (ixgbe, igc, igb, e1000e, e100) This series contains updates to ixgbe, igc, igb, e1000e, and e100 drivers. Jon Maxwell makes module parameter values readable in sysfs for ixgbe, igb, and e100. Ernesto Castellotti adds support for 1000BASE-BX on ixgbe. Arnd Bergmann fixes build failure due to dependency issues for igc. Vitaly refactors error check to be more concise and prevent future issues on e1000e. v1: https://lore.kernel.org/netdev/20240229004135.741586-1-anthony.l.nguyen@intel.com/ ==================== Link: https://lore.kernel.org/r/20240301184806.2634508-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-04e1000e: Minor flow correction in e1000_shutdown functionVitaly Lifshits
Add curly braces to avoid entering to an if statement where it is not always required in e1000_shutdown function. This improves code readability and might prevent non-deterministic behaviour in the future. Signed-off-by: Vitaly Lifshits <vitaly.lifshits@intel.com> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://lore.kernel.org/r/20240301184806.2634508-5-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-04igc: fix LEDS_CLASS dependencyArnd Bergmann
When IGC is built-in but LEDS_CLASS is a loadable module, there is a link failure: x86_64-linux-ld: drivers/net/ethernet/intel/igc/igc_leds.o: in function `igc_led_setup': igc_leds.c:(.text+0x75c): undefined reference to `devm_led_classdev_register_ext' Add another dependency that prevents this combination. Fixes: ea578703b03d ("igc: Add support for LEDs on i225/i226") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://lore.kernel.org/r/20240301184806.2634508-4-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-04ixgbe: Add 1000BASE-BX supportErnesto Castellotti
Added support for 1000BASE-BX, i.e. Gigabit Ethernet over single strand of single-mode fiber. The initialization of a 1000BASE-BX SFP is the same as 1000BASE-SX/LX with the only difference that the Bit Rate Nominal Value must be checked to make sure it is a Gigabit Ethernet transceiver, as described by the SFF-8472 specification. This was tested with the FS.com SFP-GE-BX 1310/1490nm 10km transceiver: $ ethtool -m eth4 Identifier : 0x03 (SFP) Extended identifier : 0x04 (GBIC/SFP defined by 2-wire interface ID) Connector : 0x07 (LC) Transceiver codes : 0x00 0x00 0x00 0x40 0x00 0x00 0x00 0x00 0x00 Transceiver type : Ethernet: BASE-BX10 Encoding : 0x01 (8B/10B) BR, Nominal : 1300MBd Rate identifier : 0x00 (unspecified) Length (SMF,km) : 10km Length (SMF) : 10000m Length (50um) : 0m Length (62.5um) : 0m Length (Copper) : 0m Length (OM3) : 0m Laser wavelength : 1310nm Vendor name : FS Vendor OUI : 64:9d:99 Vendor PN : SFP-GE-BX Vendor rev : Option values : 0x20 0x0a Option : RX_LOS implemented Option : TX_FAULT implemented Option : Power level 3 requirement BR margin, max : 0% BR margin, min : 0% Vendor SN : S2202359108 Date code : 220307 Optical diagnostics support : Yes Laser bias current : 17.650 mA Laser output power : 0.2132 mW / -6.71 dBm Receiver signal average optical power : 0.2740 mW / -5.62 dBm Module temperature : 47.30 degrees C / 117.13 degrees F Module voltage : 3.2576 V Alarm/warning flags implemented : Yes Laser bias current high alarm : Off Laser bias current low alarm : Off Laser bias current high warning : Off Laser bias current low warning : Off Laser output power high alarm : Off Laser output power low alarm : Off Laser output power high warning : Off Laser output power low warning : Off Module temperature high alarm : Off Module temperature low alarm : Off Module temperature high warning : Off Module temperature low warning : Off Module voltage high alarm : Off Module voltage low alarm : Off Module voltage high warning : Off Module voltage low warning : Off Laser rx power high alarm : Off Laser rx power low alarm : Off Laser rx power high warning : Off Laser rx power low warning : Off Laser bias current high alarm threshold : 110.000 mA Laser bias current low alarm threshold : 1.000 mA Laser bias current high warning threshold : 100.000 mA Laser bias current low warning threshold : 1.000 mA Laser output power high alarm threshold : 0.7079 mW / -1.50 dBm Laser output power low alarm threshold : 0.0891 mW / -10.50 dBm Laser output power high warning threshold : 0.6310 mW / -2.00 dBm Laser output power low warning threshold : 0.1000 mW / -10.00 dBm Module temperature high alarm threshold : 90.00 degrees C / 194.00 degrees F Module temperature low alarm threshold : -45.00 degrees C / -49.00 degrees F Module temperature high warning threshold : 85.00 degrees C / 185.00 degrees F Module temperature low warning threshold : -40.00 degrees C / -40.00 degrees F Module voltage high alarm threshold : 3.7950 V Module voltage low alarm threshold : 2.8050 V Module voltage high warning threshold : 3.4650 V Module voltage low warning threshold : 3.1350 V Laser rx power high alarm threshold : 0.7079 mW / -1.50 dBm Laser rx power low alarm threshold : 0.0028 mW / -25.53 dBm Laser rx power high warning threshold : 0.6310 mW / -2.00 dBm Laser rx power low warning threshold : 0.0032 mW / -24.95 dBm Signed-off-by: Ernesto Castellotti <ernesto@castellotti.net> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://lore.kernel.org/r/20240301184806.2634508-3-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-04intel: make module parameters readable in sys filesystemJon Maxwell
Linux users sometimes need an easy way to check current values of module parameters. For example the module may be manually reloaded with different parameters. Make these visible and readable in the /sys filesystem to allow that. But don't make the "debug" module parameter visible as debugging is enabled via ethtool msglvl. Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://lore.kernel.org/r/20240301184806.2634508-2-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-04tcp: align tcp_sock_write_rx groupEric Dumazet
Stephen Rothwell and kernel test robot reported that some arches (parisc, hexagon) and/or compilers would not like blamed commit. Lets make sure tcp_sock_write_rx group does not start with a hole. While we are at it, correct tcp_sock_write_tx CACHELINE_ASSERT_GROUP_SIZE() since after the blamed commit, we went to 105 bytes. Fixes: 99123622050f ("tcp: remove some holes in struct tcp_sock") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/netdev/20240301121108.5d39e4f9@canb.auug.org.au/ Closes: https://lore.kernel.org/oe-kbuild-all/202403011451.csPYOS3C-lkp@intel.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Simon Horman <horms@kernel.org> # build-tested Link: https://lore.kernel.org/r/20240301171945.2958176-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-04net: sparx5: Fix use after free inside sparx5_del_mact_entryHoratiu Vultur
Based on the static analyzis of the code it looks like when an entry from the MAC table was removed, the entry was still used after being freed. More precise the vid of the mac_entry was used after calling devm_kfree on the mac_entry. The fix consists in first using the vid of the mac_entry to delete the entry from the HW and after that to free it. Fixes: b37a1bae742f ("net: sparx5: add mactable support") Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240301080608.3053468-1-horatiu.vultur@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-04selftests/tc-testing: require an up to date iproute2 for blockcast testsPedro Tammela
Add the dependsOn test check for all the mirred blockcast tests. It will prevent the issue reported by LKFT which happens when an older iproute2 is used to run the current tdc. Tests are skipped if the dependsOn check fails. Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Link: https://lore.kernel.org/r/20240229143825.1373550-1-pctammela@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-04selftests: net: Correct couple of spelling mistakesPrabhav Kumar Vaish
Changes : - "excercise" is corrected to "exercise" in drivers/net/mlxsw/spectrum-2/tc_flower.sh - "mutliple" is corrected to "multiple" in drivers/net/netdevsim/ethtool-fec.sh Signed-off-by: Prabhav Kumar Vaish <pvkumar5749404@gmail.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20240228120701.422264-1-pvkumar5749404@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-04mailmap: fix Kishon's emailNiklas Cassel
Kishon updated his email in commit e6aa4edd2f5b ("MAINTAINERS: Update Kishon's email address in PCI endpoint subsystem"). However, as he is no longer at TI, his TI email now bounces. Add the same email as he has in MAINTAINERS to a mailmap, so that get_maintainer.pl will not output an email that bounces. (This is neeed as get_maintainer.pl will use "git author" to CC people who have significantly modified the same file as you.) Link: https://lkml.kernel.org/r/20240229134318.1201935-1-cassel@kernel.org Signed-off-by: Niklas Cassel <cassel@kernel.org> Cc: Kishon Vijay Abraham I <kishon@kernel.org> Cc: Kishon Vijay Abraham I <kishon@ti.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04init/Kconfig: lower GCC version check for -Warray-boundsKees Cook
We continue to see false positives from -Warray-bounds even in GCC 10, which is getting reported in a few places[1] still: security/security.c:811:2: warning: `memcpy' offset 32 is out of the bounds [0, 0] [-Warray-bounds] Lower the GCC version check from 11 to 10. Link: https://lkml.kernel.org/r/20240223170824.work.768-kees@kernel.org Reported-by: Lu Yao <yaolu@kylinos.cn> Closes: https://lore.kernel.org/lkml/20240117014541.8887-1-yaolu@kylinos.cn/ Link: https://lore.kernel.org/linux-next/65d84438.620a0220.7d171.81a7@mx.google.com [1] Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Paul Moore <paul@paul-moore.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Marc Aurèle La France <tsi@tuyoix.net> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm, mmap: fix vma_merge() case 7 with vma_ops->closeVlastimil Babka
When debugging issues with a workload using SysV shmem, Michal Hocko has come up with a reproducer that shows how a series of mprotect() operations can result in an elevated shm_nattch and thus leak of the resource. The problem is caused by wrong assumptions in vma_merge() commit 714965ca8252 ("mm/mmap: start distinguishing if vma can be removed in mergeability test"). The shmem vmas have a vma_ops->close callback that decrements shm_nattch, and we remove the vma without calling it. vma_merge() has thus historically avoided merging vma's with vma_ops->close and commit 714965ca8252 was supposed to keep it that way. It relaxed the checks for vma_ops->close in can_vma_merge_after() assuming that it is never called on a vma that would be a candidate for removal. However, the vma_merge() code does also use the result of this check in the decision to remove a different vma in the merge case 7. A robust solution would be to refactor vma_merge() code in a way that the vma_ops->close check is only done for vma's that are actually going to be removed, and not as part of the preliminary checks. That would both solve the existing bug, and also allow additional merges that the checks currently prevent unnecessarily in some cases. However to fix the existing bug first with a minimized risk, and for easier stable backports, this patch only adds a vma_ops->close check to the buggy case 7 specifically. All other cases of vma removal are covered by the can_vma_merge_before() check that includes the test for vma_ops->close. The reproducer code, adapted from Michal Hocko's code: int main(int argc, char *argv[]) { int segment_id; size_t segment_size = 20 * PAGE_SIZE; char * sh_mem; struct shmid_ds shmid_ds; key_t key = 0x1234; segment_id = shmget(key, segment_size, IPC_CREAT | IPC_EXCL | S_IRUSR | S_IWUSR); sh_mem = (char *)shmat(segment_id, NULL, 0); mprotect(sh_mem + 2*PAGE_SIZE, PAGE_SIZE, PROT_NONE); mprotect(sh_mem + PAGE_SIZE, PAGE_SIZE, PROT_WRITE); mprotect(sh_mem + 2*PAGE_SIZE, PAGE_SIZE, PROT_WRITE); shmdt(sh_mem); shmctl(segment_id, IPC_STAT, &shmid_ds); printf("nattch after shmdt(): %lu (expected: 0)\n", shmid_ds.shm_nattch); if (shmctl(segment_id, IPC_RMID, 0)) printf("IPCRM failed %d\n", errno); return (shmid_ds.shm_nattch) ? 1 : 0; } Link: https://lkml.kernel.org/r/20240222215930.14637-2-vbabka@suse.cz Fixes: 714965ca8252 ("mm/mmap: start distinguishing if vma can be removed in mergeability test") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm: userfaultfd: fix unexpected change to src_folio when UFFDIO_MOVE failsQi Zheng
After ptep_clear_flush(), if we find that src_folio is pinned we will fail UFFDIO_MOVE and put src_folio back to src_pte entry, but the change to src_folio->{mapping,index} is not restored in this process. This is not what we expected, so fix it. This can cause the rmap for that page to be invalid, possibly resulting in memory corruption. At least swapout+migration would no longer work, because we might fail to locate the mappings of that folio. Link: https://lkml.kernel.org/r/20240222080815.46291-1-zhengqi.arch@bytedance.com Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm, vmscan: prevent infinite loop for costly GFP_NOIO | __GFP_RETRY_MAYFAIL ↵Vlastimil Babka
allocations Sven reports an infinite loop in __alloc_pages_slowpath() for costly order __GFP_RETRY_MAYFAIL allocations that are also GFP_NOIO. Such combination can happen in a suspend/resume context where a GFP_KERNEL allocation can have __GFP_IO masked out via gfp_allowed_mask. Quoting Sven: 1. try to do a "costly" allocation (order > PAGE_ALLOC_COSTLY_ORDER) with __GFP_RETRY_MAYFAIL set. 2. page alloc's __alloc_pages_slowpath tries to get a page from the freelist. This fails because there is nothing free of that costly order. 3. page alloc tries to reclaim by calling __alloc_pages_direct_reclaim, which bails out because a zone is ready to be compacted; it pretends to have made a single page of progress. 4. page alloc tries to compact, but this always bails out early because __GFP_IO is not set (it's not passed by the snd allocator, and even if it were, we are suspending so the __GFP_IO flag would be cleared anyway). 5. page alloc believes reclaim progress was made (because of the pretense in item 3) and so it checks whether it should retry compaction. The compaction retry logic thinks it should try again, because: a) reclaim is needed because of the early bail-out in item 4 b) a zonelist is suitable for compaction 6. goto 2. indefinite stall. (end quote) The immediate root cause is confusing the COMPACT_SKIPPED returned from __alloc_pages_direct_compact() (step 4) due to lack of __GFP_IO to be indicating a lack of order-0 pages, and in step 5 evaluating that in should_compact_retry() as a reason to retry, before incrementing and limiting the number of retries. There are however other places that wrongly assume that compaction can happen while we lack __GFP_IO. To fix this, introduce gfp_compaction_allowed() to abstract the __GFP_IO evaluation and switch the open-coded test in try_to_compact_pages() to use it. Also use the new helper in: - compaction_ready(), which will make reclaim not bail out in step 3, so there's at least one attempt to actually reclaim, even if chances are small for a costly order - in_reclaim_compaction() which will make should_continue_reclaim() return false and we don't over-reclaim unnecessarily - in __alloc_pages_slowpath() to set a local variable can_compact, which is then used to avoid retrying reclaim/compaction for costly allocations (step 5) if we can't compact and also to skip the early compaction attempt that we do in some cases Link: https://lkml.kernel.org/r/20240221114357.13655-2-vbabka@suse.cz Fixes: 3250845d0526 ("Revert "mm, oom: prevent premature OOM killer invocation for high order request"") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Sven van Ashbrook <svenva@chromium.org> Closes: https://lore.kernel.org/all/CAG-rBihs_xMKb3wrMO1%2B-%2Bp4fowP9oy1pa_OTkfxBzPUVOZF%2Bg@mail.gmail.com/ Tested-by: Karthikeyan Ramasubramanian <kramasub@chromium.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Curtis Malainey <cujomalainey@chromium.org> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Takashi Iwai <tiwai@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04io_uring/net: fix overflow check in io_recvmsg_mshot_prep()Dan Carpenter
The "controllen" variable is type size_t (unsigned long). Casting it to int could lead to an integer underflow. The check_add_overflow() function considers the type of the destination which is type int. If we add two positive values and the result cannot fit in an integer then that's counted as an overflow. However, if we cast "controllen" to an int and it turns negative, then negative values *can* fit into an int type so there is no overflow. Good: 100 + (unsigned long)-4 = 96 <-- overflow Bad: 100 + (int)-4 = 96 <-- no overflow I deleted the cast of the sizeof() as well. That's not a bug but the cast is unnecessary. Fixes: 9b0fc3c054ff ("io_uring: fix types in io_recvmsg_multishot_overflow") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/138bd2e2-ede8-4bcc-aa7b-f3d9de167a37@moroto.mountain Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-04io_uring/net: correct the type of variableMuhammad Usama Anjum
The namelen is of type int. It shouldn't be made size_t which is unsigned. The signed number is needed for error checking before use. Fixes: c55978024d12 ("io_uring/net: move receive multishot out of the generic msghdr path") Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Link: https://lore.kernel.org/r/20240301144349.2807544-1-usama.anjum@collabora.com Signed-off-by: Jens Axboe <axboe@kernel.dk>