From 5575058ba95bb016243e54e5ca3371b9884ded56 Mon Sep 17 00:00:00 2001 From: Ping-Ke Shih Date: Thu, 12 Sep 2024 10:16:26 +0800 Subject: wifi: rtw89: coex: add debug message of link counts on 2/5GHz bands for wl_info v7 The counts will be used by MLO, and it is ongoing to add the code, so add debug message in todo part to avoid warnings reported by clang: coex.c:6323:23: warning: variable 'cnt_2g' set but not used [-Wunused-but-set-variable] 6323 | u8 i, mode, cnt = 0, cnt_2g = 0, cnt_5g = 0, ... | ^ coex.c:6323:35: warning: variable 'cnt_5g' set but not used [-Wunused-but-set-variable] 6323 | u8 i, mode, cnt = 0, cnt_2g = 0, cnt_5g = 0, ... | ^ Signed-off-by: Ping-Ke Shih Signed-off-by: Kalle Valo Link: https://patch.msgid.link/20240912021626.10494-1-pkshih@realtek.com --- drivers/net/wireless/realtek/rtw89/coex.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/wireless/realtek/rtw89/coex.c b/drivers/net/wireless/realtek/rtw89/coex.c index df51b29142aa..8d27374db83c 100644 --- a/drivers/net/wireless/realtek/rtw89/coex.c +++ b/drivers/net/wireless/realtek/rtw89/coex.c @@ -6445,6 +6445,8 @@ static void _update_wl_info_v7(struct rtw89_dev *rtwdev, u8 rid) /* todo DBCC related event */ rtw89_debug(rtwdev, RTW89_DBG_BTC, "[BTC] wl_info phy_now=%d\n", phy_now); + rtw89_debug(rtwdev, RTW89_DBG_BTC, + "[BTC] rlink cnt_2g=%d cnt_5g=%d\n", cnt_2g, cnt_5g); if (wl_rinfo->dbcc_en != rtwdev->dbcc_en) { wl_rinfo->dbcc_chg = 1; -- cgit From 34b69548108480ebb36cb6a067974a88ec745897 Mon Sep 17 00:00:00 2001 From: Felix Fietkau Date: Tue, 17 Sep 2024 13:09:42 +0200 Subject: wifi: mt76: do not increase mcu skb refcount if retry is not supported MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit If mcu_skb_prepare_msg is not implemented, incrementing skb refcount does not work for mcu message retry. In some cases (e.g. on SDIO), shared skbs can trigger a BUG_ON, crashing the system. Fix this by only incrementing refcount if retry is actually supported. Fixes: 3688c18b65ae ("wifi: mt76: mt7915: retry mcu messages") Closes: https://lore.kernel.org/r/d907b13a-f8be-4cb8-a0bb-560a21278041@notapiano/ Reported-by: Nícolas F. R. A. Prado #KernelCI Tested-by: Alper Nebi Yasak Signed-off-by: Felix Fietkau Signed-off-by: Kalle Valo Link: https://patch.msgid.link/20240917110942.22077-1-nbd@nbd.name --- drivers/net/wireless/mediatek/mt76/mcu.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/net/wireless/mediatek/mt76/mcu.c b/drivers/net/wireless/mediatek/mt76/mcu.c index 98da82b74094..3353012e8542 100644 --- a/drivers/net/wireless/mediatek/mt76/mcu.c +++ b/drivers/net/wireless/mediatek/mt76/mcu.c @@ -84,13 +84,16 @@ int mt76_mcu_skb_send_and_get_msg(struct mt76_dev *dev, struct sk_buff *skb, mutex_lock(&dev->mcu.mutex); if (dev->mcu_ops->mcu_skb_prepare_msg) { + orig_skb = skb; ret = dev->mcu_ops->mcu_skb_prepare_msg(dev, skb, cmd, &seq); if (ret < 0) goto out; } retry: - orig_skb = skb_get(skb); + /* orig skb might be needed for retry, mcu_skb_send_msg consumes it */ + if (orig_skb) + skb_get(orig_skb); ret = dev->mcu_ops->mcu_skb_send_msg(dev, skb, cmd, &seq); if (ret < 0) goto out; @@ -105,7 +108,7 @@ retry: do { skb = mt76_mcu_get_response(dev, expires); if (!skb && !test_bit(MT76_MCU_RESET, &dev->phy.state) && - retry++ < dev->mcu_ops->max_retry) { + orig_skb && retry++ < dev->mcu_ops->max_retry) { dev_err(dev->dev, "Retry message %08x (seq %d)\n", cmd, seq); skb = orig_skb; -- cgit From d4cdc46ca16a5c78b36c5b9b6ad8cac09d6130a0 Mon Sep 17 00:00:00 2001 From: Ben Hutchings Date: Thu, 12 Sep 2024 01:01:21 +0200 Subject: wifi: iwlegacy: Fix "field-spanning write" warning in il_enqueue_hcmd() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit iwlegacy uses command buffers with a payload size of 320 bytes (default) or 4092 bytes (huge). The struct il_device_cmd type describes the default buffers and there is no separate type describing the huge buffers. The il_enqueue_hcmd() function works with both default and huge buffers, and has a memcpy() to the buffer payload. The size of this copy may exceed 320 bytes when using a huge buffer, which now results in a run-time warning: memcpy: detected field-spanning write (size 1014) of single field "&out_cmd->cmd.payload" at drivers/net/wireless/intel/iwlegacy/common.c:3170 (size 320) To fix this: - Define a new struct type for huge buffers, with a correctly sized payload field - When using a huge buffer in il_enqueue_hcmd(), cast the command buffer pointer to that type when looking up the payload field Reported-by: Martin-Éric Racine References: https://bugs.debian.org/1062421 References: https://bugzilla.kernel.org/show_bug.cgi?id=219124 Signed-off-by: Ben Hutchings Fixes: 54d9469bc515 ("fortify: Add run-time WARN for cross-field memcpy()") Tested-by: Martin-Éric Racine Tested-by: Brandon Nielsen Acked-by: Stanislaw Gruszka Signed-off-by: Kalle Valo Link: https://patch.msgid.link/ZuIhQRi/791vlUhE@decadent.org.uk --- drivers/net/wireless/intel/iwlegacy/common.c | 13 ++++++++++++- drivers/net/wireless/intel/iwlegacy/common.h | 12 ++++++++++++ 2 files changed, 24 insertions(+), 1 deletion(-) diff --git a/drivers/net/wireless/intel/iwlegacy/common.c b/drivers/net/wireless/intel/iwlegacy/common.c index 9d33a66a49b5..4616293ec0cf 100644 --- a/drivers/net/wireless/intel/iwlegacy/common.c +++ b/drivers/net/wireless/intel/iwlegacy/common.c @@ -3122,6 +3122,7 @@ il_enqueue_hcmd(struct il_priv *il, struct il_host_cmd *cmd) struct il_cmd_meta *out_meta; dma_addr_t phys_addr; unsigned long flags; + u8 *out_payload; u32 idx; u16 fix_size; @@ -3157,6 +3158,16 @@ il_enqueue_hcmd(struct il_priv *il, struct il_host_cmd *cmd) out_cmd = txq->cmd[idx]; out_meta = &txq->meta[idx]; + /* The payload is in the same place in regular and huge + * command buffers, but we need to let the compiler know when + * we're using a larger payload buffer to avoid "field- + * spanning write" warnings at run-time for huge commands. + */ + if (cmd->flags & CMD_SIZE_HUGE) + out_payload = ((struct il_device_cmd_huge *)out_cmd)->cmd.payload; + else + out_payload = out_cmd->cmd.payload; + if (WARN_ON(out_meta->flags & CMD_MAPPED)) { spin_unlock_irqrestore(&il->hcmd_lock, flags); return -ENOSPC; @@ -3170,7 +3181,7 @@ il_enqueue_hcmd(struct il_priv *il, struct il_host_cmd *cmd) out_meta->callback = cmd->callback; out_cmd->hdr.cmd = cmd->id; - memcpy(&out_cmd->cmd.payload, cmd->data, cmd->len); + memcpy(out_payload, cmd->data, cmd->len); /* At this point, the out_cmd now has all of the incoming cmd * information */ diff --git a/drivers/net/wireless/intel/iwlegacy/common.h b/drivers/net/wireless/intel/iwlegacy/common.h index 2147781b5fff..725c2a88ddb7 100644 --- a/drivers/net/wireless/intel/iwlegacy/common.h +++ b/drivers/net/wireless/intel/iwlegacy/common.h @@ -560,6 +560,18 @@ struct il_device_cmd { #define TFD_MAX_PAYLOAD_SIZE (sizeof(struct il_device_cmd)) +/** + * struct il_device_cmd_huge + * + * For use when sending huge commands. + */ +struct il_device_cmd_huge { + struct il_cmd_header hdr; /* uCode API */ + union { + u8 payload[IL_MAX_CMD_SIZE - sizeof(struct il_cmd_header)]; + } __packed cmd; +} __packed; + struct il_host_cmd { const void *data; unsigned long reply_page; -- cgit From cb4c7df596a9048a6025e96e62fe698f15ec1992 Mon Sep 17 00:00:00 2001 From: Sam Edwards Date: Thu, 3 Oct 2024 20:41:30 -0700 Subject: phy: usb: Fix missing elements in BCM4908 USB init array The Broadcom USB PHY driver contains a lookup table (`reg_bits_map_tables`) to resolve register bitmaps unique to certain versions of the USB PHY as found in various Broadcom chip families. A recent commit (see 'fixes' tag) introduced two new elements to each chip family in this table -- except for one: BCM4908. This resulted in the xHCI controller not being initialized correctly, causing a panic on boot. The next patch will update this table to use designated initializers in order to prevent this from happening again. For now, just add back the missing array elements to resolve the regression. Fixes: 4536fe9640b6 ("phy: usb: suppress OC condition for 7439b2") Signed-off-by: Sam Edwards Reviewed-by: Justin Chen Reviewed-by: Florian Fainelli Link: https://lore.kernel.org/r/20241004034131.1363813-2-CFSworks@gmail.com Signed-off-by: Vinod Koul --- drivers/phy/broadcom/phy-brcm-usb-init.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/phy/broadcom/phy-brcm-usb-init.c b/drivers/phy/broadcom/phy-brcm-usb-init.c index 39536b6d96a9..5ebb3a616115 100644 --- a/drivers/phy/broadcom/phy-brcm-usb-init.c +++ b/drivers/phy/broadcom/phy-brcm-usb-init.c @@ -220,6 +220,8 @@ usb_reg_bits_map_table[BRCM_FAMILY_COUNT][USB_CTRL_SELECTOR_COUNT] = { 0, /* USB_CTRL_SETUP_SCB2_EN_MASK */ 0, /* USB_CTRL_SETUP_SS_EHCI64BIT_EN_MASK */ 0, /* USB_CTRL_SETUP_STRAP_IPP_SEL_MASK */ + 0, /* USB_CTRL_SETUP_OC3_DISABLE_PORT0_MASK */ + 0, /* USB_CTRL_SETUP_OC3_DISABLE_PORT1_MASK */ 0, /* USB_CTRL_SETUP_OC3_DISABLE_MASK */ 0, /* USB_CTRL_PLL_CTL_PLL_IDDQ_PWRDN_MASK */ 0, /* USB_CTRL_USB_PM_BDC_SOFT_RESETB_MASK */ -- cgit From 2d0f973b5f1c369671d0c59e103d15f4f6f775c9 Mon Sep 17 00:00:00 2001 From: Bartosz Wawrzyniak Date: Thu, 3 Oct 2024 12:34:02 +0000 Subject: phy: cadence: Sierra: Fix offset of DEQ open eye algorithm control register Fix the value of SIERRA_DEQ_OPENEYE_CTRL_PREG and add a definition for SIERRA_DEQ_TAU_EPIOFFSET_MODE_PREG. This fixes the SGMII single link register configuration. Fixes: 7a5ad9b4b98c ("phy: cadence: Sierra: Update single link PCIe register configuration") Signed-off-by: Bartosz Wawrzyniak Link: https://lore.kernel.org/r/20241003123405.1101157-1-bwawrzyn@cisco.com Signed-off-by: Vinod Koul --- drivers/phy/cadence/phy-cadence-sierra.c | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/drivers/phy/cadence/phy-cadence-sierra.c b/drivers/phy/cadence/phy-cadence-sierra.c index aeec6eb6be23..dfc4f55d112e 100644 --- a/drivers/phy/cadence/phy-cadence-sierra.c +++ b/drivers/phy/cadence/phy-cadence-sierra.c @@ -174,8 +174,9 @@ #define SIERRA_DEQ_TAU_CTRL1_SLOW_MAINT_PREG 0x150 #define SIERRA_DEQ_TAU_CTRL2_PREG 0x151 #define SIERRA_DEQ_TAU_CTRL3_PREG 0x152 -#define SIERRA_DEQ_OPENEYE_CTRL_PREG 0x158 +#define SIERRA_DEQ_TAU_EPIOFFSET_MODE_PREG 0x158 #define SIERRA_DEQ_CONCUR_EPIOFFSET_MODE_PREG 0x159 +#define SIERRA_DEQ_OPENEYE_CTRL_PREG 0x15C #define SIERRA_DEQ_PICTRL_PREG 0x161 #define SIERRA_CPICAL_TMRVAL_MODE1_PREG 0x170 #define SIERRA_CPICAL_TMRVAL_MODE0_PREG 0x171 @@ -1733,7 +1734,7 @@ static const struct cdns_reg_pairs ml_pcie_100_no_ssc_ln_regs[] = { {0x3C0F, SIERRA_DEQ_TAU_CTRL1_SLOW_MAINT_PREG}, {0x1C0C, SIERRA_DEQ_TAU_CTRL2_PREG}, {0x0100, SIERRA_DEQ_TAU_CTRL3_PREG}, - {0x5E82, SIERRA_DEQ_OPENEYE_CTRL_PREG}, + {0x5E82, SIERRA_DEQ_TAU_EPIOFFSET_MODE_PREG}, {0x002B, SIERRA_CPI_TRIM_PREG}, {0x0003, SIERRA_EPI_CTRL_PREG}, {0x803F, SIERRA_SDFILT_H2L_A_PREG}, @@ -1797,7 +1798,7 @@ static const struct cdns_reg_pairs ti_ml_pcie_100_no_ssc_ln_regs[] = { {0x3C0F, SIERRA_DEQ_TAU_CTRL1_SLOW_MAINT_PREG}, {0x1C0C, SIERRA_DEQ_TAU_CTRL2_PREG}, {0x0100, SIERRA_DEQ_TAU_CTRL3_PREG}, - {0x5E82, SIERRA_DEQ_OPENEYE_CTRL_PREG}, + {0x5E82, SIERRA_DEQ_TAU_EPIOFFSET_MODE_PREG}, {0x002B, SIERRA_CPI_TRIM_PREG}, {0x0003, SIERRA_EPI_CTRL_PREG}, {0x803F, SIERRA_SDFILT_H2L_A_PREG}, @@ -1874,7 +1875,7 @@ static const struct cdns_reg_pairs ml_pcie_100_int_ssc_ln_regs[] = { {0x3C0F, SIERRA_DEQ_TAU_CTRL1_SLOW_MAINT_PREG}, {0x1C0C, SIERRA_DEQ_TAU_CTRL2_PREG}, {0x0100, SIERRA_DEQ_TAU_CTRL3_PREG}, - {0x5E82, SIERRA_DEQ_OPENEYE_CTRL_PREG}, + {0x5E82, SIERRA_DEQ_TAU_EPIOFFSET_MODE_PREG}, {0x002B, SIERRA_CPI_TRIM_PREG}, {0x0003, SIERRA_EPI_CTRL_PREG}, {0x803F, SIERRA_SDFILT_H2L_A_PREG}, @@ -1941,7 +1942,7 @@ static const struct cdns_reg_pairs ti_ml_pcie_100_int_ssc_ln_regs[] = { {0x3C0F, SIERRA_DEQ_TAU_CTRL1_SLOW_MAINT_PREG}, {0x1C0C, SIERRA_DEQ_TAU_CTRL2_PREG}, {0x0100, SIERRA_DEQ_TAU_CTRL3_PREG}, - {0x5E82, SIERRA_DEQ_OPENEYE_CTRL_PREG}, + {0x5E82, SIERRA_DEQ_TAU_EPIOFFSET_MODE_PREG}, {0x002B, SIERRA_CPI_TRIM_PREG}, {0x0003, SIERRA_EPI_CTRL_PREG}, {0x803F, SIERRA_SDFILT_H2L_A_PREG}, @@ -2012,7 +2013,7 @@ static const struct cdns_reg_pairs ml_pcie_100_ext_ssc_ln_regs[] = { {0x3C0F, SIERRA_DEQ_TAU_CTRL1_SLOW_MAINT_PREG}, {0x1C0C, SIERRA_DEQ_TAU_CTRL2_PREG}, {0x0100, SIERRA_DEQ_TAU_CTRL3_PREG}, - {0x5E82, SIERRA_DEQ_OPENEYE_CTRL_PREG}, + {0x5E82, SIERRA_DEQ_TAU_EPIOFFSET_MODE_PREG}, {0x002B, SIERRA_CPI_TRIM_PREG}, {0x0003, SIERRA_EPI_CTRL_PREG}, {0x803F, SIERRA_SDFILT_H2L_A_PREG}, @@ -2079,7 +2080,7 @@ static const struct cdns_reg_pairs ti_ml_pcie_100_ext_ssc_ln_regs[] = { {0x3C0F, SIERRA_DEQ_TAU_CTRL1_SLOW_MAINT_PREG}, {0x1C0C, SIERRA_DEQ_TAU_CTRL2_PREG}, {0x0100, SIERRA_DEQ_TAU_CTRL3_PREG}, - {0x5E82, SIERRA_DEQ_OPENEYE_CTRL_PREG}, + {0x5E82, SIERRA_DEQ_TAU_EPIOFFSET_MODE_PREG}, {0x002B, SIERRA_CPI_TRIM_PREG}, {0x0003, SIERRA_EPI_CTRL_PREG}, {0x803F, SIERRA_SDFILT_H2L_A_PREG}, @@ -2140,7 +2141,7 @@ static const struct cdns_reg_pairs cdns_pcie_ln_regs_no_ssc[] = { {0x3C0F, SIERRA_DEQ_TAU_CTRL1_SLOW_MAINT_PREG}, {0x1C0C, SIERRA_DEQ_TAU_CTRL2_PREG}, {0x0100, SIERRA_DEQ_TAU_CTRL3_PREG}, - {0x5E82, SIERRA_DEQ_OPENEYE_CTRL_PREG}, + {0x5E82, SIERRA_DEQ_TAU_EPIOFFSET_MODE_PREG}, {0x002B, SIERRA_CPI_TRIM_PREG}, {0x0003, SIERRA_EPI_CTRL_PREG}, {0x803F, SIERRA_SDFILT_H2L_A_PREG}, @@ -2215,7 +2216,7 @@ static const struct cdns_reg_pairs cdns_pcie_ln_regs_int_ssc[] = { {0x3C0F, SIERRA_DEQ_TAU_CTRL1_SLOW_MAINT_PREG}, {0x1C0C, SIERRA_DEQ_TAU_CTRL2_PREG}, {0x0100, SIERRA_DEQ_TAU_CTRL3_PREG}, - {0x5E82, SIERRA_DEQ_OPENEYE_CTRL_PREG}, + {0x5E82, SIERRA_DEQ_TAU_EPIOFFSET_MODE_PREG}, {0x002B, SIERRA_CPI_TRIM_PREG}, {0x0003, SIERRA_EPI_CTRL_PREG}, {0x803F, SIERRA_SDFILT_H2L_A_PREG}, @@ -2284,7 +2285,7 @@ static const struct cdns_reg_pairs cdns_pcie_ln_regs_ext_ssc[] = { {0x3C0F, SIERRA_DEQ_TAU_CTRL1_SLOW_MAINT_PREG}, {0x1C0C, SIERRA_DEQ_TAU_CTRL2_PREG}, {0x0100, SIERRA_DEQ_TAU_CTRL3_PREG}, - {0x5E82, SIERRA_DEQ_OPENEYE_CTRL_PREG}, + {0x5E82, SIERRA_DEQ_TAU_EPIOFFSET_MODE_PREG}, {0x002B, SIERRA_CPI_TRIM_PREG}, {0x0003, SIERRA_EPI_CTRL_PREG}, {0x803F, SIERRA_SDFILT_H2L_A_PREG}, -- cgit From b8c4076db5fd24b3be047e033b1098a5366db2fc Mon Sep 17 00:00:00 2001 From: "Darrick J. Wong" Date: Thu, 3 Oct 2024 08:09:01 -0700 Subject: xfs: don't allocate COW extents when unsharing a hole It doesn't make sense to allocate a COW extent when unsharing a hole because holes cannot be shared. Fixes: 1f1397b7218d7 ("xfs: don't allocate into the data fork for an unshare request") Signed-off-by: Darrick J. Wong Link: https://lore.kernel.org/r/172796813277.1131942.5486112889531210260.stgit@frogsfrogsfrogs Reviewed-by: Christoph Hellwig Signed-off-by: Christian Brauner --- fs/xfs/xfs_iomap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 1e11f48814c0..dc0443848485 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -707,7 +707,7 @@ imap_needs_cow( return false; /* when zeroing we don't have to COW holes or unwritten extents */ - if (flags & IOMAP_ZERO) { + if (flags & (IOMAP_UNSHARE | IOMAP_ZERO)) { if (!nimaps || imap->br_startblock == HOLESTARTBLOCK || imap->br_state == XFS_EXT_UNWRITTEN) -- cgit From 6ef6a0e821d3dad6bf8a5d5508762dba9042c84b Mon Sep 17 00:00:00 2001 From: "Darrick J. Wong" Date: Thu, 3 Oct 2024 08:09:16 -0700 Subject: iomap: share iomap_unshare_iter predicate code with fsdax The predicate code that iomap_unshare_iter uses to decide if it's really needs to unshare a file range mapping should be shared with the fsdax version, because right now they're opencoded and inconsistent. Note that we simplify the predicate logic a bit -- we no longer allow unsharing of inline data mappings, but there aren't any filesystems that allow shared inline data currently. This is a fix in the sense that it should have been ported to fsdax. Fixes: b53fdb215d13 ("iomap: improve shared block detection in iomap_unshare_iter") Signed-off-by: Darrick J. Wong Link: https://lore.kernel.org/r/172796813294.1131942.15762084021076932620.stgit@frogsfrogsfrogs Reviewed-by: Christoph Hellwig Signed-off-by: Christian Brauner --- fs/dax.c | 3 +-- fs/iomap/buffered-io.c | 30 ++++++++++++++++-------------- include/linux/iomap.h | 1 + 3 files changed, 18 insertions(+), 16 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index c62acd2812f8..5064eefb1c1e 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1268,8 +1268,7 @@ static s64 dax_unshare_iter(struct iomap_iter *iter) s64 ret = 0; void *daddr = NULL, *saddr = NULL; - /* don't bother with blocks that are not shared to start with */ - if (!(iomap->flags & IOMAP_F_SHARED)) + if (!iomap_want_unshare_iter(iter)) return length; id = dax_read_lock(); diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 78ebd265f425..3899169b2cf7 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1309,19 +1309,12 @@ void iomap_file_buffered_write_punch_delalloc(struct inode *inode, } EXPORT_SYMBOL_GPL(iomap_file_buffered_write_punch_delalloc); -static loff_t iomap_unshare_iter(struct iomap_iter *iter) +bool iomap_want_unshare_iter(const struct iomap_iter *iter) { - struct iomap *iomap = &iter->iomap; - loff_t pos = iter->pos; - loff_t length = iomap_length(iter); - loff_t written = 0; - - /* Don't bother with blocks that are not shared to start with. */ - if (!(iomap->flags & IOMAP_F_SHARED)) - return length; - /* - * Don't bother with delalloc reservations, holes or unwritten extents. + * Don't bother with blocks that are not shared to start with; or + * mappings that cannot be shared, such as inline data, delalloc + * reservations, holes or unwritten extents. * * Note that we use srcmap directly instead of iomap_iter_srcmap as * unsharing requires providing a separate source map, and the presence @@ -1329,9 +1322,18 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter) * IOMAP_F_SHARED which can be set for any data that goes into the COW * fork for XFS. */ - if (iter->srcmap.type == IOMAP_HOLE || - iter->srcmap.type == IOMAP_DELALLOC || - iter->srcmap.type == IOMAP_UNWRITTEN) + return (iter->iomap.flags & IOMAP_F_SHARED) && + iter->srcmap.type == IOMAP_MAPPED; +} + +static loff_t iomap_unshare_iter(struct iomap_iter *iter) +{ + struct iomap *iomap = &iter->iomap; + loff_t pos = iter->pos; + loff_t length = iomap_length(iter); + loff_t written = 0; + + if (!iomap_want_unshare_iter(iter)) return length; do { diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 4ad12a3c8bae..d8a7fc84348c 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -267,6 +267,7 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len); bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio); int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len, const struct iomap_ops *ops); +bool iomap_want_unshare_iter(const struct iomap_iter *iter); int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, const struct iomap_ops *ops); int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero, -- cgit From 95472274b6fed8f2d30fbdda304e12174b3d4099 Mon Sep 17 00:00:00 2001 From: "Darrick J. Wong" Date: Thu, 3 Oct 2024 08:09:32 -0700 Subject: fsdax: remove zeroing code from dax_unshare_iter Remove the code in dax_unshare_iter that zeroes the destination memory because it's not necessary. If srcmap is unwritten, we don't have to do anything because that unwritten extent came from the regular file mapping, and unwritten extents cannot be shared. The same applies to holes. Furthermore, zeroing to unshare a mapping is just plain wrong because unsharing means copy on write, and we should be copying data. This is effectively a revert of commit 13dd4e04625f ("fsdax: unshare: zero destination if srcmap is HOLE or UNWRITTEN") Cc: ruansy.fnst@fujitsu.com Signed-off-by: Darrick J. Wong Link: https://lore.kernel.org/r/172796813311.1131942.16033376284752798632.stgit@frogsfrogsfrogs Reviewed-by: Christoph Hellwig Signed-off-by: Christian Brauner --- fs/dax.c | 8 -------- 1 file changed, 8 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 5064eefb1c1e..9fbbdaa784b4 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1276,14 +1276,6 @@ static s64 dax_unshare_iter(struct iomap_iter *iter) if (ret < 0) goto out_unlock; - /* zero the distance if srcmap is HOLE or UNWRITTEN */ - if (srcmap->flags & IOMAP_F_SHARED || srcmap->type == IOMAP_UNWRITTEN) { - memset(daddr, 0, length); - dax_flush(iomap->dax_dev, daddr, length); - ret = length; - goto out_unlock; - } - ret = dax_iomap_direct_access(srcmap, pos, length, &saddr, NULL); if (ret < 0) goto out_unlock; -- cgit From 50793801fc7f6d08def48754fb0f0706b0cfc394 Mon Sep 17 00:00:00 2001 From: "Darrick J. Wong" Date: Thu, 3 Oct 2024 08:09:48 -0700 Subject: fsdax: dax_unshare_iter needs to copy entire blocks The code that copies data from srcmap to iomap in dax_unshare_iter is very very broken, which bfoster's recent fsx changes have exposed. If the pos and len passed to dax_file_unshare are not aligned to an fsblock boundary, the iter pos and length in the _iter function will reflect this unalignment. dax_iomap_direct_access always returns a pointer to the start of the kmapped fsdax page, even if its pos argument is in the middle of that page. This is catastrophic for data integrity when iter->pos is not aligned to a page, because daddr/saddr do not point to the same byte in the file as iter->pos. Hence we corrupt user data by copying it to the wrong place. If iter->pos + iomap_length() in the _iter function not aligned to a page, then we fail to copy a full block, and only partially populate the destination block. This is catastrophic for data confidentiality because we expose stale pmem contents. Fix both of these issues by aligning copy_pos/copy_len to a page boundary (remember, this is fsdax so 1 fsblock == 1 base page) so that we always copy full blocks. We're not done yet -- there's no call to invalidate_inode_pages2_range, so programs that have the file range mmap'd will continue accessing the old memory mapping after the file metadata updates have completed. Be careful with the return value -- if the unshare succeeds, we still need to return the number of bytes that the iomap iter thinks we're operating on. Cc: ruansy.fnst@fujitsu.com Fixes: d984648e428b ("fsdax,xfs: port unshare to fsdax") Signed-off-by: Darrick J. Wong Link: https://lore.kernel.org/r/172796813328.1131942.16777025316348797355.stgit@frogsfrogsfrogs Reviewed-by: Christoph Hellwig Signed-off-by: Christian Brauner --- fs/dax.c | 34 +++++++++++++++++++++++++++------- 1 file changed, 27 insertions(+), 7 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 9fbbdaa784b4..21b47402b3dc 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1262,26 +1262,46 @@ static s64 dax_unshare_iter(struct iomap_iter *iter) { struct iomap *iomap = &iter->iomap; const struct iomap *srcmap = iomap_iter_srcmap(iter); - loff_t pos = iter->pos; - loff_t length = iomap_length(iter); + loff_t copy_pos = iter->pos; + u64 copy_len = iomap_length(iter); + u32 mod; int id = 0; s64 ret = 0; void *daddr = NULL, *saddr = NULL; if (!iomap_want_unshare_iter(iter)) - return length; + return iomap_length(iter); + + /* + * Extend the file range to be aligned to fsblock/pagesize, because + * we need to copy entire blocks, not just the byte range specified. + * Invalidate the mapping because we're about to CoW. + */ + mod = offset_in_page(copy_pos); + if (mod) { + copy_len += mod; + copy_pos -= mod; + } + + mod = offset_in_page(copy_pos + copy_len); + if (mod) + copy_len += PAGE_SIZE - mod; + + invalidate_inode_pages2_range(iter->inode->i_mapping, + copy_pos >> PAGE_SHIFT, + (copy_pos + copy_len - 1) >> PAGE_SHIFT); id = dax_read_lock(); - ret = dax_iomap_direct_access(iomap, pos, length, &daddr, NULL); + ret = dax_iomap_direct_access(iomap, copy_pos, copy_len, &daddr, NULL); if (ret < 0) goto out_unlock; - ret = dax_iomap_direct_access(srcmap, pos, length, &saddr, NULL); + ret = dax_iomap_direct_access(srcmap, copy_pos, copy_len, &saddr, NULL); if (ret < 0) goto out_unlock; - if (copy_mc_to_kernel(daddr, saddr, length) == 0) - ret = length; + if (copy_mc_to_kernel(daddr, saddr, copy_len) == 0) + ret = iomap_length(iter); else ret = -EIO; -- cgit From 117932eea99b729ee5d12783601a4f7f5fd58a23 Mon Sep 17 00:00:00 2001 From: Chen Ridong Date: Tue, 8 Oct 2024 11:24:56 +0000 Subject: cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction A hung_task problem shown below was found: INFO: task kworker/0:0:8 blocked for more than 327 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Workqueue: events cgroup_bpf_release Call Trace: __schedule+0x5a2/0x2050 ? find_held_lock+0x33/0x100 ? wq_worker_sleeping+0x9e/0xe0 schedule+0x9f/0x180 schedule_preempt_disabled+0x25/0x50 __mutex_lock+0x512/0x740 ? cgroup_bpf_release+0x1e/0x4d0 ? cgroup_bpf_release+0xcf/0x4d0 ? process_scheduled_works+0x161/0x8a0 ? cgroup_bpf_release+0x1e/0x4d0 ? mutex_lock_nested+0x2b/0x40 ? __pfx_delay_tsc+0x10/0x10 mutex_lock_nested+0x2b/0x40 cgroup_bpf_release+0xcf/0x4d0 ? process_scheduled_works+0x161/0x8a0 ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0 ? process_scheduled_works+0x161/0x8a0 process_scheduled_works+0x23a/0x8a0 worker_thread+0x231/0x5b0 ? __pfx_worker_thread+0x10/0x10 kthread+0x14d/0x1c0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x59/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 This issue can be reproduced by the following pressuse test: 1. A large number of cpuset cgroups are deleted. 2. Set cpu on and off repeatly. 3. Set watchdog_thresh repeatly. The scripts can be obtained at LINK mentioned above the signature. The reason for this issue is cgroup_mutex and cpu_hotplug_lock are acquired in different tasks, which may lead to deadlock. It can lead to a deadlock through the following steps: 1. A large number of cpusets are deleted asynchronously, which puts a large number of cgroup_bpf_release works into system_wq. The max_active of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are cgroup_bpf_release works, and many cgroup_bpf_release works will be put into inactive queue. As illustrated in the diagram, there are 256 (in the acvtive queue) + n (in the inactive queue) works. 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put smp_call_on_cpu work into system_wq. However step 1 has already filled system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has to wait until the works that were put into the inacvtive queue earlier have executed (n cgroup_bpf_release), so it will be blocked for a while. 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2. 4. Cpusets that were deleted at step 1 put cgroup_release works into cgroup_destroy_wq. They are competing to get cgroup_mutex all the time. When cgroup_metux is acqured by work at css_killed_work_fn, it will call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read. However, cpuset_css_offline will be blocked for step 3. 5. At this moment, there are 256 works in active queue that are cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as a result, all of them are blocked. Consequently, sscs.work can not be executed. Ultimately, this situation leads to four processes being blocked, forming a deadlock. system_wq(step1) WatchDog(step2) cpu offline(step3) cgroup_destroy_wq(step4) ... 2000+ cgroups deleted asyn 256 actives + n inactives __lockup_detector_reconfigure P(cpu_hotplug_lock.read) put sscs.work into system_wq 256 + n + 1(sscs.work) sscs.work wait to be executed warting sscs.work finish percpu_down_write P(cpu_hotplug_lock.write) ...blocking... css_killed_work_fn P(cgroup_mutex) cpuset_css_offline P(cpu_hotplug_lock.read) ...blocking... 256 cgroup_bpf_release mutex_lock(&cgroup_mutex); ..blocking... To fix the problem, place cgroup_bpf_release works on a dedicated workqueue which can break the loop and solve the problem. System wqs are for misc things which shouldn't create a large number of concurrent work items. If something is going to generate >WQ_DFL_ACTIVE(256) concurrent work items, it should use its own dedicated workqueue. Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself") Cc: stable@vger.kernel.org # v5.3+ Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t Tested-by: Vishal Chourasia Signed-off-by: Chen Ridong Signed-off-by: Tejun Heo --- kernel/bpf/cgroup.c | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c index e7113d700b87..025d7e2214ae 100644 --- a/kernel/bpf/cgroup.c +++ b/kernel/bpf/cgroup.c @@ -24,6 +24,23 @@ DEFINE_STATIC_KEY_ARRAY_FALSE(cgroup_bpf_enabled_key, MAX_CGROUP_BPF_ATTACH_TYPE); EXPORT_SYMBOL(cgroup_bpf_enabled_key); +/* + * cgroup bpf destruction makes heavy use of work items and there can be a lot + * of concurrent destructions. Use a separate workqueue so that cgroup bpf + * destruction work items don't end up filling up max_active of system_wq + * which may lead to deadlock. + */ +static struct workqueue_struct *cgroup_bpf_destroy_wq; + +static int __init cgroup_bpf_wq_init(void) +{ + cgroup_bpf_destroy_wq = alloc_workqueue("cgroup_bpf_destroy", 0, 1); + if (!cgroup_bpf_destroy_wq) + panic("Failed to alloc workqueue for cgroup bpf destroy.\n"); + return 0; +} +core_initcall(cgroup_bpf_wq_init); + /* __always_inline is necessary to prevent indirect call through run_prog * function pointer. */ @@ -334,7 +351,7 @@ static void cgroup_bpf_release_fn(struct percpu_ref *ref) struct cgroup *cgrp = container_of(ref, struct cgroup, bpf.refcnt); INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release); - queue_work(system_wq, &cgrp->bpf.release_work); + queue_work(cgroup_bpf_destroy_wq, &cgrp->bpf.release_work); } /* Get underlying bpf_prog of bpf_prog_list entry, regardless if it's through -- cgit From 07c90acb071b9954e1fecb1e4f4f13d12c544b34 Mon Sep 17 00:00:00 2001 From: Ville Syrjälä Date: Tue, 1 Oct 2024 23:07:45 +0300 Subject: wifi: iwlegacy: Clear stale interrupts before resuming device MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit iwl4965 fails upon resume from hibernation on my laptop. The reason seems to be a stale interrupt which isn't being cleared out before interrupts are enabled. We end up with a race beween the resume trying to bring things back up, and the restart work (queued form the interrupt handler) trying to bring things down. Eventually the whole thing blows up. Fix the problem by clearing out any stale interrupts before interrupts get enabled during resume. Here's a debug log of the indicent: [ 12.042589] ieee80211 phy0: il_isr ISR inta 0x00000080, enabled 0xaa00008b, fh 0x00000000 [ 12.042625] ieee80211 phy0: il4965_irq_tasklet inta 0x00000080, enabled 0x00000000, fh 0x00000000 [ 12.042651] iwl4965 0000:10:00.0: RF_KILL bit toggled to enable radio. [ 12.042653] iwl4965 0000:10:00.0: On demand firmware reload [ 12.042690] ieee80211 phy0: il4965_irq_tasklet End inta 0x00000000, enabled 0xaa00008b, fh 0x00000000, flags 0x00000282 [ 12.052207] ieee80211 phy0: il4965_mac_start enter [ 12.052212] ieee80211 phy0: il_prep_station Add STA to driver ID 31: ff:ff:ff:ff:ff:ff [ 12.052244] ieee80211 phy0: il4965_set_hw_ready hardware ready [ 12.052324] ieee80211 phy0: il_apm_init Init card's basic functions [ 12.052348] ieee80211 phy0: il_apm_init L1 Enabled; Disabling L0S [ 12.055727] ieee80211 phy0: il4965_load_bsm Begin load bsm [ 12.056140] ieee80211 phy0: il4965_verify_bsm Begin verify bsm [ 12.058642] ieee80211 phy0: il4965_verify_bsm BSM bootstrap uCode image OK [ 12.058721] ieee80211 phy0: il4965_load_bsm BSM write complete, poll 1 iterations [ 12.058734] ieee80211 phy0: __il4965_up iwl4965 is coming up [ 12.058737] ieee80211 phy0: il4965_mac_start Start UP work done. [ 12.058757] ieee80211 phy0: __il4965_down iwl4965 is going down [ 12.058761] ieee80211 phy0: il_scan_cancel_timeout Scan cancel timeout [ 12.058762] ieee80211 phy0: il_do_scan_abort Not performing scan to abort [ 12.058765] ieee80211 phy0: il_clear_ucode_stations Clearing ucode stations in driver [ 12.058767] ieee80211 phy0: il_clear_ucode_stations No active stations found to be cleared [ 12.058819] ieee80211 phy0: _il_apm_stop Stop card, put in low power state [ 12.058827] ieee80211 phy0: _il_apm_stop_master stop master [ 12.058864] ieee80211 phy0: il4965_clear_free_frames 0 frames on pre-allocated heap on clear. [ 12.058869] ieee80211 phy0: Hardware restart was requested [ 16.132299] iwl4965 0000:10:00.0: START_ALIVE timeout after 4000ms. [ 16.132303] ------------[ cut here ]------------ [ 16.132304] Hardware became unavailable upon resume. This could be a software issue prior to suspend or a hardware issue. [ 16.132338] WARNING: CPU: 0 PID: 181 at net/mac80211/util.c:1826 ieee80211_reconfig+0x8f/0x14b0 [mac80211] [ 16.132390] Modules linked in: ctr ccm sch_fq_codel xt_tcpudp xt_multiport xt_state iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv4 ip_tables x_tables binfmt_misc joydev mousedev btusb btrtl btintel btbcm bluetooth ecdh_generic ecc iTCO_wdt i2c_dev iwl4965 iwlegacy coretemp snd_hda_codec_analog pcspkr psmouse mac80211 snd_hda_codec_generic libarc4 sdhci_pci cqhci sha256_generic sdhci libsha256 firewire_ohci snd_hda_intel snd_intel_dspcfg mmc_core snd_hda_codec snd_hwdep firewire_core led_class iosf_mbi snd_hda_core uhci_hcd lpc_ich crc_itu_t cfg80211 ehci_pci ehci_hcd snd_pcm usbcore mfd_core rfkill snd_timer snd usb_common soundcore video parport_pc parport intel_agp wmi intel_gtt backlight e1000e agpgart evdev [ 16.132456] CPU: 0 UID: 0 PID: 181 Comm: kworker/u8:6 Not tainted 6.11.0-cl+ #143 [ 16.132460] Hardware name: Hewlett-Packard HP Compaq 6910p/30BE, BIOS 68MCU Ver. F.19 07/06/2010 [ 16.132463] Workqueue: async async_run_entry_fn [ 16.132469] RIP: 0010:ieee80211_reconfig+0x8f/0x14b0 [mac80211] [ 16.132501] Code: da 02 00 00 c6 83 ad 05 00 00 00 48 89 df e8 98 1b fc ff 85 c0 41 89 c7 0f 84 e9 02 00 00 48 c7 c7 a0 e6 48 a0 e8 d1 77 c4 e0 <0f> 0b eb 2d 84 c0 0f 85 8b 01 00 00 c6 87 ad 05 00 00 00 e8 69 1b [ 16.132504] RSP: 0018:ffffc9000029fcf0 EFLAGS: 00010282 [ 16.132507] RAX: 0000000000000000 RBX: ffff8880072008e0 RCX: 0000000000000001 [ 16.132509] RDX: ffffffff81f21a18 RSI: 0000000000000086 RDI: 0000000000000001 [ 16.132510] RBP: ffff8880072003c0 R08: 0000000000000000 R09: 0000000000000003 [ 16.132512] R10: 0000000000000000 R11: ffff88807e5b0000 R12: 0000000000000001 [ 16.132514] R13: 0000000000000000 R14: 0000000000000000 R15: 00000000ffffff92 [ 16.132515] FS: 0000000000000000(0000) GS:ffff88807c200000(0000) knlGS:0000000000000000 [ 16.132517] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 16.132519] CR2: 000055dd43786c08 CR3: 000000000978f000 CR4: 00000000000006f0 [ 16.132521] Call Trace: [ 16.132525] [ 16.132526] ? __warn+0x77/0x120 [ 16.132532] ? ieee80211_reconfig+0x8f/0x14b0 [mac80211] [ 16.132564] ? report_bug+0x15c/0x190 [ 16.132568] ? handle_bug+0x36/0x70 [ 16.132571] ? exc_invalid_op+0x13/0x60 [ 16.132573] ? asm_exc_invalid_op+0x16/0x20 [ 16.132579] ? ieee80211_reconfig+0x8f/0x14b0 [mac80211] [ 16.132611] ? snd_hdac_bus_init_cmd_io+0x24/0x200 [snd_hda_core] [ 16.132617] ? pick_eevdf+0x133/0x1c0 [ 16.132622] ? check_preempt_wakeup_fair+0x70/0x90 [ 16.132626] ? wakeup_preempt+0x4a/0x60 [ 16.132628] ? ttwu_do_activate.isra.0+0x5a/0x190 [ 16.132632] wiphy_resume+0x79/0x1a0 [cfg80211] [ 16.132675] ? wiphy_suspend+0x2a0/0x2a0 [cfg80211] [ 16.132697] dpm_run_callback+0x75/0x1b0 [ 16.132703] device_resume+0x97/0x200 [ 16.132707] async_resume+0x14/0x20 [ 16.132711] async_run_entry_fn+0x1b/0xa0 [ 16.132714] process_one_work+0x13d/0x350 [ 16.132718] worker_thread+0x2be/0x3d0 [ 16.132722] ? cancel_delayed_work_sync+0x70/0x70 [ 16.132725] kthread+0xc0/0xf0 [ 16.132729] ? kthread_park+0x80/0x80 [ 16.132732] ret_from_fork+0x28/0x40 [ 16.132735] ? kthread_park+0x80/0x80 [ 16.132738] ret_from_fork_asm+0x11/0x20 [ 16.132741] [ 16.132742] ---[ end trace 0000000000000000 ]--- [ 16.132930] ------------[ cut here ]------------ [ 16.132932] WARNING: CPU: 0 PID: 181 at net/mac80211/driver-ops.c:41 drv_stop+0xe7/0xf0 [mac80211] [ 16.132957] Modules linked in: ctr ccm sch_fq_codel xt_tcpudp xt_multiport xt_state iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv4 ip_tables x_tables binfmt_misc joydev mousedev btusb btrtl btintel btbcm bluetooth ecdh_generic ecc iTCO_wdt i2c_dev iwl4965 iwlegacy coretemp snd_hda_codec_analog pcspkr psmouse mac80211 snd_hda_codec_generic libarc4 sdhci_pci cqhci sha256_generic sdhci libsha256 firewire_ohci snd_hda_intel snd_intel_dspcfg mmc_core snd_hda_codec snd_hwdep firewire_core led_class iosf_mbi snd_hda_core uhci_hcd lpc_ich crc_itu_t cfg80211 ehci_pci ehci_hcd snd_pcm usbcore mfd_core rfkill snd_timer snd usb_common soundcore video parport_pc parport intel_agp wmi intel_gtt backlight e1000e agpgart evdev [ 16.133014] CPU: 0 UID: 0 PID: 181 Comm: kworker/u8:6 Tainted: G W 6.11.0-cl+ #143 [ 16.133018] Tainted: [W]=WARN [ 16.133019] Hardware name: Hewlett-Packard HP Compaq 6910p/30BE, BIOS 68MCU Ver. F.19 07/06/2010 [ 16.133021] Workqueue: async async_run_entry_fn [ 16.133025] RIP: 0010:drv_stop+0xe7/0xf0 [mac80211] [ 16.133048] Code: 48 85 c0 74 0e 48 8b 78 08 89 ea 48 89 de e8 e0 87 04 00 65 ff 0d d1 de c4 5f 0f 85 42 ff ff ff e8 be 52 c2 e0 e9 38 ff ff ff <0f> 0b 5b 5d c3 0f 1f 40 00 41 54 49 89 fc 55 53 48 89 f3 2e 2e 2e [ 16.133050] RSP: 0018:ffffc9000029fc50 EFLAGS: 00010246 [ 16.133053] RAX: 0000000000000000 RBX: ffff8880072008e0 RCX: ffff88800377f6c0 [ 16.133054] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8880072008e0 [ 16.133056] RBP: 0000000000000000 R08: ffffffff81f238d8 R09: 0000000000000000 [ 16.133058] R10: ffff8880080520f0 R11: 0000000000000000 R12: ffff888008051c60 [ 16.133060] R13: ffff8880072008e0 R14: 0000000000000000 R15: ffff8880072011d8 [ 16.133061] FS: 0000000000000000(0000) GS:ffff88807c200000(0000) knlGS:0000000000000000 [ 16.133063] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 16.133065] CR2: 000055dd43786c08 CR3: 000000000978f000 CR4: 00000000000006f0 [ 16.133067] Call Trace: [ 16.133069] [ 16.133070] ? __warn+0x77/0x120 [ 16.133075] ? drv_stop+0xe7/0xf0 [mac80211] [ 16.133098] ? report_bug+0x15c/0x190 [ 16.133100] ? handle_bug+0x36/0x70 [ 16.133103] ? exc_invalid_op+0x13/0x60 [ 16.133105] ? asm_exc_invalid_op+0x16/0x20 [ 16.133109] ? drv_stop+0xe7/0xf0 [mac80211] [ 16.133132] ieee80211_do_stop+0x55a/0x810 [mac80211] [ 16.133161] ? fq_codel_reset+0xa5/0xc0 [sch_fq_codel] [ 16.133164] ieee80211_stop+0x4f/0x180 [mac80211] [ 16.133192] __dev_close_many+0xa2/0x120 [ 16.133195] dev_close_many+0x90/0x150 [ 16.133198] dev_close+0x5d/0x80 [ 16.133200] cfg80211_shutdown_all_interfaces+0x40/0xe0 [cfg80211] [ 16.133223] wiphy_resume+0xb2/0x1a0 [cfg80211] [ 16.133247] ? wiphy_suspend+0x2a0/0x2a0 [cfg80211] [ 16.133269] dpm_run_callback+0x75/0x1b0 [ 16.133273] device_resume+0x97/0x200 [ 16.133277] async_resume+0x14/0x20 [ 16.133280] async_run_entry_fn+0x1b/0xa0 [ 16.133283] process_one_work+0x13d/0x350 [ 16.133287] worker_thread+0x2be/0x3d0 [ 16.133290] ? cancel_delayed_work_sync+0x70/0x70 [ 16.133294] kthread+0xc0/0xf0 [ 16.133296] ? kthread_park+0x80/0x80 [ 16.133299] ret_from_fork+0x28/0x40 [ 16.133302] ? kthread_park+0x80/0x80 [ 16.133304] ret_from_fork_asm+0x11/0x20 [ 16.133307] [ 16.133308] ---[ end trace 0000000000000000 ]--- [ 16.133335] ieee80211 phy0: PM: dpm_run_callback(): wiphy_resume [cfg80211] returns -110 [ 16.133360] ieee80211 phy0: PM: failed to restore async: error -110 Cc: stable@vger.kernel.org Cc: Stanislaw Gruszka Cc: Kalle Valo Cc: linux-wireless@vger.kernel.org Signed-off-by: Ville Syrjälä Acked-by: Stanislaw Gruszka Signed-off-by: Kalle Valo Link: https://patch.msgid.link/20241001200745.8276-1-ville.syrjala@linux.intel.com --- drivers/net/wireless/intel/iwlegacy/common.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/wireless/intel/iwlegacy/common.c b/drivers/net/wireless/intel/iwlegacy/common.c index 4616293ec0cf..958dd4f9bc69 100644 --- a/drivers/net/wireless/intel/iwlegacy/common.c +++ b/drivers/net/wireless/intel/iwlegacy/common.c @@ -4973,6 +4973,8 @@ il_pci_resume(struct device *device) */ pci_write_config_byte(pdev, PCI_CFG_RETRY_TIMEOUT, 0x00); + _il_wr(il, CSR_INT, 0xffffffff); + _il_wr(il, CSR_FH_INT_STATUS, 0xffffffff); il_enable_interrupts(il); if (!(_il_rd(il, CSR_GP_CNTRL) & CSR_GP_CNTRL_REG_FLAG_HW_RF_KILL_SW)) -- cgit From b3e046c31441d182b954fc2f57b2dc38c71ad4bc Mon Sep 17 00:00:00 2001 From: Geert Uytterhoeven Date: Tue, 24 Sep 2024 14:08:57 +0200 Subject: mac80211: MAC80211_MESSAGE_TRACING should depend on TRACING When tracing is disabled, there is no point in asking the user about enabling tracing of all mac80211 debug messages. Fixes: 3fae0273168026ed ("mac80211: trace debug messages") Signed-off-by: Geert Uytterhoeven Link: https://patch.msgid.link/85bbe38ce0df13350f45714e2dc288cc70947a19.1727179690.git.geert@linux-m68k.org Signed-off-by: Johannes Berg --- net/mac80211/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/mac80211/Kconfig b/net/mac80211/Kconfig index 13438cc0a6b1..cf0f7780fb10 100644 --- a/net/mac80211/Kconfig +++ b/net/mac80211/Kconfig @@ -96,7 +96,7 @@ config MAC80211_DEBUGFS config MAC80211_MESSAGE_TRACING bool "Trace all mac80211 debug messages" - depends on MAC80211 + depends on MAC80211 && TRACING help Select this option to have mac80211 register the mac80211_msg trace subsystem with tracepoints to -- cgit From 8dd0498983eef524a8d104eb8abb32ec4c595bec Mon Sep 17 00:00:00 2001 From: Ben Greear Date: Mon, 23 Sep 2024 18:13:25 -0700 Subject: wifi: mac80211: Fix setting txpower with emulate_chanctx Propagate hw conf into the driver when txpower changes and driver is emulating channel contexts. Signed-off-by: Ben Greear Link: https://patch.msgid.link/20240924011325.1509103-1-greearb@candelatech.com Signed-off-by: Johannes Berg --- net/mac80211/cfg.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/net/mac80211/cfg.c b/net/mac80211/cfg.c index 847304a3a29a..7cefe763b772 100644 --- a/net/mac80211/cfg.c +++ b/net/mac80211/cfg.c @@ -3046,6 +3046,7 @@ static int ieee80211_set_tx_power(struct wiphy *wiphy, enum nl80211_tx_power_setting txp_type = type; bool update_txp_type = false; bool has_monitor = false; + int old_power = local->user_power_level; lockdep_assert_wiphy(local->hw.wiphy); @@ -3128,6 +3129,10 @@ static int ieee80211_set_tx_power(struct wiphy *wiphy, } } + if (local->emulate_chanctx && + (old_power != local->user_power_level)) + ieee80211_hw_conf_chan(local); + return 0; } -- cgit From e1a9ae3a73810c00e492485fdbae09f0dccb057e Mon Sep 17 00:00:00 2001 From: Chenming Huang Date: Mon, 23 Sep 2024 07:46:44 +0530 Subject: wifi: cfg80211: Do not create BSS entries for unsupported channels Currently, in cfg80211_parse_ml_elem_sta_data(), when RNR element indicates a BSS that operates in a channel that current regulatory domain doesn't support, a NULL value is returned by ieee80211_get_channel_khz() and assigned to this BSS entry's channel field. Later in cfg80211_inform_single_bss_data(), the reported BSS entry's channel will be wrongly overridden by transmitted BSS's. This could result in connection failure that when wpa_supplicant tries to select this reported BSS entry while it actually resides in an unsupported channel. Since this channel is not supported, it is reasonable to skip such entries instead of reporting wrong information. Signed-off-by: Chenming Huang Link: https://patch.msgid.link/20240923021644.12885-1-quic_chenhuan@quicinc.com Signed-off-by: Johannes Berg --- net/wireless/scan.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/net/wireless/scan.c b/net/wireless/scan.c index 59a90bf3c0d6..d0aed41ded2f 100644 --- a/net/wireless/scan.c +++ b/net/wireless/scan.c @@ -3050,6 +3050,10 @@ cfg80211_parse_ml_elem_sta_data(struct wiphy *wiphy, freq = ieee80211_channel_to_freq_khz(ap_info->channel, band); data.channel = ieee80211_get_channel_khz(wiphy, freq); + /* Skip if RNR element specifies an unsupported channel */ + if (!data.channel) + continue; + /* Skip if BSS entry generated from MBSSID or DIRECT source * frame data available already. */ -- cgit From 68d0021fe7231eec0fb84cd110cf62a6e782b72d Mon Sep 17 00:00:00 2001 From: Remi Pommarel Date: Tue, 24 Sep 2024 21:28:04 +0200 Subject: wifi: cfg80211: Add wiphy_delayed_work_pending() Add wiphy_delayed_work_pending() to check if any delayed work timer is pending, that can be used to be sure that wiphy_delayed_work_queue() won't postpone an already pending delayed work. Signed-off-by: Remi Pommarel Link: https://patch.msgid.link/20240924192805.13859-2-repk@triplefau.lt [fix return value kernel-doc] Signed-off-by: Johannes Berg --- include/net/cfg80211.h | 44 ++++++++++++++++++++++++++++++++++++++++++++ net/wireless/core.c | 7 +++++++ 2 files changed, 51 insertions(+) diff --git a/include/net/cfg80211.h b/include/net/cfg80211.h index 69ec1eb41a09..941dc62f3027 100644 --- a/include/net/cfg80211.h +++ b/include/net/cfg80211.h @@ -6129,6 +6129,50 @@ void wiphy_delayed_work_cancel(struct wiphy *wiphy, void wiphy_delayed_work_flush(struct wiphy *wiphy, struct wiphy_delayed_work *dwork); +/** + * wiphy_delayed_work_pending - Find out whether a wiphy delayable + * work item is currently pending. + * + * @wiphy: the wiphy, for debug purposes + * @dwork: the delayed work in question + * + * Return: true if timer is pending, false otherwise + * + * How wiphy_delayed_work_queue() works is by setting a timer which + * when it expires calls wiphy_work_queue() to queue the wiphy work. + * Because wiphy_delayed_work_queue() uses mod_timer(), if it is + * called twice and the second call happens before the first call + * deadline, the work will rescheduled for the second deadline and + * won't run before that. + * + * wiphy_delayed_work_pending() can be used to detect if calling + * wiphy_work_delayed_work_queue() would start a new work schedule + * or delayed a previous one. As seen below it cannot be used to + * detect precisely if the work has finished to execute nor if it + * is currently executing. + * + * CPU0 CPU1 + * wiphy_delayed_work_queue(wk) + * mod_timer(wk->timer) + * wiphy_delayed_work_pending(wk) -> true + * + * [...] + * expire_timers(wk->timer) + * detach_timer(wk->timer) + * wiphy_delayed_work_pending(wk) -> false + * wk->timer->function() | + * wiphy_work_queue(wk) | delayed work pending + * list_add_tail() | returns false but + * queue_work(cfg80211_wiphy_work) | wk->func() has not + * | been run yet + * [...] | + * cfg80211_wiphy_work() | + * wk->func() V + * + */ +bool wiphy_delayed_work_pending(struct wiphy *wiphy, + struct wiphy_delayed_work *dwork); + /** * enum ieee80211_ap_reg_power - regulatory power for an Access Point * diff --git a/net/wireless/core.c b/net/wireless/core.c index 661adfc77644..8331064de9dd 100644 --- a/net/wireless/core.c +++ b/net/wireless/core.c @@ -1704,6 +1704,13 @@ void wiphy_delayed_work_flush(struct wiphy *wiphy, } EXPORT_SYMBOL_GPL(wiphy_delayed_work_flush); +bool wiphy_delayed_work_pending(struct wiphy *wiphy, + struct wiphy_delayed_work *dwork) +{ + return timer_pending(&dwork->timer); +} +EXPORT_SYMBOL_GPL(wiphy_delayed_work_pending); + static int __init cfg80211_init(void) { int err; -- cgit From 4cc6f3e5e5765abad9c091989970d67d8c1d2204 Mon Sep 17 00:00:00 2001 From: Remi Pommarel Date: Tue, 24 Sep 2024 21:28:05 +0200 Subject: wifi: mac80211: Convert color collision detection to wiphy work Call to ieee80211_color_collision_detection_work() needs wiphy lock to be held (see lockdep assert in cfg80211_bss_color_notify()). Not locking wiphy causes the following lockdep error: WARNING: CPU: 2 PID: 42 at net/wireless/nl80211.c:19505 cfg80211_bss_color_notify+0x1a4/0x25c Modules linked in: CPU: 2 PID: 42 Comm: kworker/u8:3 Tainted: G W 6.4.0-02327-g36c6cb260481 #1048 Hardware name: Workqueue: phy1 ieee80211_color_collision_detection_work pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : cfg80211_bss_color_notify+0x1a4/0x25c lr : cfg80211_bss_color_notify+0x1a0/0x25c sp : ffff000002947d00 x29: ffff000002947d00 x28: ffff800008e1a000 x27: ffff000002bd4705 x26: ffff00000d034000 x25: ffff80000903cf40 x24: 0000000000000000 x23: ffff00000cb70720 x22: 0000000000800000 x21: ffff800008dfb008 x20: 000000000000008d x19: ffff00000d035fa8 x18: 0000000000000010 x17: 0000000000000001 x16: 000003564b1ce96a x15: 000d69696d057970 x14: 000000000000003b x13: 0000000000000001 x12: 0000000000040000 x11: 0000000000000001 x10: ffff80000978f9c0 x9 : ffff0000028d3174 x8 : ffff800008e30000 x7 : 0000000000000000 x6 : 0000000000000028 x5 : 000000000002f498 x4 : ffff00000d034a80 x3 : 0000000000800000 x2 : ffff800016143000 x1 : 0000000000000000 x0 : 0000000000000000 Call trace: cfg80211_bss_color_notify+0x1a4/0x25c ieee80211_color_collision_detection_work+0x20/0x118 process_one_work+0x294/0x554 worker_thread+0x70/0x440 kthread+0xf4/0xf8 ret_from_fork+0x10/0x20 irq event stamp: 77372 hardirqs last enabled at (77371): [] _raw_spin_unlock_irq+0x2c/0x4c hardirqs last disabled at (77372): [] el1_dbg+0x20/0x48 softirqs last enabled at (77350): [] batadv_send_outstanding_bcast_packet+0xb8/0x120 softirqs last disabled at (77348): [] batadv_send_outstanding_bcast_packet+0x80/0x120 The wiphy lock cannot be taken directly from color collision detection delayed work (ieee80211_color_collision_detection_work()) because this work is cancel_delayed_work_sync() under this wiphy lock causing a potential deadlock( see [0] for details). To fix that ieee80211_color_collision_detection_work() could be converted to a wiphy work and cancel_delayed_work_sync() can be simply replaced by wiphy_delayed_work_cancel() serving the same purpose under wiphy lock. This could potentially fix [1]. [0]: https://lore.kernel.org/linux-wireless/D4A40Q44OAY2.W3SIF6UEPBUN@freebox.fr/ [1]: https://lore.kernel.org/lkml/000000000000612f290618eee3e5@google.com/ Reported-by: Nicolas Escande Signed-off-by: Remi Pommarel Link: https://patch.msgid.link/20240924192805.13859-3-repk@triplefau.lt Signed-off-by: Johannes Berg --- net/mac80211/cfg.c | 17 +++++++++-------- net/mac80211/ieee80211_i.h | 5 +++-- net/mac80211/link.c | 7 ++++--- 3 files changed, 16 insertions(+), 13 deletions(-) diff --git a/net/mac80211/cfg.c b/net/mac80211/cfg.c index 7cefe763b772..13294cc20a9b 100644 --- a/net/mac80211/cfg.c +++ b/net/mac80211/cfg.c @@ -4831,12 +4831,12 @@ void ieee80211_color_change_finalize_work(struct wiphy *wiphy, ieee80211_color_change_finalize(link); } -void ieee80211_color_collision_detection_work(struct work_struct *work) +void ieee80211_color_collision_detection_work(struct wiphy *wiphy, + struct wiphy_work *work) { - struct delayed_work *delayed_work = to_delayed_work(work); struct ieee80211_link_data *link = - container_of(delayed_work, struct ieee80211_link_data, - color_collision_detect_work); + container_of(work, struct ieee80211_link_data, + color_collision_detect_work.work); struct ieee80211_sub_if_data *sdata = link->sdata; cfg80211_obss_color_collision_notify(sdata->dev, link->color_bitmap, @@ -4889,7 +4889,8 @@ ieee80211_obss_color_collision_notify(struct ieee80211_vif *vif, return; } - if (delayed_work_pending(&link->color_collision_detect_work)) { + if (wiphy_delayed_work_pending(sdata->local->hw.wiphy, + &link->color_collision_detect_work)) { rcu_read_unlock(); return; } @@ -4898,9 +4899,9 @@ ieee80211_obss_color_collision_notify(struct ieee80211_vif *vif, /* queue the color collision detection event every 500 ms in order to * avoid sending too much netlink messages to userspace. */ - ieee80211_queue_delayed_work(&sdata->local->hw, - &link->color_collision_detect_work, - msecs_to_jiffies(500)); + wiphy_delayed_work_queue(sdata->local->hw.wiphy, + &link->color_collision_detect_work, + msecs_to_jiffies(500)); rcu_read_unlock(); } diff --git a/net/mac80211/ieee80211_i.h b/net/mac80211/ieee80211_i.h index 4f0390918b60..04fec7e516cf 100644 --- a/net/mac80211/ieee80211_i.h +++ b/net/mac80211/ieee80211_i.h @@ -1053,7 +1053,7 @@ struct ieee80211_link_data { } csa; struct wiphy_work color_change_finalize_work; - struct delayed_work color_collision_detect_work; + struct wiphy_delayed_work color_collision_detect_work; u64 color_bitmap; /* context reservation -- protected with wiphy mutex */ @@ -2005,7 +2005,8 @@ int ieee80211_channel_switch(struct wiphy *wiphy, struct net_device *dev, /* color change handling */ void ieee80211_color_change_finalize_work(struct wiphy *wiphy, struct wiphy_work *work); -void ieee80211_color_collision_detection_work(struct work_struct *work); +void ieee80211_color_collision_detection_work(struct wiphy *wiphy, + struct wiphy_work *work); /* interface handling */ #define MAC80211_SUPPORTED_FEATURES_TX (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM | \ diff --git a/net/mac80211/link.c b/net/mac80211/link.c index 0bbac64d5fa0..46092fbcde90 100644 --- a/net/mac80211/link.c +++ b/net/mac80211/link.c @@ -41,8 +41,8 @@ void ieee80211_link_init(struct ieee80211_sub_if_data *sdata, ieee80211_csa_finalize_work); wiphy_work_init(&link->color_change_finalize_work, ieee80211_color_change_finalize_work); - INIT_DELAYED_WORK(&link->color_collision_detect_work, - ieee80211_color_collision_detection_work); + wiphy_delayed_work_init(&link->color_collision_detect_work, + ieee80211_color_collision_detection_work); INIT_LIST_HEAD(&link->assigned_chanctx_list); INIT_LIST_HEAD(&link->reserved_chanctx_list); wiphy_delayed_work_init(&link->dfs_cac_timer_work, @@ -72,7 +72,8 @@ void ieee80211_link_stop(struct ieee80211_link_data *link) if (link->sdata->vif.type == NL80211_IFTYPE_STATION) ieee80211_mgd_stop_link(link); - cancel_delayed_work_sync(&link->color_collision_detect_work); + wiphy_delayed_work_cancel(link->sdata->local->hw.wiphy, + &link->color_collision_detect_work); wiphy_work_cancel(link->sdata->local->hw.wiphy, &link->color_change_finalize_work); wiphy_work_cancel(link->sdata->local->hw.wiphy, -- cgit From 393b6bc174b0dd21bb2a36c13b36e62fc3474a23 Mon Sep 17 00:00:00 2001 From: Felix Fietkau Date: Wed, 2 Oct 2024 11:56:30 +0200 Subject: wifi: mac80211: do not pass a stopped vif to the driver in .get_txpower Avoid potentially crashing in the driver because of uninitialized private data Fixes: 5b3dc42b1b0d ("mac80211: add support for driver tx power reporting") Cc: stable@vger.kernel.org Signed-off-by: Felix Fietkau Link: https://patch.msgid.link/20241002095630.22431-1-nbd@nbd.name Signed-off-by: Johannes Berg --- net/mac80211/cfg.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/mac80211/cfg.c b/net/mac80211/cfg.c index 13294cc20a9b..6dfc61a9acd4 100644 --- a/net/mac80211/cfg.c +++ b/net/mac80211/cfg.c @@ -3143,7 +3143,8 @@ static int ieee80211_get_tx_power(struct wiphy *wiphy, struct ieee80211_local *local = wiphy_priv(wiphy); struct ieee80211_sub_if_data *sdata = IEEE80211_WDEV_TO_SUB_IF(wdev); - if (local->ops->get_txpower) + if (local->ops->get_txpower && + (sdata->flags & IEEE80211_SDATA_IN_DRIVER)) return drv_get_txpower(local, sdata, dbm); if (local->emulate_chanctx) -- cgit From 57be3d3562ca4aa62b8047bc681028cc402af8ce Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Fri, 4 Oct 2024 14:14:44 -0600 Subject: wifi: radiotap: Avoid -Wflex-array-member-not-at-end warnings -Wflex-array-member-not-at-end was introduced in GCC-14, and we are getting ready to enable it, globally. So, in order to avoid ending up with a flexible-array member in the middle of multiple other structs, we use the `__struct_group()` helper to create a new tagged `struct ieee80211_radiotap_header_fixed`. This structure groups together all the members of the flexible `struct ieee80211_radiotap_header` except the flexible array. As a result, the array is effectively separated from the rest of the members without modifying the memory layout of the flexible structure. We then change the type of the middle struct members currently causing trouble from `struct ieee80211_radiotap_header` to `struct ieee80211_radiotap_header_fixed`. We also want to ensure that in case new members need to be added to the flexible structure, they are always included within the newly created tagged struct. For this, we use `static_assert()`. This ensures that the memory layout for both the flexible structure and the new tagged struct is the same after any changes. This approach avoids having to implement `struct ieee80211_radiotap_header_fixed` as a completely separate structure, thus preventing having to maintain two independent but basically identical structures, closing the door to potential bugs in the future. So, with these changes, fix the following warnings: drivers/net/wireless/ath/wil6210/txrx.c:309:50: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] drivers/net/wireless/intel/ipw2x00/ipw2100.c:2521:50: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] drivers/net/wireless/intel/ipw2x00/ipw2200.h:1146:42: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] drivers/net/wireless/intel/ipw2x00/libipw.h:595:36: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] drivers/net/wireless/marvell/libertas/radiotap.h:34:42: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] drivers/net/wireless/marvell/libertas/radiotap.h:5:42: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] drivers/net/wireless/microchip/wilc1000/mon.c:10:42: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] drivers/net/wireless/microchip/wilc1000/mon.c:15:42: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] drivers/net/wireless/virtual/mac80211_hwsim.c:758:42: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] drivers/net/wireless/virtual/mac80211_hwsim.c:767:42: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] Signed-off-by: Gustavo A. R. Silva Link: https://patch.msgid.link/ZwBMtBZKcrzwU7l4@kspp Signed-off-by: Johannes Berg --- drivers/net/wireless/ath/wil6210/txrx.c | 2 +- drivers/net/wireless/intel/ipw2x00/ipw2100.c | 2 +- drivers/net/wireless/intel/ipw2x00/ipw2200.h | 2 +- drivers/net/wireless/marvell/libertas/radiotap.h | 4 +-- drivers/net/wireless/microchip/wilc1000/mon.c | 4 +-- drivers/net/wireless/virtual/mac80211_hwsim.c | 4 +-- include/net/ieee80211_radiotap.h | 43 +++++++++++++----------- 7 files changed, 33 insertions(+), 28 deletions(-) diff --git a/drivers/net/wireless/ath/wil6210/txrx.c b/drivers/net/wireless/ath/wil6210/txrx.c index f29ac6de7139..19702b6f09c3 100644 --- a/drivers/net/wireless/ath/wil6210/txrx.c +++ b/drivers/net/wireless/ath/wil6210/txrx.c @@ -306,7 +306,7 @@ static void wil_rx_add_radiotap_header(struct wil6210_priv *wil, struct sk_buff *skb) { struct wil6210_rtap { - struct ieee80211_radiotap_header rthdr; + struct ieee80211_radiotap_header_fixed rthdr; /* fields should be in the order of bits in rthdr.it_present */ /* flags */ u8 flags; diff --git a/drivers/net/wireless/intel/ipw2x00/ipw2100.c b/drivers/net/wireless/intel/ipw2x00/ipw2100.c index b6636002c7d2..fe75941c584d 100644 --- a/drivers/net/wireless/intel/ipw2x00/ipw2100.c +++ b/drivers/net/wireless/intel/ipw2x00/ipw2100.c @@ -2518,7 +2518,7 @@ static void isr_rx_monitor(struct ipw2100_priv *priv, int i, * to build this manually element by element, we can write it much * more efficiently than we can parse it. ORDER MATTERS HERE */ struct ipw_rt_hdr { - struct ieee80211_radiotap_header rt_hdr; + struct ieee80211_radiotap_header_fixed rt_hdr; s8 rt_dbmsignal; /* signal in dbM, kluged to signed */ } *ipw_rt; diff --git a/drivers/net/wireless/intel/ipw2x00/ipw2200.h b/drivers/net/wireless/intel/ipw2x00/ipw2200.h index 8ebf09121e17..226286cb7eb8 100644 --- a/drivers/net/wireless/intel/ipw2x00/ipw2200.h +++ b/drivers/net/wireless/intel/ipw2x00/ipw2200.h @@ -1143,7 +1143,7 @@ struct ipw_prom_priv { * structure is provided regardless of any bits unset. */ struct ipw_rt_hdr { - struct ieee80211_radiotap_header rt_hdr; + struct ieee80211_radiotap_header_fixed rt_hdr; u64 rt_tsf; /* TSF */ /* XXX */ u8 rt_flags; /* radiotap packet flags */ u8 rt_rate; /* rate in 500kb/s */ diff --git a/drivers/net/wireless/marvell/libertas/radiotap.h b/drivers/net/wireless/marvell/libertas/radiotap.h index 1ed5608d353f..d543bfe739dc 100644 --- a/drivers/net/wireless/marvell/libertas/radiotap.h +++ b/drivers/net/wireless/marvell/libertas/radiotap.h @@ -2,7 +2,7 @@ #include struct tx_radiotap_hdr { - struct ieee80211_radiotap_header hdr; + struct ieee80211_radiotap_header_fixed hdr; u8 rate; u8 txpower; u8 rts_retries; @@ -31,7 +31,7 @@ struct tx_radiotap_hdr { #define IEEE80211_FC_DSTODS 0x0300 struct rx_radiotap_hdr { - struct ieee80211_radiotap_header hdr; + struct ieee80211_radiotap_header_fixed hdr; u8 flags; u8 rate; u8 antsignal; diff --git a/drivers/net/wireless/microchip/wilc1000/mon.c b/drivers/net/wireless/microchip/wilc1000/mon.c index 03b7229a0ff5..c3d27aaec297 100644 --- a/drivers/net/wireless/microchip/wilc1000/mon.c +++ b/drivers/net/wireless/microchip/wilc1000/mon.c @@ -7,12 +7,12 @@ #include "cfg80211.h" struct wilc_wfi_radiotap_hdr { - struct ieee80211_radiotap_header hdr; + struct ieee80211_radiotap_header_fixed hdr; u8 rate; } __packed; struct wilc_wfi_radiotap_cb_hdr { - struct ieee80211_radiotap_header hdr; + struct ieee80211_radiotap_header_fixed hdr; u8 rate; u8 dump; u16 tx_flags; diff --git a/drivers/net/wireless/virtual/mac80211_hwsim.c b/drivers/net/wireless/virtual/mac80211_hwsim.c index f0e528abb1b4..3f424f14de4e 100644 --- a/drivers/net/wireless/virtual/mac80211_hwsim.c +++ b/drivers/net/wireless/virtual/mac80211_hwsim.c @@ -763,7 +763,7 @@ static const struct rhashtable_params hwsim_rht_params = { }; struct hwsim_radiotap_hdr { - struct ieee80211_radiotap_header hdr; + struct ieee80211_radiotap_header_fixed hdr; __le64 rt_tsft; u8 rt_flags; u8 rt_rate; @@ -772,7 +772,7 @@ struct hwsim_radiotap_hdr { } __packed; struct hwsim_radiotap_ack_hdr { - struct ieee80211_radiotap_header hdr; + struct ieee80211_radiotap_header_fixed hdr; u8 rt_flags; u8 pad; __le16 rt_channel; diff --git a/include/net/ieee80211_radiotap.h b/include/net/ieee80211_radiotap.h index 91762faecc13..1458d3695005 100644 --- a/include/net/ieee80211_radiotap.h +++ b/include/net/ieee80211_radiotap.h @@ -24,25 +24,27 @@ * struct ieee80211_radiotap_header - base radiotap header */ struct ieee80211_radiotap_header { - /** - * @it_version: radiotap version, always 0 - */ - uint8_t it_version; - - /** - * @it_pad: padding (or alignment) - */ - uint8_t it_pad; - - /** - * @it_len: overall radiotap header length - */ - __le16 it_len; - - /** - * @it_present: (first) present word - */ - __le32 it_present; + __struct_group(ieee80211_radiotap_header_fixed, hdr, __packed, + /** + * @it_version: radiotap version, always 0 + */ + uint8_t it_version; + + /** + * @it_pad: padding (or alignment) + */ + uint8_t it_pad; + + /** + * @it_len: overall radiotap header length + */ + __le16 it_len; + + /** + * @it_present: (first) present word + */ + __le32 it_present; + ); /** * @it_optional: all remaining presence bitmaps @@ -50,6 +52,9 @@ struct ieee80211_radiotap_header { __le32 it_optional[]; } __packed; +static_assert(offsetof(struct ieee80211_radiotap_header, it_optional) == sizeof(struct ieee80211_radiotap_header_fixed), + "struct member likely outside of __struct_group()"); + /* version is always 0 */ #define PKTHDR_RADIOTAP_VERSION 0 -- cgit From 52009b419355195912a628d0a9847922e90c348c Mon Sep 17 00:00:00 2001 From: Felix Fietkau Date: Sun, 6 Oct 2024 17:36:30 +0200 Subject: wifi: mac80211: skip non-uploaded keys in ieee80211_iter_keys Sync iterator conditions with ieee80211_iter_keys_rcu. Fixes: 830af02f24fb ("mac80211: allow driver to iterate keys") Signed-off-by: Felix Fietkau Link: https://patch.msgid.link/20241006153630.87885-1-nbd@nbd.name Signed-off-by: Johannes Berg --- net/mac80211/key.c | 42 +++++++++++++++++++++++++----------------- 1 file changed, 25 insertions(+), 17 deletions(-) diff --git a/net/mac80211/key.c b/net/mac80211/key.c index eecdd2265eaa..e45b5f56c405 100644 --- a/net/mac80211/key.c +++ b/net/mac80211/key.c @@ -987,6 +987,26 @@ void ieee80211_reenable_keys(struct ieee80211_sub_if_data *sdata) } } +static void +ieee80211_key_iter(struct ieee80211_hw *hw, + struct ieee80211_vif *vif, + struct ieee80211_key *key, + void (*iter)(struct ieee80211_hw *hw, + struct ieee80211_vif *vif, + struct ieee80211_sta *sta, + struct ieee80211_key_conf *key, + void *data), + void *iter_data) +{ + /* skip keys of station in removal process */ + if (key->sta && key->sta->removed) + return; + if (!(key->flags & KEY_FLAG_UPLOADED_TO_HARDWARE)) + return; + iter(hw, vif, key->sta ? &key->sta->sta : NULL, + &key->conf, iter_data); +} + void ieee80211_iter_keys(struct ieee80211_hw *hw, struct ieee80211_vif *vif, void (*iter)(struct ieee80211_hw *hw, @@ -1005,16 +1025,13 @@ void ieee80211_iter_keys(struct ieee80211_hw *hw, if (vif) { sdata = vif_to_sdata(vif); list_for_each_entry_safe(key, tmp, &sdata->key_list, list) - iter(hw, &sdata->vif, - key->sta ? &key->sta->sta : NULL, - &key->conf, iter_data); + ieee80211_key_iter(hw, vif, key, iter, iter_data); } else { list_for_each_entry(sdata, &local->interfaces, list) list_for_each_entry_safe(key, tmp, &sdata->key_list, list) - iter(hw, &sdata->vif, - key->sta ? &key->sta->sta : NULL, - &key->conf, iter_data); + ieee80211_key_iter(hw, &sdata->vif, key, + iter, iter_data); } } EXPORT_SYMBOL(ieee80211_iter_keys); @@ -1031,17 +1048,8 @@ _ieee80211_iter_keys_rcu(struct ieee80211_hw *hw, { struct ieee80211_key *key; - list_for_each_entry_rcu(key, &sdata->key_list, list) { - /* skip keys of station in removal process */ - if (key->sta && key->sta->removed) - continue; - if (!(key->flags & KEY_FLAG_UPLOADED_TO_HARDWARE)) - continue; - - iter(hw, &sdata->vif, - key->sta ? &key->sta->sta : NULL, - &key->conf, iter_data); - } + list_for_each_entry_rcu(key, &sdata->key_list, list) + ieee80211_key_iter(hw, &sdata->vif, key, iter, iter_data); } void ieee80211_iter_keys_rcu(struct ieee80211_hw *hw, -- cgit From b5a468199b995bd8ee3c26f169a416a181210c9e Mon Sep 17 00:00:00 2001 From: Alain Volmat Date: Wed, 9 Oct 2024 18:15:52 +0200 Subject: spi: stm32: fix missing device mode capability in stm32mp25 The STM32MP25 SOC has capability to behave in device mode however missing .has_device_mode within its stm32mp25_spi_cfg structure leads to not being able to enable the device mode. Signed-off-by: Alain Volmat Link: https://patch.msgid.link/20241009-spi-mp25-device-fix-v1-1-8e5ca7db7838@foss.st.com Signed-off-by: Mark Brown --- drivers/spi/spi-stm32.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/spi/spi-stm32.c b/drivers/spi/spi-stm32.c index 4c4ff074e3f6..fc72a89fb3a7 100644 --- a/drivers/spi/spi-stm32.c +++ b/drivers/spi/spi-stm32.c @@ -2044,6 +2044,7 @@ static const struct stm32_spi_cfg stm32mp25_spi_cfg = { .baud_rate_div_max = STM32H7_SPI_MBR_DIV_MAX, .has_fifo = true, .prevent_dma_burst = true, + .has_device_mode = true, }; static const struct of_device_id stm32_spi_of_match[] = { -- cgit From 1e48fd0574ee697e87f9c9bbd64d9a121d271f7a Mon Sep 17 00:00:00 2001 From: Justin Chen Date: Thu, 10 Oct 2024 11:53:44 -0700 Subject: phy: usb: disable COMMONONN for dual mode The COMMONONN bit suspends the phy when the port is put into a suspend state. However when the phy is shared between host and device in dual mode, this no longer works cleanly as there is no synchronization between the two. Fixes: 5095d045a962 ("phy: usb: Turn off phy when port is in suspend") Signed-off-by: Justin Chen Acked-by: Florian Fainelli Link: https://lore.kernel.org/r/20241010185344.859865-1-justin.chen@broadcom.com Signed-off-by: Vinod Koul --- drivers/phy/broadcom/phy-brcm-usb-init-synopsys.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/drivers/phy/broadcom/phy-brcm-usb-init-synopsys.c b/drivers/phy/broadcom/phy-brcm-usb-init-synopsys.c index 4c10cafded4e..950b7ae1d1a8 100644 --- a/drivers/phy/broadcom/phy-brcm-usb-init-synopsys.c +++ b/drivers/phy/broadcom/phy-brcm-usb-init-synopsys.c @@ -153,7 +153,9 @@ static void xhci_soft_reset(struct brcm_usb_init_params *params, } else { USB_CTRL_SET(ctrl, USB_PM, XHC_SOFT_RESETB); /* Required for COMMONONN to be set */ - USB_XHCI_GBL_UNSET(xhci_gbl, GUSB2PHYCFG, U2_FREECLK_EXISTS); + if (params->supported_port_modes != USB_CTLR_MODE_DRD) + USB_XHCI_GBL_UNSET(xhci_gbl, GUSB2PHYCFG, + U2_FREECLK_EXISTS); } } @@ -328,8 +330,12 @@ static void usb_init_common_7216(struct brcm_usb_init_params *params) /* 1 millisecond - for USB clocks to settle down */ usleep_range(1000, 2000); - /* Disable PHY when port is suspended */ - USB_CTRL_SET(ctrl, P0_U2PHY_CFG1, COMMONONN); + /* + * Disable PHY when port is suspended + * Does not work in DRD mode + */ + if (params->supported_port_modes != USB_CTLR_MODE_DRD) + USB_CTRL_SET(ctrl, P0_U2PHY_CFG1, COMMONONN); usb_wake_enable_7216(params, false); usb_init_common(params); -- cgit From e9e1b20fae7de06ba36dd3f8dba858157bad233d Mon Sep 17 00:00:00 2001 From: Mika Westerberg Date: Wed, 25 Sep 2024 12:59:20 +0300 Subject: thunderbolt: Fix KASAN reported stack out-of-bounds read in tb_retimer_scan() KASAN reported following issue: BUG: KASAN: stack-out-of-bounds in tb_retimer_scan+0xffe/0x1550 [thunderbolt] Read of size 4 at addr ffff88810111fc1c by task kworker/u56:0/11 CPU: 0 UID: 0 PID: 11 Comm: kworker/u56:0 Tainted: G U 6.11.0+ #1387 Tainted: [U]=USER Workqueue: thunderbolt0 tb_handle_hotplug [thunderbolt] Call Trace: dump_stack_lvl+0x6c/0x90 print_report+0xd1/0x630 kasan_report+0xdb/0x110 __asan_report_load4_noabort+0x14/0x20 tb_retimer_scan+0xffe/0x1550 [thunderbolt] tb_scan_port+0xa6f/0x2060 [thunderbolt] tb_handle_hotplug+0x17b1/0x3080 [thunderbolt] process_one_work+0x626/0x1100 worker_thread+0x6c8/0xfa0 kthread+0x2c8/0x3a0 ret_from_fork+0x3a/0x80 ret_from_fork_asm+0x1a/0x30 This happens because the loop variable still gets incremented by one so max becomes 3 instead of 2, and this makes the second loop read past the the array declared on the stack. Fix this by assigning to max directly in the loop body. Fixes: ff6ab055e070 ("thunderbolt: Add receiver lane margining support for retimers") CC: stable@vger.kernel.org Signed-off-by: Mika Westerberg --- drivers/thunderbolt/retimer.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/thunderbolt/retimer.c b/drivers/thunderbolt/retimer.c index 721319329afa..7db9869a9f3f 100644 --- a/drivers/thunderbolt/retimer.c +++ b/drivers/thunderbolt/retimer.c @@ -516,7 +516,7 @@ int tb_retimer_scan(struct tb_port *port, bool add) */ tb_retimer_set_inbound_sbtx(port); - for (i = 1; i <= TB_MAX_RETIMER_INDEX; i++) { + for (max = 1, i = 1; i <= TB_MAX_RETIMER_INDEX; i++) { /* * Last retimer is true only for the last on-board * retimer (the one connected directly to the Type-C @@ -527,9 +527,10 @@ int tb_retimer_scan(struct tb_port *port, bool add) last_idx = i; else if (ret < 0) break; + + max = i; } - max = i; ret = 0; /* Add retimers if they do not exist already */ -- cgit From 6e9c5c8ef2820d18492d07172ac52f23ea8a54d9 Mon Sep 17 00:00:00 2001 From: Wolfram Sang Date: Mon, 7 Oct 2024 13:02:01 +0200 Subject: dmaengine: sh: rz-dmac: handle configs where one address is zero Configs like the ones coming from the MMC subsystem will have either 'src' or 'dst' zeroed, resulting in an unknown bus width. This will bail out on the RZ DMA driver because of the sanity check for a valid bus width. Reorder the code, so that the check will only be applied when the corresponding address is non-zero. Fixes: 5000d37042a6 ("dmaengine: sh: Add DMAC driver for RZ/G2L SoC") Signed-off-by: Wolfram Sang Reviewed-by: Biju Das Reviewed-by: Geert Uytterhoeven Tested-by: Biju Das Tested-by: Claudiu Beznea Link: https://lore.kernel.org/r/20241007110200.43166-6-wsa+renesas@sang-engineering.com Signed-off-by: Vinod Koul --- drivers/dma/sh/rz-dmac.c | 25 ++++++++++++++----------- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/drivers/dma/sh/rz-dmac.c b/drivers/dma/sh/rz-dmac.c index 65a27c5a7bce..811389fc9cb8 100644 --- a/drivers/dma/sh/rz-dmac.c +++ b/drivers/dma/sh/rz-dmac.c @@ -601,22 +601,25 @@ static int rz_dmac_config(struct dma_chan *chan, struct rz_dmac_chan *channel = to_rz_dmac_chan(chan); u32 val; - channel->src_per_address = config->src_addr; channel->dst_per_address = config->dst_addr; - - val = rz_dmac_ds_to_val_mapping(config->dst_addr_width); - if (val == CHCFG_DS_INVALID) - return -EINVAL; - channel->chcfg &= ~CHCFG_FILL_DDS_MASK; - channel->chcfg |= FIELD_PREP(CHCFG_FILL_DDS_MASK, val); + if (channel->dst_per_address) { + val = rz_dmac_ds_to_val_mapping(config->dst_addr_width); + if (val == CHCFG_DS_INVALID) + return -EINVAL; - val = rz_dmac_ds_to_val_mapping(config->src_addr_width); - if (val == CHCFG_DS_INVALID) - return -EINVAL; + channel->chcfg |= FIELD_PREP(CHCFG_FILL_DDS_MASK, val); + } + channel->src_per_address = config->src_addr; channel->chcfg &= ~CHCFG_FILL_SDS_MASK; - channel->chcfg |= FIELD_PREP(CHCFG_FILL_SDS_MASK, val); + if (channel->src_per_address) { + val = rz_dmac_ds_to_val_mapping(config->src_addr_width); + if (val == CHCFG_DS_INVALID) + return -EINVAL; + + channel->chcfg |= FIELD_PREP(CHCFG_FILL_SDS_MASK, val); + } return 0; } -- cgit From d35f40642904b017d1301340734b91aef69d1c0c Mon Sep 17 00:00:00 2001 From: Jai Luthra Date: Mon, 30 Sep 2024 13:02:54 -0400 Subject: dmaengine: ti: k3-udma: Set EOP for all TRs in cyclic BCDMA transfer When receiving data in cyclic mode from PDMA peripherals, where reload count is set to infinite, any TR in the set can potentially be the last one of the overall transfer. In such cases, the EOP flag needs to be set in each TR and PDMA's Static TR "Z" parameter should be set, matching the size of the TR. This is required for the teardown to function properly and cleanup the internal state memory. This only affects platforms using BCDMA and not those using UDMA-P, which could set EOP flag in the teardown TR automatically. Similarly when transmitting data in cyclic mode to PDMA peripherals, the EOP flag needs to be set to get the teardown completion signal correctly. Fixes: 017794739702 ("dmaengine: ti: k3-udma: Initial support for K3 BCDMA") Tested-by: Francesco Dolcini # Toradex Verdin AM62 Signed-off-by: Jai Luthra Signed-off-by: Jai Luthra Acked-by: Peter Ujfalusi Link: https://lore.kernel.org/r/20240930-z_cnt-v2-1-9d38aba149a2@linux.dev Signed-off-by: Vinod Koul --- drivers/dma/ti/k3-udma.c | 62 ++++++++++++++++++++++++++++++++++++------------ 1 file changed, 47 insertions(+), 15 deletions(-) diff --git a/drivers/dma/ti/k3-udma.c b/drivers/dma/ti/k3-udma.c index 406ee199c2ac..b3f27b3f9209 100644 --- a/drivers/dma/ti/k3-udma.c +++ b/drivers/dma/ti/k3-udma.c @@ -3185,27 +3185,40 @@ static int udma_configure_statictr(struct udma_chan *uc, struct udma_desc *d, d->static_tr.elcnt = elcnt; - /* - * PDMA must to close the packet when the channel is in packet mode. - * For TR mode when the channel is not cyclic we also need PDMA to close - * the packet otherwise the transfer will stall because PDMA holds on - * the data it has received from the peripheral. - */ if (uc->config.pkt_mode || !uc->cyclic) { + /* + * PDMA must close the packet when the channel is in packet mode. + * For TR mode when the channel is not cyclic we also need PDMA + * to close the packet otherwise the transfer will stall because + * PDMA holds on the data it has received from the peripheral. + */ unsigned int div = dev_width * elcnt; if (uc->cyclic) d->static_tr.bstcnt = d->residue / d->sglen / div; else d->static_tr.bstcnt = d->residue / div; + } else if (uc->ud->match_data->type == DMA_TYPE_BCDMA && + uc->config.dir == DMA_DEV_TO_MEM && + uc->cyclic) { + /* + * For cyclic mode with BCDMA we have to set EOP in each TR to + * prevent short packet errors seen on channel teardown. So the + * PDMA must close the packet after every TR transfer by setting + * burst count equal to the number of bytes transferred. + */ + struct cppi5_tr_type1_t *tr_req = d->hwdesc[0].tr_req_base; - if (uc->config.dir == DMA_DEV_TO_MEM && - d->static_tr.bstcnt > uc->ud->match_data->statictr_z_mask) - return -EINVAL; + d->static_tr.bstcnt = + (tr_req->icnt0 * tr_req->icnt1) / dev_width; } else { d->static_tr.bstcnt = 0; } + if (uc->config.dir == DMA_DEV_TO_MEM && + d->static_tr.bstcnt > uc->ud->match_data->statictr_z_mask) + return -EINVAL; + return 0; } @@ -3450,8 +3463,9 @@ udma_prep_slave_sg(struct dma_chan *chan, struct scatterlist *sgl, /* static TR for remote PDMA */ if (udma_configure_statictr(uc, d, dev_width, burst)) { dev_err(uc->ud->dev, - "%s: StaticTR Z is limited to maximum 4095 (%u)\n", - __func__, d->static_tr.bstcnt); + "%s: StaticTR Z is limited to maximum %u (%u)\n", + __func__, uc->ud->match_data->statictr_z_mask, + d->static_tr.bstcnt); udma_free_hwdesc(uc, d); kfree(d); @@ -3476,6 +3490,7 @@ udma_prep_dma_cyclic_tr(struct udma_chan *uc, dma_addr_t buf_addr, u16 tr0_cnt0, tr0_cnt1, tr1_cnt0; unsigned int i; int num_tr; + u32 period_csf = 0; num_tr = udma_get_tr_counters(period_len, __ffs(buf_addr), &tr0_cnt0, &tr0_cnt1, &tr1_cnt0); @@ -3498,6 +3513,20 @@ udma_prep_dma_cyclic_tr(struct udma_chan *uc, dma_addr_t buf_addr, period_addr = buf_addr | ((u64)uc->config.asel << K3_ADDRESS_ASEL_SHIFT); + /* + * For BCDMA <-> PDMA transfers, the EOP flag needs to be set on the + * last TR of a descriptor, to mark the packet as complete. + * This is required for getting the teardown completion message in case + * of TX, and to avoid short-packet error in case of RX. + * + * As we are in cyclic mode, we do not know which period might be the + * last one, so set the flag for each period. + */ + if (uc->config.ep_type == PSIL_EP_PDMA_XY && + uc->ud->match_data->type == DMA_TYPE_BCDMA) { + period_csf = CPPI5_TR_CSF_EOP; + } + for (i = 0; i < periods; i++) { int tr_idx = i * num_tr; @@ -3525,8 +3554,10 @@ udma_prep_dma_cyclic_tr(struct udma_chan *uc, dma_addr_t buf_addr, } if (!(flags & DMA_PREP_INTERRUPT)) - cppi5_tr_csf_set(&tr_req[tr_idx].flags, - CPPI5_TR_CSF_SUPR_EVT); + period_csf |= CPPI5_TR_CSF_SUPR_EVT; + + if (period_csf) + cppi5_tr_csf_set(&tr_req[tr_idx].flags, period_csf); period_addr += period_len; } @@ -3655,8 +3686,9 @@ udma_prep_dma_cyclic(struct dma_chan *chan, dma_addr_t buf_addr, size_t buf_len, /* static TR for remote PDMA */ if (udma_configure_statictr(uc, d, dev_width, burst)) { dev_err(uc->ud->dev, - "%s: StaticTR Z is limited to maximum 4095 (%u)\n", - __func__, d->static_tr.bstcnt); + "%s: StaticTR Z is limited to maximum %u (%u)\n", + __func__, uc->ud->match_data->statictr_z_mask, + d->static_tr.bstcnt); udma_free_hwdesc(uc, d); kfree(d); -- cgit From 3cc4e13bb1617f6a13e5e6882465984148743cf4 Mon Sep 17 00:00:00 2001 From: Xiu Jianfeng Date: Sat, 12 Oct 2024 07:22:46 +0000 Subject: cgroup: Fix potential overflow issue when checking max_depth MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit cgroup.max.depth is the maximum allowed descent depth below the current cgroup. If the actual descent depth is equal or larger, an attempt to create a new child cgroup will fail. However due to the cgroup->max_depth is of int type and having the default value INT_MAX, the condition 'level > cgroup->max_depth' will never be satisfied, and it will cause an overflow of the level after it reaches to INT_MAX. Fix it by starting the level from 0 and using '>=' instead. It's worth mentioning that this issue is unlikely to occur in reality, as it's impossible to have a depth of INT_MAX hierarchy, but should be be avoided logically. Fixes: 1a926e0bbab8 ("cgroup: implement hierarchy limits") Signed-off-by: Xiu Jianfeng Reviewed-by: Michal Koutný Signed-off-by: Tejun Heo --- kernel/cgroup/cgroup.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 5886b95c6eae..044c7ba1cc48 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5789,7 +5789,7 @@ static bool cgroup_check_hierarchy_limits(struct cgroup *parent) { struct cgroup *cgroup; int ret = false; - int level = 1; + int level = 0; lockdep_assert_held(&cgroup_mutex); @@ -5797,7 +5797,7 @@ static bool cgroup_check_hierarchy_limits(struct cgroup *parent) if (cgroup->nr_descendants >= cgroup->max_descendants) goto fail; - if (level > cgroup->max_depth) + if (level >= cgroup->max_depth) goto fail; level++; -- cgit From e15d84b3bba187aa372dff7c58ce1fd5cb48a076 Mon Sep 17 00:00:00 2001 From: Manikanta Pubbisetty Date: Tue, 15 Oct 2024 12:11:03 +0530 Subject: wifi: ath10k: Fix memory leak in management tx In the current logic, memory is allocated for storing the MSDU context during management packet TX but this memory is not being freed during management TX completion. Similar leaks are seen in the management TX cleanup logic. Kmemleak reports this problem as below, unreferenced object 0xffffff80b64ed250 (size 16): comm "kworker/u16:7", pid 148, jiffies 4294687130 (age 714.199s) hex dump (first 16 bytes): 00 2b d8 d8 80 ff ff ff c4 74 e9 fd 07 00 00 00 .+.......t...... backtrace: [] __kmem_cache_alloc_node+0x1e4/0x2d8 [] kmalloc_trace+0x48/0x110 [] ath10k_wmi_tlv_op_gen_mgmt_tx_send+0xd4/0x1d8 [ath10k_core] [] ath10k_mgmt_over_wmi_tx_work+0x134/0x298 [ath10k_core] [] process_scheduled_works+0x1ac/0x400 [] worker_thread+0x208/0x328 [] kthread+0x100/0x1c0 [] ret_from_fork+0x10/0x20 Free the memory during completion and cleanup to fix the leak. Protect the mgmt_pending_tx idr_remove() operation in ath10k_wmi_tlv_op_cleanup_mgmt_tx_send() using ar->data_lock similar to other instances. Tested-on: WCN3990 hw1.0 SNOC WLAN.HL.2.0-01387-QCAHLSWMTPLZ-1 Fixes: dc405152bb64 ("ath10k: handle mgmt tx completion event") Fixes: c730c477176a ("ath10k: Remove msdu from idr when management pkt send fails") Cc: stable@vger.kernel.org Signed-off-by: Manikanta Pubbisetty Link: https://patch.msgid.link/20241015064103.6060-1-quic_mpubbise@quicinc.com Signed-off-by: Jeff Johnson --- drivers/net/wireless/ath/ath10k/wmi-tlv.c | 7 ++++++- drivers/net/wireless/ath/ath10k/wmi.c | 2 ++ 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/drivers/net/wireless/ath/ath10k/wmi-tlv.c b/drivers/net/wireless/ath/ath10k/wmi-tlv.c index dbaf26d6a7a6..16d07d619b4d 100644 --- a/drivers/net/wireless/ath/ath10k/wmi-tlv.c +++ b/drivers/net/wireless/ath/ath10k/wmi-tlv.c @@ -3043,9 +3043,14 @@ ath10k_wmi_tlv_op_cleanup_mgmt_tx_send(struct ath10k *ar, struct sk_buff *msdu) { struct ath10k_skb_cb *cb = ATH10K_SKB_CB(msdu); + struct ath10k_mgmt_tx_pkt_addr *pkt_addr; struct ath10k_wmi *wmi = &ar->wmi; - idr_remove(&wmi->mgmt_pending_tx, cb->msdu_id); + spin_lock_bh(&ar->data_lock); + pkt_addr = idr_remove(&wmi->mgmt_pending_tx, cb->msdu_id); + spin_unlock_bh(&ar->data_lock); + + kfree(pkt_addr); return 0; } diff --git a/drivers/net/wireless/ath/ath10k/wmi.c b/drivers/net/wireless/ath/ath10k/wmi.c index fe2344598364..1d58452ed8ca 100644 --- a/drivers/net/wireless/ath/ath10k/wmi.c +++ b/drivers/net/wireless/ath/ath10k/wmi.c @@ -2441,6 +2441,7 @@ wmi_process_mgmt_tx_comp(struct ath10k *ar, struct mgmt_tx_compl_params *param) dma_unmap_single(ar->dev, pkt_addr->paddr, msdu->len, DMA_TO_DEVICE); info = IEEE80211_SKB_CB(msdu); + kfree(pkt_addr); if (param->status) { info->flags &= ~IEEE80211_TX_STAT_ACK; @@ -9612,6 +9613,7 @@ static int ath10k_wmi_mgmt_tx_clean_up_pending(int msdu_id, void *ptr, dma_unmap_single(ar->dev, pkt_addr->paddr, msdu->len, DMA_TO_DEVICE); ieee80211_free_txskb(ar->hw, msdu); + kfree(pkt_addr); return 0; } -- cgit From befd716ed429b26eca7abde95da6195c548470de Mon Sep 17 00:00:00 2001 From: Remi Pommarel Date: Tue, 24 Sep 2024 21:41:19 +0200 Subject: wifi: ath11k: Fix invalid ring usage in full monitor mode On full monitor HW the monitor destination rxdma ring does not have the same descriptor format as in the "classical" mode. The full monitor destination entries are of hal_sw_monitor_ring type and fetched using ath11k_dp_full_mon_process_rx while the classical ones are of type hal_reo_entrance_ring and fetched with ath11k_dp_rx_mon_dest_process. Although both hal_sw_monitor_ring and hal_reo_entrance_ring are of same size, the offset to useful info (such as sw_cookie, paddr, etc) are different. Thus if ath11k_dp_rx_mon_dest_process gets called on full monitor destination ring, invalid skb buffer id will be fetched from DMA ring causing issues such as the following rcu_sched stall: rcu: INFO: rcu_sched self-detected stall on CPU rcu: 0-....: (1 GPs behind) idle=c67/0/0x7 softirq=45768/45769 fqs=1012 (t=2100 jiffies g=14817 q=8703) Task dump for CPU 0: task:swapper/0 state:R running task stack: 0 pid: 0 ppid: 0 flags:0x0000000a Call trace: dump_backtrace+0x0/0x160 show_stack+0x14/0x20 sched_show_task+0x158/0x184 dump_cpu_task+0x40/0x4c rcu_dump_cpu_stacks+0xec/0x12c rcu_sched_clock_irq+0x6c8/0x8a0 update_process_times+0x88/0xd0 tick_sched_timer+0x74/0x1e0 __hrtimer_run_queues+0x150/0x204 hrtimer_interrupt+0xe4/0x240 arch_timer_handler_phys+0x30/0x40 handle_percpu_devid_irq+0x80/0x130 handle_domain_irq+0x5c/0x90 gic_handle_irq+0x8c/0xb4 do_interrupt_handler+0x30/0x54 el1_interrupt+0x2c/0x4c el1h_64_irq_handler+0x14/0x1c el1h_64_irq+0x74/0x78 do_raw_spin_lock+0x60/0x100 _raw_spin_lock_bh+0x1c/0x2c ath11k_dp_rx_mon_mpdu_pop.constprop.0+0x174/0x650 ath11k_dp_rx_process_mon_status+0x8b4/0xa80 ath11k_dp_rx_process_mon_rings+0x244/0x510 ath11k_dp_service_srng+0x190/0x300 ath11k_pcic_ext_grp_napi_poll+0x30/0xc0 __napi_poll+0x34/0x174 net_rx_action+0xf8/0x2a0 _stext+0x12c/0x2ac irq_exit+0x94/0xc0 handle_domain_irq+0x60/0x90 gic_handle_irq+0x8c/0xb4 call_on_irq_stack+0x28/0x44 do_interrupt_handler+0x4c/0x54 el1_interrupt+0x2c/0x4c el1h_64_irq_handler+0x14/0x1c el1h_64_irq+0x74/0x78 arch_cpu_idle+0x14/0x20 do_idle+0xf0/0x130 cpu_startup_entry+0x24/0x50 rest_init+0xf8/0x104 arch_call_rest_init+0xc/0x14 start_kernel+0x56c/0x58c __primary_switched+0xa0/0xa8 Thus ath11k_dp_rx_mon_dest_process(), which use classical destination entry format, should no be called on full monitor capable HW. Fixes: 67a9d399fcb0 ("ath11k: enable RX PPDU stats in monitor co-exist mode") Signed-off-by: Remi Pommarel Reviewed-by: Praneesh P Link: https://patch.msgid.link/20240924194119.15942-1-repk@triplefau.lt Signed-off-by: Jeff Johnson --- drivers/net/wireless/ath/ath11k/dp_rx.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/net/wireless/ath/ath11k/dp_rx.c b/drivers/net/wireless/ath/ath11k/dp_rx.c index 86485580dd89..2b66bfb01140 100644 --- a/drivers/net/wireless/ath/ath11k/dp_rx.c +++ b/drivers/net/wireless/ath/ath11k/dp_rx.c @@ -5291,8 +5291,11 @@ int ath11k_dp_rx_process_mon_status(struct ath11k_base *ab, int mac_id, hal_status == HAL_TLV_STATUS_PPDU_DONE) { rx_mon_stats->status_ppdu_done++; pmon->mon_ppdu_status = DP_PPDU_STATUS_DONE; - ath11k_dp_rx_mon_dest_process(ar, mac_id, budget, napi); - pmon->mon_ppdu_status = DP_PPDU_STATUS_START; + if (!ab->hw_params.full_monitor_mode) { + ath11k_dp_rx_mon_dest_process(ar, mac_id, + budget, napi); + pmon->mon_ppdu_status = DP_PPDU_STATUS_START; + } } if (ppdu_info->peer_id == HAL_INVALID_PEERID || -- cgit From 938ade15abaea765dfab32d906de45657067c11f Mon Sep 17 00:00:00 2001 From: Johan Hovold Date: Mon, 16 Sep 2024 10:23:05 +0200 Subject: dt-bindings: phy: qcom,sc8280xp-qmp-pcie-phy: add missing x1e80100 pipediv2 clocks The x1e80100 QMP PCIe PHYs all have a pipediv2 clock that needs to be described. Fixes: e94b29f2bd73 ("dt-bindings: phy: qcom,sc8280xp-qmp-pcie-phy: Document the X1E80100 QMP PCIe PHYs") Cc: Abel Vesa Signed-off-by: Johan Hovold Acked-by: Rob Herring (Arm) Link: https://lore.kernel.org/r/20240916082307.29393-2-johan+linaro@kernel.org Signed-off-by: Vinod Koul --- Documentation/devicetree/bindings/phy/qcom,sc8280xp-qmp-pcie-phy.yaml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/devicetree/bindings/phy/qcom,sc8280xp-qmp-pcie-phy.yaml b/Documentation/devicetree/bindings/phy/qcom,sc8280xp-qmp-pcie-phy.yaml index dcf4fa55fbba..447b379aa83d 100644 --- a/Documentation/devicetree/bindings/phy/qcom,sc8280xp-qmp-pcie-phy.yaml +++ b/Documentation/devicetree/bindings/phy/qcom,sc8280xp-qmp-pcie-phy.yaml @@ -154,8 +154,6 @@ allOf: - qcom,sm8550-qmp-gen4x2-pcie-phy - qcom,sm8650-qmp-gen3x2-pcie-phy - qcom,sm8650-qmp-gen4x2-pcie-phy - - qcom,x1e80100-qmp-gen3x2-pcie-phy - - qcom,x1e80100-qmp-gen4x2-pcie-phy then: properties: clocks: @@ -171,6 +169,8 @@ allOf: - qcom,sc8280xp-qmp-gen3x1-pcie-phy - qcom,sc8280xp-qmp-gen3x2-pcie-phy - qcom,sc8280xp-qmp-gen3x4-pcie-phy + - qcom,x1e80100-qmp-gen3x2-pcie-phy + - qcom,x1e80100-qmp-gen4x2-pcie-phy - qcom,x1e80100-qmp-gen4x4-pcie-phy then: properties: -- cgit From bd9e4d4a3b127686efc60096271b0a44c3100061 Mon Sep 17 00:00:00 2001 From: Johan Hovold Date: Wed, 11 Sep 2024 13:52:50 +0200 Subject: phy: qcom: qmp-usb: fix NULL-deref on runtime suspend Commit 413db06c05e7 ("phy: qcom-qmp-usb: clean up probe initialisation") removed most users of the platform device driver data, but mistakenly also removed the initialisation despite the data still being used in the runtime PM callbacks. Restore the driver data initialisation at probe to avoid a NULL-pointer dereference on runtime suspend. Apparently no one uses runtime PM, which currently needs to be enabled manually through sysfs, with this driver. Fixes: 413db06c05e7 ("phy: qcom-qmp-usb: clean up probe initialisation") Cc: stable@vger.kernel.org # 6.2 Signed-off-by: Johan Hovold Reviewed-by: Dmitry Baryshkov Link: https://lore.kernel.org/r/20240911115253.10920-2-johan+linaro@kernel.org Signed-off-by: Vinod Koul --- drivers/phy/qualcomm/phy-qcom-qmp-usb.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/phy/qualcomm/phy-qcom-qmp-usb.c b/drivers/phy/qualcomm/phy-qcom-qmp-usb.c index 2fd49355aa37..1246d3bc8b92 100644 --- a/drivers/phy/qualcomm/phy-qcom-qmp-usb.c +++ b/drivers/phy/qualcomm/phy-qcom-qmp-usb.c @@ -2179,6 +2179,7 @@ static int qmp_usb_probe(struct platform_device *pdev) return -ENOMEM; qmp->dev = dev; + dev_set_drvdata(dev, qmp); qmp->cfg = of_device_get_match_data(dev); if (!qmp->cfg) -- cgit From 29240130ab77c80bea1464317ae2a5fd29c16a0c Mon Sep 17 00:00:00 2001 From: Johan Hovold Date: Wed, 11 Sep 2024 13:52:51 +0200 Subject: phy: qcom: qmp-usb-legacy: fix NULL-deref on runtime suspend Commit 413db06c05e7 ("phy: qcom-qmp-usb: clean up probe initialisation") removed most users of the platform device driver data from the qcom-qmp-usb driver, but mistakenly also removed the initialisation despite the data still being used in the runtime PM callbacks. This bug was later reproduced when the driver was copied to create the qmp-usb-legacy driver. Restore the driver data initialisation at probe to avoid a NULL-pointer dereference on runtime suspend. Apparently no one uses runtime PM, which currently needs to be enabled manually through sysfs, with these drivers. Fixes: e464a3180a43 ("phy: qcom-qmp-usb: split off the legacy USB+dp_com support") Cc: stable@vger.kernel.org # 6.6 Signed-off-by: Johan Hovold Reviewed-by: Dmitry Baryshkov Link: https://lore.kernel.org/r/20240911115253.10920-3-johan+linaro@kernel.org Signed-off-by: Vinod Koul --- drivers/phy/qualcomm/phy-qcom-qmp-usb-legacy.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/phy/qualcomm/phy-qcom-qmp-usb-legacy.c b/drivers/phy/qualcomm/phy-qcom-qmp-usb-legacy.c index 6d0ba39c1943..8bf951b0490c 100644 --- a/drivers/phy/qualcomm/phy-qcom-qmp-usb-legacy.c +++ b/drivers/phy/qualcomm/phy-qcom-qmp-usb-legacy.c @@ -1248,6 +1248,7 @@ static int qmp_usb_legacy_probe(struct platform_device *pdev) return -ENOMEM; qmp->dev = dev; + dev_set_drvdata(dev, qmp); qmp->cfg = of_device_get_match_data(dev); if (!qmp->cfg) -- cgit From 34c21f94fa1e147a19b54b6adf0c93a623b70dd8 Mon Sep 17 00:00:00 2001 From: Johan Hovold Date: Wed, 11 Sep 2024 13:52:52 +0200 Subject: phy: qcom: qmp-usbc: fix NULL-deref on runtime suspend Commit 413db06c05e7 ("phy: qcom-qmp-usb: clean up probe initialisation") removed most users of the platform device driver data from the qcom-qmp-usb driver, but mistakenly also removed the initialisation despite the data still being used in the runtime PM callbacks. This bug was later reproduced when the driver was copied to create the qmp-usbc driver. Restore the driver data initialisation at probe to avoid a NULL-pointer dereference on runtime suspend. Apparently no one uses runtime PM, which currently needs to be enabled manually through sysfs, with these drivers. Fixes: 19281571a4d5 ("phy: qcom: qmp-usb: split USB-C PHY driver") Cc: stable@vger.kernel.org # 6.9 Signed-off-by: Johan Hovold Reviewed-by: Dmitry Baryshkov Link: https://lore.kernel.org/r/20240911115253.10920-4-johan+linaro@kernel.org Signed-off-by: Vinod Koul --- drivers/phy/qualcomm/phy-qcom-qmp-usbc.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/phy/qualcomm/phy-qcom-qmp-usbc.c b/drivers/phy/qualcomm/phy-qcom-qmp-usbc.c index d4fa1063ea61..cf12a6f12134 100644 --- a/drivers/phy/qualcomm/phy-qcom-qmp-usbc.c +++ b/drivers/phy/qualcomm/phy-qcom-qmp-usbc.c @@ -1050,6 +1050,7 @@ static int qmp_usbc_probe(struct platform_device *pdev) return -ENOMEM; qmp->dev = dev; + dev_set_drvdata(dev, qmp); qmp->orientation = TYPEC_ORIENTATION_NORMAL; -- cgit From 1dd196f9004848d0318e8831f962cc76255431d8 Mon Sep 17 00:00:00 2001 From: Johan Hovold Date: Wed, 11 Sep 2024 13:52:53 +0200 Subject: phy: qcom: qmp-combo: move driver data initialisation earlier Commit 44aff8e31080 ("phy: qcom-qmp-combo: clean up probe initialisation") removed most users of the platform device driver data, but mistakenly also removed the initialisation despite the data still being used in the runtime PM callbacks. The initialisation was soon after restored by commit 83a0bbe39b17 ("phy: qcom-qmp-combo: add support for updated sc8280xp binding") but now happens slightly later during probe. This should not cause any trouble currently as runtime PM needs to be enabled manually through sysfs and the platform device would not be suspended before the PHY has been registered anyway. Move the driver data initialisation to avoid a NULL-pointer dereference on runtime suspend if runtime PM is ever enabled by default in this driver. Fixes: 44aff8e31080 ("phy: qcom-qmp-combo: clean up probe initialisation") Signed-off-by: Johan Hovold Reviewed-by: Dmitry Baryshkov Link: https://lore.kernel.org/r/20240911115253.10920-5-johan+linaro@kernel.org Signed-off-by: Vinod Koul --- drivers/phy/qualcomm/phy-qcom-qmp-combo.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/phy/qualcomm/phy-qcom-qmp-combo.c b/drivers/phy/qualcomm/phy-qcom-qmp-combo.c index a8adc3214bfe..643045c9024e 100644 --- a/drivers/phy/qualcomm/phy-qcom-qmp-combo.c +++ b/drivers/phy/qualcomm/phy-qcom-qmp-combo.c @@ -3673,6 +3673,7 @@ static int qmp_combo_probe(struct platform_device *pdev) return -ENOMEM; qmp->dev = dev; + dev_set_drvdata(dev, qmp); qmp->orientation = TYPEC_ORIENTATION_NORMAL; @@ -3749,8 +3750,6 @@ static int qmp_combo_probe(struct platform_device *pdev) phy_set_drvdata(qmp->dp_phy, qmp); - dev_set_drvdata(dev, qmp); - if (usb_np == dev->of_node) phy_provider = devm_of_phy_provider_register(dev, qmp_combo_phy_xlate); else -- cgit From ab8aaab874c4aa378e76d0a55ce6e0fad6e042a2 Mon Sep 17 00:00:00 2001 From: Arnaldo Carvalho de Melo Date: Fri, 11 Oct 2024 15:20:17 -0300 Subject: tools headers UAPI: Sync linux/const.h with the kernel headers To pick up the changes in: 947697c6f0f75f98 ("uapi: Define GENMASK_U128") That causes no changes in tooling, just addresses this perf build warning: Warning: Kernel ABI header differences: diff -u tools/include/uapi/linux/const.h include/uapi/linux/const.h Cc: Adrian Hunter Cc: Anshuman Khandual Cc: Ian Rogers Cc: Jiri Olsa Cc: Kan Liang Cc: Namhyung Kim Cc: Yury Norov Link: https://lore.kernel.org/lkml/ZwltGNJwujKu1Fgn@x1 Signed-off-by: Arnaldo Carvalho de Melo --- tools/include/uapi/linux/const.h | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/tools/include/uapi/linux/const.h b/tools/include/uapi/linux/const.h index a429381e7ca5..e16be0d37746 100644 --- a/tools/include/uapi/linux/const.h +++ b/tools/include/uapi/linux/const.h @@ -28,6 +28,23 @@ #define _BITUL(x) (_UL(1) << (x)) #define _BITULL(x) (_ULL(1) << (x)) +#if !defined(__ASSEMBLY__) +/* + * Missing asm support + * + * __BIT128() would not work in the asm code, as it shifts an + * 'unsigned __init128' data type as direct representation of + * 128 bit constants is not supported in the gcc compiler, as + * they get silently truncated. + * + * TODO: Please revisit this implementation when gcc compiler + * starts representing 128 bit constants directly like long + * and unsigned long etc. Subsequently drop the comment for + * GENMASK_U128() which would then start supporting asm code. + */ +#define _BIT128(x) ((unsigned __int128)(1) << (x)) +#endif + #define __ALIGN_KERNEL(x, a) __ALIGN_KERNEL_MASK(x, (__typeof__(x))(a) - 1) #define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask)) -- cgit From 39c6a356201ebbd7e1db5be53fbb46ef4bfc70a4 Mon Sep 17 00:00:00 2001 From: Arnaldo Carvalho de Melo Date: Fri, 11 Oct 2024 16:10:01 -0300 Subject: perf trace: The return from 'write' isn't a pid When adding a explicit beautifier for the 'write' syscall when the BPF based buffer collector was introduced there was a cut'n'paste error that carried the syscall_fmt->errpid setting from a nearby syscall (waitid) that returns a pid. So the write return was being suppressed by the return pretty printer, remove that field, reverting it back to the default return handler, that prints positive numbers as-is and interpret negative values as errnos. I actually introduced the problem while making Howard's original patch work just with the 'write' syscall, as we couldn't just look for any buffers, the ones that are filled in by the kernel couldn't use the same sys_enter BPF collector. Fixes: b257fac12f38d7f5 ("perf trace: Pretty print buffer data") Reported-by: James Clark Link: https://lore.kernel.org/lkml/bcf50648-3c7e-4513-8717-0d14492c53b9@linaro.org Link: https://lore.kernel.org/all/Zt8jTfzDYgBPvFCd@x1/#t Cc: Adrian Hunter Cc: Alan Maguire Cc: Howard Chu Cc: Ian Rogers Cc: Jiri Olsa Cc: Kan Liang Cc: Namhyung Kim Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/builtin-trace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c index f6e847529073..d3f11b90d025 100644 --- a/tools/perf/builtin-trace.c +++ b/tools/perf/builtin-trace.c @@ -1399,7 +1399,7 @@ static const struct syscall_fmt syscall_fmts[] = { .arg = { [2] = { .scnprintf = SCA_WAITID_OPTIONS, /* options */ }, }, }, { .name = "waitid", .errpid = true, .arg = { [3] = { .scnprintf = SCA_WAITID_OPTIONS, /* options */ }, }, }, - { .name = "write", .errpid = true, + { .name = "write", .arg = { [1] = { .scnprintf = SCA_BUF /* buf */, .from_user = true, }, }, }, }; -- cgit From aa70ff0945fea2ed14046273609d04725f222616 Mon Sep 17 00:00:00 2001 From: Ping-Ke Shih Date: Tue, 24 Sep 2024 10:16:33 +0800 Subject: wifi: rtw89: pci: early chips only enable 36-bit DMA on specific PCI hosts MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The early chips including RTL8852A, RTL8851B, RTL8852B and RTL8852BT have interoperability problems of 36-bit DMA with some PCI hosts. Rollback to 32-bit DMA by default, and only enable 36-bit DMA for tested platforms. Since all Intel platforms we have can work correctly, add the vendor ID to white list. Otherwise, list vendor/device ID of bridge we have tested. Fixes: 1fd4b3fe52ef ("wifi: rtw89: pci: support 36-bit PCI DMA address") Reported-by: Marcel Weißenbach Closes: https://lore.kernel.org/linux-wireless/20240918073237.Horde.VLueh0_KaiDw-9asEEcdM84@ignaz.org/T/#m07c5694df1acb173a42e1a0bab7ac22bd231a2b8 Signed-off-by: Ping-Ke Shih Tested-by: Marcel Weißenbach Signed-off-by: Kalle Valo Link: https://patch.msgid.link/20240924021633.19861-1-pkshih@realtek.com --- drivers/net/wireless/realtek/rtw89/pci.c | 48 +++++++++++++++++++++++++++----- 1 file changed, 41 insertions(+), 7 deletions(-) diff --git a/drivers/net/wireless/realtek/rtw89/pci.c b/drivers/net/wireless/realtek/rtw89/pci.c index 02afeb3acce4..5aef7fa37878 100644 --- a/drivers/net/wireless/realtek/rtw89/pci.c +++ b/drivers/net/wireless/realtek/rtw89/pci.c @@ -3026,24 +3026,54 @@ static void rtw89_pci_declaim_device(struct rtw89_dev *rtwdev, pci_disable_device(pdev); } -static void rtw89_pci_cfg_dac(struct rtw89_dev *rtwdev) +static bool rtw89_pci_chip_is_manual_dac(struct rtw89_dev *rtwdev) { - struct rtw89_pci *rtwpci = (struct rtw89_pci *)rtwdev->priv; const struct rtw89_chip_info *chip = rtwdev->chip; - if (!rtwpci->enable_dac) - return; - switch (chip->chip_id) { case RTL8852A: case RTL8852B: case RTL8851B: case RTL8852BT: - break; + return true; default: - return; + return false; + } +} + +static bool rtw89_pci_is_dac_compatible_bridge(struct rtw89_dev *rtwdev) +{ + struct rtw89_pci *rtwpci = (struct rtw89_pci *)rtwdev->priv; + struct pci_dev *bridge = pci_upstream_bridge(rtwpci->pdev); + + if (!rtw89_pci_chip_is_manual_dac(rtwdev)) + return true; + + if (!bridge) + return false; + + switch (bridge->vendor) { + case PCI_VENDOR_ID_INTEL: + return true; + case PCI_VENDOR_ID_ASMEDIA: + if (bridge->device == 0x2806) + return true; + break; } + return false; +} + +static void rtw89_pci_cfg_dac(struct rtw89_dev *rtwdev) +{ + struct rtw89_pci *rtwpci = (struct rtw89_pci *)rtwdev->priv; + + if (!rtwpci->enable_dac) + return; + + if (!rtw89_pci_chip_is_manual_dac(rtwdev)) + return; + rtw89_pci_config_byte_set(rtwdev, RTW89_PCIE_L1_CTRL, RTW89_PCIE_BIT_EN_64BITS); } @@ -3061,6 +3091,9 @@ static int rtw89_pci_setup_mapping(struct rtw89_dev *rtwdev, goto err; } + if (!rtw89_pci_is_dac_compatible_bridge(rtwdev)) + goto no_dac; + ret = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(36)); if (!ret) { rtwpci->enable_dac = true; @@ -3073,6 +3106,7 @@ static int rtw89_pci_setup_mapping(struct rtw89_dev *rtwdev, goto err_release_regions; } } +no_dac: resource_len = pci_resource_len(pdev, bar_id); rtwpci->mmap = pci_iomap(pdev, bar_id, resource_len); -- cgit From b73b2069528f90ec49d5fa1010a759baa2c2be05 Mon Sep 17 00:00:00 2001 From: Geert Uytterhoeven Date: Tue, 24 Sep 2024 14:09:32 +0200 Subject: wifi: brcm80211: BRCM_TRACING should depend on TRACING When tracing is disabled, there is no point in asking the user about enabling Broadcom wireless device tracing. Fixes: f5c4f10852d42012 ("brcm80211: Allow trace support to be enabled separately from debug") Signed-off-by: Geert Uytterhoeven Acked-by: Arend van Spriel Signed-off-by: Kalle Valo Link: https://patch.msgid.link/81a29b15eaacc1ac1fb421bdace9ac0c3385f40f.1727179742.git.geert@linux-m68k.org --- drivers/net/wireless/broadcom/brcm80211/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/wireless/broadcom/brcm80211/Kconfig b/drivers/net/wireless/broadcom/brcm80211/Kconfig index 3a1a35b5672f..19d0c003f626 100644 --- a/drivers/net/wireless/broadcom/brcm80211/Kconfig +++ b/drivers/net/wireless/broadcom/brcm80211/Kconfig @@ -27,6 +27,7 @@ source "drivers/net/wireless/broadcom/brcm80211/brcmfmac/Kconfig" config BRCM_TRACING bool "Broadcom device tracing" depends on BRCMSMAC || BRCMFMAC + depends on TRACING help If you say Y here, the Broadcom wireless drivers will register with ftrace to dump event information into the trace ringbuffer. -- cgit From 4aefde403da7af30757915e0462d88398c9388c5 Mon Sep 17 00:00:00 2001 From: Bitterblue Smith Date: Tue, 8 Oct 2024 21:44:02 +0300 Subject: wifi: rtw88: Fix the RX aggregation in USB 3 mode RTL8822CU, RTL8822BU, and RTL8821CU don't need BIT_EN_PRE_CALC. In fact, RTL8822BU in USB 3 mode doesn't pass all the frames to the driver, resulting in much lower download speed than normal: $ iperf3 -c 192.168.0.1 -R Connecting to host 192.168.0.1, port 5201 Reverse mode, remote host 192.168.0.1 is sending [ 5] local 192.168.0.50 port 43062 connected to 192.168.0.1 port 5201 [ ID] Interval Transfer Bitrate [ 5] 0.00-1.00 sec 26.9 MBytes 225 Mbits/sec [ 5] 1.00-2.00 sec 7.50 MBytes 62.9 Mbits/sec [ 5] 2.00-3.00 sec 8.50 MBytes 71.3 Mbits/sec [ 5] 3.00-4.00 sec 8.38 MBytes 70.3 Mbits/sec [ 5] 4.00-5.00 sec 7.75 MBytes 65.0 Mbits/sec [ 5] 5.00-6.00 sec 8.00 MBytes 67.1 Mbits/sec [ 5] 6.00-7.00 sec 8.00 MBytes 67.1 Mbits/sec [ 5] 7.00-8.00 sec 7.75 MBytes 65.0 Mbits/sec [ 5] 8.00-9.00 sec 7.88 MBytes 66.1 Mbits/sec [ 5] 9.00-10.00 sec 7.88 MBytes 66.1 Mbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.02 sec 102 MBytes 85.1 Mbits/sec 224 sender [ 5] 0.00-10.00 sec 98.6 MBytes 82.7 Mbits/sec receiver Don't set BIT_EN_PRE_CALC. Then the speed is much better: % iperf3 -c 192.168.0.1 -R Connecting to host 192.168.0.1, port 5201 Reverse mode, remote host 192.168.0.1 is sending [ 5] local 192.168.0.50 port 39000 connected to 192.168.0.1 port 5201 [ ID] Interval Transfer Bitrate [ 5] 0.00-1.00 sec 52.8 MBytes 442 Mbits/sec [ 5] 1.00-2.00 sec 71.9 MBytes 603 Mbits/sec [ 5] 2.00-3.00 sec 74.8 MBytes 627 Mbits/sec [ 5] 3.00-4.00 sec 75.9 MBytes 636 Mbits/sec [ 5] 4.00-5.00 sec 76.0 MBytes 638 Mbits/sec [ 5] 5.00-6.00 sec 74.1 MBytes 622 Mbits/sec [ 5] 6.00-7.00 sec 74.0 MBytes 621 Mbits/sec [ 5] 7.00-8.00 sec 76.0 MBytes 638 Mbits/sec [ 5] 8.00-9.00 sec 74.4 MBytes 624 Mbits/sec [ 5] 9.00-10.00 sec 63.9 MBytes 536 Mbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 717 MBytes 601 Mbits/sec 24 sender [ 5] 0.00-10.00 sec 714 MBytes 599 Mbits/sec receiver Fixes: 002a5db9a52a ("wifi: rtw88: Enable USB RX aggregation for 8822c/8822b/8821c") Signed-off-by: Bitterblue Smith Acked-by: Ping-Ke Shih Signed-off-by: Kalle Valo Link: https://patch.msgid.link/afb94a82-3d18-459e-97fc-1a217608cdf0@gmail.com --- drivers/net/wireless/realtek/rtw88/usb.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/net/wireless/realtek/rtw88/usb.c b/drivers/net/wireless/realtek/rtw88/usb.c index e83ab6fb83f5..b17a429bcd29 100644 --- a/drivers/net/wireless/realtek/rtw88/usb.c +++ b/drivers/net/wireless/realtek/rtw88/usb.c @@ -771,7 +771,6 @@ static void rtw_usb_dynamic_rx_agg_v1(struct rtw_dev *rtwdev, bool enable) u8 size, timeout; u16 val16; - rtw_write32_set(rtwdev, REG_RXDMA_AGG_PG_TH, BIT_EN_PRE_CALC); rtw_write8_set(rtwdev, REG_TXDMA_PQ_MAP, BIT_RXDMA_AGG_EN); rtw_write8_clr(rtwdev, REG_RXDMA_AGG_PG_TH + 3, BIT(7)); -- cgit From a95d28a8a2f76c591a195c06ea15f5b15c66c3d1 Mon Sep 17 00:00:00 2001 From: Bitterblue Smith Date: Thu, 10 Oct 2024 18:34:43 +0300 Subject: wifi: rtlwifi: rtl8192du: Don't claim USB ID 0bda:8171 This ID appears to be RTL8188SU, not RTL8192DU. This is the wrong driver for RTL8188SU. The r8712u driver from staging handles this ID. I think this ID comes from the original rtl8192du driver from Realtek. I don't know if they added it by mistake, or it was actually used for two different chips. RTL8188SU with this ID exists in the wild. RTL8192DU with this ID probably doesn't. Fixes: b5dc8873b6ff ("wifi: rtlwifi: Add rtl8192du/sw.c") Cc: stable@vger.kernel.org # v6.11 Closes: https://github.com/lwfinger/rtl8192du/issues/105 Signed-off-by: Bitterblue Smith Acked-by: Ping-Ke Shih Signed-off-by: Kalle Valo Link: https://patch.msgid.link/40245564-41fe-4a5e-881f-cd517255b20a@gmail.com --- drivers/net/wireless/realtek/rtlwifi/rtl8192du/sw.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/net/wireless/realtek/rtlwifi/rtl8192du/sw.c b/drivers/net/wireless/realtek/rtlwifi/rtl8192du/sw.c index d069a81ac617..cc699efa9c79 100644 --- a/drivers/net/wireless/realtek/rtlwifi/rtl8192du/sw.c +++ b/drivers/net/wireless/realtek/rtlwifi/rtl8192du/sw.c @@ -352,7 +352,6 @@ static const struct usb_device_id rtl8192d_usb_ids[] = { {RTL_USB_DEVICE(USB_VENDOR_ID_REALTEK, 0x8194, rtl92du_hal_cfg)}, {RTL_USB_DEVICE(USB_VENDOR_ID_REALTEK, 0x8111, rtl92du_hal_cfg)}, {RTL_USB_DEVICE(USB_VENDOR_ID_REALTEK, 0x0193, rtl92du_hal_cfg)}, - {RTL_USB_DEVICE(USB_VENDOR_ID_REALTEK, 0x8171, rtl92du_hal_cfg)}, {RTL_USB_DEVICE(USB_VENDOR_ID_REALTEK, 0xe194, rtl92du_hal_cfg)}, {RTL_USB_DEVICE(0x2019, 0xab2c, rtl92du_hal_cfg)}, {RTL_USB_DEVICE(0x2019, 0xab2d, rtl92du_hal_cfg)}, -- cgit From 031b46b4729b1a6ff8484a1e29cdb41b710ed740 Mon Sep 17 00:00:00 2001 From: Johan Hovold Date: Tue, 15 Oct 2024 14:14:06 +0200 Subject: phy: qcom: qmp-pcie: drop bogus x1e80100 qref supplies The PCIe PHYs on x1e80100 do not a have a qref supply so stop requesting one. This also avoids the follow warning at boot: qcom-qmp-pcie-phy 1bfc000.phy: supply vdda-qref not found, using dummy regulator Fixes: 9dab00ee9544 ("phy: qcom: qmp-pcie: Add Gen4 4-lanes mode for X1E80100") Fixes: 606060ce8fd0 ("phy: qcom-qmp-pcie: Add support for X1E80100 g3x2 and g4x2 PCIE") Cc: Abel Vesa Signed-off-by: Johan Hovold Reviewed-by: Neil Armstrong Link: https://lore.kernel.org/r/20241015121406.15033-1-johan+linaro@kernel.org Signed-off-by: Vinod Koul --- drivers/phy/qualcomm/phy-qcom-qmp-pcie.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/phy/qualcomm/phy-qcom-qmp-pcie.c b/drivers/phy/qualcomm/phy-qcom-qmp-pcie.c index f71787fb4d7e..36aaac34e6c6 100644 --- a/drivers/phy/qualcomm/phy-qcom-qmp-pcie.c +++ b/drivers/phy/qualcomm/phy-qcom-qmp-pcie.c @@ -3661,8 +3661,8 @@ static const struct qmp_phy_cfg x1e80100_qmp_gen4x2_pciephy_cfg = { .reset_list = sdm845_pciephy_reset_l, .num_resets = ARRAY_SIZE(sdm845_pciephy_reset_l), - .vreg_list = sm8550_qmp_phy_vreg_l, - .num_vregs = ARRAY_SIZE(sm8550_qmp_phy_vreg_l), + .vreg_list = qmp_phy_vreg_l, + .num_vregs = ARRAY_SIZE(qmp_phy_vreg_l), .regs = pciephy_v6_regs_layout, .pwrdn_ctrl = SW_PWRDN | REFCLK_DRV_DSBL, @@ -3695,8 +3695,8 @@ static const struct qmp_phy_cfg x1e80100_qmp_gen4x4_pciephy_cfg = { .reset_list = sdm845_pciephy_reset_l, .num_resets = ARRAY_SIZE(sdm845_pciephy_reset_l), - .vreg_list = sm8550_qmp_phy_vreg_l, - .num_vregs = ARRAY_SIZE(sm8550_qmp_phy_vreg_l), + .vreg_list = qmp_phy_vreg_l, + .num_vregs = ARRAY_SIZE(qmp_phy_vreg_l), .regs = pciephy_v6_regs_layout, .pwrdn_ctrl = SW_PWRDN | REFCLK_DRV_DSBL, -- cgit From e10c52e7e064038d9bd67b20bf4ce92077d7d84e Mon Sep 17 00:00:00 2001 From: Jan Kiszka Date: Tue, 15 Oct 2024 15:04:44 +0800 Subject: phy: starfive: jh7110-usb: Fix link configuration to controller In order to connect the USB 2.0 PHY to its controller, we also need to set "u0_pdrstn_split_sw_usbpipe_plugen" [1]. Some downstream U-Boot versions did that, but upstream firmware does not, and the kernel must not rely on such behavior anyway. Failing to set this left the USB gadget port invisible to connected hosts behind. Link: https://doc-en.rvspace.org/JH7110/TRM/JH7110_TRM/sys_syscon.html#sys_syscon__section_b3l_fqs_wsb [1] Fixes: 16d3a71c20cf ("phy: starfive: Add JH7110 USB 2.0 PHY driver") Signed-off-by: Jan Kiszka Signed-off-by: Minda Chen Reviewed-by: Conor Dooley Link: https://lore.kernel.org/r/20241015070444.20972-2-minda.chen@starfivetech.com Signed-off-by: Vinod Koul --- drivers/phy/starfive/phy-jh7110-usb.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/drivers/phy/starfive/phy-jh7110-usb.c b/drivers/phy/starfive/phy-jh7110-usb.c index 633912f8a05d..cb5454fbe2c8 100644 --- a/drivers/phy/starfive/phy-jh7110-usb.c +++ b/drivers/phy/starfive/phy-jh7110-usb.c @@ -10,18 +10,24 @@ #include #include #include +#include #include #include #include +#include #include #define USB_125M_CLK_RATE 125000000 #define USB_LS_KEEPALIVE_OFF 0x4 #define USB_LS_KEEPALIVE_ENABLE BIT(4) +#define USB_PDRSTN_SPLIT BIT(17) +#define SYSCON_USB_SPLIT_OFFSET 0x18 + struct jh7110_usb2_phy { struct phy *phy; void __iomem *regs; + struct regmap *sys_syscon; struct clk *usb_125m_clk; struct clk *app_125m; enum phy_mode mode; @@ -61,6 +67,10 @@ static int usb2_phy_set_mode(struct phy *_phy, usb2_set_ls_keepalive(phy, (mode != PHY_MODE_USB_DEVICE)); } + /* Connect usb 2.0 phy mode */ + regmap_update_bits(phy->sys_syscon, SYSCON_USB_SPLIT_OFFSET, + USB_PDRSTN_SPLIT, USB_PDRSTN_SPLIT); + return 0; } @@ -129,6 +139,12 @@ static int jh7110_usb_phy_probe(struct platform_device *pdev) phy_set_drvdata(phy->phy, phy); phy_provider = devm_of_phy_provider_register(dev, of_phy_simple_xlate); + phy->sys_syscon = + syscon_regmap_lookup_by_compatible("starfive,jh7110-sys-syscon"); + if (IS_ERR(phy->sys_syscon)) + return dev_err_probe(dev, PTR_ERR(phy->sys_syscon), + "Failed to get sys-syscon\n"); + return PTR_ERR_OR_ZERO(phy_provider); } -- cgit From b4b32423b6ee6bb96e19fd82bcfd372f6192c737 Mon Sep 17 00:00:00 2001 From: Siddharth Vadapalli Date: Sat, 12 Oct 2024 11:09:37 +0530 Subject: phy: ti: phy-j721e-wiz: fix usxgmii configuration Commit b64a85fb8f53 ("phy: ti: phy-j721e-wiz.c: Add usxgmii support in wiz driver") added support for USXGMII mode. In doing so, P0_REFCLK_SEL was set to "pcs_mac_clk_divx1_ln_0" (0x3) and P0_STANDARD_MODE was set to LANE_MODE_GEN1, which results in a data rate of 5.15625 Gbps. However, since the USXGMII mode can support up to 10.3125 Gbps data rate, the aforementioned fields should be set to "pcs_mac_clk_divx0_ln_0" (0x2) and LANE_MODE_GEN2 respectively. The signal corresponding to the USXGMII lane of the SERDES has been measured as 5 Gbps without the change and 10 Gbps with the change. Hence, fix the configuration accordingly to support USXGMII up to 10G. Fixes: b64a85fb8f53 ("phy: ti: phy-j721e-wiz.c: Add usxgmii support in wiz driver") Signed-off-by: Siddharth Vadapalli Reviewed-by: Roger Quadros Link: https://lore.kernel.org/r/20241012053937.3596885-1-s-vadapalli@ti.com Signed-off-by: Vinod Koul --- drivers/phy/ti/phy-j721e-wiz.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/phy/ti/phy-j721e-wiz.c b/drivers/phy/ti/phy-j721e-wiz.c index a6c0c5607ffd..c6e846d385d2 100644 --- a/drivers/phy/ti/phy-j721e-wiz.c +++ b/drivers/phy/ti/phy-j721e-wiz.c @@ -450,8 +450,8 @@ static int wiz_mode_select(struct wiz *wiz) } else if (wiz->lane_phy_type[i] == PHY_TYPE_USXGMII) { ret = regmap_field_write(wiz->p0_mac_src_sel[i], 0x3); ret = regmap_field_write(wiz->p0_rxfclk_sel[i], 0x3); - ret = regmap_field_write(wiz->p0_refclk_sel[i], 0x3); - mode = LANE_MODE_GEN1; + ret = regmap_field_write(wiz->p0_refclk_sel[i], 0x2); + mode = LANE_MODE_GEN2; } else { continue; } -- cgit From d8f9d6d826fc15780451802796bb88ec52978f17 Mon Sep 17 00:00:00 2001 From: Cristian Ciocaltea Date: Mon, 23 Sep 2024 19:40:16 +0300 Subject: phy: phy-rockchip-samsung-hdptx: Depend on CONFIG_COMMON_CLK Ensure CONFIG_PHY_ROCKCHIP_SAMSUNG_HDPTX depends on CONFIG_COMMON_CLK to fix the following link errors when compile testing some random kernel configurations: m68k-linux-ld: drivers/phy/rockchip/phy-rockchip-samsung-hdptx.o: in function `rk_hdptx_phy_clk_register': drivers/phy/rockchip/phy-rockchip-samsung-hdptx.c:1031:(.text+0x470): undefined reference to `__clk_get_name' m68k-linux-ld: drivers/phy/rockchip/phy-rockchip-samsung-hdptx.c:1036:(.text+0x4ba): undefined reference to `devm_clk_hw_register' m68k-linux-ld: drivers/phy/rockchip/phy-rockchip-samsung-hdptx.c:1040:(.text+0x4d2): undefined reference to `of_clk_hw_simple_get' m68k-linux-ld: drivers/phy/rockchip/phy-rockchip-samsung-hdptx.c:1040:(.text+0x4da): undefined reference to `devm_of_clk_add_hw_provider' Fixes: c4b09c562086 ("phy: phy-rockchip-samsung-hdptx: Add clock provider support") Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202409180305.53PXymZn-lkp@intel.com/ Signed-off-by: Cristian Ciocaltea Reviewed-by: AngeloGioacchino Del Regno Reviewed-by: Heiko Stuebner Link: https://lore.kernel.org/r/20240923-sam-hdptx-link-fix-v1-1-8d10d7456305@collabora.com Signed-off-by: Vinod Koul --- drivers/phy/rockchip/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/phy/rockchip/Kconfig b/drivers/phy/rockchip/Kconfig index 490263375057..2f7a05f21dc5 100644 --- a/drivers/phy/rockchip/Kconfig +++ b/drivers/phy/rockchip/Kconfig @@ -86,6 +86,7 @@ config PHY_ROCKCHIP_PCIE config PHY_ROCKCHIP_SAMSUNG_HDPTX tristate "Rockchip Samsung HDMI/eDP Combo PHY driver" depends on (ARCH_ROCKCHIP || COMPILE_TEST) && OF + depends on COMMON_CLK depends on HAS_IOMEM select GENERIC_PHY select MFD_SYSCON -- cgit From 691e79ffc42154a9c91dc3b7e96a307037b4be74 Mon Sep 17 00:00:00 2001 From: Jinjie Ruan Date: Fri, 11 Oct 2024 17:55:12 +0800 Subject: iio: gts-helper: Fix memory leaks in iio_gts_build_avail_scale_table() modprobe iio-test-gts and rmmod it, then the following memory leak occurs: unreferenced object 0xffffff80c810be00 (size 64): comm "kunit_try_catch", pid 1654, jiffies 4294913981 hex dump (first 32 bytes): 02 00 00 00 08 00 00 00 20 00 00 00 40 00 00 00 ........ ...@... 80 00 00 00 00 02 00 00 00 04 00 00 00 08 00 00 ................ backtrace (crc a63d875e): [<0000000028c1b3c2>] kmemleak_alloc+0x34/0x40 [<000000001d6ecc87>] __kmalloc_noprof+0x2bc/0x3c0 [<00000000393795c1>] devm_iio_init_iio_gts+0x4b4/0x16f4 [<0000000071bb4b09>] 0xffffffdf052a62e0 [<000000000315bc18>] 0xffffffdf052a6488 [<00000000f9dc55b5>] kunit_try_run_case+0x13c/0x3ac [<00000000175a3fd4>] kunit_generic_run_threadfn_adapter+0x80/0xec [<00000000f505065d>] kthread+0x2e8/0x374 [<00000000bbfb0e5d>] ret_from_fork+0x10/0x20 unreferenced object 0xffffff80cbfe9e70 (size 16): comm "kunit_try_catch", pid 1658, jiffies 4294914015 hex dump (first 16 bytes): 10 00 00 00 40 00 00 00 80 00 00 00 00 00 00 00 ....@........... backtrace (crc 857f0cb4): [<0000000028c1b3c2>] kmemleak_alloc+0x34/0x40 [<000000001d6ecc87>] __kmalloc_noprof+0x2bc/0x3c0 [<00000000393795c1>] devm_iio_init_iio_gts+0x4b4/0x16f4 [<0000000071bb4b09>] 0xffffffdf052a62e0 [<000000007d089d45>] 0xffffffdf052a6864 [<00000000f9dc55b5>] kunit_try_run_case+0x13c/0x3ac [<00000000175a3fd4>] kunit_generic_run_threadfn_adapter+0x80/0xec [<00000000f505065d>] kthread+0x2e8/0x374 [<00000000bbfb0e5d>] ret_from_fork+0x10/0x20 ...... It includes 5*5 times "size 64" memory leaks, which correspond to 5 times test_init_iio_gain_scale() calls with gts_test_gains size 10 (10*size(int)) and gts_test_itimes size 5. It also includes 5*1 times "size 16" memory leak, which correspond to one time __test_init_iio_gain_scale() call with gts_test_gains_gain_low size 3 (3*size(int)) and gts_test_itimes size 5. The reason is that the per_time_gains[i] is not freed which is allocated in the "gts->num_itime" for loop in iio_gts_build_avail_scale_table(). Cc: stable@vger.kernel.org Fixes: 38416c28e168 ("iio: light: Add gain-time-scale helpers") Signed-off-by: Jinjie Ruan Reviewed-by: Matti Vaittinen Link: https://patch.msgid.link/20241011095512.3667549-1-ruanjinjie@huawei.com Signed-off-by: Jonathan Cameron --- drivers/iio/industrialio-gts-helper.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/iio/industrialio-gts-helper.c b/drivers/iio/industrialio-gts-helper.c index 59d7615c0f56..7326c7949244 100644 --- a/drivers/iio/industrialio-gts-helper.c +++ b/drivers/iio/industrialio-gts-helper.c @@ -307,6 +307,8 @@ static int iio_gts_build_avail_scale_table(struct iio_gts *gts) if (ret) goto err_free_out; + for (i = 0; i < gts->num_itime; i++) + kfree(per_time_gains[i]); kfree(per_time_gains); gts->per_time_avail_scale_tables = per_time_scales; -- cgit From 369f05688911b05216cfcd6ca74473bec87948d7 Mon Sep 17 00:00:00 2001 From: Jinjie Ruan Date: Wed, 16 Oct 2024 09:24:53 +0800 Subject: iio: gts-helper: Fix memory leaks for the error path of iio_gts_build_avail_scale_table() If per_time_scales[i] or per_time_gains[i] kcalloc fails in the for loop of iio_gts_build_avail_scale_table(), the err_free_out will fail to call kfree() each time when i is reduced to 0, so all the per_time_scales[0] and per_time_gains[0] will not be freed, which will cause memory leaks. Fix it by checking if i >= 0. Cc: stable@vger.kernel.org Fixes: 38416c28e168 ("iio: light: Add gain-time-scale helpers") Reviewed-by: Matti Vaittinen Signed-off-by: Jinjie Ruan Link: https://patch.msgid.link/20241016012453.2013302-1-ruanjinjie@huawei.com Signed-off-by: Jonathan Cameron --- drivers/iio/industrialio-gts-helper.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iio/industrialio-gts-helper.c b/drivers/iio/industrialio-gts-helper.c index 7326c7949244..5f131bc1a01e 100644 --- a/drivers/iio/industrialio-gts-helper.c +++ b/drivers/iio/industrialio-gts-helper.c @@ -315,7 +315,7 @@ static int iio_gts_build_avail_scale_table(struct iio_gts *gts) return 0; err_free_out: - for (i--; i; i--) { + for (i--; i >= 0; i--) { kfree(per_time_scales[i]); kfree(per_time_gains[i]); } -- cgit From 3cea8af2d1a9ae5869b47c3dabe3b20f331f3bbd Mon Sep 17 00:00:00 2001 From: Gil Fine Date: Thu, 10 Oct 2024 17:29:42 +0300 Subject: thunderbolt: Honor TMU requirements in the domain when setting TMU mode Currently, when configuring TMU (Time Management Unit) mode of a given router, we take into account only its own TMU requirements ignoring other routers in the domain. This is problematic if the router we are configuring has lower TMU requirements than what is already configured in the domain. In the scenario below, we have a host router with two USB4 ports: A and B. Port A connected to device router #1 (which supports CL states) and existing DisplayPort tunnel, thus, the TMU mode is HiFi uni-directional. 1. Initial topology [Host] A/ / [Device #1] / Monitor 2. Plug in device #2 (that supports CL states) to downstream port B of the host router [Host] A/ B\ / \ [Device #1] [Device #2] / Monitor The TMU mode on port B and port A will be configured to LowRes which is not what we want and will cause monitor to start flickering. To address this we first scan the domain and search for any router configured to HiFi uni-directional mode, and if found, configure TMU mode of the given router to HiFi uni-directional as well. Cc: stable@vger.kernel.org Signed-off-by: Gil Fine Signed-off-by: Mika Westerberg --- drivers/thunderbolt/tb.c | 48 ++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 42 insertions(+), 6 deletions(-) diff --git a/drivers/thunderbolt/tb.c b/drivers/thunderbolt/tb.c index 10e719dd837c..4f777788e917 100644 --- a/drivers/thunderbolt/tb.c +++ b/drivers/thunderbolt/tb.c @@ -288,6 +288,24 @@ static void tb_increase_tmu_accuracy(struct tb_tunnel *tunnel) device_for_each_child(&sw->dev, NULL, tb_increase_switch_tmu_accuracy); } +static int tb_switch_tmu_hifi_uni_required(struct device *dev, void *not_used) +{ + struct tb_switch *sw = tb_to_switch(dev); + + if (sw && tb_switch_tmu_is_enabled(sw) && + tb_switch_tmu_is_configured(sw, TB_SWITCH_TMU_MODE_HIFI_UNI)) + return 1; + + return device_for_each_child(dev, NULL, + tb_switch_tmu_hifi_uni_required); +} + +static bool tb_tmu_hifi_uni_required(struct tb *tb) +{ + return device_for_each_child(&tb->dev, NULL, + tb_switch_tmu_hifi_uni_required) == 1; +} + static int tb_enable_tmu(struct tb_switch *sw) { int ret; @@ -302,12 +320,30 @@ static int tb_enable_tmu(struct tb_switch *sw) ret = tb_switch_tmu_configure(sw, TB_SWITCH_TMU_MODE_MEDRES_ENHANCED_UNI); if (ret == -EOPNOTSUPP) { - if (tb_switch_clx_is_enabled(sw, TB_CL1)) - ret = tb_switch_tmu_configure(sw, - TB_SWITCH_TMU_MODE_LOWRES); - else - ret = tb_switch_tmu_configure(sw, - TB_SWITCH_TMU_MODE_HIFI_BI); + if (tb_switch_clx_is_enabled(sw, TB_CL1)) { + /* + * Figure out uni-directional HiFi TMU requirements + * currently in the domain. If there are no + * uni-directional HiFi requirements we can put the TMU + * into LowRes mode. + * + * Deliberately skip bi-directional HiFi links + * as these work independently of other links + * (and they do not allow any CL states anyway). + */ + if (tb_tmu_hifi_uni_required(sw->tb)) + ret = tb_switch_tmu_configure(sw, + TB_SWITCH_TMU_MODE_HIFI_UNI); + else + ret = tb_switch_tmu_configure(sw, + TB_SWITCH_TMU_MODE_LOWRES); + } else { + ret = tb_switch_tmu_configure(sw, TB_SWITCH_TMU_MODE_HIFI_BI); + } + + /* If not supported, fallback to bi-directional HiFi */ + if (ret == -EOPNOTSUPP) + ret = tb_switch_tmu_configure(sw, TB_SWITCH_TMU_MODE_HIFI_BI); } if (ret) return ret; -- cgit From 4021e685139d567b3fc862f54101ae9dbb15d8b5 Mon Sep 17 00:00:00 2001 From: Gao Xiang Date: Wed, 9 Oct 2024 11:31:50 +0800 Subject: fs/super.c: introduce get_tree_bdev_flags() As Allison reported [1], currently get_tree_bdev() will store "Can't lookup blockdev" error message. Although it makes sense for pure bdev-based fses, this message may mislead users who try to use EROFS file-backed mounts since get_tree_nodev() is used as a fallback then. Add get_tree_bdev_flags() to specify extensible flags [2] and GET_TREE_BDEV_QUIET_LOOKUP to silence "Can't lookup blockdev" message since it's misleading to EROFS file-backed mounts now. [1] https://lore.kernel.org/r/CAOYeF9VQ8jKVmpy5Zy9DNhO6xmWSKMB-DO8yvBB0XvBE7=3Ugg@mail.gmail.com [2] https://lore.kernel.org/r/ZwUkJEtwIpUA4qMz@infradead.org Suggested-by: Christoph Hellwig Signed-off-by: Gao Xiang Link: https://lore.kernel.org/r/20241009033151.2334888-1-hsiangkao@linux.alibaba.com Reviewed-by: Christoph Hellwig Reviewed-by: Jan Kara Signed-off-by: Christian Brauner --- fs/super.c | 26 ++++++++++++++++++++------ include/linux/fs_context.h | 6 ++++++ 2 files changed, 26 insertions(+), 6 deletions(-) diff --git a/fs/super.c b/fs/super.c index 1db230432960..c9c7223bc2a2 100644 --- a/fs/super.c +++ b/fs/super.c @@ -1596,13 +1596,14 @@ int setup_bdev_super(struct super_block *sb, int sb_flags, EXPORT_SYMBOL_GPL(setup_bdev_super); /** - * get_tree_bdev - Get a superblock based on a single block device + * get_tree_bdev_flags - Get a superblock based on a single block device * @fc: The filesystem context holding the parameters * @fill_super: Helper to initialise a new superblock + * @flags: GET_TREE_BDEV_* flags */ -int get_tree_bdev(struct fs_context *fc, - int (*fill_super)(struct super_block *, - struct fs_context *)) +int get_tree_bdev_flags(struct fs_context *fc, + int (*fill_super)(struct super_block *sb, + struct fs_context *fc), unsigned int flags) { struct super_block *s; int error = 0; @@ -1613,10 +1614,10 @@ int get_tree_bdev(struct fs_context *fc, error = lookup_bdev(fc->source, &dev); if (error) { - errorf(fc, "%s: Can't lookup blockdev", fc->source); + if (!(flags & GET_TREE_BDEV_QUIET_LOOKUP)) + errorf(fc, "%s: Can't lookup blockdev", fc->source); return error; } - fc->sb_flags |= SB_NOSEC; s = sget_dev(fc, dev); if (IS_ERR(s)) @@ -1644,6 +1645,19 @@ int get_tree_bdev(struct fs_context *fc, fc->root = dget(s->s_root); return 0; } +EXPORT_SYMBOL_GPL(get_tree_bdev_flags); + +/** + * get_tree_bdev - Get a superblock based on a single block device + * @fc: The filesystem context holding the parameters + * @fill_super: Helper to initialise a new superblock + */ +int get_tree_bdev(struct fs_context *fc, + int (*fill_super)(struct super_block *, + struct fs_context *)) +{ + return get_tree_bdev_flags(fc, fill_super, 0); +} EXPORT_SYMBOL(get_tree_bdev); static int test_bdev_super(struct super_block *s, void *data) diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h index c13e99cbbf81..4b4bfef6f053 100644 --- a/include/linux/fs_context.h +++ b/include/linux/fs_context.h @@ -160,6 +160,12 @@ extern int get_tree_keyed(struct fs_context *fc, int setup_bdev_super(struct super_block *sb, int sb_flags, struct fs_context *fc); + +#define GET_TREE_BDEV_QUIET_LOOKUP 0x0001 +int get_tree_bdev_flags(struct fs_context *fc, + int (*fill_super)(struct super_block *sb, + struct fs_context *fc), unsigned int flags); + extern int get_tree_bdev(struct fs_context *fc, int (*fill_super)(struct super_block *sb, struct fs_context *fc)); -- cgit From 14c2d97265ea5989000c428dbb7321cbd4a85f9b Mon Sep 17 00:00:00 2001 From: Gao Xiang Date: Wed, 9 Oct 2024 11:31:51 +0800 Subject: erofs: use get_tree_bdev_flags() to avoid misleading messages Users can pass in an arbitrary source path for the proper type of a mount then without "Can't lookup blockdev" error message. Reported-by: Allison Karlitskaya Closes: https://lore.kernel.org/r/CAOYeF9VQ8jKVmpy5Zy9DNhO6xmWSKMB-DO8yvBB0XvBE7=3Ugg@mail.gmail.com Signed-off-by: Gao Xiang Link: https://lore.kernel.org/r/20241009033151.2334888-2-hsiangkao@linux.alibaba.com Signed-off-by: Christian Brauner --- fs/erofs/super.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/erofs/super.c b/fs/erofs/super.c index 320d586c3896..bed3dbe5b7cb 100644 --- a/fs/erofs/super.c +++ b/fs/erofs/super.c @@ -709,7 +709,9 @@ static int erofs_fc_get_tree(struct fs_context *fc) if (IS_ENABLED(CONFIG_EROFS_FS_ONDEMAND) && sbi->fsid) return get_tree_nodev(fc, erofs_fc_fill_super); - ret = get_tree_bdev(fc, erofs_fc_fill_super); + ret = get_tree_bdev_flags(fc, erofs_fc_fill_super, + IS_ENABLED(CONFIG_EROFS_FS_BACKED_BY_FILE) ? + GET_TREE_BDEV_QUIET_LOOKUP : 0); #ifdef CONFIG_EROFS_FS_BACKED_BY_FILE if (ret == -ENOTBLK) { if (!fc->source) -- cgit From f2b5b8201b1545ef92e050735e9c768010d497aa Mon Sep 17 00:00:00 2001 From: Bartosz Golaszewski Date: Mon, 21 Oct 2024 16:21:13 +0200 Subject: spi: mtk-snfi: fix kerneldoc for mtk_snand_is_page_ops() The op argument is missing the colon and is not picked up by the kerneldoc generator. Fix it to address the following build warning: drivers/spi/spi-mtk-snfi.c:1201: warning: Function parameter or struct member 'op' not described in 'mtk_snand_is_page_ops' Fixes: 764f1b748164 ("spi: add driver for MTK SPI NAND Flash Interface") Signed-off-by: Bartosz Golaszewski Link: https://patch.msgid.link/20241021142113.71081-1-brgl@bgdev.pl Signed-off-by: Mark Brown --- drivers/spi/spi-mtk-snfi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/spi/spi-mtk-snfi.c b/drivers/spi/spi-mtk-snfi.c index ddd98ddb7913..c5677fd94e5e 100644 --- a/drivers/spi/spi-mtk-snfi.c +++ b/drivers/spi/spi-mtk-snfi.c @@ -1187,7 +1187,7 @@ cleanup: /** * mtk_snand_is_page_ops() - check if the op is a controller supported page op. - * @op spi-mem op to check + * @op: spi-mem op to check * * Check whether op can be executed with read_from_cache or program_load * mode in the controller. -- cgit From 6db388585e486c0261aeef55f8bc63a9b45756c0 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 15 Oct 2024 06:13:50 +0200 Subject: iomap: turn iomap_want_unshare_iter into an inline function MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit iomap_want_unshare_iter currently sits in fs/iomap/buffered-io.c, which depends on CONFIG_BLOCK. It is also in used in fs/dax.c whіch has no such dependency. Given that it is a trivial check turn it into an inline in include/linux/iomap.h to fix the DAX && !BLOCK build. Fixes: 6ef6a0e821d3 ("iomap: share iomap_unshare_iter predicate code with fsdax") Reported-by: kernel test robot Signed-off-by: Christoph Hellwig Link: https://lore.kernel.org/r/20241015041350.118403-1-hch@lst.de Reviewed-by: Brian Foster Signed-off-by: Christian Brauner --- fs/iomap/buffered-io.c | 17 ----------------- include/linux/iomap.h | 20 +++++++++++++++++++- 2 files changed, 19 insertions(+), 18 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 3899169b2cf7..6db2aaf808d5 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1309,23 +1309,6 @@ void iomap_file_buffered_write_punch_delalloc(struct inode *inode, } EXPORT_SYMBOL_GPL(iomap_file_buffered_write_punch_delalloc); -bool iomap_want_unshare_iter(const struct iomap_iter *iter) -{ - /* - * Don't bother with blocks that are not shared to start with; or - * mappings that cannot be shared, such as inline data, delalloc - * reservations, holes or unwritten extents. - * - * Note that we use srcmap directly instead of iomap_iter_srcmap as - * unsharing requires providing a separate source map, and the presence - * of one is a good indicator that unsharing is needed, unlike - * IOMAP_F_SHARED which can be set for any data that goes into the COW - * fork for XFS. - */ - return (iter->iomap.flags & IOMAP_F_SHARED) && - iter->srcmap.type == IOMAP_MAPPED; -} - static loff_t iomap_unshare_iter(struct iomap_iter *iter) { struct iomap *iomap = &iter->iomap; diff --git a/include/linux/iomap.h b/include/linux/iomap.h index d8a7fc84348c..0198f36e521e 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -256,6 +256,25 @@ static inline const struct iomap *iomap_iter_srcmap(const struct iomap_iter *i) return &i->iomap; } +/* + * Check if the range needs to be unshared for a FALLOC_FL_UNSHARE_RANGE + * operation. + * + * Don't bother with blocks that are not shared to start with; or mappings that + * cannot be shared, such as inline data, delalloc reservations, holes or + * unwritten extents. + * + * Note that we use srcmap directly instead of iomap_iter_srcmap as unsharing + * requires providing a separate source map, and the presence of one is a good + * indicator that unsharing is needed, unlike IOMAP_F_SHARED which can be set + * for any data that goes into the COW fork for XFS. + */ +static inline bool iomap_want_unshare_iter(const struct iomap_iter *iter) +{ + return (iter->iomap.flags & IOMAP_F_SHARED) && + iter->srcmap.type == IOMAP_MAPPED; +} + ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from, const struct iomap_ops *ops, void *private); int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops); @@ -267,7 +286,6 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len); bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio); int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len, const struct iomap_ops *ops); -bool iomap_want_unshare_iter(const struct iomap_iter *iter); int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, const struct iomap_ops *ops); int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero, -- cgit From 89f8c6f197f480fe05edf91eb9359d5425869d04 Mon Sep 17 00:00:00 2001 From: Leon Romanovsky Date: Mon, 7 Oct 2024 20:55:17 +0300 Subject: RDMA/cxgb4: Dump vendor specific QP details Restore the missing functionality to dump vendor specific QP details, which was mistakenly removed in the commit mentioned in Fixes line. Fixes: 5cc34116ccec ("RDMA: Add dedicated QP resource tracker function") Link: https://patch.msgid.link/r/ed9844829135cfdcac7d64285688195a5cd43f82.1728323026.git.leonro@nvidia.com Reported-by: Dr. David Alan Gilbert Closes: https://lore.kernel.org/all/Zv_4qAxuC0dLmgXP@gallifrey Signed-off-by: Leon Romanovsky Signed-off-by: Jason Gunthorpe --- drivers/infiniband/hw/cxgb4/provider.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/infiniband/hw/cxgb4/provider.c b/drivers/infiniband/hw/cxgb4/provider.c index 10a4c738b59f..e059f92d90fd 100644 --- a/drivers/infiniband/hw/cxgb4/provider.c +++ b/drivers/infiniband/hw/cxgb4/provider.c @@ -473,6 +473,7 @@ static const struct ib_device_ops c4iw_dev_ops = { .fill_res_cq_entry = c4iw_fill_res_cq_entry, .fill_res_cm_id_entry = c4iw_fill_res_cm_id_entry, .fill_res_mr_entry = c4iw_fill_res_mr_entry, + .fill_res_qp_entry = c4iw_fill_res_qp_entry, .get_dev_fw_str = get_dev_fw_str, .get_dma_mr = c4iw_get_dma_mr, .get_hw_stats = c4iw_get_mib, -- cgit From 78ed28e08e74da6265e49e19206e1bcb8b9a7f0d Mon Sep 17 00:00:00 2001 From: Patrisious Haddad Date: Thu, 10 Oct 2024 11:50:23 +0300 Subject: RDMA/mlx5: Round max_rd_atomic/max_dest_rd_atomic up instead of down After the cited commit below max_dest_rd_atomic and max_rd_atomic values are being rounded down to the next power of 2. As opposed to the old behavior and mlx4 driver where they used to be rounded up instead. In order to stay consistent with older code and other drivers, revert to using fls round function which rounds up to the next power of 2. Fixes: f18e26af6aba ("RDMA/mlx5: Convert modify QP to use MLX5_SET macros") Link: https://patch.msgid.link/r/d85515d6ef21a2fa8ef4c8293dce9b58df8a6297.1728550179.git.leon@kernel.org Signed-off-by: Patrisious Haddad Reviewed-by: Maher Sanalla Signed-off-by: Leon Romanovsky Signed-off-by: Jason Gunthorpe --- drivers/infiniband/hw/mlx5/qp.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c index e39b1a101e97..10ce3b44f645 100644 --- a/drivers/infiniband/hw/mlx5/qp.c +++ b/drivers/infiniband/hw/mlx5/qp.c @@ -4268,14 +4268,14 @@ static int __mlx5_ib_modify_qp(struct ib_qp *ibqp, MLX5_SET(qpc, qpc, retry_count, attr->retry_cnt); if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC && attr->max_rd_atomic) - MLX5_SET(qpc, qpc, log_sra_max, ilog2(attr->max_rd_atomic)); + MLX5_SET(qpc, qpc, log_sra_max, fls(attr->max_rd_atomic - 1)); if (attr_mask & IB_QP_SQ_PSN) MLX5_SET(qpc, qpc, next_send_psn, attr->sq_psn); if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC && attr->max_dest_rd_atomic) MLX5_SET(qpc, qpc, log_rra_max, - ilog2(attr->max_dest_rd_atomic)); + fls(attr->max_dest_rd_atomic - 1)); if (attr_mask & (IB_QP_ACCESS_FLAGS | IB_QP_MAX_DEST_RD_ATOMIC)) { err = set_qpc_atomic_flags(qp, attr, attr_mask, qpc); -- cgit From d71f4acd584cc861f54b3cb3ac07875f06550a05 Mon Sep 17 00:00:00 2001 From: Selvin Xavier Date: Mon, 14 Oct 2024 06:36:14 -0700 Subject: RDMA/bnxt_re: Fix the usage of control path spin locks Control path completion processing always runs in tasklet context. To synchronize with the posting thread, there is no need to use the irq variant of spin lock. Use spin_lock_bh instead. Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver") Link: https://patch.msgid.link/r/1728912975-19346-2-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Kalesh AP Signed-off-by: Selvin Xavier Signed-off-by: Jason Gunthorpe --- drivers/infiniband/hw/bnxt_re/qplib_rcfw.c | 25 ++++++++++--------------- 1 file changed, 10 insertions(+), 15 deletions(-) diff --git a/drivers/infiniband/hw/bnxt_re/qplib_rcfw.c b/drivers/infiniband/hw/bnxt_re/qplib_rcfw.c index 7294221b3316..ca26b88a0a80 100644 --- a/drivers/infiniband/hw/bnxt_re/qplib_rcfw.c +++ b/drivers/infiniband/hw/bnxt_re/qplib_rcfw.c @@ -290,7 +290,6 @@ static int __send_message(struct bnxt_qplib_rcfw *rcfw, struct bnxt_qplib_hwq *hwq; u32 sw_prod, cmdq_prod; struct pci_dev *pdev; - unsigned long flags; u16 cookie; u8 *preq; @@ -301,7 +300,7 @@ static int __send_message(struct bnxt_qplib_rcfw *rcfw, /* Cmdq are in 16-byte units, each request can consume 1 or more * cmdqe */ - spin_lock_irqsave(&hwq->lock, flags); + spin_lock_bh(&hwq->lock); required_slots = bnxt_qplib_get_cmd_slots(msg->req); free_slots = HWQ_FREE_SLOTS(hwq); cookie = cmdq->seq_num & RCFW_MAX_COOKIE_VALUE; @@ -311,7 +310,7 @@ static int __send_message(struct bnxt_qplib_rcfw *rcfw, dev_info_ratelimited(&pdev->dev, "CMDQ is full req/free %d/%d!", required_slots, free_slots); - spin_unlock_irqrestore(&hwq->lock, flags); + spin_unlock_bh(&hwq->lock); return -EAGAIN; } if (msg->block) @@ -367,7 +366,7 @@ static int __send_message(struct bnxt_qplib_rcfw *rcfw, wmb(); writel(cmdq_prod, cmdq->cmdq_mbox.prod); writel(RCFW_CMDQ_TRIG_VAL, cmdq->cmdq_mbox.db); - spin_unlock_irqrestore(&hwq->lock, flags); + spin_unlock_bh(&hwq->lock); /* Return the CREQ response pointer */ return 0; } @@ -486,7 +485,6 @@ static int __bnxt_qplib_rcfw_send_message(struct bnxt_qplib_rcfw *rcfw, { struct creq_qp_event *evnt = (struct creq_qp_event *)msg->resp; struct bnxt_qplib_crsqe *crsqe; - unsigned long flags; u16 cookie; int rc; u8 opcode; @@ -512,12 +510,12 @@ static int __bnxt_qplib_rcfw_send_message(struct bnxt_qplib_rcfw *rcfw, rc = __poll_for_resp(rcfw, cookie); if (rc) { - spin_lock_irqsave(&rcfw->cmdq.hwq.lock, flags); + spin_lock_bh(&rcfw->cmdq.hwq.lock); crsqe = &rcfw->crsqe_tbl[cookie]; crsqe->is_waiter_alive = false; if (rc == -ENODEV) set_bit(FIRMWARE_STALL_DETECTED, &rcfw->cmdq.flags); - spin_unlock_irqrestore(&rcfw->cmdq.hwq.lock, flags); + spin_unlock_bh(&rcfw->cmdq.hwq.lock); return -ETIMEDOUT; } @@ -628,7 +626,6 @@ static int bnxt_qplib_process_qp_event(struct bnxt_qplib_rcfw *rcfw, u16 cookie, blocked = 0; bool is_waiter_alive; struct pci_dev *pdev; - unsigned long flags; u32 wait_cmds = 0; int rc = 0; @@ -659,8 +656,7 @@ static int bnxt_qplib_process_qp_event(struct bnxt_qplib_rcfw *rcfw, * */ - spin_lock_irqsave_nested(&hwq->lock, flags, - SINGLE_DEPTH_NESTING); + spin_lock_nested(&hwq->lock, SINGLE_DEPTH_NESTING); cookie = le16_to_cpu(qp_event->cookie); blocked = cookie & RCFW_CMD_IS_BLOCKING; cookie &= RCFW_MAX_COOKIE_VALUE; @@ -672,7 +668,7 @@ static int bnxt_qplib_process_qp_event(struct bnxt_qplib_rcfw *rcfw, dev_info(&pdev->dev, "rcfw timedout: cookie = %#x, free_slots = %d", cookie, crsqe->free_slots); - spin_unlock_irqrestore(&hwq->lock, flags); + spin_unlock(&hwq->lock); return rc; } @@ -720,7 +716,7 @@ static int bnxt_qplib_process_qp_event(struct bnxt_qplib_rcfw *rcfw, __destroy_timedout_ah(rcfw, (struct creq_create_ah_resp *) qp_event); - spin_unlock_irqrestore(&hwq->lock, flags); + spin_unlock(&hwq->lock); } *num_wait += wait_cmds; return rc; @@ -734,12 +730,11 @@ static void bnxt_qplib_service_creq(struct tasklet_struct *t) u32 type, budget = CREQ_ENTRY_POLL_BUDGET; struct bnxt_qplib_hwq *hwq = &creq->hwq; struct creq_base *creqe; - unsigned long flags; u32 num_wakeup = 0; u32 hw_polled = 0; /* Service the CREQ until budget is over */ - spin_lock_irqsave(&hwq->lock, flags); + spin_lock_bh(&hwq->lock); while (budget > 0) { creqe = bnxt_qplib_get_qe(hwq, hwq->cons, NULL); if (!CREQ_CMP_VALID(creqe, creq->creq_db.dbinfo.flags)) @@ -782,7 +777,7 @@ static void bnxt_qplib_service_creq(struct tasklet_struct *t) if (hw_polled) bnxt_qplib_ring_nq_db(&creq->creq_db.dbinfo, rcfw->res->cctx, true); - spin_unlock_irqrestore(&hwq->lock, flags); + spin_unlock_bh(&hwq->lock); if (num_wakeup) wake_up_nr(&rcfw->cmdq.waitq, num_wakeup); } -- cgit From 76d3ddff7153cc0bcc14a63798d19f5d0693ea71 Mon Sep 17 00:00:00 2001 From: Selvin Xavier Date: Mon, 14 Oct 2024 06:36:15 -0700 Subject: RDMA/bnxt_re: synchronize the qp-handle table array There is a race between the CREQ tasklet and destroy qp when accessing the qp-handle table. There is a chance of reading a valid qp-handle in the CREQ tasklet handler while the QP is already moving ahead with the destruction. Fixing this race by implementing a table-lock to synchronize the access. Fixes: f218d67ef004 ("RDMA/bnxt_re: Allow posting when QPs are in error") Fixes: 84cf229f4001 ("RDMA/bnxt_re: Fix the qp table indexing") Link: https://patch.msgid.link/r/1728912975-19346-3-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Kalesh AP Signed-off-by: Selvin Xavier Signed-off-by: Jason Gunthorpe --- drivers/infiniband/hw/bnxt_re/qplib_fp.c | 4 ++++ drivers/infiniband/hw/bnxt_re/qplib_rcfw.c | 13 +++++++++---- drivers/infiniband/hw/bnxt_re/qplib_rcfw.h | 2 ++ 3 files changed, 15 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/hw/bnxt_re/qplib_fp.c b/drivers/infiniband/hw/bnxt_re/qplib_fp.c index 2ebcb2de962b..7ad83566ab0f 100644 --- a/drivers/infiniband/hw/bnxt_re/qplib_fp.c +++ b/drivers/infiniband/hw/bnxt_re/qplib_fp.c @@ -1532,9 +1532,11 @@ int bnxt_qplib_destroy_qp(struct bnxt_qplib_res *res, u32 tbl_indx; int rc; + spin_lock_bh(&rcfw->tbl_lock); tbl_indx = map_qp_id_to_tbl_indx(qp->id, rcfw); rcfw->qp_tbl[tbl_indx].qp_id = BNXT_QPLIB_QP_ID_INVALID; rcfw->qp_tbl[tbl_indx].qp_handle = NULL; + spin_unlock_bh(&rcfw->tbl_lock); bnxt_qplib_rcfw_cmd_prep((struct cmdq_base *)&req, CMDQ_BASE_OPCODE_DESTROY_QP, @@ -1545,8 +1547,10 @@ int bnxt_qplib_destroy_qp(struct bnxt_qplib_res *res, sizeof(resp), 0); rc = bnxt_qplib_rcfw_send_message(rcfw, &msg); if (rc) { + spin_lock_bh(&rcfw->tbl_lock); rcfw->qp_tbl[tbl_indx].qp_id = qp->id; rcfw->qp_tbl[tbl_indx].qp_handle = qp; + spin_unlock_bh(&rcfw->tbl_lock); return rc; } diff --git a/drivers/infiniband/hw/bnxt_re/qplib_rcfw.c b/drivers/infiniband/hw/bnxt_re/qplib_rcfw.c index ca26b88a0a80..e82bd37158ad 100644 --- a/drivers/infiniband/hw/bnxt_re/qplib_rcfw.c +++ b/drivers/infiniband/hw/bnxt_re/qplib_rcfw.c @@ -634,17 +634,21 @@ static int bnxt_qplib_process_qp_event(struct bnxt_qplib_rcfw *rcfw, case CREQ_QP_EVENT_EVENT_QP_ERROR_NOTIFICATION: err_event = (struct creq_qp_error_notification *)qp_event; qp_id = le32_to_cpu(err_event->xid); + spin_lock(&rcfw->tbl_lock); tbl_indx = map_qp_id_to_tbl_indx(qp_id, rcfw); qp = rcfw->qp_tbl[tbl_indx].qp_handle; + if (!qp) { + spin_unlock(&rcfw->tbl_lock); + break; + } + bnxt_qplib_mark_qp_error(qp); + rc = rcfw->creq.aeq_handler(rcfw, qp_event, qp); + spin_unlock(&rcfw->tbl_lock); dev_dbg(&pdev->dev, "Received QP error notification\n"); dev_dbg(&pdev->dev, "qpid 0x%x, req_err=0x%x, resp_err=0x%x\n", qp_id, err_event->req_err_state_reason, err_event->res_err_state_reason); - if (!qp) - break; - bnxt_qplib_mark_qp_error(qp); - rc = rcfw->creq.aeq_handler(rcfw, qp_event, qp); break; default: /* @@ -973,6 +977,7 @@ int bnxt_qplib_alloc_rcfw_channel(struct bnxt_qplib_res *res, GFP_KERNEL); if (!rcfw->qp_tbl) goto fail; + spin_lock_init(&rcfw->tbl_lock); rcfw->max_timeout = res->cctx->hwrm_cmd_max_timeout; diff --git a/drivers/infiniband/hw/bnxt_re/qplib_rcfw.h b/drivers/infiniband/hw/bnxt_re/qplib_rcfw.h index 45996e60a0d0..07779aeb7575 100644 --- a/drivers/infiniband/hw/bnxt_re/qplib_rcfw.h +++ b/drivers/infiniband/hw/bnxt_re/qplib_rcfw.h @@ -224,6 +224,8 @@ struct bnxt_qplib_rcfw { struct bnxt_qplib_crsqe *crsqe_tbl; int qp_tbl_size; struct bnxt_qplib_qp_node *qp_tbl; + /* To synchronize the qp-handle hash table */ + spinlock_t tbl_lock; u64 oos_prev; u32 init_oos_stats; u32 cmdq_depth; -- cgit From f89263b69731e0144d275fff777ee0dd92069200 Mon Sep 17 00:00:00 2001 From: Richard Zhu Date: Mon, 21 Oct 2024 11:52:41 -0400 Subject: phy: freescale: imx8m-pcie: Do CMN_RST just before PHY PLL lock check When enable initcall_debug together with higher debug level below. CONFIG_CONSOLE_LOGLEVEL_DEFAULT=9 CONFIG_CONSOLE_LOGLEVEL_QUIET=9 CONFIG_MESSAGE_LOGLEVEL_DEFAULT=7 The initialization of i.MX8MP PCIe PHY might be timeout failed randomly. To fix this issue, adjust the sequence of the resets refer to the power up sequence listed below. i.MX8MP PCIe PHY power up sequence: /--------------------------------------------- 1.8v supply ---------/ /--------------------------------------------------- 0.8v supply ---/ ---\ /-------------------------------------------------- X REFCLK Valid Reference Clock ---/ \-------------------------------------------------- ------------------------------------------- | i_init_restn -------------- ------------------------------------ | i_cmn_rstn --------------------- ------------------------------- | o_pll_lock_done -------------------------- Logs: imx6q-pcie 33800000.pcie: host bridge /soc@0/pcie@33800000 ranges: imx6q-pcie 33800000.pcie: IO 0x001ff80000..0x001ff8ffff -> 0x0000000000 imx6q-pcie 33800000.pcie: MEM 0x0018000000..0x001fefffff -> 0x0018000000 probe of clk_imx8mp_audiomix.reset.0 returned 0 after 1052 usecs probe of 30e20000.clock-controller returned 0 after 32971 usecs phy phy-32f00000.pcie-phy.4: phy poweron failed --> -110 probe of 30e10000.dma-controller returned 0 after 10235 usecs imx6q-pcie 33800000.pcie: waiting for PHY ready timeout! dwhdmi-imx 32fd8000.hdmi: Detected HDMI TX controller v2.13a with HDCP (samsung_dw_hdmi_phy2) imx6q-pcie 33800000.pcie: probe with driver imx6q-pcie failed with error -110 Fixes: dce9edff16ee ("phy: freescale: imx8m-pcie: Add i.MX8MP PCIe PHY support") Cc: stable@vger.kernel.org Signed-off-by: Richard Zhu Signed-off-by: Frank Li v2 changes: - Rebase to latest fixes branch of linux-phy git repo. - Richard's environment have problem and can't sent out patch. So I help post this fix patch. Link: https://lore.kernel.org/r/20241021155241.943665-1-Frank.Li@nxp.com Signed-off-by: Vinod Koul --- drivers/phy/freescale/phy-fsl-imx8m-pcie.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/phy/freescale/phy-fsl-imx8m-pcie.c b/drivers/phy/freescale/phy-fsl-imx8m-pcie.c index 11fcb1867118..e98361dcdead 100644 --- a/drivers/phy/freescale/phy-fsl-imx8m-pcie.c +++ b/drivers/phy/freescale/phy-fsl-imx8m-pcie.c @@ -141,11 +141,6 @@ static int imx8_pcie_phy_power_on(struct phy *phy) IMX8MM_GPR_PCIE_REF_CLK_PLL); usleep_range(100, 200); - /* Do the PHY common block reset */ - regmap_update_bits(imx8_phy->iomuxc_gpr, IOMUXC_GPR14, - IMX8MM_GPR_PCIE_CMN_RST, - IMX8MM_GPR_PCIE_CMN_RST); - switch (imx8_phy->drvdata->variant) { case IMX8MP: reset_control_deassert(imx8_phy->perst); @@ -156,6 +151,11 @@ static int imx8_pcie_phy_power_on(struct phy *phy) break; } + /* Do the PHY common block reset */ + regmap_update_bits(imx8_phy->iomuxc_gpr, IOMUXC_GPR14, + IMX8MM_GPR_PCIE_CMN_RST, + IMX8MM_GPR_PCIE_CMN_RST); + /* Polling to check the phy is ready or not. */ ret = readl_poll_timeout(imx8_phy->base + IMX8MM_PCIE_PHY_CMN_REG075, val, val == ANA_PLL_DONE, 10, 20000); -- cgit From 16fde3e076775d3b51f48d44d050746fbc9d638e Mon Sep 17 00:00:00 2001 From: Abel Vesa Date: Mon, 21 Oct 2024 16:53:28 +0300 Subject: dt-bindings: phy: qcom,sc8280xp-qmp-pcie-phy: Fix X1E80100 resets entries The PCIe 6a PHY is actually Gen4 4-lanes capable. So the gen4x4 compatible describes it. But according to the schema, currently the gen4x4 compatible doesn't require both PHY and PHY-nocsr resets, while the HW does. So fix that by adding the gen4x4 compatible alongside the gen4x2 one for the resets description. Fixes: 0c5f4d23f776 ("dt-bindings: phy: qcom,sc8280xp-qmp-pcie-phy: Document the X1E80100 QMP PCIe PHY Gen4 x4") Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202410182029.n2zPkuGx-lkp@intel.com/ Reviewed-by: Krzysztof Kozlowski Reviewed-by: Johan Hovold Signed-off-by: Abel Vesa Link: https://lore.kernel.org/r/20241021-phy-qcom-qmp-pcie-fix-x1e80100-gen4x4-resets-v3-1-1918c46fc37c@linaro.org Signed-off-by: Vinod Koul --- Documentation/devicetree/bindings/phy/qcom,sc8280xp-qmp-pcie-phy.yaml | 1 + 1 file changed, 1 insertion(+) diff --git a/Documentation/devicetree/bindings/phy/qcom,sc8280xp-qmp-pcie-phy.yaml b/Documentation/devicetree/bindings/phy/qcom,sc8280xp-qmp-pcie-phy.yaml index 447b379aa83d..380a9222a51d 100644 --- a/Documentation/devicetree/bindings/phy/qcom,sc8280xp-qmp-pcie-phy.yaml +++ b/Documentation/devicetree/bindings/phy/qcom,sc8280xp-qmp-pcie-phy.yaml @@ -201,6 +201,7 @@ allOf: - qcom,sm8550-qmp-gen4x2-pcie-phy - qcom,sm8650-qmp-gen4x2-pcie-phy - qcom,x1e80100-qmp-gen4x2-pcie-phy + - qcom,x1e80100-qmp-gen4x4-pcie-phy then: properties: resets: -- cgit From e70d2677ef4088d59158739d72b67ac36d1b132b Mon Sep 17 00:00:00 2001 From: Dipendra Khadka Date: Mon, 30 Sep 2024 19:11:00 +0000 Subject: phy: tegra: xusb: Add error pointer check in xusb.c Add error pointer check after tegra_xusb_find_lane(). Fixes: e8f7d2f409a1 ("phy: tegra: xusb: Add usb-phy support") Signed-off-by: Dipendra Khadka Acked-by: Thierry Reding Link: https://lore.kernel.org/r/20240930191101.13184-1-kdipendra88@gmail.com Signed-off-by: Vinod Koul --- drivers/phy/tegra/xusb.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/phy/tegra/xusb.c b/drivers/phy/tegra/xusb.c index cfdb54b6070a..342f5ccf611d 100644 --- a/drivers/phy/tegra/xusb.c +++ b/drivers/phy/tegra/xusb.c @@ -699,6 +699,8 @@ static int tegra_xusb_setup_usb_role_switch(struct tegra_xusb_port *port) return -ENOMEM; lane = tegra_xusb_find_lane(port->padctl, "usb2", port->index); + if (IS_ERR(lane)) + return PTR_ERR(lane); /* * Assign phy dev to usb-phy dev. Host/device drivers can use phy -- cgit From 995d4d558eea79f8d2e8e46d0914c3940b7463ac Mon Sep 17 00:00:00 2001 From: "Jason-JH.Lin" Date: Wed, 9 Oct 2024 11:46:42 +0800 Subject: drm/mediatek: ovl: Fix XRGB format breakage for blend_modes unsupported SoCs OVL_CON_AEN is for alpha blending enable. For the SoC that is supported the blend_modes, OVL_CON_AEN will always enabled to use constant alpha and then use the ignore_pixel_alpha bit to do the alpha blending for XRGB8888 format. Note that ignore pixel alpha bit is not supported if the SoC is not supported the blend_modes. So it will break the original setting of XRGB8888 format for the blend_modes unsupported SoCs, such as MT8173. To fix the downgrade issue, enable alpha blending only when a valid blend_mode or has_alpha is set. Fixes: bc46eb5d5d77 ("drm/mediatek: Support DRM plane alpha in OVL") Signed-off-by: Jason-JH.Lin Reviewed-by: CK Hu Reviewed-by: AngeloGioacchino Del Regno Link: https://patchwork.kernel.org/project/dri-devel/patch/20241009034646.13143-2-jason-jh.lin@mediatek.com/ Signed-off-by: Chun-Kuang Hu --- drivers/gpu/drm/mediatek/mtk_disp_ovl.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/mediatek/mtk_disp_ovl.c b/drivers/gpu/drm/mediatek/mtk_disp_ovl.c index 89b439dcf3a6..047cd1796a51 100644 --- a/drivers/gpu/drm/mediatek/mtk_disp_ovl.c +++ b/drivers/gpu/drm/mediatek/mtk_disp_ovl.c @@ -473,8 +473,14 @@ void mtk_ovl_layer_config(struct device *dev, unsigned int idx, con = ovl_fmt_convert(ovl, fmt, blend_mode); if (state->base.fb) { - con |= OVL_CON_AEN; con |= state->base.alpha & OVL_CON_ALPHA; + + /* + * For blend_modes supported SoCs, always enable alpha blending. + * For blend_modes unsupported SoCs, enable alpha blending when has_alpha is set. + */ + if (blend_mode || state->base.fb->format->has_alpha) + con |= OVL_CON_AEN; } /* CONST_BLD must be enabled for XRGB formats although the alpha channel -- cgit From 28fbc3293f034f3d148bb0bc433114db493657b8 Mon Sep 17 00:00:00 2001 From: "Jason-JH.Lin" Date: Wed, 9 Oct 2024 11:46:43 +0800 Subject: drm/mediatek: ovl: Refine ignore_pixel_alpha comment and placement Refine the comment for ignore_pixel_alpha flag and move it to if(state->fb) statement to make it less conditional. Signed-off-by: Jason-JH.Lin Reviewed-by: AngeloGioacchino Del Regno Reviewed-by: CK Hu Link: https://patchwork.kernel.org/project/dri-devel/patch/20241009034646.13143-3-jason-jh.lin@mediatek.com/ Signed-off-by: Chun-Kuang Hu --- drivers/gpu/drm/mediatek/mtk_disp_ovl.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/drivers/gpu/drm/mediatek/mtk_disp_ovl.c b/drivers/gpu/drm/mediatek/mtk_disp_ovl.c index 047cd1796a51..0d3a9c5e8d26 100644 --- a/drivers/gpu/drm/mediatek/mtk_disp_ovl.c +++ b/drivers/gpu/drm/mediatek/mtk_disp_ovl.c @@ -481,16 +481,16 @@ void mtk_ovl_layer_config(struct device *dev, unsigned int idx, */ if (blend_mode || state->base.fb->format->has_alpha) con |= OVL_CON_AEN; - } - /* CONST_BLD must be enabled for XRGB formats although the alpha channel - * can be ignored, or OVL will still read the value from memory. - * For RGB888 related formats, whether CONST_BLD is enabled or not won't - * affect the result. Therefore we use !has_alpha as the condition. - */ - if ((state->base.fb && !state->base.fb->format->has_alpha) || - blend_mode == DRM_MODE_BLEND_PIXEL_NONE) - ignore_pixel_alpha = OVL_CONST_BLEND; + /* + * Although the alpha channel can be ignored, CONST_BLD must be enabled + * for XRGB format, otherwise OVL will still read the value from memory. + * For RGB888 related formats, whether CONST_BLD is enabled or not won't + * affect the result. Therefore we use !has_alpha as the condition. + */ + if (blend_mode == DRM_MODE_BLEND_PIXEL_NONE || !state->base.fb->format->has_alpha) + ignore_pixel_alpha = OVL_CONST_BLEND; + } if (pending->rotation & DRM_MODE_REFLECT_Y) { con |= OVL_CON_VIRT_FLIP; -- cgit From 41607c3ceb0e527e0985387bc41bbf291dc9a3d8 Mon Sep 17 00:00:00 2001 From: "Jason-JH.Lin" Date: Wed, 9 Oct 2024 11:46:44 +0800 Subject: drm/mediatek: ovl: Remove the color format comment for ovl_fmt_convert() Since we changed MACROs to be consistent with DRM input color format naming, the comment for ovl_fmt_conver() is no longer needed. Fixes: 9f428b95ac89 ("drm/mediatek: Add new color format MACROs in OVL") Signed-off-by: Jason-JH.Lin Reviewed-by: CK Hu Reviewed-by: AngeloGioacchino Del Regno Link: https://patchwork.kernel.org/project/dri-devel/patch/20241009034646.13143-4-jason-jh.lin@mediatek.com/ Signed-off-by: Chun-Kuang Hu --- drivers/gpu/drm/mediatek/mtk_disp_ovl.c | 5 ----- 1 file changed, 5 deletions(-) diff --git a/drivers/gpu/drm/mediatek/mtk_disp_ovl.c b/drivers/gpu/drm/mediatek/mtk_disp_ovl.c index 0d3a9c5e8d26..1ccb700858cf 100644 --- a/drivers/gpu/drm/mediatek/mtk_disp_ovl.c +++ b/drivers/gpu/drm/mediatek/mtk_disp_ovl.c @@ -389,11 +389,6 @@ void mtk_ovl_layer_off(struct device *dev, unsigned int idx, static unsigned int ovl_fmt_convert(struct mtk_disp_ovl *ovl, unsigned int fmt, unsigned int blend_mode) { - /* The return value in switch "MEM_MODE_INPUT_FORMAT_XXX" - * is defined in mediatek HW data sheet. - * The alphabet order in XXX is no relation to data - * arrangement in memory. - */ switch (fmt) { default: case DRM_FORMAT_RGB565: -- cgit From 333ab43616ff46694b46b4137acd0e19dc291a7f Mon Sep 17 00:00:00 2001 From: "Jason-JH.Lin" Date: Wed, 9 Oct 2024 11:46:45 +0800 Subject: drm/mediatek: ovl: Add blend_modes to driver data OVL_CON_CLRFMT_MAN is a configuration for extending color format settings of DISP_REG_OVL_CON(n). It will change some of the original color format settings. Take the settings of (3 << 12) for example. - If OVL_CON_CLRFMT_MAN = 0 means OVL_CON_CLRFMT_RGBA8888. - If OVL_CON_CLRFMT_MAN = 1 means OVL_CON_CLRFMT_PARGB8888. Since previous SoCs did not support OVL_CON_CLRFMT_MAN, this means that the SoC does not support the premultiplied color format. It will break the original color format setting of MT8173. Therefore, the blend_modes is added to the driver data and then mtk_ovl_fmt_convert() will check the blend_modes to see if pre-multiplied is supported in the current platform. If it is not supported, use coverage mode to set it to the supported color formats to solve the degradation problem. Fixes: a3f7f7ef4bfe ("drm/mediatek: Support "Pre-multiplied" blending in OVL") Signed-off-by: Jason-JH.Lin Reviewed-by: AngeloGioacchino Del Regno Reviewed-by: CK Hu Link: https://patchwork.kernel.org/project/dri-devel/patch/20241009034646.13143-5-jason-jh.lin@mediatek.com/ Signed-off-by: Chun-Kuang Hu --- drivers/gpu/drm/mediatek/mtk_disp_ovl.c | 34 ++++++++++++++++++++++++++++++--- 1 file changed, 31 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/mediatek/mtk_disp_ovl.c b/drivers/gpu/drm/mediatek/mtk_disp_ovl.c index 1ccb700858cf..fab23b1904bd 100644 --- a/drivers/gpu/drm/mediatek/mtk_disp_ovl.c +++ b/drivers/gpu/drm/mediatek/mtk_disp_ovl.c @@ -146,6 +146,7 @@ struct mtk_disp_ovl_data { bool fmt_rgb565_is_0; bool smi_id_en; bool supports_afbc; + const u32 blend_modes; const u32 *formats; size_t num_formats; bool supports_clrfmt_ext; @@ -386,9 +387,27 @@ void mtk_ovl_layer_off(struct device *dev, unsigned int idx, DISP_REG_OVL_RDMA_CTRL(idx)); } -static unsigned int ovl_fmt_convert(struct mtk_disp_ovl *ovl, unsigned int fmt, - unsigned int blend_mode) +static unsigned int mtk_ovl_fmt_convert(struct mtk_disp_ovl *ovl, + struct mtk_plane_state *state) { + unsigned int fmt = state->pending.format; + unsigned int blend_mode = DRM_MODE_BLEND_COVERAGE; + + /* + * For the platforms where OVL_CON_CLRFMT_MAN is defined in the hardware data sheet + * and supports premultiplied color formats, such as OVL_CON_CLRFMT_PARGB8888. + * + * Check blend_modes in the driver data to see if premultiplied mode is supported. + * If not, use coverage mode instead to set it to the supported color formats. + * + * Current DRM assumption is that alpha is default premultiplied, so the bitmask of + * blend_modes must include BIT(DRM_MODE_BLEND_PREMULTI). Otherwise, mtk_plane_init() + * will get an error return from drm_plane_create_blend_mode_property() and + * state->base.pixel_blend_mode should not be used. + */ + if (ovl->data->blend_modes & BIT(DRM_MODE_BLEND_PREMULTI)) + blend_mode = state->base.pixel_blend_mode; + switch (fmt) { default: case DRM_FORMAT_RGB565: @@ -466,7 +485,7 @@ void mtk_ovl_layer_config(struct device *dev, unsigned int idx, return; } - con = ovl_fmt_convert(ovl, fmt, blend_mode); + con = mtk_ovl_fmt_convert(ovl, state); if (state->base.fb) { con |= state->base.alpha & OVL_CON_ALPHA; @@ -664,6 +683,9 @@ static const struct mtk_disp_ovl_data mt8192_ovl_driver_data = { .layer_nr = 4, .fmt_rgb565_is_0 = true, .smi_id_en = true, + .blend_modes = BIT(DRM_MODE_BLEND_PREMULTI) | + BIT(DRM_MODE_BLEND_COVERAGE) | + BIT(DRM_MODE_BLEND_PIXEL_NONE), .formats = mt8173_formats, .num_formats = ARRAY_SIZE(mt8173_formats), }; @@ -674,6 +696,9 @@ static const struct mtk_disp_ovl_data mt8192_ovl_2l_driver_data = { .layer_nr = 2, .fmt_rgb565_is_0 = true, .smi_id_en = true, + .blend_modes = BIT(DRM_MODE_BLEND_PREMULTI) | + BIT(DRM_MODE_BLEND_COVERAGE) | + BIT(DRM_MODE_BLEND_PIXEL_NONE), .formats = mt8173_formats, .num_formats = ARRAY_SIZE(mt8173_formats), }; @@ -685,6 +710,9 @@ static const struct mtk_disp_ovl_data mt8195_ovl_driver_data = { .fmt_rgb565_is_0 = true, .smi_id_en = true, .supports_afbc = true, + .blend_modes = BIT(DRM_MODE_BLEND_PREMULTI) | + BIT(DRM_MODE_BLEND_COVERAGE) | + BIT(DRM_MODE_BLEND_PIXEL_NONE), .formats = mt8195_formats, .num_formats = ARRAY_SIZE(mt8195_formats), .supports_clrfmt_ext = true, -- cgit From e6411bf2aea87aa3fdf74c7bce37db3d975ab026 Mon Sep 17 00:00:00 2001 From: "Jason-JH.Lin" Date: Wed, 9 Oct 2024 11:46:46 +0800 Subject: drm/mediatek: Add blend_modes to mtk_plane_init() for different SoCs Since some SoCs support premultiplied pixel formats but some do not, the blend_modes parameter is added to mtk_plane_init(), which is obtained from the mtk_ddp_comp_get_blend_modes function implemented in different blending supported components. The blending supported components can use driver data to set the blend mode capabilities for different SoCs. Signed-off-by: Jason-JH.Lin Reviewed-by: AngeloGioacchino Del Regno Reviewed-by: CK Hu Link: https://patchwork.kernel.org/project/dri-devel/patch/20241009034646.13143-6-jason-jh.lin@mediatek.com/ Signed-off-by: Chun-Kuang Hu --- drivers/gpu/drm/mediatek/mtk_crtc.c | 1 + drivers/gpu/drm/mediatek/mtk_ddp_comp.c | 2 ++ drivers/gpu/drm/mediatek/mtk_ddp_comp.h | 10 ++++++++++ drivers/gpu/drm/mediatek/mtk_disp_drv.h | 2 ++ drivers/gpu/drm/mediatek/mtk_disp_ovl.c | 7 +++++++ drivers/gpu/drm/mediatek/mtk_disp_ovl_adaptor.c | 7 +++++++ drivers/gpu/drm/mediatek/mtk_ethdr.c | 7 +++++++ drivers/gpu/drm/mediatek/mtk_ethdr.h | 1 + drivers/gpu/drm/mediatek/mtk_plane.c | 15 +++++++-------- drivers/gpu/drm/mediatek/mtk_plane.h | 4 ++-- 10 files changed, 46 insertions(+), 10 deletions(-) diff --git a/drivers/gpu/drm/mediatek/mtk_crtc.c b/drivers/gpu/drm/mediatek/mtk_crtc.c index 175b00e5a253..b65f196f2015 100644 --- a/drivers/gpu/drm/mediatek/mtk_crtc.c +++ b/drivers/gpu/drm/mediatek/mtk_crtc.c @@ -913,6 +913,7 @@ static int mtk_crtc_init_comp_planes(struct drm_device *drm_dev, BIT(pipe), mtk_crtc_plane_type(mtk_crtc->layer_nr, num_planes), mtk_ddp_comp_supported_rotations(comp), + mtk_ddp_comp_get_blend_modes(comp), mtk_ddp_comp_get_formats(comp), mtk_ddp_comp_get_num_formats(comp), i); if (ret) diff --git a/drivers/gpu/drm/mediatek/mtk_ddp_comp.c b/drivers/gpu/drm/mediatek/mtk_ddp_comp.c index be66d94be361..edc6417639e6 100644 --- a/drivers/gpu/drm/mediatek/mtk_ddp_comp.c +++ b/drivers/gpu/drm/mediatek/mtk_ddp_comp.c @@ -363,6 +363,7 @@ static const struct mtk_ddp_comp_funcs ddp_ovl = { .layer_config = mtk_ovl_layer_config, .bgclr_in_on = mtk_ovl_bgclr_in_on, .bgclr_in_off = mtk_ovl_bgclr_in_off, + .get_blend_modes = mtk_ovl_get_blend_modes, .get_formats = mtk_ovl_get_formats, .get_num_formats = mtk_ovl_get_num_formats, }; @@ -416,6 +417,7 @@ static const struct mtk_ddp_comp_funcs ddp_ovl_adaptor = { .disconnect = mtk_ovl_adaptor_disconnect, .add = mtk_ovl_adaptor_add_comp, .remove = mtk_ovl_adaptor_remove_comp, + .get_blend_modes = mtk_ovl_adaptor_get_blend_modes, .get_formats = mtk_ovl_adaptor_get_formats, .get_num_formats = mtk_ovl_adaptor_get_num_formats, .mode_valid = mtk_ovl_adaptor_mode_valid, diff --git a/drivers/gpu/drm/mediatek/mtk_ddp_comp.h b/drivers/gpu/drm/mediatek/mtk_ddp_comp.h index ecf6dc283cd7..39720b27f4e9 100644 --- a/drivers/gpu/drm/mediatek/mtk_ddp_comp.h +++ b/drivers/gpu/drm/mediatek/mtk_ddp_comp.h @@ -80,6 +80,7 @@ struct mtk_ddp_comp_funcs { void (*ctm_set)(struct device *dev, struct drm_crtc_state *state); struct device * (*dma_dev_get)(struct device *dev); + u32 (*get_blend_modes)(struct device *dev); const u32 *(*get_formats)(struct device *dev); size_t (*get_num_formats)(struct device *dev); void (*connect)(struct device *dev, struct device *mmsys_dev, unsigned int next); @@ -266,6 +267,15 @@ static inline struct device *mtk_ddp_comp_dma_dev_get(struct mtk_ddp_comp *comp) return comp->dev; } +static inline +u32 mtk_ddp_comp_get_blend_modes(struct mtk_ddp_comp *comp) +{ + if (comp->funcs && comp->funcs->get_blend_modes) + return comp->funcs->get_blend_modes(comp->dev); + + return 0; +} + static inline const u32 *mtk_ddp_comp_get_formats(struct mtk_ddp_comp *comp) { diff --git a/drivers/gpu/drm/mediatek/mtk_disp_drv.h b/drivers/gpu/drm/mediatek/mtk_disp_drv.h index 082ac18fe04a..04154db9085c 100644 --- a/drivers/gpu/drm/mediatek/mtk_disp_drv.h +++ b/drivers/gpu/drm/mediatek/mtk_disp_drv.h @@ -103,6 +103,7 @@ void mtk_ovl_register_vblank_cb(struct device *dev, void mtk_ovl_unregister_vblank_cb(struct device *dev); void mtk_ovl_enable_vblank(struct device *dev); void mtk_ovl_disable_vblank(struct device *dev); +u32 mtk_ovl_get_blend_modes(struct device *dev); const u32 *mtk_ovl_get_formats(struct device *dev); size_t mtk_ovl_get_num_formats(struct device *dev); @@ -131,6 +132,7 @@ void mtk_ovl_adaptor_start(struct device *dev); void mtk_ovl_adaptor_stop(struct device *dev); unsigned int mtk_ovl_adaptor_layer_nr(struct device *dev); struct device *mtk_ovl_adaptor_dma_dev_get(struct device *dev); +u32 mtk_ovl_adaptor_get_blend_modes(struct device *dev); const u32 *mtk_ovl_adaptor_get_formats(struct device *dev); size_t mtk_ovl_adaptor_get_num_formats(struct device *dev); enum drm_mode_status mtk_ovl_adaptor_mode_valid(struct device *dev, diff --git a/drivers/gpu/drm/mediatek/mtk_disp_ovl.c b/drivers/gpu/drm/mediatek/mtk_disp_ovl.c index fab23b1904bd..9786ce94de0e 100644 --- a/drivers/gpu/drm/mediatek/mtk_disp_ovl.c +++ b/drivers/gpu/drm/mediatek/mtk_disp_ovl.c @@ -215,6 +215,13 @@ void mtk_ovl_disable_vblank(struct device *dev) writel_relaxed(0x0, ovl->regs + DISP_REG_OVL_INTEN); } +u32 mtk_ovl_get_blend_modes(struct device *dev) +{ + struct mtk_disp_ovl *ovl = dev_get_drvdata(dev); + + return ovl->data->blend_modes; +} + const u32 *mtk_ovl_get_formats(struct device *dev) { struct mtk_disp_ovl *ovl = dev_get_drvdata(dev); diff --git a/drivers/gpu/drm/mediatek/mtk_disp_ovl_adaptor.c b/drivers/gpu/drm/mediatek/mtk_disp_ovl_adaptor.c index c6768210b08b..bf2546c4681a 100644 --- a/drivers/gpu/drm/mediatek/mtk_disp_ovl_adaptor.c +++ b/drivers/gpu/drm/mediatek/mtk_disp_ovl_adaptor.c @@ -400,6 +400,13 @@ void mtk_ovl_adaptor_disable_vblank(struct device *dev) mtk_ethdr_disable_vblank(ovl_adaptor->ovl_adaptor_comp[OVL_ADAPTOR_ETHDR0]); } +u32 mtk_ovl_adaptor_get_blend_modes(struct device *dev) +{ + struct mtk_disp_ovl_adaptor *ovl_adaptor = dev_get_drvdata(dev); + + return mtk_ethdr_get_blend_modes(ovl_adaptor->ovl_adaptor_comp[OVL_ADAPTOR_ETHDR0]); +} + const u32 *mtk_ovl_adaptor_get_formats(struct device *dev) { struct mtk_disp_ovl_adaptor *ovl_adaptor = dev_get_drvdata(dev); diff --git a/drivers/gpu/drm/mediatek/mtk_ethdr.c b/drivers/gpu/drm/mediatek/mtk_ethdr.c index d1d9cf8b10e1..0f22e7d337cb 100644 --- a/drivers/gpu/drm/mediatek/mtk_ethdr.c +++ b/drivers/gpu/drm/mediatek/mtk_ethdr.c @@ -145,6 +145,13 @@ static irqreturn_t mtk_ethdr_irq_handler(int irq, void *dev_id) return IRQ_HANDLED; } +u32 mtk_ethdr_get_blend_modes(struct device *dev) +{ + return BIT(DRM_MODE_BLEND_PREMULTI) | + BIT(DRM_MODE_BLEND_COVERAGE) | + BIT(DRM_MODE_BLEND_PIXEL_NONE); +} + void mtk_ethdr_layer_config(struct device *dev, unsigned int idx, struct mtk_plane_state *state, struct cmdq_pkt *cmdq_pkt) diff --git a/drivers/gpu/drm/mediatek/mtk_ethdr.h b/drivers/gpu/drm/mediatek/mtk_ethdr.h index 81af9edea3f7..a72aeee46829 100644 --- a/drivers/gpu/drm/mediatek/mtk_ethdr.h +++ b/drivers/gpu/drm/mediatek/mtk_ethdr.h @@ -13,6 +13,7 @@ void mtk_ethdr_clk_disable(struct device *dev); void mtk_ethdr_config(struct device *dev, unsigned int w, unsigned int h, unsigned int vrefresh, unsigned int bpc, struct cmdq_pkt *cmdq_pkt); +u32 mtk_ethdr_get_blend_modes(struct device *dev); void mtk_ethdr_layer_config(struct device *dev, unsigned int idx, struct mtk_plane_state *state, struct cmdq_pkt *cmdq_pkt); diff --git a/drivers/gpu/drm/mediatek/mtk_plane.c b/drivers/gpu/drm/mediatek/mtk_plane.c index 7d2cb4e0fafa..8a48b3b0a956 100644 --- a/drivers/gpu/drm/mediatek/mtk_plane.c +++ b/drivers/gpu/drm/mediatek/mtk_plane.c @@ -320,8 +320,8 @@ static const struct drm_plane_helper_funcs mtk_plane_helper_funcs = { int mtk_plane_init(struct drm_device *dev, struct drm_plane *plane, unsigned long possible_crtcs, enum drm_plane_type type, - unsigned int supported_rotations, const u32 *formats, - size_t num_formats, unsigned int plane_idx) + unsigned int supported_rotations, const u32 blend_modes, + const u32 *formats, size_t num_formats, unsigned int plane_idx) { int err; @@ -366,12 +366,11 @@ int mtk_plane_init(struct drm_device *dev, struct drm_plane *plane, if (err) DRM_ERROR("failed to create property: alpha\n"); - err = drm_plane_create_blend_mode_property(plane, - BIT(DRM_MODE_BLEND_PREMULTI) | - BIT(DRM_MODE_BLEND_COVERAGE) | - BIT(DRM_MODE_BLEND_PIXEL_NONE)); - if (err) - DRM_ERROR("failed to create property: blend_mode\n"); + if (blend_modes) { + err = drm_plane_create_blend_mode_property(plane, blend_modes); + if (err) + DRM_ERROR("failed to create property: blend_mode\n"); + } drm_plane_helper_add(plane, &mtk_plane_helper_funcs); diff --git a/drivers/gpu/drm/mediatek/mtk_plane.h b/drivers/gpu/drm/mediatek/mtk_plane.h index 5b177eac67b7..3b13b89989c7 100644 --- a/drivers/gpu/drm/mediatek/mtk_plane.h +++ b/drivers/gpu/drm/mediatek/mtk_plane.h @@ -48,6 +48,6 @@ to_mtk_plane_state(struct drm_plane_state *state) int mtk_plane_init(struct drm_device *dev, struct drm_plane *plane, unsigned long possible_crtcs, enum drm_plane_type type, - unsigned int supported_rotations, const u32 *formats, - size_t num_formats, unsigned int plane_idx); + unsigned int supported_rotations, const u32 blend_modes, + const u32 *formats, size_t num_formats, unsigned int plane_idx); #endif -- cgit From 655c6c1b7afe6d29f386f415594ee643e5e3d755 Mon Sep 17 00:00:00 2001 From: Hsin-Te Yuan Date: Wed, 16 Oct 2024 14:17:14 +0000 Subject: drm/mediatek: Fix color format MACROs in OVL In commit 9f428b95ac89 ("drm/mediatek: Add new color format MACROs in OVL"), some new color formats are defined in the MACROs to make the switch statement more concise. That commit was intended to be a no-op cleanup. However, there are typos in these formats MACROs, which cause the return value to be incorrect. Fix the typos to ensure the return value remains unchanged. Fixes: 9f428b95ac89 ("drm/mediatek: Add new color format MACROs in OVL") Reviewed-by: Douglas Anderson Reviewed-by: Matthias Brugger Signed-off-by: Hsin-Te Yuan Reviewed-by: AngeloGioacchino Del Regno Reviewed-by: CK Hu Link: https://patchwork.kernel.org/project/dri-devel/patch/20241016-color-v3-1-e0f5f44a72d8@chromium.org/ Signed-off-by: Chun-Kuang Hu --- drivers/gpu/drm/mediatek/mtk_disp_ovl.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/mediatek/mtk_disp_ovl.c b/drivers/gpu/drm/mediatek/mtk_disp_ovl.c index 9786ce94de0e..e0c0bb01f65a 100644 --- a/drivers/gpu/drm/mediatek/mtk_disp_ovl.c +++ b/drivers/gpu/drm/mediatek/mtk_disp_ovl.c @@ -65,8 +65,8 @@ #define OVL_CON_CLRFMT_RGB (1 << 12) #define OVL_CON_CLRFMT_ARGB8888 (2 << 12) #define OVL_CON_CLRFMT_RGBA8888 (3 << 12) -#define OVL_CON_CLRFMT_ABGR8888 (OVL_CON_CLRFMT_RGBA8888 | OVL_CON_BYTE_SWAP) -#define OVL_CON_CLRFMT_BGRA8888 (OVL_CON_CLRFMT_ARGB8888 | OVL_CON_BYTE_SWAP) +#define OVL_CON_CLRFMT_ABGR8888 (OVL_CON_CLRFMT_ARGB8888 | OVL_CON_BYTE_SWAP) +#define OVL_CON_CLRFMT_BGRA8888 (OVL_CON_CLRFMT_RGBA8888 | OVL_CON_BYTE_SWAP) #define OVL_CON_CLRFMT_UYVY (4 << 12) #define OVL_CON_CLRFMT_YUYV (5 << 12) #define OVL_CON_MTX_YUV_TO_RGB (6 << 16) -- cgit From f54f0d0e2b1f74de85ff02013fa4886e4154aca5 Mon Sep 17 00:00:00 2001 From: Keith Busch Date: Thu, 17 Oct 2024 10:45:24 -0700 Subject: nvme: enhance cns version checking The number of CNS bits in the command is specific to the nvme spec version compliance. The existing check is not sufficient for possible CNS values the driver uses that may create confusion between host and device, so enhance the check to consider the version and desired CNS value. Reviewed-by: Sagi Grimberg Reviewed-by: Christoph Hellwig Signed-off-by: Keith Busch --- drivers/nvme/host/core.c | 37 +++++++++++++++++++++++++------------ 1 file changed, 25 insertions(+), 12 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 876c8e6311db..09466e208729 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1390,17 +1390,30 @@ static void nvme_update_keep_alive(struct nvme_ctrl *ctrl, nvme_start_keep_alive(ctrl); } -/* - * In NVMe 1.0 the CNS field was just a binary controller or namespace - * flag, thus sending any new CNS opcodes has a big chance of not working. - * Qemu unfortunately had that bug after reporting a 1.1 version compliance - * (but not for any later version). - */ -static bool nvme_ctrl_limited_cns(struct nvme_ctrl *ctrl) +static bool nvme_id_cns_ok(struct nvme_ctrl *ctrl, u8 cns) { - if (ctrl->quirks & NVME_QUIRK_IDENTIFY_CNS) - return ctrl->vs < NVME_VS(1, 2, 0); - return ctrl->vs < NVME_VS(1, 1, 0); + /* + * The CNS field occupies a full byte starting with NVMe 1.2 + */ + if (ctrl->vs >= NVME_VS(1, 2, 0)) + return true; + + /* + * NVMe 1.1 expanded the CNS value to two bits, which means values + * larger than that could get truncated and treated as an incorrect + * value. + * + * Qemu implemented 1.0 behavior for controllers claiming 1.1 + * compliance, so they need to be quirked here. + */ + if (ctrl->vs >= NVME_VS(1, 1, 0) && + !(ctrl->quirks & NVME_QUIRK_IDENTIFY_CNS)) + return cns <= 3; + + /* + * NVMe 1.0 used a single bit for the CNS value. + */ + return cns <= 1; } static int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id) @@ -3104,7 +3117,7 @@ static int nvme_init_non_mdts_limits(struct nvme_ctrl *ctrl) ctrl->max_zeroes_sectors = 0; if (ctrl->subsys->subtype != NVME_NQN_NVME || - nvme_ctrl_limited_cns(ctrl) || + !nvme_id_cns_ok(ctrl, NVME_ID_CNS_CS_CTRL) || test_bit(NVME_CTRL_SKIP_ID_CNS_CS, &ctrl->flags)) return 0; @@ -4200,7 +4213,7 @@ static void nvme_scan_work(struct work_struct *work) } mutex_lock(&ctrl->scan_lock); - if (nvme_ctrl_limited_cns(ctrl)) { + if (!nvme_id_cns_ok(ctrl, NVME_ID_CNS_NS_ACTIVE_LIST)) { nvme_scan_ns_sequential(ctrl); } else { /* -- cgit From d0ccf760a405d243a49485be0a43bd5b66ed17e2 Mon Sep 17 00:00:00 2001 From: Georgi Djakov Date: Wed, 9 Oct 2024 02:16:15 +0300 Subject: spi: geni-qcom: Fix boot warning related to pm_runtime and devres During boot, users sometimes observe the following warning: [7.841431] WARNING: CPU: 4 PID: 492 at drivers/interconnect/core.c:685 __icc_enable (drivers/interconnect/core.c:685 (discriminator 7)) [..] [7.841541] Call trace: [7.841542] __icc_enable (drivers/interconnect/core.c:685 (discriminator 7)) [7.841545] icc_disable (drivers/interconnect/core.c:708) [7.841547] geni_icc_disable (drivers/soc/qcom/qcom-geni-se.c:862) [7.841553] spi_geni_runtime_suspend+0x3c/0x4c spi_geni_qcom This occurs when the spi-geni driver receives an -EPROBE_DEFER error from spi_geni_grab_gpi_chan(), causing devres to start releasing all resources as shown below: [7.138679] geni_spi 880000.spi: DEVRES REL ffff800081443800 devm_icc_release (8 bytes) [7.138751] geni_spi 880000.spi: DEVRES REL ffff800081443800 devm_icc_release (8 bytes) [7.138827] geni_spi 880000.spi: DEVRES REL ffff800081443800 pm_runtime_disable_action (16 bytes) [7.139494] geni_spi 880000.spi: DEVRES REL ffff800081443800 devm_pm_opp_config_release (16 bytes) [7.139512] geni_spi 880000.spi: DEVRES REL ffff800081443800 devm_spi_release_controller (8 bytes) [7.139516] geni_spi 880000.spi: DEVRES REL ffff800081443800 devm_clk_release (16 bytes) [7.139519] geni_spi 880000.spi: DEVRES REL ffff800081443800 devm_ioremap_release (8 bytes) [7.139524] geni_spi 880000.spi: DEVRES REL ffff800081443800 devm_region_release (24 bytes) [7.139527] geni_spi 880000.spi: DEVRES REL ffff800081443800 devm_kzalloc_release (22 bytes) [7.139530] geni_spi 880000.spi: DEVRES REL ffff800081443800 devm_pinctrl_release (8 bytes) [7.139539] geni_spi 880000.spi: DEVRES REL ffff800081443800 devm_kzalloc_release (40 bytes) The issue here is that pm_runtime_disable_action() results in a call to spi_geni_runtime_suspend(), which attempts to suspend the device and disable an interconnect path that devm_icc_release() has just released. Resolve this by calling geni_icc_get() before enabling runtime PM. This approach ensures that when devres releases resources in reverse order, it will start with pm_runtime_disable_action(), suspending the device, and then proceed to free the remaining resources. Reported-by: Naresh Kamboju Reported-by: Linux Kernel Functional Testing Closes: https://lore.kernel.org/r/CA+G9fYtsjFtddG8i+k-SpV8U6okL0p4zpsTiwGfNH5GUA8dWAA@mail.gmail.com Fixes: 89e362c883c6 ("spi: geni-qcom: Undo runtime PM changes at driver exit time") Signed-off-by: Georgi Djakov Link: https://patch.msgid.link/20241008231615.430073-1-djakov@kernel.org Signed-off-by: Mark Brown --- drivers/spi/spi-geni-qcom.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c index f6e40f90418f..768d7482102a 100644 --- a/drivers/spi/spi-geni-qcom.c +++ b/drivers/spi/spi-geni-qcom.c @@ -1116,6 +1116,11 @@ static int spi_geni_probe(struct platform_device *pdev) init_completion(&mas->tx_reset_done); init_completion(&mas->rx_reset_done); spin_lock_init(&mas->lock); + + ret = geni_icc_get(&mas->se, NULL); + if (ret) + return ret; + pm_runtime_use_autosuspend(&pdev->dev); pm_runtime_set_autosuspend_delay(&pdev->dev, 250); ret = devm_pm_runtime_enable(dev); @@ -1125,9 +1130,6 @@ static int spi_geni_probe(struct platform_device *pdev) if (device_property_read_bool(&pdev->dev, "spi-slave")) spi->target = true; - ret = geni_icc_get(&mas->se, NULL); - if (ret) - return ret; /* Set the bus quota to a reasonable value for register access */ mas->se.icc_paths[GENI_TO_CORE].avg_bw = Bps_to_icc(CORE_2X_50_MHZ); mas->se.icc_paths[CPU_TO_GENI].avg_bw = GENI_DEFAULT_BW; -- cgit From 4b60a5655528786bf659e9627fb0b45900f4cc66 Mon Sep 17 00:00:00 2001 From: Elena Salomatkina Date: Wed, 23 Oct 2024 00:37:08 +0300 Subject: sumversion: Fix a memory leak in get_src_version() strsep() modifies its first argument - buf. An invalid pointer will be passed to the free() function. Make the pointer passed to free() match the return value of read_text_file(). Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: 9413e7640564 ("kbuild: split the second line of *.mod into *.usyms") Signed-off-by: Elena Salomatkina Signed-off-by: Masahiro Yamada --- scripts/mod/sumversion.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/scripts/mod/sumversion.c b/scripts/mod/sumversion.c index e7d2da45b0df..6de9af17599d 100644 --- a/scripts/mod/sumversion.c +++ b/scripts/mod/sumversion.c @@ -392,7 +392,7 @@ out_file: /* Calc and record src checksum. */ void get_src_version(const char *modname, char sum[], unsigned sumlen) { - char *buf; + char *buf, *pos; struct md4_ctx md; char *fname; char filelist[PATH_MAX + 1]; @@ -401,9 +401,10 @@ void get_src_version(const char *modname, char sum[], unsigned sumlen) snprintf(filelist, sizeof(filelist), "%s.mod", modname); buf = read_text_file(filelist); + pos = buf; md4_init(&md); - while ((fname = strsep(&buf, "\n"))) { + while ((fname = strsep(&pos, "\n"))) { if (!*fname) continue; if (!(is_static_library(fname)) && -- cgit From 2b059d0d1e624adc6e69a754bc48057f8bf459dc Mon Sep 17 00:00:00 2001 From: Pei Xiao Date: Wed, 23 Oct 2024 14:21:17 +0800 Subject: slub/kunit: fix a WARNING due to unwrapped __kmalloc_cache_noprof 'modprobe slub_kunit' will have a warning as shown below. The root cause is that __kmalloc_cache_noprof was directly used, which resulted in no alloc_tag being allocated. This caused current->alloc_tag to be null, leading to a warning in alloc_tag_add_check. Let's add an alloc_hook layer to __kmalloc_cache_noprof specifically within lib/slub_kunit.c, which is the only user of this internal slub function outside kmalloc implementation itself. [58162.947016] WARNING: CPU: 2 PID: 6210 at ./include/linux/alloc_tag.h:125 alloc_tagging_slab_alloc_hook+0x268/0x27c [58162.957721] Call trace: [58162.957919] alloc_tagging_slab_alloc_hook+0x268/0x27c [58162.958286] __kmalloc_cache_noprof+0x14c/0x344 [58162.958615] test_kmalloc_redzone_access+0x50/0x10c [slub_kunit] [58162.959045] kunit_try_run_case+0x74/0x184 [kunit] [58162.959401] kunit_generic_run_threadfn_adapter+0x2c/0x4c [kunit] [58162.959841] kthread+0x10c/0x118 [58162.960093] ret_from_fork+0x10/0x20 [58162.960363] ---[ end trace 0000000000000000 ]--- Signed-off-by: Pei Xiao Fixes: a0a44d9175b3 ("mm, slab: don't wrap internal functions with alloc_hooks()") Signed-off-by: Vlastimil Babka --- lib/slub_kunit.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/slub_kunit.c b/lib/slub_kunit.c index 80e39f003344..33564f965958 100644 --- a/lib/slub_kunit.c +++ b/lib/slub_kunit.c @@ -141,7 +141,7 @@ static void test_kmalloc_redzone_access(struct kunit *test) { struct kmem_cache *s = test_kmem_cache_create("TestSlub_RZ_kmalloc", 32, SLAB_KMALLOC|SLAB_STORE_USER|SLAB_RED_ZONE); - u8 *p = __kmalloc_cache_noprof(s, GFP_KERNEL, 18); + u8 *p = alloc_hooks(__kmalloc_cache_noprof(s, GFP_KERNEL, 18)); kasan_disable_current(); -- cgit From 3ded11b5c1b476f6d027d9017aa7deb8ab381ec1 Mon Sep 17 00:00:00 2001 From: Liankun Yang Date: Mon, 23 Sep 2024 21:24:15 +0800 Subject: drm/mediatek: Fix get efuse issue for MT8188 DPTX Update efuse data for MT8188 displayport. The DP monitor can not display when DUT connected to USB-c to DP dongle. Analysis view is invalid DP efuse data. Fixes: 350c3fe907fb ("drm/mediatek: dp: Add support MT8188 dp/edp function") Reviewed-by: Matthias Brugger Reviewed-by: AngeloGioacchino Del Regno Signed-off-by: Liankun Yang Reviewed-by: Fei Shao Tested-by: Fei Shao Reviewed-by: CK Hu Link: https://patchwork.kernel.org/project/dri-devel/patch/20240923132521.22785-1-liankun.yang@mediatek.com/ Signed-off-by: Chun-Kuang Hu --- drivers/gpu/drm/mediatek/mtk_dp.c | 85 ++++++++++++++++++++++++++++++++++++++- 1 file changed, 84 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/mediatek/mtk_dp.c b/drivers/gpu/drm/mediatek/mtk_dp.c index d8796a904eca..f2bee617f063 100644 --- a/drivers/gpu/drm/mediatek/mtk_dp.c +++ b/drivers/gpu/drm/mediatek/mtk_dp.c @@ -145,6 +145,89 @@ struct mtk_dp_data { u16 audio_m_div2_bit; }; +static const struct mtk_dp_efuse_fmt mt8188_dp_efuse_fmt[MTK_DP_CAL_MAX] = { + [MTK_DP_CAL_GLB_BIAS_TRIM] = { + .idx = 0, + .shift = 10, + .mask = 0x1f, + .min_val = 1, + .max_val = 0x1e, + .default_val = 0xf, + }, + [MTK_DP_CAL_CLKTX_IMPSE] = { + .idx = 0, + .shift = 15, + .mask = 0xf, + .min_val = 1, + .max_val = 0xe, + .default_val = 0x8, + }, + [MTK_DP_CAL_LN_TX_IMPSEL_PMOS_0] = { + .idx = 1, + .shift = 0, + .mask = 0xf, + .min_val = 1, + .max_val = 0xe, + .default_val = 0x8, + }, + [MTK_DP_CAL_LN_TX_IMPSEL_PMOS_1] = { + .idx = 1, + .shift = 8, + .mask = 0xf, + .min_val = 1, + .max_val = 0xe, + .default_val = 0x8, + }, + [MTK_DP_CAL_LN_TX_IMPSEL_PMOS_2] = { + .idx = 1, + .shift = 16, + .mask = 0xf, + .min_val = 1, + .max_val = 0xe, + .default_val = 0x8, + }, + [MTK_DP_CAL_LN_TX_IMPSEL_PMOS_3] = { + .idx = 1, + .shift = 24, + .mask = 0xf, + .min_val = 1, + .max_val = 0xe, + .default_val = 0x8, + }, + [MTK_DP_CAL_LN_TX_IMPSEL_NMOS_0] = { + .idx = 1, + .shift = 4, + .mask = 0xf, + .min_val = 1, + .max_val = 0xe, + .default_val = 0x8, + }, + [MTK_DP_CAL_LN_TX_IMPSEL_NMOS_1] = { + .idx = 1, + .shift = 12, + .mask = 0xf, + .min_val = 1, + .max_val = 0xe, + .default_val = 0x8, + }, + [MTK_DP_CAL_LN_TX_IMPSEL_NMOS_2] = { + .idx = 1, + .shift = 20, + .mask = 0xf, + .min_val = 1, + .max_val = 0xe, + .default_val = 0x8, + }, + [MTK_DP_CAL_LN_TX_IMPSEL_NMOS_3] = { + .idx = 1, + .shift = 28, + .mask = 0xf, + .min_val = 1, + .max_val = 0xe, + .default_val = 0x8, + }, +}; + static const struct mtk_dp_efuse_fmt mt8195_edp_efuse_fmt[MTK_DP_CAL_MAX] = { [MTK_DP_CAL_GLB_BIAS_TRIM] = { .idx = 3, @@ -2771,7 +2854,7 @@ static SIMPLE_DEV_PM_OPS(mtk_dp_pm_ops, mtk_dp_suspend, mtk_dp_resume); static const struct mtk_dp_data mt8188_dp_data = { .bridge_type = DRM_MODE_CONNECTOR_DisplayPort, .smc_cmd = MTK_DP_SIP_ATF_VIDEO_UNMUTE, - .efuse_fmt = mt8195_dp_efuse_fmt, + .efuse_fmt = mt8188_dp_efuse_fmt, .audio_supported = true, .audio_pkt_in_hblank_area = true, .audio_m_div2_bit = MT8188_AUDIO_M_CODE_MULT_DIV_SEL_DP_ENC0_P0_DIV_2, -- cgit From 4018651ba5c409034149f297d3dd3328b91561fd Mon Sep 17 00:00:00 2001 From: Dan Carpenter Date: Thu, 12 Sep 2024 11:44:59 +0300 Subject: drm/mediatek: Fix potential NULL dereference in mtk_crtc_destroy() In mtk_crtc_create(), if the call to mbox_request_channel() fails then we set the "mtk_crtc->cmdq_client.chan" pointer to NULL. In that situation, we do not call cmdq_pkt_create(). During the cleanup, we need to check if the "mtk_crtc->cmdq_client.chan" is NULL first before calling cmdq_pkt_destroy(). Calling cmdq_pkt_destroy() is unnecessary if we didn't call cmdq_pkt_create() and it will result in a NULL pointer dereference. Fixes: 7627122fd1c0 ("drm/mediatek: Add cmdq_handle in mtk_crtc") Signed-off-by: Dan Carpenter Reviewed-by: AngeloGioacchino Del Regno Reviewed-by: CK Hu Link: https://patchwork.kernel.org/project/dri-devel/patch/cc537bd6-837f-4c85-a37b-1a007e268310@stanley.mountain/ Signed-off-by: Chun-Kuang Hu --- drivers/gpu/drm/mediatek/mtk_crtc.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/gpu/drm/mediatek/mtk_crtc.c b/drivers/gpu/drm/mediatek/mtk_crtc.c index b65f196f2015..eb0e1233ad04 100644 --- a/drivers/gpu/drm/mediatek/mtk_crtc.c +++ b/drivers/gpu/drm/mediatek/mtk_crtc.c @@ -127,9 +127,8 @@ static void mtk_crtc_destroy(struct drm_crtc *crtc) mtk_mutex_put(mtk_crtc->mutex); #if IS_REACHABLE(CONFIG_MTK_CMDQ) - cmdq_pkt_destroy(&mtk_crtc->cmdq_client, &mtk_crtc->cmdq_handle); - if (mtk_crtc->cmdq_client.chan) { + cmdq_pkt_destroy(&mtk_crtc->cmdq_client, &mtk_crtc->cmdq_handle); mbox_free_channel(mtk_crtc->cmdq_client.chan); mtk_crtc->cmdq_client.chan = NULL; } -- cgit From af6ab107ce2c338790c6629fe0edc0333e708be8 Mon Sep 17 00:00:00 2001 From: Macpaul Lin Date: Thu, 3 Oct 2024 11:09:19 +0800 Subject: dt-bindings: display: mediatek: dpi: correct power-domains property The MediaTek DPI module is typically associated with one of the following multimedia power domains: - POWER_DOMAIN_DISPLAY - POWER_DOMAIN_VDOSYS - POWER_DOMAIN_MM The specific power domain used varies depending on the SoC design. These power domains are shared by multiple devices within the SoC. In most cases, these power domains are enabled by other devices. As a result, the DPI module of legacy SoCs often functions correctly even without explicit configuration. It is recommended to explicitly add the appropriate power domain property to the DPI node in the device tree. Hence drop the compatible checking for specific SoCs. Fixes: 5474d49b2f79 ("dt-bindings: display: mediatek: dpi: Add power domains") Signed-off-by: Macpaul Lin Signed-off-by: Jitao Shi Signed-off-by: Pablo Sun Reviewed-by: Krzysztof Kozlowski Link: https://patchwork.kernel.org/project/dri-devel/patch/20241003030919.17980-4-macpaul.lin@mediatek.com/ Signed-off-by: Chun-Kuang Hu --- .../bindings/display/mediatek/mediatek,dpi.yaml | 24 +++++++++------------- 1 file changed, 10 insertions(+), 14 deletions(-) diff --git a/Documentation/devicetree/bindings/display/mediatek/mediatek,dpi.yaml b/Documentation/devicetree/bindings/display/mediatek/mediatek,dpi.yaml index 3a82aec9021c..497c0eb4ed0b 100644 --- a/Documentation/devicetree/bindings/display/mediatek/mediatek,dpi.yaml +++ b/Documentation/devicetree/bindings/display/mediatek/mediatek,dpi.yaml @@ -63,6 +63,16 @@ properties: - const: sleep power-domains: + description: | + The MediaTek DPI module is typically associated with one of the + following multimedia power domains: + POWER_DOMAIN_DISPLAY + POWER_DOMAIN_VDOSYS + POWER_DOMAIN_MM + The specific power domain used varies depending on the SoC design. + + It is recommended to explicitly add the appropriate power domain + property to the DPI node in the device tree. maxItems: 1 port: @@ -79,20 +89,6 @@ required: - clock-names - port -allOf: - - if: - not: - properties: - compatible: - contains: - enum: - - mediatek,mt6795-dpi - - mediatek,mt8173-dpi - - mediatek,mt8186-dpi - then: - properties: - power-domains: false - additionalProperties: false examples: -- cgit From ecabac70ff919580324b407818ee3e6c0004dcf8 Mon Sep 17 00:00:00 2001 From: Arnaldo Carvalho de Melo Date: Tue, 15 Oct 2024 17:03:37 -0300 Subject: perf trace augmented_raw_syscalls: Add extra array index bounds checking to satisfy some BPF verifiers In a RHEL8 kernel (4.18.0-513.11.1.el8_9.x86_64), that, as enterprise kernels go, have backports from modern kernels, the verifier complains about lack of bounds check for the index into the array of syscall arguments, on a BPF bytecode generated by clang 17, with: ; } else if (size < 0 && size >= -6) { /* buffer */ 116: (b7) r1 = -6 117: (2d) if r1 > r6 goto pc-30 R0=map_value(id=0,off=0,ks=4,vs=24688,imm=0) R1_w=inv-6 R2=map_value(id=0,off=16,ks=4,vs=8272,imm=0) R3=inv(id=0) R5=inv40 R6=inv(id=0,umin_value=18446744073709551610,var_off=(0xffffffff00000000; 0xffffffff)) R7=map_value(id=0,off=56,ks=4,vs=8272,imm=0) R8=invP6 R9=map_value(id=0,off=20,ks=4,vs=24,imm=0) R10=fp0 fp-8=mmmmmmmm fp-16=map_value fp-24=map_value fp-32=inv40 fp-40=ctx fp-48=map_value fp-56=inv1 fp-64=map_value fp-72=map_value fp-80=map_value ; index = -(size + 1); 118: (a7) r6 ^= -1 119: (67) r6 <<= 32 120: (77) r6 >>= 32 ; aug_size = args->args[index]; 121: (67) r6 <<= 3 122: (79) r1 = *(u64 *)(r10 -24) 123: (0f) r1 += r6 last_idx 123 first_idx 116 regs=40 stack=0 before 122: (79) r1 = *(u64 *)(r10 -24) regs=40 stack=0 before 121: (67) r6 <<= 3 regs=40 stack=0 before 120: (77) r6 >>= 32 regs=40 stack=0 before 119: (67) r6 <<= 32 regs=40 stack=0 before 118: (a7) r6 ^= -1 regs=40 stack=0 before 117: (2d) if r1 > r6 goto pc-30 regs=42 stack=0 before 116: (b7) r1 = -6 R0_w=map_value(id=0,off=0,ks=4,vs=24688,imm=0) R1_w=inv1 R2_w=map_value(id=0,off=16,ks=4,vs=8272,imm=0) R3_w=inv(id=0) R5_w=inv40 R6_rw=invP(id=0,smin_value=-2147483648,smax_value=0) R7_w=map_value(id=0,off=56,ks=4,vs=8272,imm=0) R8_w=invP6 R9_w=map_value(id=0,off=20,ks=4,vs=24,imm=0) R10=fp0 fp-8=mmmmmmmm fp-16_w=map_value fp-24_r=map_value fp-32_w=inv40 fp-40=ctx fp-48=map_value fp-56_w=inv1 fp-64_w=map_value fp-72=map_value fp-80=map_value parent didn't have regs=40 stack=0 marks last_idx 110 first_idx 98 regs=40 stack=0 before 110: (6d) if r1 s> r6 goto pc+5 regs=42 stack=0 before 109: (b7) r1 = 1 regs=40 stack=0 before 108: (65) if r6 s> 0x1000 goto pc+7 regs=40 stack=0 before 98: (55) if r6 != 0x1 goto pc+9 R0_w=map_value(id=0,off=0,ks=4,vs=24688,imm=0) R1_w=invP12 R2_w=map_value(id=0,off=16,ks=4,vs=8272,imm=0) R3_rw=inv(id=0) R5_w=inv24 R6_rw=invP(id=0,smin_value=-2147483648,smax_value=2147483647) R7_w=map_value(id=0,off=40,ks=4,vs=8272,imm=0) R8_rw=invP4 R9_w=map_value(id=0,off=12,ks=4,vs=24,imm=0) R10=fp0 fp-8=mmmmmmmm fp-16_rw=map_value fp-24_r=map_value fp-32_rw=invP24 fp-40_r=ctx fp-48_r=map_value fp-56_w=invP1 fp-64_rw=map_value fp-72_r=map_value fp-80_r=map_value parent already had regs=40 stack=0 marks 124: (79) r6 = *(u64 *)(r1 +16) R0=map_value(id=0,off=0,ks=4,vs=24688,imm=0) R1_w=map_value(id=0,off=0,ks=4,vs=8272,umax_value=34359738360,var_off=(0x0; 0x7fffffff8),s32_max_value=2147483640,u32_max_value=-8) R2=map_value(id=0,off=16,ks=4,vs=8272,imm=0) R3=inv(id=0) R5=inv40 R6_w=invP(id=0,umax_value=34359738360,var_off=(0x0; 0x7fffffff8),s32_max_value=2147483640,u32_max_value=-8) R7=map_value(id=0,off=56,ks=4,vs=8272,imm=0) R8=invP6 R9=map_value(id=0,off=20,ks=4,vs=24,imm=0) R10=fp0 fp-8=mmmmmmmm fp-16=map_value fp-24=map_value fp-32=inv40 fp-40=ctx fp-48=map_value fp-56=inv1 fp-64=map_value fp-72=map_value fp-80=map_value R1 unbounded memory access, make sure to bounds check any such access processed 466 insns (limit 1000000) max_states_per_insn 2 total_states 20 peak_states 20 mark_read 3 If we add this line, as used in other BPF programs, to cap that index: index &= 7; The generated BPF program is considered safe by that version of the BPF verifier, allowing perf to collect the syscall args in one more kernel using the BPF based pointer contents collector. With the above one-liner it works with that kernel: [root@dell-per740-01 ~]# uname -a Linux dell-per740-01.khw.eng.rdu2.dc.redhat.com 4.18.0-513.11.1.el8_9.x86_64 #1 SMP Thu Dec 7 03:06:13 EST 2023 x86_64 x86_64 x86_64 GNU/Linux [root@dell-per740-01 ~]# ~acme/bin/perf trace -e *sleep* sleep 1.234567890 0.000 (1234.704 ms): sleep/3863610 nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 234567890 }) = 0 [root@dell-per740-01 ~]# As well as with the one in Fedora 40: root@number:~# uname -a Linux number 6.11.3-200.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Oct 10 22:31:19 UTC 2024 x86_64 GNU/Linux root@number:~# perf trace -e *sleep* sleep 1.234567890 0.000 (1234.722 ms): sleep/14873 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 234567890 }, rmtp: 0x7ffe87311a40) = 0 root@number:~# Song Liu reported that this one-liner was being optimized out by clang 18, so I suggested and he tested that adding a compiler barrier before it made clang v18 to keep it and the verifier in the kernel in Song's case (Meta's 5.12 based kernel) also was happy with the resulting bytecode. I'll investigate using virtme-ng[1] to have all the perf BPF based functionality thoroughly tested over multiple kernels and clang versions. [1] https://kernel-recipes.org/en/2024/virtme-ng/ Cc: Adrian Hunter Cc: Alan Maguire Cc: Alexander Shishkin Cc: Andrea Righi Cc: Howard Chu Cc: Ian Rogers Cc: Ingo Molnar Cc: James Clark Cc: Jiri Olsa Cc: Kan Liang Cc: Mark Rutland Cc: Namhyung Kim Cc: Peter Zijlstra Cc: Song Liu Link: https://lore.kernel.org/lkml/Zw7JgJc0LOwSpuvx@x1 Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c index b2f17cca014b..31df5f0cb14b 100644 --- a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c +++ b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c @@ -477,6 +477,8 @@ static int augment_sys_enter(void *ctx, struct syscall_enter_args *args) augmented = true; } else if (size < 0 && size >= -6) { /* buffer */ index = -(size + 1); + barrier_var(index); // Prevent clang (noticed with v18) from removing the &= 7 trick. + index &= 7; // Satisfy the bounds checking with the verifier in some kernels. aug_size = args->args[index]; if (aug_size > TRACE_AUG_MAX_BUF) -- cgit From 395d38419f1853decab84acc16176b3fa5c96690 Mon Sep 17 00:00:00 2001 From: Howard Chu Date: Thu, 10 Oct 2024 19:14:02 -0700 Subject: perf trace augmented_raw_syscalls: Add more checks to pass the verifier Add some more checks to pass the verifier in more kernels. Signed-off-by: Howard Chu Tested-by: Arnaldo Carvalho de Melo Cc: Adrian Hunter Cc: Alan Maguire Cc: Alexander Shishkin Cc: Ian Rogers Cc: Ingo Molnar Cc: Jiri Olsa Cc: Kan Liang Cc: Mark Rutland Cc: Namhyung Kim Cc: Peter Zijlstra Link: https://lore.kernel.org/r/20241011021403.4089793-3-howardchu95@gmail.com [ Reduced the patch removing things that can be done later ] Signed-off-by: Arnaldo Carvalho de Melo --- .../perf/util/bpf_skel/augmented_raw_syscalls.bpf.c | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c index 31df5f0cb14b..4a62ed593e84 100644 --- a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c +++ b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c @@ -288,6 +288,10 @@ int sys_enter_rename(struct syscall_enter_args *args) augmented_args->arg.size = PERF_ALIGN(oldpath_len + 1, sizeof(u64)); len += augmented_args->arg.size; + /* Every read from userspace is limited to value size */ + if (augmented_args->arg.size > sizeof(augmented_args->arg.value)) + return 1; /* Failure: don't filter */ + struct augmented_arg *arg2 = (void *)&augmented_args->arg.value + augmented_args->arg.size; newpath_len = augmented_arg__read_str(arg2, newpath_arg, sizeof(augmented_args->arg.value)); @@ -315,6 +319,10 @@ int sys_enter_renameat2(struct syscall_enter_args *args) augmented_args->arg.size = PERF_ALIGN(oldpath_len + 1, sizeof(u64)); len += augmented_args->arg.size; + /* Every read from userspace is limited to value size */ + if (augmented_args->arg.size > sizeof(augmented_args->arg.value)) + return 1; /* Failure: don't filter */ + struct augmented_arg *arg2 = (void *)&augmented_args->arg.value + augmented_args->arg.size; newpath_len = augmented_arg__read_str(arg2, newpath_arg, sizeof(augmented_args->arg.value)); @@ -423,8 +431,9 @@ static bool pid_filter__has(struct pids_filtered *pids, pid_t pid) static int augment_sys_enter(void *ctx, struct syscall_enter_args *args) { bool augmented, do_output = false; - int zero = 0, size, aug_size, index, output = 0, + int zero = 0, size, aug_size, index, value_size = sizeof(struct augmented_arg) - offsetof(struct augmented_arg, value); + u64 output = 0; /* has to be u64, otherwise it won't pass the verifier */ unsigned int nr, *beauty_map; struct beauty_payload_enter *payload; void *arg, *payload_offset; @@ -490,10 +499,17 @@ static int augment_sys_enter(void *ctx, struct syscall_enter_args *args) } } + /* Augmented data size is limited to sizeof(augmented_arg->unnamed union with value field) */ + if (aug_size > value_size) + aug_size = value_size; + /* write data to payload */ if (augmented) { int written = offsetof(struct augmented_arg, value) + aug_size; + if (written < 0 || written > sizeof(struct augmented_arg)) + return 1; + ((struct augmented_arg *)payload_offset)->size = aug_size; output += written; payload_offset += written; @@ -501,7 +517,7 @@ static int augment_sys_enter(void *ctx, struct syscall_enter_args *args) } } - if (!do_output) + if (!do_output || (sizeof(struct syscall_enter_args) + output) > sizeof(struct beauty_payload_enter)) return 1; return augmented__beauty_output(ctx, payload, sizeof(struct syscall_enter_args) + output); -- cgit From 7fbff3c0e085745b99f220ad56fcee3ea9643d87 Mon Sep 17 00:00:00 2001 From: Howard Chu Date: Thu, 10 Oct 2024 19:14:01 -0700 Subject: perf build: Change the clang check back to 12.0.1 This serves as a revert for this patch: https://lore.kernel.org/linux-perf-users/ZuGL9ROeTV2uXoSp@x1/ Signed-off-by: Howard Chu Tested-by: James Clark Cc: Adrian Hunter Cc: Alan Maguire Cc: Alexander Shishkin Cc: Ian Rogers Cc: Ingo Molnar Cc: Jiri Olsa Cc: Kan Liang Cc: Mark Rutland Cc: Namhyung Kim Cc: Peter Zijlstra Link: https://lore.kernel.org/r/20241011021403.4089793-2-howardchu95@gmail.com Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/Makefile.config | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/perf/Makefile.config b/tools/perf/Makefile.config index 4ddb27a48eed..d4332675babb 100644 --- a/tools/perf/Makefile.config +++ b/tools/perf/Makefile.config @@ -704,8 +704,8 @@ ifeq ($(BUILD_BPF_SKEL),1) BUILD_BPF_SKEL := 0 else CLANG_VERSION := $(shell $(CLANG) --version | head -1 | sed 's/.*clang version \([[:digit:]]\+.[[:digit:]]\+.[[:digit:]]\+\).*/\1/g') - ifeq ($(call version-lt3,$(CLANG_VERSION),16.0.6),1) - $(warning Warning: Disabled BPF skeletons as at least $(CLANG) version 16.0.6 is reported to be a working setup with the current of BPF based perf features) + ifeq ($(call version-lt3,$(CLANG_VERSION),12.0.1),1) + $(warning Warning: Disabled BPF skeletons as reliable BTF generation needs at least $(CLANG) version 12.0.1) BUILD_BPF_SKEL := 0 endif endif -- cgit From 5d35634ecc2d2c3938bd7dc23df0ad046da1b303 Mon Sep 17 00:00:00 2001 From: Jiri Slaby Date: Tue, 22 Oct 2024 17:22:36 -0300 Subject: perf trace: Fix non-listed archs in the syscalltbl routines This fixes a build breakage on 32-bit arm, where the syscalltbl__id_at_idx() function was missing. Committer notes: Generating a proper syscall table from a copy of arch/arm/tools/syscall.tbl ends up being too big a patch for this rc stage, I started doing it but while testing noticed some other problems with using BPF to collect pointer args on arm7 (32-bit) will maybe continue trying to make it work on the next cycle... Fixes: 7a2fb5619cc1fb53 ("perf trace: Fix iteration of syscall ids in syscalltbl->entries") Suggested-by: Howard Chu Signed-off-by: Acked-by: Namhyung Kim Cc: Adrian Hunter Cc: Howard Chu Cc: Ian Rogers Cc: Jiri Olsa Link: https://lore.kernel.org/lkml/3a592835-a14f-40be-8961-c0cee7720a94@kernel.org Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/util/syscalltbl.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/tools/perf/util/syscalltbl.c b/tools/perf/util/syscalltbl.c index 7c15dec6900d..6c45ded922b6 100644 --- a/tools/perf/util/syscalltbl.c +++ b/tools/perf/util/syscalltbl.c @@ -46,6 +46,11 @@ static const char *const *syscalltbl_native = syscalltbl_mips_n64; #include const int syscalltbl_native_max_id = SYSCALLTBL_LOONGARCH_MAX_ID; static const char *const *syscalltbl_native = syscalltbl_loongarch; +#else +const int syscalltbl_native_max_id = 0; +static const char *const syscalltbl_native[] = { + [0] = "unknown", +}; #endif struct syscall { @@ -182,6 +187,11 @@ int syscalltbl__id(struct syscalltbl *tbl, const char *name) return audit_name_to_syscall(name, tbl->audit_machine); } +int syscalltbl__id_at_idx(struct syscalltbl *tbl __maybe_unused, int idx) +{ + return idx; +} + int syscalltbl__strglobmatch_next(struct syscalltbl *tbl __maybe_unused, const char *syscall_glob __maybe_unused, int *idx __maybe_unused) { -- cgit From d822ca29a4fc5278fb511790dace44836e8cc40d Mon Sep 17 00:00:00 2001 From: Arnaldo Carvalho de Melo Date: Tue, 22 Oct 2024 17:36:16 -0300 Subject: tools headers UAPI: Sync kvm headers with the kernel sources To pick the changes in: aa8d1f48d353b046 ("KVM: x86/mmu: Introduce a quirk to control memslot zap behavior") That don't change functionality in tools/perf, as no new ioctl is added for the 'perf trace' scripts to harvest. This addresses these perf build warnings: Warning: Kernel ABI header differences: diff -u tools/arch/x86/include/uapi/asm/kvm.h arch/x86/include/uapi/asm/kvm.h Please see tools/include/uapi/README for further details. Cc: Adrian Hunter Cc: Ian Rogers Cc: Jiri Olsa Cc: Kan Liang Cc: Namhyung Kim Cc: Paolo Bonzini Cc: Yan Zhao Link: https://lore.kernel.org/lkml/ZxgN0O02YrAJ2qIC@x1 Signed-off-by: Arnaldo Carvalho de Melo --- tools/arch/x86/include/uapi/asm/kvm.h | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h index bf57a824f722..a8debbf2f702 100644 --- a/tools/arch/x86/include/uapi/asm/kvm.h +++ b/tools/arch/x86/include/uapi/asm/kvm.h @@ -439,6 +439,7 @@ struct kvm_sync_regs { #define KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT (1 << 4) #define KVM_X86_QUIRK_FIX_HYPERCALL_INSN (1 << 5) #define KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS (1 << 6) +#define KVM_X86_QUIRK_SLOT_ZAP_ALL (1 << 7) #define KVM_STATE_NESTED_FORMAT_VMX 0 #define KVM_STATE_NESTED_FORMAT_SVM 1 -- cgit From 3ad0edc46fb7668e75583ee6ebcce684f62ec4dc Mon Sep 17 00:00:00 2001 From: Moudy Ho Date: Mon, 7 Oct 2024 10:28:34 +0800 Subject: dt-bindings: display: mediatek: split: add subschema property constraints The display node in mt8195.dtsi was triggering a CHECK_DTBS error due to an excessively long 'clocks' property: display@14f06000: clocks: [[31, 14], [31, 43], [31, 44]] is too long To resolve this issue, the constraints for 'clocks' and other properties within the subschemas will be reinforced. Fixes: 739058a9c5c3 ("dt-bindings: display: mediatek: split: add compatible for MT8195") Signed-off-by: Macpaul Lin Signed-off-by: Moudy Ho Reviewed-by: Krzysztof Kozlowski Link: https://patchwork.kernel.org/project/dri-devel/patch/20241007022834.4609-1-moudy.ho@mediatek.com/ Signed-off-by: Chun-Kuang Hu --- .../bindings/display/mediatek/mediatek,split.yaml | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/Documentation/devicetree/bindings/display/mediatek/mediatek,split.yaml b/Documentation/devicetree/bindings/display/mediatek/mediatek,split.yaml index e4affc854f3d..4b6ff546757e 100644 --- a/Documentation/devicetree/bindings/display/mediatek/mediatek,split.yaml +++ b/Documentation/devicetree/bindings/display/mediatek/mediatek,split.yaml @@ -38,6 +38,7 @@ properties: description: A phandle and PM domain specifier as defined by bindings of the power controller specified by phandle. See Documentation/devicetree/bindings/power/power-domain.yaml for details. + maxItems: 1 mediatek,gce-client-reg: description: @@ -57,6 +58,9 @@ properties: clocks: items: - description: SPLIT Clock + - description: Used for interfacing with the HDMI RX signal source. + - description: Paired with receiving HDMI RX metadata. + minItems: 1 required: - compatible @@ -72,9 +76,24 @@ allOf: const: mediatek,mt8195-mdp3-split then: + properties: + clocks: + minItems: 3 + required: - mediatek,gce-client-reg + - if: + properties: + compatible: + contains: + const: mediatek,mt8173-disp-split + + then: + properties: + clocks: + maxItems: 1 + additionalProperties: false examples: -- cgit From 894b00a3350c560990638bdf89bdf1f3d5491950 Mon Sep 17 00:00:00 2001 From: Marco Elver Date: Mon, 21 Oct 2024 14:00:10 +0200 Subject: kasan: Fix Software Tag-Based KASAN with GCC Per [1], -fsanitize=kernel-hwaddress with GCC currently does not disable instrumentation in functions with __attribute__((no_sanitize_address)). However, __attribute__((no_sanitize("hwaddress"))) does correctly disable instrumentation. Use it instead. Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117196 [1] Link: https://lore.kernel.org/r/000000000000f362e80620e27859@google.com Link: https://lore.kernel.org/r/ZvFGwKfoC4yVjN_X@J2N7QTR9R3 Link: https://bugzilla.kernel.org/show_bug.cgi?id=218854 Reported-by: syzbot+908886656a02769af987@syzkaller.appspotmail.com Tested-by: Andrey Konovalov Cc: Andrew Pinski Cc: Mark Rutland Cc: Will Deacon Signed-off-by: Marco Elver Reviewed-by: Andrey Konovalov Fixes: 7b861a53e46b ("kasan: Bump required compiler version") Link: https://lore.kernel.org/r/20241021120013.3209481-1-elver@google.com Signed-off-by: Will Deacon --- include/linux/compiler-gcc.h | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h index f805adaa316e..cd6f9aae311f 100644 --- a/include/linux/compiler-gcc.h +++ b/include/linux/compiler-gcc.h @@ -80,7 +80,11 @@ #define __noscs __attribute__((__no_sanitize__("shadow-call-stack"))) #endif +#ifdef __SANITIZE_HWADDRESS__ +#define __no_sanitize_address __attribute__((__no_sanitize__("hwaddress"))) +#else #define __no_sanitize_address __attribute__((__no_sanitize_address__)) +#endif #if defined(__SANITIZE_THREAD__) #define __no_sanitize_thread __attribute__((__no_sanitize_thread__)) -- cgit From 237ab03e301d21cc8fed631a8cdb5076c92ac263 Mon Sep 17 00:00:00 2001 From: Marco Elver Date: Mon, 21 Oct 2024 14:00:11 +0200 Subject: Revert "kasan: Disable Software Tag-Based KASAN with GCC" This reverts commit 7aed6a2c51ffc97a126e0ea0c270fab7af97ae18. Now that __no_sanitize_address attribute is fixed for KASAN_SW_TAGS with GCC, allow re-enabling KASAN_SW_TAGS with GCC. Cc: Andrey Konovalov Cc: Andrew Pinski Cc: Mark Rutland Cc: Will Deacon Signed-off-by: Marco Elver Reviewed-by: Andrey Konovalov Link: https://lore.kernel.org/r/20241021120013.3209481-2-elver@google.com Signed-off-by: Will Deacon --- lib/Kconfig.kasan | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/lib/Kconfig.kasan b/lib/Kconfig.kasan index 233ab2096924..98016e137b7f 100644 --- a/lib/Kconfig.kasan +++ b/lib/Kconfig.kasan @@ -22,11 +22,8 @@ config ARCH_DISABLE_KASAN_INLINE config CC_HAS_KASAN_GENERIC def_bool $(cc-option, -fsanitize=kernel-address) -# GCC appears to ignore no_sanitize_address when -fsanitize=kernel-hwaddress -# is passed. See https://bugzilla.kernel.org/show_bug.cgi?id=218854 (and -# the linked LKML thread) for more details. config CC_HAS_KASAN_SW_TAGS - def_bool !CC_IS_GCC && $(cc-option, -fsanitize=kernel-hwaddress) + def_bool $(cc-option, -fsanitize=kernel-hwaddress) # This option is only required for software KASAN modes. # Old GCC versions do not have proper support for no_sanitize_address. @@ -101,7 +98,7 @@ config KASAN_SW_TAGS help Enables Software Tag-Based KASAN. - Requires Clang. + Requires GCC 11+ or Clang. Supported only on arm64 CPUs and relies on Top Byte Ignore. -- cgit From c83212d79be2c9886d3e6039759ecd388fd5fed1 Mon Sep 17 00:00:00 2001 From: Xiongfeng Wang Date: Wed, 16 Oct 2024 16:47:40 +0800 Subject: firmware: arm_sdei: Fix the input parameter of cpuhp_remove_state() In sdei_device_freeze(), the input parameter of cpuhp_remove_state() is passed as 'sdei_entry_point' by mistake. Change it to 'sdei_hp_state'. Fixes: d2c48b2387eb ("firmware: arm_sdei: Fix sleep from invalid context BUG") Signed-off-by: Xiongfeng Wang Reviewed-by: James Morse Link: https://lore.kernel.org/r/20241016084740.183353-1-wangxiongfeng2@huawei.com Signed-off-by: Will Deacon --- drivers/firmware/arm_sdei.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/firmware/arm_sdei.c b/drivers/firmware/arm_sdei.c index 285fe7ad490d..3e8051fe8296 100644 --- a/drivers/firmware/arm_sdei.c +++ b/drivers/firmware/arm_sdei.c @@ -763,7 +763,7 @@ static int sdei_device_freeze(struct device *dev) int err; /* unregister private events */ - cpuhp_remove_state(sdei_entry_point); + cpuhp_remove_state(sdei_hp_state); err = sdei_unregister_shared(); if (err) -- cgit From 90a88784cdb7757feb8dd520255e6cb861f30943 Mon Sep 17 00:00:00 2001 From: David Sterba Date: Tue, 22 Oct 2024 16:21:05 +0200 Subject: MIPS: export __cmpxchg_small() Export the symbol __cmpxchg_small() for btrfs.ko that uses it to store blk_status_t, which is u8. Reported by LKP: >> ERROR: modpost: "__cmpxchg_small" [fs/btrfs/btrfs.ko] undefined! Patch using the cmpxchg() https://lore.kernel.org/linux-btrfs/1d4f72f7fee285b2ddf4bf62b0ac0fd89def5417.1728575379.git.naohiro.aota@wdc.com/ Link: https://lore.kernel.org/all/20241016134919.GO1609@suse.cz/ Acked-by: Thomas Bogendoerfer Signed-off-by: David Sterba --- arch/mips/kernel/cmpxchg.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/mips/kernel/cmpxchg.c b/arch/mips/kernel/cmpxchg.c index e974a4954df8..c371def2302d 100644 --- a/arch/mips/kernel/cmpxchg.c +++ b/arch/mips/kernel/cmpxchg.c @@ -102,3 +102,4 @@ unsigned long __cmpxchg_small(volatile void *ptr, unsigned long old, return old; } } +EXPORT_SYMBOL(__cmpxchg_small); -- cgit From d48e1dea3931de64c26717adc2b89743c7ab6594 Mon Sep 17 00:00:00 2001 From: Naohiro Aota Date: Wed, 9 Oct 2024 22:52:06 +0900 Subject: btrfs: fix error propagation of split bios The purpose of btrfs_bbio_propagate_error() shall be propagating an error of split bio to its original btrfs_bio, and tell the error to the upper layer. However, it's not working well on some cases. * Case 1. Immediate (or quick) end_bio with an error When btrfs sends btrfs_bio to mirrored devices, btrfs calls btrfs_bio_end_io() when all the mirroring bios are completed. If that btrfs_bio was split, it is from btrfs_clone_bioset and its end_io function is btrfs_orig_write_end_io. For this case, btrfs_bbio_propagate_error() accesses the orig_bbio's bio context to increase the error count. That works well in most cases. However, if the end_io is called enough fast, orig_bbio's (remaining part after split) bio context may not be properly set at that time. Since the bio context is set when the orig_bbio (the last btrfs_bio) is sent to devices, that might be too late for earlier split btrfs_bio's completion. That will result in NULL pointer dereference. That bug is easily reproducible by running btrfs/146 on zoned devices [1] and it shows the following trace. [1] You need raid-stripe-tree feature as it create "-d raid0 -m raid1" FS. BUG: kernel NULL pointer dereference, address: 0000000000000020 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP PTI CPU: 1 UID: 0 PID: 13 Comm: kworker/u32:1 Not tainted 6.11.0-rc7-BTRFS-ZNS+ #474 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Workqueue: writeback wb_workfn (flush-btrfs-5) RIP: 0010:btrfs_bio_end_io+0xae/0xc0 [btrfs] BTRFS error (device dm-0): bdev /dev/mapper/error-test errs: wr 2, rd 0, flush 0, corrupt 0, gen 0 RSP: 0018:ffffc9000006f248 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888005a7f080 RCX: ffffc9000006f1dc RDX: 0000000000000000 RSI: 000000000000000a RDI: ffff888005a7f080 RBP: ffff888011dfc540 R08: 0000000000000000 R09: 0000000000000001 R10: ffffffff82e508e0 R11: 0000000000000005 R12: ffff88800ddfbe58 R13: ffff888005a7f080 R14: ffff888005a7f158 R15: ffff888005a7f158 FS: 0000000000000000(0000) GS:ffff88803ea80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000020 CR3: 0000000002e22006 CR4: 0000000000370ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? __die_body.cold+0x19/0x26 ? page_fault_oops+0x13e/0x2b0 ? _printk+0x58/0x73 ? do_user_addr_fault+0x5f/0x750 ? exc_page_fault+0x76/0x240 ? asm_exc_page_fault+0x22/0x30 ? btrfs_bio_end_io+0xae/0xc0 [btrfs] ? btrfs_log_dev_io_error+0x7f/0x90 [btrfs] btrfs_orig_write_end_io+0x51/0x90 [btrfs] dm_submit_bio+0x5c2/0xa50 [dm_mod] ? find_held_lock+0x2b/0x80 ? blk_try_enter_queue+0x90/0x1e0 __submit_bio+0xe0/0x130 ? ktime_get+0x10a/0x160 ? lockdep_hardirqs_on+0x74/0x100 submit_bio_noacct_nocheck+0x199/0x410 btrfs_submit_bio+0x7d/0x150 [btrfs] btrfs_submit_chunk+0x1a1/0x6d0 [btrfs] ? lockdep_hardirqs_on+0x74/0x100 ? __folio_start_writeback+0x10/0x2c0 btrfs_submit_bbio+0x1c/0x40 [btrfs] submit_one_bio+0x44/0x60 [btrfs] submit_extent_folio+0x13f/0x330 [btrfs] ? btrfs_set_range_writeback+0xa3/0xd0 [btrfs] extent_writepage_io+0x18b/0x360 [btrfs] extent_write_locked_range+0x17c/0x340 [btrfs] ? __pfx_end_bbio_data_write+0x10/0x10 [btrfs] run_delalloc_cow+0x71/0xd0 [btrfs] btrfs_run_delalloc_range+0x176/0x500 [btrfs] ? find_lock_delalloc_range+0x119/0x260 [btrfs] writepage_delalloc+0x2ab/0x480 [btrfs] extent_write_cache_pages+0x236/0x7d0 [btrfs] btrfs_writepages+0x72/0x130 [btrfs] do_writepages+0xd4/0x240 ? find_held_lock+0x2b/0x80 ? wbc_attach_and_unlock_inode+0x12c/0x290 ? wbc_attach_and_unlock_inode+0x12c/0x290 __writeback_single_inode+0x5c/0x4c0 ? do_raw_spin_unlock+0x49/0xb0 writeback_sb_inodes+0x22c/0x560 __writeback_inodes_wb+0x4c/0xe0 wb_writeback+0x1d6/0x3f0 wb_workfn+0x334/0x520 process_one_work+0x1ee/0x570 ? lock_is_held_type+0xc6/0x130 worker_thread+0x1d1/0x3b0 ? __pfx_worker_thread+0x10/0x10 kthread+0xee/0x120 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 Modules linked in: dm_mod btrfs blake2b_generic xor raid6_pq rapl CR2: 0000000000000020 * Case 2. Earlier completion of orig_bbio for mirrored btrfs_bios btrfs_bbio_propagate_error() assumes the end_io function for orig_bbio is called last among split bios. In that case, btrfs_orig_write_end_io() sets the bio->bi_status to BLK_STS_IOERR by seeing the bioc->error [2]. Otherwise, the increased orig_bio's bioc->error is not checked by anyone and return BLK_STS_OK to the upper layer. [2] Actually, this is not true. Because we only increases orig_bioc->errors by max_errors, the condition "atomic_read(&bioc->error) > bioc->max_errors" is still not met if only one split btrfs_bio fails. * Case 3. Later completion of orig_bbio for un-mirrored btrfs_bios In contrast to the above case, btrfs_bbio_propagate_error() is not working well if un-mirrored orig_bbio is completed last. It sets orig_bbio->bio.bi_status to the btrfs_bio's error. But, that is easily over-written by orig_bbio's completion status. If the status is BLK_STS_OK, the upper layer would not know the failure. * Solution Considering the above cases, we can only save the error status in the orig_bbio (remaining part after split) itself as it is always available. Also, the saved error status should be propagated when all the split btrfs_bios are finished (i.e, bbio->pending_ios == 0). This commit introduces "status" to btrfs_bbio and saves the first error of split bios to original btrfs_bio's "status" variable. When all the split bios are finished, the saved status is loaded into original btrfs_bio's status. With this commit, btrfs/146 on zoned devices does not hit the NULL pointer dereference anymore. Fixes: 852eee62d31a ("btrfs: allow btrfs_submit_bio to split bios") CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Qu Wenruo Reviewed-by: Christoph Hellwig Reviewed-by: Johannes Thumshirn Signed-off-by: Naohiro Aota Signed-off-by: David Sterba --- fs/btrfs/bio.c | 37 +++++++++++++------------------------ fs/btrfs/bio.h | 3 +++ 2 files changed, 16 insertions(+), 24 deletions(-) diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c index ce13416bc10f..f83ec5a1baa6 100644 --- a/fs/btrfs/bio.c +++ b/fs/btrfs/bio.c @@ -49,6 +49,7 @@ void btrfs_bio_init(struct btrfs_bio *bbio, struct btrfs_fs_info *fs_info, bbio->end_io = end_io; bbio->private = private; atomic_set(&bbio->pending_ios, 1); + WRITE_ONCE(bbio->status, BLK_STS_OK); } /* @@ -120,41 +121,29 @@ static void __btrfs_bio_end_io(struct btrfs_bio *bbio) } } -static void btrfs_orig_write_end_io(struct bio *bio); - -static void btrfs_bbio_propagate_error(struct btrfs_bio *bbio, - struct btrfs_bio *orig_bbio) -{ - /* - * For writes we tolerate nr_mirrors - 1 write failures, so we can't - * just blindly propagate a write failure here. Instead increment the - * error count in the original I/O context so that it is guaranteed to - * be larger than the error tolerance. - */ - if (bbio->bio.bi_end_io == &btrfs_orig_write_end_io) { - struct btrfs_io_stripe *orig_stripe = orig_bbio->bio.bi_private; - struct btrfs_io_context *orig_bioc = orig_stripe->bioc; - - atomic_add(orig_bioc->max_errors, &orig_bioc->error); - } else { - orig_bbio->bio.bi_status = bbio->bio.bi_status; - } -} - void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status) { bbio->bio.bi_status = status; if (bbio->bio.bi_pool == &btrfs_clone_bioset) { struct btrfs_bio *orig_bbio = bbio->private; - if (bbio->bio.bi_status) - btrfs_bbio_propagate_error(bbio, orig_bbio); btrfs_cleanup_bio(bbio); bbio = orig_bbio; } - if (atomic_dec_and_test(&bbio->pending_ios)) + /* + * At this point, bbio always points to the original btrfs_bio. Save + * the first error in it. + */ + if (status != BLK_STS_OK) + cmpxchg(&bbio->status, BLK_STS_OK, status); + + if (atomic_dec_and_test(&bbio->pending_ios)) { + /* Load split bio's error which might be set above. */ + if (status == BLK_STS_OK) + bbio->bio.bi_status = READ_ONCE(bbio->status); __btrfs_bio_end_io(bbio); + } } static int next_repair_mirror(struct btrfs_failed_bio *fbio, int cur_mirror) diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h index e48612340745..e2fe16074ad6 100644 --- a/fs/btrfs/bio.h +++ b/fs/btrfs/bio.h @@ -79,6 +79,9 @@ struct btrfs_bio { /* File system that this I/O operates on. */ struct btrfs_fs_info *fs_info; + /* Save the first error status of split bio. */ + blk_status_t status; + /* * This member must come last, bio_alloc_bioset will allocate enough * bytes for entire btrfs_bio but relies on bio being last. -- cgit From e3dfd64c1f344ebec9397719244c27b360255855 Mon Sep 17 00:00:00 2001 From: Kan Liang Date: Fri, 13 Sep 2024 09:23:40 -0700 Subject: perf: Fix missing RCU reader protection in perf_event_clear_cpumask() Running rcutorture scenario TREE05, the below warning is triggered. [ 32.604594] WARNING: suspicious RCU usage [ 32.605928] 6.11.0-rc5-00040-g4ba4f1afb6a9 #55238 Not tainted [ 32.607812] ----------------------------- [ 32.609140] kernel/events/core.c:13946 RCU-list traversed in non-reader section!! [ 32.611595] other info that might help us debug this: [ 32.614247] rcu_scheduler_active = 2, debug_locks = 1 [ 32.616392] 3 locks held by cpuhp/4/35: [ 32.617687] #0: ffffffffb666a650 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0x4e/0x200 [ 32.620563] #1: ffffffffb666cd20 (cpuhp_state-down){+.+.}-{0:0}, at: cpuhp_thread_fun+0x4e/0x200 [ 32.623412] #2: ffffffffb677c288 (pmus_lock){+.+.}-{3:3}, at: perf_event_exit_cpu_context+0x32/0x2f0 In perf_event_clear_cpumask(), uses list_for_each_entry_rcu() without an obvious RCU read-side critical section. Either pmus_srcu or pmus_lock is good enough to protect the pmus list. In the current context, pmus_lock is already held. The list_for_each_entry_rcu() is not required. Fixes: 4ba4f1afb6a9 ("perf: Generic hotplug support for a PMU with a scope") Closes: https://lore.kernel.org/lkml/2b66dff8-b827-494b-b151-1ad8d56f13e6@paulmck-laptop/ Closes: https://lore.kernel.org/oe-lkp/202409131559.545634cc-oliver.sang@intel.com Reported-by: "Paul E. McKenney" Reported-by: kernel test robot Suggested-by: Peter Zijlstra Signed-off-by: Kan Liang Signed-off-by: Peter Zijlstra (Intel) Tested-by: "Paul E. McKenney" Link: https://lore.kernel.org/r/20240913162340.2142976-1-kan.liang@linux.intel.com --- kernel/events/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index cdd09769e6c5..df27d08a7232 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -13959,7 +13959,7 @@ static void perf_event_clear_cpumask(unsigned int cpu) } /* migrate */ - list_for_each_entry_rcu(pmu, &pmus, entry, lockdep_is_held(&pmus_srcu)) { + list_for_each_entry(pmu, &pmus, entry) { if (pmu->scope == PERF_PMU_SCOPE_NONE || WARN_ON_ONCE(pmu->scope >= PERF_PMU_MAX_SCOPE)) continue; -- cgit From b55945c500c5723992504aa03b362fab416863a6 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Wed, 23 Oct 2024 11:36:41 +0200 Subject: sched: Fix pick_next_task_fair() vs try_to_wake_up() race Syzkaller robot reported KCSAN tripping over the ASSERT_EXCLUSIVE_WRITER(p->on_rq) in __block_task(). The report noted that both pick_next_task_fair() and try_to_wake_up() were concurrently trying to write to the same p->on_rq, violating the assertion -- even though both paths hold rq->__lock. The logical consequence is that both code paths end up holding a different rq->__lock. And looking through ttwu(), this is possible when the __block_task() 'p->on_rq = 0' store is visible to the ttwu() 'p->on_rq' load, which then assumes the task is not queued and continues to migrate it. Rearrange things such that __block_task() releases @p with the store and no code thereafter will use @p again. Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Reported-by: syzbot+0ec1e96c2cdf5c0e512a@syzkaller.appspotmail.com Reported-by: Kent Overstreet Signed-off-by: Peter Zijlstra (Intel) Tested-by: Marco Elver Link: https://lkml.kernel.org/r/20241023093641.GE16066@noisy.programming.kicks-ass.net --- kernel/sched/fair.c | 21 ++++++++++++++------- kernel/sched/sched.h | 34 ++++++++++++++++++++++++++++++++-- 2 files changed, 46 insertions(+), 9 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c157d4860a3b..879614686188 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5625,8 +5625,9 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq) struct sched_entity *se = pick_eevdf(cfs_rq); if (se->sched_delayed) { dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED); - SCHED_WARN_ON(se->sched_delayed); - SCHED_WARN_ON(se->on_rq); + /* + * Must not reference @se again, see __block_task(). + */ return NULL; } return se; @@ -7176,7 +7177,11 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) /* Fix-up what dequeue_task_fair() skipped */ hrtick_update(rq); - /* Fix-up what block_task() skipped. */ + /* + * Fix-up what block_task() skipped. + * + * Must be last, @p might not be valid after this. + */ __block_task(rq, p); } @@ -7193,12 +7198,14 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE)))) util_est_dequeue(&rq->cfs, p); - if (dequeue_entities(rq, &p->se, flags) < 0) { - util_est_update(&rq->cfs, p, DEQUEUE_SLEEP); + util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP); + if (dequeue_entities(rq, &p->se, flags) < 0) return false; - } - util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP); + /* + * Must not reference @p after dequeue_entities(DEQUEUE_DELAYED). + */ + hrtick_update(rq); return true; } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 081519ffab46..9f9d1cc390b1 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2769,8 +2769,6 @@ static inline void sub_nr_running(struct rq *rq, unsigned count) static inline void __block_task(struct rq *rq, struct task_struct *p) { - WRITE_ONCE(p->on_rq, 0); - ASSERT_EXCLUSIVE_WRITER(p->on_rq); if (p->sched_contributes_to_load) rq->nr_uninterruptible++; @@ -2778,6 +2776,38 @@ static inline void __block_task(struct rq *rq, struct task_struct *p) atomic_inc(&rq->nr_iowait); delayacct_blkio_start(); } + + ASSERT_EXCLUSIVE_WRITER(p->on_rq); + + /* + * The moment this write goes through, ttwu() can swoop in and migrate + * this task, rendering our rq->__lock ineffective. + * + * __schedule() try_to_wake_up() + * LOCK rq->__lock LOCK p->pi_lock + * pick_next_task() + * pick_next_task_fair() + * pick_next_entity() + * dequeue_entities() + * __block_task() + * RELEASE p->on_rq = 0 if (p->on_rq && ...) + * break; + * + * ACQUIRE (after ctrl-dep) + * + * cpu = select_task_rq(); + * set_task_cpu(p, cpu); + * ttwu_queue() + * ttwu_do_activate() + * LOCK rq->__lock + * activate_task() + * STORE p->on_rq = 1 + * UNLOCK rq->__lock + * + * Callers must ensure to not reference @p after this -- we no longer + * own it. + */ + smp_store_release(&p->on_rq, 0); } extern void activate_task(struct rq *rq, struct task_struct *p, int flags); -- cgit From 9b3c11a867a82ebee4e096008014417918b82801 Mon Sep 17 00:00:00 2001 From: Ihor Solodrai Date: Mon, 21 Oct 2024 23:16:52 +0000 Subject: selftests/sched_ext: add order-only dependency of runner.o on BPFOBJ The runner.o may start building before libbpf headers are installed, and as a result build fails. This happened a couple of times on libbpf/ci test jobs: * https://github.com/libbpf/ci/actions/runs/11447667257/job/31849533100 * https://github.com/theihor/libbpf-ci/actions/runs/11445162764/job/31841649552 Headers are installed in a recipe for $(BPFOBJ) target, and adding an order-only dependency should ensure this doesn't happen. Signed-off-by: Ihor Solodrai Signed-off-by: Tejun Heo --- tools/testing/selftests/sched_ext/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile index 06ae9c107049..011762224600 100644 --- a/tools/testing/selftests/sched_ext/Makefile +++ b/tools/testing/selftests/sched_ext/Makefile @@ -184,7 +184,7 @@ auto-test-targets := \ testcase-targets := $(addsuffix .o,$(addprefix $(SCXOBJ_DIR)/,$(auto-test-targets))) -$(SCXOBJ_DIR)/runner.o: runner.c | $(SCXOBJ_DIR) +$(SCXOBJ_DIR)/runner.o: runner.c | $(SCXOBJ_DIR) $(BPFOBJ) $(CC) $(CFLAGS) -c $< -o $@ # Create all of the test targets object files, whose testcase objects will be -- cgit From 06a130e42a5bfc84795464bff023bff4c16f58c5 Mon Sep 17 00:00:00 2001 From: Veronika Molnarova Date: Thu, 17 Oct 2024 18:15:55 +0200 Subject: perf test: Handle perftool-testsuite_probe failure due to broken DWARF Test case test_adding_blacklisted ends in failure if the blacklisted probe is of an assembler function with no DWARF available. At the same time, probing the blacklisted function with ASM DWARF doesn't test the blacklist itself as the failure is a result of the broken DWARF. When the broken DWARF output is encountered, check if the probed function was compiled by the assembler. If so, the broken DWARF message is expected and does not report a perf issue, else report a failure. If the ASM DWARF affected the probe, try the next probe on the blacklist. If the first 5 probes are defective due to broken DWARF, skip the test case. Fixes: def5480d63c1e847 ("perf testsuite probe: Add test for blacklisted kprobes handling") Signed-off-by: Veronika Molnarova Tested-by: Arnaldo Carvalho de Melo Cc: Adrian Hunter Cc: Alexander Shishkin Cc: Ian Rogers Cc: Ingo Molnar Cc: Jiri Olsa Cc: Kan Liang Cc: Mark Rutland Cc: Michael Petlan Cc: Namhyung Kim Cc: Peter Zijlstra Cc: Veronika Molnarova Link: https://lore.kernel.org/r/20241017161555.236769-1-vmolnaro@redhat.com Signed-off-by: Arnaldo Carvalho de Melo --- .../shell/base_probe/test_adding_blacklisted.sh | 69 +++++++++++++++++----- 1 file changed, 54 insertions(+), 15 deletions(-) diff --git a/tools/perf/tests/shell/base_probe/test_adding_blacklisted.sh b/tools/perf/tests/shell/base_probe/test_adding_blacklisted.sh index b5dc10b2a738..bead723e34af 100755 --- a/tools/perf/tests/shell/base_probe/test_adding_blacklisted.sh +++ b/tools/perf/tests/shell/base_probe/test_adding_blacklisted.sh @@ -19,35 +19,74 @@ TEST_RESULT=0 # skip if not supported -BLACKFUNC=`head -n 1 /sys/kernel/debug/kprobes/blacklist 2> /dev/null | cut -f2` -if [ -z "$BLACKFUNC" ]; then +BLACKFUNC_LIST=`head -n 5 /sys/kernel/debug/kprobes/blacklist 2> /dev/null | cut -f2` +if [ -z "$BLACKFUNC_LIST" ]; then print_overall_skipped exit 0 fi +# try to find vmlinux with DWARF debug info +VMLINUX_FILE=$(perf probe -v random_probe |& grep "Using.*for symbols" | sed -r 's/^Using (.*) for symbols$/\1/') + # remove all previously added probes clear_all_probes ### adding blacklisted function - -# functions from blacklist should be skipped by perf probe -! $CMD_PERF probe $BLACKFUNC > $LOGS_DIR/adding_blacklisted.log 2> $LOGS_DIR/adding_blacklisted.err -PERF_EXIT_CODE=$? - REGEX_SCOPE_FAIL="Failed to find scope of probe point" REGEX_SKIP_MESSAGE=" is blacklisted function, skip it\." -REGEX_NOT_FOUND_MESSAGE="Probe point \'$BLACKFUNC\' not found." +REGEX_NOT_FOUND_MESSAGE="Probe point \'$RE_EVENT\' not found." REGEX_ERROR_MESSAGE="Error: Failed to add events." REGEX_INVALID_ARGUMENT="Failed to write event: Invalid argument" REGEX_SYMBOL_FAIL="Failed to find symbol at $RE_ADDRESS" -REGEX_OUT_SECTION="$BLACKFUNC is out of \.\w+, skip it" -../common/check_all_lines_matched.pl "$REGEX_SKIP_MESSAGE" "$REGEX_NOT_FOUND_MESSAGE" "$REGEX_ERROR_MESSAGE" "$REGEX_SCOPE_FAIL" "$REGEX_INVALID_ARGUMENT" "$REGEX_SYMBOL_FAIL" "$REGEX_OUT_SECTION" < $LOGS_DIR/adding_blacklisted.err -CHECK_EXIT_CODE=$? - -print_results $PERF_EXIT_CODE $CHECK_EXIT_CODE "adding blacklisted function $BLACKFUNC" -(( TEST_RESULT += $? )) - +REGEX_OUT_SECTION="$RE_EVENT is out of \.\w+, skip it" +REGEX_MISSING_DECL_LINE="A function DIE doesn't have decl_line. Maybe broken DWARF?" + +BLACKFUNC="" +SKIP_DWARF=0 + +for BLACKFUNC in $BLACKFUNC_LIST; do + echo "Probing $BLACKFUNC" + + # functions from blacklist should be skipped by perf probe + ! $CMD_PERF probe $BLACKFUNC > $LOGS_DIR/adding_blacklisted.log 2> $LOGS_DIR/adding_blacklisted.err + PERF_EXIT_CODE=$? + + # check for bad DWARF polluting the result + ../common/check_all_patterns_found.pl "$REGEX_MISSING_DECL_LINE" >/dev/null < $LOGS_DIR/adding_blacklisted.err + + if [ $? -eq 0 ]; then + SKIP_DWARF=1 + echo "Result polluted by broken DWARF, trying another probe" + + # confirm that the broken DWARF comes from assembler + if [ -n "$VMLINUX_FILE" ]; then + readelf -wi "$VMLINUX_FILE" | + awk -v probe="$BLACKFUNC" '/DW_AT_language/ { comp_lang = $0 } + $0 ~ probe { if (comp_lang) { print comp_lang }; exit }' | + grep -q "MIPS assembler" + + CHECK_EXIT_CODE=$? + if [ $CHECK_EXIT_CODE -ne 0 ]; then + SKIP_DWARF=0 # broken DWARF while available + break + fi + fi + else + ../common/check_all_lines_matched.pl "$REGEX_SKIP_MESSAGE" "$REGEX_NOT_FOUND_MESSAGE" "$REGEX_ERROR_MESSAGE" "$REGEX_SCOPE_FAIL" "$REGEX_INVALID_ARGUMENT" "$REGEX_SYMBOL_FAIL" "$REGEX_OUT_SECTION" < $LOGS_DIR/adding_blacklisted.err + CHECK_EXIT_CODE=$? + + SKIP_DWARF=0 + break + fi +done + +if [ $SKIP_DWARF -eq 1 ]; then + print_testcase_skipped "adding blacklisted function $BLACKFUNC" +else + print_results $PERF_EXIT_CODE $CHECK_EXIT_CODE "adding blacklisted function $BLACKFUNC" + (( TEST_RESULT += $? )) +fi ### listing not-added probe -- cgit From 25f00a13dccf8e45441265768de46c8bf58e08f6 Mon Sep 17 00:00:00 2001 From: Frank Li Date: Wed, 23 Oct 2024 16:30:32 -0400 Subject: spi: spi-fsl-dspi: Fix crash when not using GPIO chip select Add check for the return value of spi_get_csgpiod() to avoid passing a NULL pointer to gpiod_direction_output(), preventing a crash when GPIO chip select is not used. Fix below crash: [ 4.251960] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 [ 4.260762] Mem abort info: [ 4.263556] ESR = 0x0000000096000004 [ 4.267308] EC = 0x25: DABT (current EL), IL = 32 bits [ 4.272624] SET = 0, FnV = 0 [ 4.275681] EA = 0, S1PTW = 0 [ 4.278822] FSC = 0x04: level 0 translation fault [ 4.283704] Data abort info: [ 4.286583] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 [ 4.292074] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 4.297130] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 4.302445] [0000000000000000] user address but active_mm is swapper [ 4.308805] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP [ 4.315072] Modules linked in: [ 4.318124] CPU: 2 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.12.0-rc4-next-20241023-00008-ga20ec42c5fc1 #359 [ 4.328130] Hardware name: LS1046A QDS Board (DT) [ 4.332832] pstate: 40000005 (nZcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 4.339794] pc : gpiod_direction_output+0x34/0x5c [ 4.344505] lr : gpiod_direction_output+0x18/0x5c [ 4.349208] sp : ffff80008003b8f0 [ 4.352517] x29: ffff80008003b8f0 x28: 0000000000000000 x27: ffffc96bcc7e9068 [ 4.359659] x26: ffffc96bcc6e00b0 x25: ffffc96bcc598398 x24: ffff447400132810 [ 4.366800] x23: 0000000000000000 x22: 0000000011e1a300 x21: 0000000000020002 [ 4.373940] x20: 0000000000000000 x19: 0000000000000000 x18: ffffffffffffffff [ 4.381081] x17: ffff44740016e600 x16: 0000000500000003 x15: 0000000000000007 [ 4.388221] x14: 0000000000989680 x13: 0000000000020000 x12: 000000000000001e [ 4.395362] x11: 0044b82fa09b5a53 x10: 0000000000000019 x9 : 0000000000000008 [ 4.402502] x8 : 0000000000000002 x7 : 0000000000000007 x6 : 0000000000000000 [ 4.409641] x5 : 0000000000000200 x4 : 0000000002000000 x3 : 0000000000000000 [ 4.416781] x2 : 0000000000022202 x1 : 0000000000000000 x0 : 0000000000000000 [ 4.423921] Call trace: [ 4.426362] gpiod_direction_output+0x34/0x5c (P) [ 4.431067] gpiod_direction_output+0x18/0x5c (L) [ 4.435771] dspi_setup+0x220/0x334 Fixes: 9e264f3f85a5 ("spi: Replace all spi->chip_select and spi->cs_gpiod references with function call") Cc: stable@vger.kernel.org Signed-off-by: Frank Li Link: https://patch.msgid.link/20241023203032.1388491-1-Frank.Li@nxp.com Signed-off-by: Mark Brown --- drivers/spi/spi-fsl-dspi.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/spi/spi-fsl-dspi.c b/drivers/spi/spi-fsl-dspi.c index 191de1917f83..3fa990fb59c7 100644 --- a/drivers/spi/spi-fsl-dspi.c +++ b/drivers/spi/spi-fsl-dspi.c @@ -1003,6 +1003,7 @@ static int dspi_setup(struct spi_device *spi) u32 cs_sck_delay = 0, sck_cs_delay = 0; struct fsl_dspi_platform_data *pdata; unsigned char pasc = 0, asc = 0; + struct gpio_desc *gpio_cs; struct chip_data *chip; unsigned long clkrate; bool cs = true; @@ -1077,7 +1078,10 @@ static int dspi_setup(struct spi_device *spi) chip->ctar_val |= SPI_CTAR_LSBFE; } - gpiod_direction_output(spi_get_csgpiod(spi, 0), false); + gpio_cs = spi_get_csgpiod(spi, 0); + if (gpio_cs) + gpiod_direction_output(gpio_cs, false); + dspi_deassert_cs(spi, &cs); spi_set_ctldata(spi, chip); -- cgit From 758f18158952a6287ac23679ec04c32d44ca5368 Mon Sep 17 00:00:00 2001 From: Arnaldo Carvalho de Melo Date: Wed, 23 Oct 2024 16:12:57 -0300 Subject: perf python: Fix up the build on architectures without HAVE_KVM_STAT_SUPPORT MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Noticed while building on a raspbian arm 32-bit system. There was also this other case, fixed by adding a missing util/stat.h with the prototypes: /tmp/tmp.MbiSHoF3dj/perf-6.12.0-rc3/tools/perf/util/python.c:1396:6: error: no previous prototype for ‘perf_stat__set_no_csv_summary’ [-Werror=missing-prototypes] 1396 | void perf_stat__set_no_csv_summary(int set __maybe_unused) | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /tmp/tmp.MbiSHoF3dj/perf-6.12.0-rc3/tools/perf/util/python.c:1400:6: error: no previous prototype for ‘perf_stat__set_big_num’ [-Werror=missing-prototypes] 1400 | void perf_stat__set_big_num(int set __maybe_unused) | ^~~~~~~~~~~~~~~~~~~~~~ cc1: all warnings being treated as errors In other architectures this must be building due to some lucky indirect inclusion of that header. Fixes: 9dabf4003423c8d3 ("perf python: Switch module to linking libraries from building source") Reviewed-by: Ian Rogers Cc: Adrian Hunter Cc: Jiri Olsa Cc: Kan Liang Cc: Namhyung Kim Link: https://lore.kernel.org/lkml/ZxllAtpmEw5fg9oy@x1 Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/util/python.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/tools/perf/util/python.c b/tools/perf/util/python.c index 31a223eaf8e6..ee3d43a7ba45 100644 --- a/tools/perf/util/python.c +++ b/tools/perf/util/python.c @@ -19,6 +19,7 @@ #include "util/bpf-filter.h" #include "util/env.h" #include "util/kvm-stat.h" +#include "util/stat.h" #include "util/kwork.h" #include "util/sample.h" #include "util/lock-contention.h" @@ -1355,6 +1356,7 @@ error: unsigned int scripting_max_stack = PERF_MAX_STACK_DEPTH; +#ifdef HAVE_KVM_STAT_SUPPORT bool kvm_entry_event(struct evsel *evsel __maybe_unused) { return false; @@ -1384,6 +1386,7 @@ void exit_event_decode_key(struct perf_kvm_stat *kvm __maybe_unused, char *decode __maybe_unused) { } +#endif // HAVE_KVM_STAT_SUPPORT int find_scripts(char **scripts_array __maybe_unused, char **scripts_path_array __maybe_unused, int num __maybe_unused, int pathlen __maybe_unused) -- cgit From 6b51b9f65cec2c5246b06eec0334ba465ba357a8 Mon Sep 17 00:00:00 2001 From: Hongbo Li Date: Tue, 22 Oct 2024 09:38:12 +0800 Subject: doc: correcting the debug path for cachefiles The original debug path is under "/sys/modules", that's wrong. The real path in kernel is "/sys/module". So we can correct it. Signed-off-by: Hongbo Li Link: https://lore.kernel.org/r/20241022013812.2880883-1-lihongbo22@huawei.com Signed-off-by: Christian Brauner --- Documentation/filesystems/caching/cachefiles.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/filesystems/caching/cachefiles.rst b/Documentation/filesystems/caching/cachefiles.rst index e04a27bdbe19..b3ccc782cb3b 100644 --- a/Documentation/filesystems/caching/cachefiles.rst +++ b/Documentation/filesystems/caching/cachefiles.rst @@ -115,7 +115,7 @@ set up cache ready for use. The following script commands are available: This mask can also be set through sysfs, eg:: - echo 5 >/sys/modules/cachefiles/parameters/debug + echo 5 > /sys/module/cachefiles/parameters/debug Starting the Cache -- cgit From 247d65fb122ad560be1c8c4d87d7374fb28b0770 Mon Sep 17 00:00:00 2001 From: David Howells Date: Wed, 23 Oct 2024 11:40:10 +0100 Subject: afs: Fix missing subdir edit when renamed between parent dirs When rename moves an AFS subdirectory between parent directories, the subdir also needs a bit of editing: the ".." entry needs updating to point to the new parent (though I don't make use of the info) and the DV needs incrementing by 1 to reflect the change of content. The server also sends a callback break notification on the subdirectory if we have one, but we can take care of recovering the promise next time we access the subdir. This can be triggered by something like: mount -t afs %example.com:xfstest.test20 /xfstest.test/ mkdir /xfstest.test/{aaa,bbb,aaa/ccc} touch /xfstest.test/bbb/ccc/d mv /xfstest.test/{aaa/ccc,bbb/ccc} touch /xfstest.test/bbb/ccc/e When the pathwalk for the second touch hits "ccc", kafs spots that the DV is incorrect and downloads it again (so the fix is not critical). Fix this, if the rename target is a directory and the old and new parents are different, by: (1) Incrementing the DV number of the target locally. (2) Editing the ".." entry in the target to refer to its new parent's vnode ID and uniquifier. Link: https://lore.kernel.org/r/3340431.1729680010@warthog.procyon.org.uk Fixes: 63a4681ff39c ("afs: Locally edit directory data for mkdir/create/unlink/...") cc: David Howells cc: Marc Dionne cc: linux-afs@lists.infradead.org Signed-off-by: David Howells Signed-off-by: Christian Brauner --- fs/afs/dir.c | 25 +++++++++++++ fs/afs/dir_edit.c | 91 +++++++++++++++++++++++++++++++++++++++++++++- fs/afs/internal.h | 2 + include/trace/events/afs.h | 7 +++- 4 files changed, 122 insertions(+), 3 deletions(-) diff --git a/fs/afs/dir.c b/fs/afs/dir.c index f8622ed72e08..ada363af5aab 100644 --- a/fs/afs/dir.c +++ b/fs/afs/dir.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include "internal.h" #include "afs_fs.h" @@ -1823,6 +1824,8 @@ error: static void afs_rename_success(struct afs_operation *op) { + struct afs_vnode *vnode = AFS_FS_I(d_inode(op->dentry)); + _enter("op=%08x", op->debug_id); op->ctime = op->file[0].scb.status.mtime_client; @@ -1832,6 +1835,22 @@ static void afs_rename_success(struct afs_operation *op) op->ctime = op->file[1].scb.status.mtime_client; afs_vnode_commit_status(op, &op->file[1]); } + + /* If we're moving a subdir between dirs, we need to update + * its DV counter too as the ".." will be altered. + */ + if (S_ISDIR(vnode->netfs.inode.i_mode) && + op->file[0].vnode != op->file[1].vnode) { + u64 new_dv; + + write_seqlock(&vnode->cb_lock); + + new_dv = vnode->status.data_version + 1; + vnode->status.data_version = new_dv; + inode_set_iversion_raw(&vnode->netfs.inode, new_dv); + + write_sequnlock(&vnode->cb_lock); + } } static void afs_rename_edit_dir(struct afs_operation *op) @@ -1873,6 +1892,12 @@ static void afs_rename_edit_dir(struct afs_operation *op) &vnode->fid, afs_edit_dir_for_rename_2); } + if (S_ISDIR(vnode->netfs.inode.i_mode) && + new_dvnode != orig_dvnode && + test_bit(AFS_VNODE_DIR_VALID, &vnode->flags)) + afs_edit_dir_update_dotdot(vnode, new_dvnode, + afs_edit_dir_for_rename_sub); + new_inode = d_inode(new_dentry); if (new_inode) { spin_lock(&new_inode->i_lock); diff --git a/fs/afs/dir_edit.c b/fs/afs/dir_edit.c index a71bff10496b..fe223fb78111 100644 --- a/fs/afs/dir_edit.c +++ b/fs/afs/dir_edit.c @@ -127,10 +127,10 @@ static struct folio *afs_dir_get_folio(struct afs_vnode *vnode, pgoff_t index) /* * Scan a directory block looking for a dirent of the right name. */ -static int afs_dir_scan_block(union afs_xdr_dir_block *block, struct qstr *name, +static int afs_dir_scan_block(const union afs_xdr_dir_block *block, const struct qstr *name, unsigned int blocknum) { - union afs_xdr_dirent *de; + const union afs_xdr_dirent *de; u64 bitmap; int d, len, n; @@ -492,3 +492,90 @@ error: clear_bit(AFS_VNODE_DIR_VALID, &vnode->flags); goto out_unmap; } + +/* + * Edit a subdirectory that has been moved between directories to update the + * ".." entry. + */ +void afs_edit_dir_update_dotdot(struct afs_vnode *vnode, struct afs_vnode *new_dvnode, + enum afs_edit_dir_reason why) +{ + union afs_xdr_dir_block *block; + union afs_xdr_dirent *de; + struct folio *folio; + unsigned int nr_blocks, b; + pgoff_t index; + loff_t i_size; + int slot; + + _enter(""); + + i_size = i_size_read(&vnode->netfs.inode); + if (i_size < AFS_DIR_BLOCK_SIZE) { + clear_bit(AFS_VNODE_DIR_VALID, &vnode->flags); + return; + } + nr_blocks = i_size / AFS_DIR_BLOCK_SIZE; + + /* Find a block that has sufficient slots available. Each folio + * contains two or more directory blocks. + */ + for (b = 0; b < nr_blocks; b++) { + index = b / AFS_DIR_BLOCKS_PER_PAGE; + folio = afs_dir_get_folio(vnode, index); + if (!folio) + goto error; + + block = kmap_local_folio(folio, b * AFS_DIR_BLOCK_SIZE - folio_pos(folio)); + + /* Abandon the edit if we got a callback break. */ + if (!test_bit(AFS_VNODE_DIR_VALID, &vnode->flags)) + goto invalidated; + + slot = afs_dir_scan_block(block, &dotdot_name, b); + if (slot >= 0) + goto found_dirent; + + kunmap_local(block); + folio_unlock(folio); + folio_put(folio); + } + + /* Didn't find the dirent to clobber. Download the directory again. */ + trace_afs_edit_dir(vnode, why, afs_edit_dir_update_nodd, + 0, 0, 0, 0, ".."); + clear_bit(AFS_VNODE_DIR_VALID, &vnode->flags); + goto out; + +found_dirent: + de = &block->dirents[slot]; + de->u.vnode = htonl(new_dvnode->fid.vnode); + de->u.unique = htonl(new_dvnode->fid.unique); + + trace_afs_edit_dir(vnode, why, afs_edit_dir_update_dd, b, slot, + ntohl(de->u.vnode), ntohl(de->u.unique), ".."); + + kunmap_local(block); + folio_unlock(folio); + folio_put(folio); + inode_set_iversion_raw(&vnode->netfs.inode, vnode->status.data_version); + +out: + _leave(""); + return; + +invalidated: + kunmap_local(block); + folio_unlock(folio); + folio_put(folio); + trace_afs_edit_dir(vnode, why, afs_edit_dir_update_inval, + 0, 0, 0, 0, ".."); + clear_bit(AFS_VNODE_DIR_VALID, &vnode->flags); + goto out; + +error: + trace_afs_edit_dir(vnode, why, afs_edit_dir_update_error, + 0, 0, 0, 0, ".."); + clear_bit(AFS_VNODE_DIR_VALID, &vnode->flags); + goto out; +} diff --git a/fs/afs/internal.h b/fs/afs/internal.h index 6e1d3c4daf72..b306c0980870 100644 --- a/fs/afs/internal.h +++ b/fs/afs/internal.h @@ -1072,6 +1072,8 @@ extern void afs_check_for_remote_deletion(struct afs_operation *); extern void afs_edit_dir_add(struct afs_vnode *, struct qstr *, struct afs_fid *, enum afs_edit_dir_reason); extern void afs_edit_dir_remove(struct afs_vnode *, struct qstr *, enum afs_edit_dir_reason); +void afs_edit_dir_update_dotdot(struct afs_vnode *vnode, struct afs_vnode *new_dvnode, + enum afs_edit_dir_reason why); /* * dir_silly.c diff --git a/include/trace/events/afs.h b/include/trace/events/afs.h index 450c44c83a5d..a0aed1a428a1 100644 --- a/include/trace/events/afs.h +++ b/include/trace/events/afs.h @@ -331,7 +331,11 @@ enum yfs_cm_operation { EM(afs_edit_dir_delete, "delete") \ EM(afs_edit_dir_delete_error, "d_err ") \ EM(afs_edit_dir_delete_inval, "d_invl") \ - E_(afs_edit_dir_delete_noent, "d_nent") + EM(afs_edit_dir_delete_noent, "d_nent") \ + EM(afs_edit_dir_update_dd, "u_ddot") \ + EM(afs_edit_dir_update_error, "u_fail") \ + EM(afs_edit_dir_update_inval, "u_invl") \ + E_(afs_edit_dir_update_nodd, "u_nodd") #define afs_edit_dir_reasons \ EM(afs_edit_dir_for_create, "Create") \ @@ -340,6 +344,7 @@ enum yfs_cm_operation { EM(afs_edit_dir_for_rename_0, "Renam0") \ EM(afs_edit_dir_for_rename_1, "Renam1") \ EM(afs_edit_dir_for_rename_2, "Renam2") \ + EM(afs_edit_dir_for_rename_sub, "RnmSub") \ EM(afs_edit_dir_for_rmdir, "RmDir ") \ EM(afs_edit_dir_for_silly_0, "S_Ren0") \ EM(afs_edit_dir_for_silly_1, "S_Ren1") \ -- cgit From e65a0dc1cabe71b91ef5603e5814359451b74ca7 Mon Sep 17 00:00:00 2001 From: David Howells Date: Wed, 23 Oct 2024 11:07:05 +0100 Subject: iov_iter: Fix iov_iter_get_pages*() for folio_queue p9_get_mapped_pages() uses iov_iter_get_pages_alloc2() to extract pages from an iterator when performing a zero-copy request and under some circumstances, this crashes with odd page errors[1], for example, I see: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xbcf0 flags: 0x2000000000000000(zone=1) ... page dumped because: VM_BUG_ON_FOLIO(((unsigned int) folio_ref_count(folio) + 127u <= 127u)) ------------[ cut here ]------------ kernel BUG at include/linux/mm.h:1444! This is because, unlike in iov_iter_extract_folioq_pages(), the iter_folioq_get_pages() helper function doesn't skip the current folio when iov_offset points to the end of it, but rather extracts the next page beyond the end of the folio and adds it to the list. Reading will then clobber the contents of this page, leading to system corruption, and if the page is not in use, put_page() may try to clean up the unused page. This can be worked around by copying the iterator before each extraction[2] and using iov_iter_advance() on the original as the advance function steps over the page we're at the end of. Fix this by skipping the page extraction if we're at the end of the folio. This was reproduced in the ktest environment[3] by forcing 9p to use the fscache caching mode and then reading a file through 9p. Fixes: db0aa2e9566f ("mm: Define struct folio_queue and ITER_FOLIOQ to handle a sequence of folios") Reported-by: Antony Antony Closes: https://lore.kernel.org/r/ZxFQw4OI9rrc7UYc@Antony2201.local/ Signed-off-by: David Howells cc: Eric Van Hensbergen cc: Latchesar Ionkov cc: Dominique Martinet cc: Christian Schoenebeck cc: v9fs@lists.linux.dev cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/ZxFEi1Tod43pD6JC@moon.secunet.de/ [1] Link: https://lore.kernel.org/r/2299159.1729543103@warthog.procyon.org.uk/ [2] Link: https://github.com/koverstreet/ktest.git [3] Tested-by: Antony Antony Link: https://lore.kernel.org/r/3327438.1729678025@warthog.procyon.org.uk Signed-off-by: Christian Brauner --- lib/iov_iter.c | 21 ++++++++++++--------- 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/lib/iov_iter.c b/lib/iov_iter.c index 1abb32c0da50..cc4b5541eef8 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -1021,15 +1021,18 @@ static ssize_t iter_folioq_get_pages(struct iov_iter *iter, size_t offset = iov_offset, fsize = folioq_folio_size(folioq, slot); size_t part = PAGE_SIZE - offset % PAGE_SIZE; - part = umin(part, umin(maxsize - extracted, fsize - offset)); - count -= part; - iov_offset += part; - extracted += part; - - *pages = folio_page(folio, offset / PAGE_SIZE); - get_page(*pages); - pages++; - maxpages--; + if (offset < fsize) { + part = umin(part, umin(maxsize - extracted, fsize - offset)); + count -= part; + iov_offset += part; + extracted += part; + + *pages = folio_page(folio, offset / PAGE_SIZE); + get_page(*pages); + pages++; + maxpages--; + } + if (maxpages == 0 || extracted >= maxsize) break; -- cgit From 08a7d2525511ba07b8ab3dfb472a9d3df4c40f79 Mon Sep 17 00:00:00 2001 From: Arnaldo Carvalho de Melo Date: Thu, 24 Oct 2024 10:19:06 -0300 Subject: tools arch x86: Sync the msr-index.h copy with the kernel sources To pick up the changes from these csets: dc1e67f70f6d4e33 ("KVM VMX: Move MSR_IA32_VMX_MISC bit defines to asm/vmx.h") d7bfc9ffd58037ff ("KVM: VMX: Move MSR_IA32_VMX_BASIC bit defines to asm/vmx.h") beb2e446046f8dd9 ("x86/cpu: KVM: Move macro to encode PAT value to common header") e7e80b66fb242a63 ("x86/cpu: KVM: Add common defines for architectural memory types (PAT, MTRRs, etc.)") That cause no changes to tooling: $ tools/perf/trace/beauty/tracepoints/x86_msr.sh > before $ cp arch/x86/include/asm/msr-index.h tools/arch/x86/include/asm/msr-index.h $ tools/perf/trace/beauty/tracepoints/x86_msr.sh > after $ diff -u before after $ To see how this works take a look at this previous update: https://git.kernel.org/torvalds/c/174372668933ede5 174372668933ede5 ("tools arch x86: Sync the msr-index.h copy with the kernel sources to pick IA32_MKTME_KEYID_PARTITIONING") Just silences this perf build warning: Warning: Kernel ABI header differences: diff -u tools/arch/x86/include/asm/msr-index.h arch/x86/include/asm/msr-index.h Please see tools/include/uapi/README for further details. Cc: Adrian Hunter Cc: Ian Rogers Cc: Jiri Olsa Cc: Kan Liang Cc: Namhyung Kim Cc: Sean Christopherson Cc: Xin Li Link: https://lore.kernel.org/lkml/ZxpLSBzGin3vjs3b@x1 Signed-off-by: Arnaldo Carvalho de Melo --- tools/arch/x86/include/asm/msr-index.h | 34 ++++++++++++++++++++-------------- 1 file changed, 20 insertions(+), 14 deletions(-) diff --git a/tools/arch/x86/include/asm/msr-index.h b/tools/arch/x86/include/asm/msr-index.h index a7c06a46fb76..3ae84c3b8e6d 100644 --- a/tools/arch/x86/include/asm/msr-index.h +++ b/tools/arch/x86/include/asm/msr-index.h @@ -36,6 +36,20 @@ #define EFER_FFXSR (1<<_EFER_FFXSR) #define EFER_AUTOIBRS (1<<_EFER_AUTOIBRS) +/* + * Architectural memory types that are common to MTRRs, PAT, VMX MSRs, etc. + * Most MSRs support/allow only a subset of memory types, but the values + * themselves are common across all relevant MSRs. + */ +#define X86_MEMTYPE_UC 0ull /* Uncacheable, a.k.a. Strong Uncacheable */ +#define X86_MEMTYPE_WC 1ull /* Write Combining */ +/* RESERVED 2 */ +/* RESERVED 3 */ +#define X86_MEMTYPE_WT 4ull /* Write Through */ +#define X86_MEMTYPE_WP 5ull /* Write Protected */ +#define X86_MEMTYPE_WB 6ull /* Write Back */ +#define X86_MEMTYPE_UC_MINUS 7ull /* Weak Uncacheabled (PAT only) */ + /* FRED MSRs */ #define MSR_IA32_FRED_RSP0 0x1cc /* Level 0 stack pointer */ #define MSR_IA32_FRED_RSP1 0x1cd /* Level 1 stack pointer */ @@ -365,6 +379,12 @@ #define MSR_IA32_CR_PAT 0x00000277 +#define PAT_VALUE(p0, p1, p2, p3, p4, p5, p6, p7) \ + ((X86_MEMTYPE_ ## p0) | (X86_MEMTYPE_ ## p1 << 8) | \ + (X86_MEMTYPE_ ## p2 << 16) | (X86_MEMTYPE_ ## p3 << 24) | \ + (X86_MEMTYPE_ ## p4 << 32) | (X86_MEMTYPE_ ## p5 << 40) | \ + (X86_MEMTYPE_ ## p6 << 48) | (X86_MEMTYPE_ ## p7 << 56)) + #define MSR_IA32_DEBUGCTLMSR 0x000001d9 #define MSR_IA32_LASTBRANCHFROMIP 0x000001db #define MSR_IA32_LASTBRANCHTOIP 0x000001dc @@ -1159,15 +1179,6 @@ #define MSR_IA32_VMX_VMFUNC 0x00000491 #define MSR_IA32_VMX_PROCBASED_CTLS3 0x00000492 -/* VMX_BASIC bits and bitmasks */ -#define VMX_BASIC_VMCS_SIZE_SHIFT 32 -#define VMX_BASIC_TRUE_CTLS (1ULL << 55) -#define VMX_BASIC_64 0x0001000000000000LLU -#define VMX_BASIC_MEM_TYPE_SHIFT 50 -#define VMX_BASIC_MEM_TYPE_MASK 0x003c000000000000LLU -#define VMX_BASIC_MEM_TYPE_WB 6LLU -#define VMX_BASIC_INOUT 0x0040000000000000LLU - /* Resctrl MSRs: */ /* - Intel: */ #define MSR_IA32_L3_QOS_CFG 0xc81 @@ -1185,11 +1196,6 @@ #define MSR_IA32_SMBA_BW_BASE 0xc0000280 #define MSR_IA32_EVT_CFG_BASE 0xc0000400 -/* MSR_IA32_VMX_MISC bits */ -#define MSR_IA32_VMX_MISC_INTEL_PT (1ULL << 14) -#define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL << 29) -#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE 0x1F - /* AMD-V MSRs */ #define MSR_VM_CR 0xc0010114 #define MSR_VM_IGNNE 0xc0010115 -- cgit From a85df8c7b5ee2d3d4823befada42c5c41aff4cb0 Mon Sep 17 00:00:00 2001 From: Dan Carpenter Date: Fri, 13 Sep 2024 17:34:54 +0300 Subject: drm/tegra: Fix NULL vs IS_ERR() check in probe() The iommu_paging_domain_alloc() function doesn't return NULL pointers, it returns error pointers. Update the check to match. Fixes: 45c690aea8ee ("drm/tegra: Use iommu_paging_domain_alloc()") Signed-off-by: Dan Carpenter Reviewed-by: Lu Baolu Signed-off-by: Thierry Reding Link: https://patchwork.freedesktop.org/patch/msgid/ba31cf3a-af3d-4ff1-87a8-f05aaf8c780b@stanley.mountain --- drivers/gpu/drm/tegra/drm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c index c9eb329665ec..34d22ba210b0 100644 --- a/drivers/gpu/drm/tegra/drm.c +++ b/drivers/gpu/drm/tegra/drm.c @@ -1153,8 +1153,8 @@ static int host1x_drm_probe(struct host1x_device *dev) if (host1x_drm_wants_iommu(dev) && device_iommu_mapped(dma_dev)) { tegra->domain = iommu_paging_domain_alloc(dma_dev); - if (!tegra->domain) { - err = -ENOMEM; + if (IS_ERR(tegra->domain)) { + err = PTR_ERR(tegra->domain); goto free; } -- cgit From 4f7f417042b242c1e5a9ed03741acb5d900e0871 Mon Sep 17 00:00:00 2001 From: Vishal Chourasia Date: Thu, 24 Oct 2024 10:46:09 +0530 Subject: sched_ext: Fix function pointer type mismatches in BPF selftests Fix incompatible function pointer type warnings in sched_ext BPF selftests by explicitly casting the function pointers when initializing struct_ops. This addresses multiple -Wincompatible-function-pointer-types warnings from the clang compiler where function signatures didn't match exactly. The void * cast ensures the compiler accepts the function pointer assignment despite minor type differences in the parameters. Signed-off-by: Vishal Chourasia Signed-off-by: Tejun Heo --- tools/testing/selftests/sched_ext/create_dsq.bpf.c | 6 +-- .../selftests/sched_ext/ddsp_bogus_dsq_fail.bpf.c | 4 +- .../selftests/sched_ext/ddsp_vtimelocal_fail.bpf.c | 4 +- .../testing/selftests/sched_ext/dsp_local_on.bpf.c | 8 +-- .../selftests/sched_ext/enq_select_cpu_fails.bpf.c | 4 +- tools/testing/selftests/sched_ext/exit.bpf.c | 14 +++--- tools/testing/selftests/sched_ext/hotplug.bpf.c | 8 +-- .../selftests/sched_ext/init_enable_count.bpf.c | 8 +-- tools/testing/selftests/sched_ext/maximal.bpf.c | 58 +++++++++++----------- tools/testing/selftests/sched_ext/maybe_null.bpf.c | 6 +-- .../selftests/sched_ext/maybe_null_fail_dsp.bpf.c | 4 +- .../selftests/sched_ext/maybe_null_fail_yld.bpf.c | 4 +- tools/testing/selftests/sched_ext/prog_run.bpf.c | 2 +- .../selftests/sched_ext/select_cpu_dfl.bpf.c | 2 +- .../sched_ext/select_cpu_dfl_nodispatch.bpf.c | 6 +-- .../selftests/sched_ext/select_cpu_dispatch.bpf.c | 2 +- .../sched_ext/select_cpu_dispatch_bad_dsq.bpf.c | 4 +- .../sched_ext/select_cpu_dispatch_dbl_dsp.bpf.c | 4 +- .../selftests/sched_ext/select_cpu_vtime.bpf.c | 12 ++--- 19 files changed, 80 insertions(+), 80 deletions(-) diff --git a/tools/testing/selftests/sched_ext/create_dsq.bpf.c b/tools/testing/selftests/sched_ext/create_dsq.bpf.c index 23f79ed343f0..2cfc4ffd60e2 100644 --- a/tools/testing/selftests/sched_ext/create_dsq.bpf.c +++ b/tools/testing/selftests/sched_ext/create_dsq.bpf.c @@ -51,8 +51,8 @@ s32 BPF_STRUCT_OPS_SLEEPABLE(create_dsq_init) SEC(".struct_ops.link") struct sched_ext_ops create_dsq_ops = { - .init_task = create_dsq_init_task, - .exit_task = create_dsq_exit_task, - .init = create_dsq_init, + .init_task = (void *) create_dsq_init_task, + .exit_task = (void *) create_dsq_exit_task, + .init = (void *) create_dsq_init, .name = "create_dsq", }; diff --git a/tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.bpf.c b/tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.bpf.c index e97ad41d354a..37d9bf6fb745 100644 --- a/tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.bpf.c +++ b/tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.bpf.c @@ -35,8 +35,8 @@ void BPF_STRUCT_OPS(ddsp_bogus_dsq_fail_exit, struct scx_exit_info *ei) SEC(".struct_ops.link") struct sched_ext_ops ddsp_bogus_dsq_fail_ops = { - .select_cpu = ddsp_bogus_dsq_fail_select_cpu, - .exit = ddsp_bogus_dsq_fail_exit, + .select_cpu = (void *) ddsp_bogus_dsq_fail_select_cpu, + .exit = (void *) ddsp_bogus_dsq_fail_exit, .name = "ddsp_bogus_dsq_fail", .timeout_ms = 1000U, }; diff --git a/tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.bpf.c b/tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.bpf.c index dde7e7dafbfb..dffc97d9cdf1 100644 --- a/tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.bpf.c +++ b/tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.bpf.c @@ -32,8 +32,8 @@ void BPF_STRUCT_OPS(ddsp_vtimelocal_fail_exit, struct scx_exit_info *ei) SEC(".struct_ops.link") struct sched_ext_ops ddsp_vtimelocal_fail_ops = { - .select_cpu = ddsp_vtimelocal_fail_select_cpu, - .exit = ddsp_vtimelocal_fail_exit, + .select_cpu = (void *) ddsp_vtimelocal_fail_select_cpu, + .exit = (void *) ddsp_vtimelocal_fail_exit, .name = "ddsp_vtimelocal_fail", .timeout_ms = 1000U, }; diff --git a/tools/testing/selftests/sched_ext/dsp_local_on.bpf.c b/tools/testing/selftests/sched_ext/dsp_local_on.bpf.c index efb4672decb4..6a7db1502c29 100644 --- a/tools/testing/selftests/sched_ext/dsp_local_on.bpf.c +++ b/tools/testing/selftests/sched_ext/dsp_local_on.bpf.c @@ -56,10 +56,10 @@ void BPF_STRUCT_OPS(dsp_local_on_exit, struct scx_exit_info *ei) SEC(".struct_ops.link") struct sched_ext_ops dsp_local_on_ops = { - .select_cpu = dsp_local_on_select_cpu, - .enqueue = dsp_local_on_enqueue, - .dispatch = dsp_local_on_dispatch, - .exit = dsp_local_on_exit, + .select_cpu = (void *) dsp_local_on_select_cpu, + .enqueue = (void *) dsp_local_on_enqueue, + .dispatch = (void *) dsp_local_on_dispatch, + .exit = (void *) dsp_local_on_exit, .name = "dsp_local_on", .timeout_ms = 1000U, }; diff --git a/tools/testing/selftests/sched_ext/enq_select_cpu_fails.bpf.c b/tools/testing/selftests/sched_ext/enq_select_cpu_fails.bpf.c index b3dfc1033cd6..1efb50d61040 100644 --- a/tools/testing/selftests/sched_ext/enq_select_cpu_fails.bpf.c +++ b/tools/testing/selftests/sched_ext/enq_select_cpu_fails.bpf.c @@ -36,8 +36,8 @@ void BPF_STRUCT_OPS(enq_select_cpu_fails_enqueue, struct task_struct *p, SEC(".struct_ops.link") struct sched_ext_ops enq_select_cpu_fails_ops = { - .select_cpu = enq_select_cpu_fails_select_cpu, - .enqueue = enq_select_cpu_fails_enqueue, + .select_cpu = (void *) enq_select_cpu_fails_select_cpu, + .enqueue = (void *) enq_select_cpu_fails_enqueue, .name = "enq_select_cpu_fails", .timeout_ms = 1000U, }; diff --git a/tools/testing/selftests/sched_ext/exit.bpf.c b/tools/testing/selftests/sched_ext/exit.bpf.c index ae12ddaac921..bf79ccd55f8f 100644 --- a/tools/testing/selftests/sched_ext/exit.bpf.c +++ b/tools/testing/selftests/sched_ext/exit.bpf.c @@ -72,13 +72,13 @@ s32 BPF_STRUCT_OPS_SLEEPABLE(exit_init) SEC(".struct_ops.link") struct sched_ext_ops exit_ops = { - .select_cpu = exit_select_cpu, - .enqueue = exit_enqueue, - .dispatch = exit_dispatch, - .init_task = exit_init_task, - .enable = exit_enable, - .exit = exit_exit, - .init = exit_init, + .select_cpu = (void *) exit_select_cpu, + .enqueue = (void *) exit_enqueue, + .dispatch = (void *) exit_dispatch, + .init_task = (void *) exit_init_task, + .enable = (void *) exit_enable, + .exit = (void *) exit_exit, + .init = (void *) exit_init, .name = "exit", .timeout_ms = 1000U, }; diff --git a/tools/testing/selftests/sched_ext/hotplug.bpf.c b/tools/testing/selftests/sched_ext/hotplug.bpf.c index 8f2601db39f3..6c9f25c9bf53 100644 --- a/tools/testing/selftests/sched_ext/hotplug.bpf.c +++ b/tools/testing/selftests/sched_ext/hotplug.bpf.c @@ -46,16 +46,16 @@ void BPF_STRUCT_OPS_SLEEPABLE(hotplug_cpu_offline, s32 cpu) SEC(".struct_ops.link") struct sched_ext_ops hotplug_cb_ops = { - .cpu_online = hotplug_cpu_online, - .cpu_offline = hotplug_cpu_offline, - .exit = hotplug_exit, + .cpu_online = (void *) hotplug_cpu_online, + .cpu_offline = (void *) hotplug_cpu_offline, + .exit = (void *) hotplug_exit, .name = "hotplug_cbs", .timeout_ms = 1000U, }; SEC(".struct_ops.link") struct sched_ext_ops hotplug_nocb_ops = { - .exit = hotplug_exit, + .exit = (void *) hotplug_exit, .name = "hotplug_nocbs", .timeout_ms = 1000U, }; diff --git a/tools/testing/selftests/sched_ext/init_enable_count.bpf.c b/tools/testing/selftests/sched_ext/init_enable_count.bpf.c index 47ea89a626c3..5eb9edb1837d 100644 --- a/tools/testing/selftests/sched_ext/init_enable_count.bpf.c +++ b/tools/testing/selftests/sched_ext/init_enable_count.bpf.c @@ -45,9 +45,9 @@ void BPF_STRUCT_OPS(cnt_disable, struct task_struct *p) SEC(".struct_ops.link") struct sched_ext_ops init_enable_count_ops = { - .init_task = cnt_init_task, - .exit_task = cnt_exit_task, - .enable = cnt_enable, - .disable = cnt_disable, + .init_task = (void *) cnt_init_task, + .exit_task = (void *) cnt_exit_task, + .enable = (void *) cnt_enable, + .disable = (void *) cnt_disable, .name = "init_enable_count", }; diff --git a/tools/testing/selftests/sched_ext/maximal.bpf.c b/tools/testing/selftests/sched_ext/maximal.bpf.c index 00bfa9cb95d3..4d4cd8d966db 100644 --- a/tools/testing/selftests/sched_ext/maximal.bpf.c +++ b/tools/testing/selftests/sched_ext/maximal.bpf.c @@ -131,34 +131,34 @@ void BPF_STRUCT_OPS(maximal_exit, struct scx_exit_info *info) SEC(".struct_ops.link") struct sched_ext_ops maximal_ops = { - .select_cpu = maximal_select_cpu, - .enqueue = maximal_enqueue, - .dequeue = maximal_dequeue, - .dispatch = maximal_dispatch, - .runnable = maximal_runnable, - .running = maximal_running, - .stopping = maximal_stopping, - .quiescent = maximal_quiescent, - .yield = maximal_yield, - .core_sched_before = maximal_core_sched_before, - .set_weight = maximal_set_weight, - .set_cpumask = maximal_set_cpumask, - .update_idle = maximal_update_idle, - .cpu_acquire = maximal_cpu_acquire, - .cpu_release = maximal_cpu_release, - .cpu_online = maximal_cpu_online, - .cpu_offline = maximal_cpu_offline, - .init_task = maximal_init_task, - .enable = maximal_enable, - .exit_task = maximal_exit_task, - .disable = maximal_disable, - .cgroup_init = maximal_cgroup_init, - .cgroup_exit = maximal_cgroup_exit, - .cgroup_prep_move = maximal_cgroup_prep_move, - .cgroup_move = maximal_cgroup_move, - .cgroup_cancel_move = maximal_cgroup_cancel_move, - .cgroup_set_weight = maximal_cgroup_set_weight, - .init = maximal_init, - .exit = maximal_exit, + .select_cpu = (void *) maximal_select_cpu, + .enqueue = (void *) maximal_enqueue, + .dequeue = (void *) maximal_dequeue, + .dispatch = (void *) maximal_dispatch, + .runnable = (void *) maximal_runnable, + .running = (void *) maximal_running, + .stopping = (void *) maximal_stopping, + .quiescent = (void *) maximal_quiescent, + .yield = (void *) maximal_yield, + .core_sched_before = (void *) maximal_core_sched_before, + .set_weight = (void *) maximal_set_weight, + .set_cpumask = (void *) maximal_set_cpumask, + .update_idle = (void *) maximal_update_idle, + .cpu_acquire = (void *) maximal_cpu_acquire, + .cpu_release = (void *) maximal_cpu_release, + .cpu_online = (void *) maximal_cpu_online, + .cpu_offline = (void *) maximal_cpu_offline, + .init_task = (void *) maximal_init_task, + .enable = (void *) maximal_enable, + .exit_task = (void *) maximal_exit_task, + .disable = (void *) maximal_disable, + .cgroup_init = (void *) maximal_cgroup_init, + .cgroup_exit = (void *) maximal_cgroup_exit, + .cgroup_prep_move = (void *) maximal_cgroup_prep_move, + .cgroup_move = (void *) maximal_cgroup_move, + .cgroup_cancel_move = (void *) maximal_cgroup_cancel_move, + .cgroup_set_weight = (void *) maximal_cgroup_set_weight, + .init = (void *) maximal_init, + .exit = (void *) maximal_exit, .name = "maximal", }; diff --git a/tools/testing/selftests/sched_ext/maybe_null.bpf.c b/tools/testing/selftests/sched_ext/maybe_null.bpf.c index 27d0f386acfb..cf4ae870cd4e 100644 --- a/tools/testing/selftests/sched_ext/maybe_null.bpf.c +++ b/tools/testing/selftests/sched_ext/maybe_null.bpf.c @@ -29,8 +29,8 @@ bool BPF_STRUCT_OPS(maybe_null_success_yield, struct task_struct *from, SEC(".struct_ops.link") struct sched_ext_ops maybe_null_success = { - .dispatch = maybe_null_success_dispatch, - .yield = maybe_null_success_yield, - .enable = maybe_null_running, + .dispatch = (void *) maybe_null_success_dispatch, + .yield = (void *) maybe_null_success_yield, + .enable = (void *) maybe_null_running, .name = "minimal", }; diff --git a/tools/testing/selftests/sched_ext/maybe_null_fail_dsp.bpf.c b/tools/testing/selftests/sched_ext/maybe_null_fail_dsp.bpf.c index c0641050271d..ec724d7b33d1 100644 --- a/tools/testing/selftests/sched_ext/maybe_null_fail_dsp.bpf.c +++ b/tools/testing/selftests/sched_ext/maybe_null_fail_dsp.bpf.c @@ -19,7 +19,7 @@ void BPF_STRUCT_OPS(maybe_null_fail_dispatch, s32 cpu, struct task_struct *p) SEC(".struct_ops.link") struct sched_ext_ops maybe_null_fail = { - .dispatch = maybe_null_fail_dispatch, - .enable = maybe_null_running, + .dispatch = (void *) maybe_null_fail_dispatch, + .enable = (void *) maybe_null_running, .name = "maybe_null_fail_dispatch", }; diff --git a/tools/testing/selftests/sched_ext/maybe_null_fail_yld.bpf.c b/tools/testing/selftests/sched_ext/maybe_null_fail_yld.bpf.c index 3c1740028e3b..e6552cace020 100644 --- a/tools/testing/selftests/sched_ext/maybe_null_fail_yld.bpf.c +++ b/tools/testing/selftests/sched_ext/maybe_null_fail_yld.bpf.c @@ -22,7 +22,7 @@ bool BPF_STRUCT_OPS(maybe_null_fail_yield, struct task_struct *from, SEC(".struct_ops.link") struct sched_ext_ops maybe_null_fail = { - .yield = maybe_null_fail_yield, - .enable = maybe_null_running, + .yield = (void *) maybe_null_fail_yield, + .enable = (void *) maybe_null_running, .name = "maybe_null_fail_yield", }; diff --git a/tools/testing/selftests/sched_ext/prog_run.bpf.c b/tools/testing/selftests/sched_ext/prog_run.bpf.c index 6a4d7c48e3f2..00c267626a68 100644 --- a/tools/testing/selftests/sched_ext/prog_run.bpf.c +++ b/tools/testing/selftests/sched_ext/prog_run.bpf.c @@ -28,6 +28,6 @@ void BPF_STRUCT_OPS(prog_run_exit, struct scx_exit_info *ei) SEC(".struct_ops.link") struct sched_ext_ops prog_run_ops = { - .exit = prog_run_exit, + .exit = (void *) prog_run_exit, .name = "prog_run", }; diff --git a/tools/testing/selftests/sched_ext/select_cpu_dfl.bpf.c b/tools/testing/selftests/sched_ext/select_cpu_dfl.bpf.c index 2ed2991afafe..f171ac470970 100644 --- a/tools/testing/selftests/sched_ext/select_cpu_dfl.bpf.c +++ b/tools/testing/selftests/sched_ext/select_cpu_dfl.bpf.c @@ -35,6 +35,6 @@ void BPF_STRUCT_OPS(select_cpu_dfl_enqueue, struct task_struct *p, SEC(".struct_ops.link") struct sched_ext_ops select_cpu_dfl_ops = { - .enqueue = select_cpu_dfl_enqueue, + .enqueue = (void *) select_cpu_dfl_enqueue, .name = "select_cpu_dfl", }; diff --git a/tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.bpf.c b/tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.bpf.c index 4bb5abb2d369..9efdbb7da928 100644 --- a/tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.bpf.c +++ b/tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.bpf.c @@ -82,8 +82,8 @@ s32 BPF_STRUCT_OPS(select_cpu_dfl_nodispatch_init_task, SEC(".struct_ops.link") struct sched_ext_ops select_cpu_dfl_nodispatch_ops = { - .select_cpu = select_cpu_dfl_nodispatch_select_cpu, - .enqueue = select_cpu_dfl_nodispatch_enqueue, - .init_task = select_cpu_dfl_nodispatch_init_task, + .select_cpu = (void *) select_cpu_dfl_nodispatch_select_cpu, + .enqueue = (void *) select_cpu_dfl_nodispatch_enqueue, + .init_task = (void *) select_cpu_dfl_nodispatch_init_task, .name = "select_cpu_dfl_nodispatch", }; diff --git a/tools/testing/selftests/sched_ext/select_cpu_dispatch.bpf.c b/tools/testing/selftests/sched_ext/select_cpu_dispatch.bpf.c index f0b96a4a04b2..59bfc4f36167 100644 --- a/tools/testing/selftests/sched_ext/select_cpu_dispatch.bpf.c +++ b/tools/testing/selftests/sched_ext/select_cpu_dispatch.bpf.c @@ -35,7 +35,7 @@ dispatch: SEC(".struct_ops.link") struct sched_ext_ops select_cpu_dispatch_ops = { - .select_cpu = select_cpu_dispatch_select_cpu, + .select_cpu = (void *) select_cpu_dispatch_select_cpu, .name = "select_cpu_dispatch", .timeout_ms = 1000U, }; diff --git a/tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.bpf.c b/tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.bpf.c index 7b42ddce0f56..3bbd5fcdfb18 100644 --- a/tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.bpf.c +++ b/tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.bpf.c @@ -30,8 +30,8 @@ void BPF_STRUCT_OPS(select_cpu_dispatch_bad_dsq_exit, struct scx_exit_info *ei) SEC(".struct_ops.link") struct sched_ext_ops select_cpu_dispatch_bad_dsq_ops = { - .select_cpu = select_cpu_dispatch_bad_dsq_select_cpu, - .exit = select_cpu_dispatch_bad_dsq_exit, + .select_cpu = (void *) select_cpu_dispatch_bad_dsq_select_cpu, + .exit = (void *) select_cpu_dispatch_bad_dsq_exit, .name = "select_cpu_dispatch_bad_dsq", .timeout_ms = 1000U, }; diff --git a/tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.bpf.c b/tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.bpf.c index 653e3dc0b4dc..0fda57fe0ecf 100644 --- a/tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.bpf.c +++ b/tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.bpf.c @@ -31,8 +31,8 @@ void BPF_STRUCT_OPS(select_cpu_dispatch_dbl_dsp_exit, struct scx_exit_info *ei) SEC(".struct_ops.link") struct sched_ext_ops select_cpu_dispatch_dbl_dsp_ops = { - .select_cpu = select_cpu_dispatch_dbl_dsp_select_cpu, - .exit = select_cpu_dispatch_dbl_dsp_exit, + .select_cpu = (void *) select_cpu_dispatch_dbl_dsp_select_cpu, + .exit = (void *) select_cpu_dispatch_dbl_dsp_exit, .name = "select_cpu_dispatch_dbl_dsp", .timeout_ms = 1000U, }; diff --git a/tools/testing/selftests/sched_ext/select_cpu_vtime.bpf.c b/tools/testing/selftests/sched_ext/select_cpu_vtime.bpf.c index 7f3ebf4fc2ea..e6c67bcf5e6e 100644 --- a/tools/testing/selftests/sched_ext/select_cpu_vtime.bpf.c +++ b/tools/testing/selftests/sched_ext/select_cpu_vtime.bpf.c @@ -81,12 +81,12 @@ s32 BPF_STRUCT_OPS_SLEEPABLE(select_cpu_vtime_init) SEC(".struct_ops.link") struct sched_ext_ops select_cpu_vtime_ops = { - .select_cpu = select_cpu_vtime_select_cpu, - .dispatch = select_cpu_vtime_dispatch, - .running = select_cpu_vtime_running, - .stopping = select_cpu_vtime_stopping, - .enable = select_cpu_vtime_enable, - .init = select_cpu_vtime_init, + .select_cpu = (void *) select_cpu_vtime_select_cpu, + .dispatch = (void *) select_cpu_vtime_dispatch, + .running = (void *) select_cpu_vtime_running, + .stopping = (void *) select_cpu_vtime_stopping, + .enable = (void *) select_cpu_vtime_enable, + .init = (void *) select_cpu_vtime_init, .name = "select_cpu_vtime", .timeout_ms = 1000U, }; -- cgit From 63dd163cd61dda6f38343776b42331cc6b7e56e0 Mon Sep 17 00:00:00 2001 From: Javier Carrasco Date: Wed, 16 Oct 2024 19:04:31 +0200 Subject: iio: light: veml6030: fix microlux value calculation The raw value conversion to obtain a measurement in lux as INT_PLUS_MICRO does not calculate the decimal part properly to display it as micro (in this case microlux). It only calculates the module to obtain the decimal part from a resolution that is 10000 times the provided in the datasheet (0.5376 lux/cnt for the veml6030). The resulting value must still be multiplied by 100 to make it micro. This bug was introduced with the original implementation of the driver. Only the illuminance channel is fixed becuase the scale is non sensical for the intensity channels anyway. Cc: stable@vger.kernel.org Fixes: 7b779f573c48 ("iio: light: add driver for veml6030 ambient light sensor") Signed-off-by: Javier Carrasco Link: https://patch.msgid.link/20241016-veml6030-fix-processed-micro-v1-1-4a5644796437@gmail.com Signed-off-by: Jonathan Cameron --- drivers/iio/light/veml6030.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iio/light/veml6030.c b/drivers/iio/light/veml6030.c index 9630de1c578e..621428885455 100644 --- a/drivers/iio/light/veml6030.c +++ b/drivers/iio/light/veml6030.c @@ -522,7 +522,7 @@ static int veml6030_read_raw(struct iio_dev *indio_dev, } if (mask == IIO_CHAN_INFO_PROCESSED) { *val = (reg * data->cur_resolution) / 10000; - *val2 = (reg * data->cur_resolution) % 10000; + *val2 = (reg * data->cur_resolution) % 10000 * 100; return IIO_VAL_INT_PLUS_MICRO; } *val = reg; -- cgit From fbe5956e8809f04e9121923db0b6d1b94f2b93ba Mon Sep 17 00:00:00 2001 From: Julien Stephan Date: Tue, 22 Oct 2024 15:22:36 +0200 Subject: dt-bindings: iio: adc: ad7380: fix ad7380-4 reference supply ad7380-4 is the only device from ad738x family that doesn't have an internal reference. Moreover its external reference is called REFIN in the datasheet while all other use REFIO as an optional external reference. If refio-supply is omitted the internal reference is used. Fix the binding by adding refin-supply and makes it required for ad7380-4 only. Fixes: 1a291cc8ee17 ("dt-bindings: iio: adc: ad7380: add support for ad738x-4 4 channels variants") Acked-by: Conor Dooley Reviewed-by: David Lechner Signed-off-by: Julien Stephan Link: https://patch.msgid.link/20241022-ad7380-fix-supplies-v3-1-f0cefe1b7fa6@baylibre.com Cc: Signed-off-by: Jonathan Cameron --- .../devicetree/bindings/iio/adc/adi,ad7380.yaml | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/Documentation/devicetree/bindings/iio/adc/adi,ad7380.yaml b/Documentation/devicetree/bindings/iio/adc/adi,ad7380.yaml index bd19abb867d9..0065d6508824 100644 --- a/Documentation/devicetree/bindings/iio/adc/adi,ad7380.yaml +++ b/Documentation/devicetree/bindings/iio/adc/adi,ad7380.yaml @@ -67,6 +67,10 @@ properties: A 2.5V to 3.3V supply for the external reference voltage. When omitted, the internal 2.5V reference is used. + refin-supply: + description: + A 2.5V to 3.3V supply for external reference voltage, for ad7380-4 only. + aina-supply: description: The common mode voltage supply for the AINA- pin on pseudo-differential @@ -135,6 +139,23 @@ allOf: ainc-supply: false aind-supply: false + # ad7380-4 uses refin-supply as external reference. + # All other chips from ad738x family use refio as optional external reference. + # When refio-supply is omitted, internal reference is used. + - if: + properties: + compatible: + enum: + - adi,ad7380-4 + then: + properties: + refio-supply: false + required: + - refin-supply + else: + properties: + refin-supply: false + examples: - | #include -- cgit From 2ac6b2e823b52f7f4abf1d3a97d11889e22d0d16 Mon Sep 17 00:00:00 2001 From: Julien Stephan Date: Tue, 22 Oct 2024 15:22:37 +0200 Subject: iio: adc: ad7380: use devm_regulator_get_enable_read_voltage() Use devm_regulator_get_enable_read_voltage() to simplify the code. Reviewed-by: Nuno Sa Reviewed-by: David Lechner Signed-off-by: Julien Stephan Link: https://patch.msgid.link/20241022-ad7380-fix-supplies-v3-2-f0cefe1b7fa6@baylibre.com Signed-off-by: Jonathan Cameron --- drivers/iio/adc/ad7380.c | 81 +++++++++++++----------------------------------- 1 file changed, 21 insertions(+), 60 deletions(-) diff --git a/drivers/iio/adc/ad7380.c b/drivers/iio/adc/ad7380.c index e8bddfb0d07d..e033c7341911 100644 --- a/drivers/iio/adc/ad7380.c +++ b/drivers/iio/adc/ad7380.c @@ -956,7 +956,7 @@ static const struct iio_info ad7380_info = { .debugfs_reg_access = &ad7380_debugfs_reg_access, }; -static int ad7380_init(struct ad7380_state *st, struct regulator *vref) +static int ad7380_init(struct ad7380_state *st, bool external_ref_en) { int ret; @@ -968,13 +968,13 @@ static int ad7380_init(struct ad7380_state *st, struct regulator *vref) if (ret < 0) return ret; - /* select internal or external reference voltage */ - ret = regmap_update_bits(st->regmap, AD7380_REG_ADDR_CONFIG1, - AD7380_CONFIG1_REFSEL, - FIELD_PREP(AD7380_CONFIG1_REFSEL, - vref ? 1 : 0)); - if (ret < 0) - return ret; + if (external_ref_en) { + /* select external reference voltage */ + ret = regmap_set_bits(st->regmap, AD7380_REG_ADDR_CONFIG1, + AD7380_CONFIG1_REFSEL); + if (ret < 0) + return ret; + } /* This is the default value after reset. */ st->oversampling_ratio = 1; @@ -987,16 +987,11 @@ static int ad7380_init(struct ad7380_state *st, struct regulator *vref) FIELD_PREP(AD7380_CONFIG2_SDO, 1)); } -static void ad7380_regulator_disable(void *p) -{ - regulator_disable(p); -} - static int ad7380_probe(struct spi_device *spi) { struct iio_dev *indio_dev; struct ad7380_state *st; - struct regulator *vref; + bool external_ref_en; int ret, i; indio_dev = devm_iio_device_alloc(&spi->dev, sizeof(*st)); @@ -1009,37 +1004,17 @@ static int ad7380_probe(struct spi_device *spi) if (!st->chip_info) return dev_err_probe(&spi->dev, -EINVAL, "missing match data\n"); - vref = devm_regulator_get_optional(&spi->dev, "refio"); - if (IS_ERR(vref)) { - if (PTR_ERR(vref) != -ENODEV) - return dev_err_probe(&spi->dev, PTR_ERR(vref), - "Failed to get refio regulator\n"); - - vref = NULL; - } - /* * If there is no REFIO supply, then it means that we are using * the internal 2.5V reference, otherwise REFIO is reference voltage. */ - if (vref) { - ret = regulator_enable(vref); - if (ret) - return ret; + ret = devm_regulator_get_enable_read_voltage(&spi->dev, "refio"); + if (ret < 0 && ret != -ENODEV) + return dev_err_probe(&spi->dev, ret, + "Failed to get refio regulator\n"); - ret = devm_add_action_or_reset(&spi->dev, - ad7380_regulator_disable, vref); - if (ret) - return ret; - - ret = regulator_get_voltage(vref); - if (ret < 0) - return ret; - - st->vref_mv = ret / 1000; - } else { - st->vref_mv = AD7380_INTERNAL_REF_MV; - } + external_ref_en = ret != -ENODEV; + st->vref_mv = external_ref_en ? ret / 1000 : AD7380_INTERNAL_REF_MV; if (st->chip_info->num_vcm_supplies > ARRAY_SIZE(st->vcm_mv)) return dev_err_probe(&spi->dev, -EINVAL, @@ -1050,27 +1025,13 @@ static int ad7380_probe(struct spi_device *spi) * input pin. */ for (i = 0; i < st->chip_info->num_vcm_supplies; i++) { - struct regulator *vcm; - - vcm = devm_regulator_get(&spi->dev, - st->chip_info->vcm_supplies[i]); - if (IS_ERR(vcm)) - return dev_err_probe(&spi->dev, PTR_ERR(vcm), - "Failed to get %s regulator\n", - st->chip_info->vcm_supplies[i]); + const char *vcm = st->chip_info->vcm_supplies[i]; - ret = regulator_enable(vcm); - if (ret) - return ret; - - ret = devm_add_action_or_reset(&spi->dev, - ad7380_regulator_disable, vcm); - if (ret) - return ret; - - ret = regulator_get_voltage(vcm); + ret = devm_regulator_get_enable_read_voltage(&spi->dev, vcm); if (ret < 0) - return ret; + return dev_err_probe(&spi->dev, ret, + "Failed to get %s regulator\n", + vcm); st->vcm_mv[i] = ret / 1000; } @@ -1135,7 +1096,7 @@ static int ad7380_probe(struct spi_device *spi) if (ret) return ret; - ret = ad7380_init(st, vref); + ret = ad7380_init(st, external_ref_en); if (ret) return ret; -- cgit From 7ddbc2728728f9b832ade7b4c180efdc2f22e8b9 Mon Sep 17 00:00:00 2001 From: Julien Stephan Date: Tue, 22 Oct 2024 15:22:38 +0200 Subject: iio: adc: ad7380: add missing supplies vcc and vlogic are required but are not retrieved and enabled in the probe. Add them. In order to prepare support for additional parts requiring different supplies, add vcc and vlogic to the platform specific structures Reviewed-by: Nuno Sa Reviewed-by: David Lechner Signed-off-by: Julien Stephan Link: https://patch.msgid.link/20241022-ad7380-fix-supplies-v3-3-f0cefe1b7fa6@baylibre.com Signed-off-by: Jonathan Cameron --- drivers/iio/adc/ad7380.c | 43 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 43 insertions(+) diff --git a/drivers/iio/adc/ad7380.c b/drivers/iio/adc/ad7380.c index e033c7341911..b107d8e97ab3 100644 --- a/drivers/iio/adc/ad7380.c +++ b/drivers/iio/adc/ad7380.c @@ -75,6 +75,7 @@ #define T_CONVERT_NS 190 /* conversion time */ #define T_CONVERT_0_NS 10 /* 1st conversion start time (oversampling) */ #define T_CONVERT_X_NS 500 /* xth conversion start time (oversampling) */ +#define T_POWERUP_US 5000 /* Power up */ struct ad7380_timing_specs { const unsigned int t_csh_ns; /* CS minimum high time */ @@ -86,6 +87,8 @@ struct ad7380_chip_info { unsigned int num_channels; unsigned int num_simult_channels; bool has_mux; + const char * const *supplies; + unsigned int num_supplies; const char * const *vcm_supplies; unsigned int num_vcm_supplies; const unsigned long *available_scan_masks; @@ -243,6 +246,10 @@ DEFINE_AD7380_8_CHANNEL(ad7386_4_channels, 16, 0, u); DEFINE_AD7380_8_CHANNEL(ad7387_4_channels, 14, 0, u); DEFINE_AD7380_8_CHANNEL(ad7388_4_channels, 12, 0, u); +static const char * const ad7380_supplies[] = { + "vcc", "vlogic", +}; + static const char * const ad7380_2_channel_vcm_supplies[] = { "aina", "ainb", }; @@ -338,6 +345,8 @@ static const struct ad7380_chip_info ad7380_chip_info = { .channels = ad7380_channels, .num_channels = ARRAY_SIZE(ad7380_channels), .num_simult_channels = 2, + .supplies = ad7380_supplies, + .num_supplies = ARRAY_SIZE(ad7380_supplies), .available_scan_masks = ad7380_2_channel_scan_masks, .timing_specs = &ad7380_timing, }; @@ -347,6 +356,8 @@ static const struct ad7380_chip_info ad7381_chip_info = { .channels = ad7381_channels, .num_channels = ARRAY_SIZE(ad7381_channels), .num_simult_channels = 2, + .supplies = ad7380_supplies, + .num_supplies = ARRAY_SIZE(ad7380_supplies), .available_scan_masks = ad7380_2_channel_scan_masks, .timing_specs = &ad7380_timing, }; @@ -356,6 +367,8 @@ static const struct ad7380_chip_info ad7383_chip_info = { .channels = ad7383_channels, .num_channels = ARRAY_SIZE(ad7383_channels), .num_simult_channels = 2, + .supplies = ad7380_supplies, + .num_supplies = ARRAY_SIZE(ad7380_supplies), .vcm_supplies = ad7380_2_channel_vcm_supplies, .num_vcm_supplies = ARRAY_SIZE(ad7380_2_channel_vcm_supplies), .available_scan_masks = ad7380_2_channel_scan_masks, @@ -367,6 +380,8 @@ static const struct ad7380_chip_info ad7384_chip_info = { .channels = ad7384_channels, .num_channels = ARRAY_SIZE(ad7384_channels), .num_simult_channels = 2, + .supplies = ad7380_supplies, + .num_supplies = ARRAY_SIZE(ad7380_supplies), .vcm_supplies = ad7380_2_channel_vcm_supplies, .num_vcm_supplies = ARRAY_SIZE(ad7380_2_channel_vcm_supplies), .available_scan_masks = ad7380_2_channel_scan_masks, @@ -378,6 +393,8 @@ static const struct ad7380_chip_info ad7386_chip_info = { .channels = ad7386_channels, .num_channels = ARRAY_SIZE(ad7386_channels), .num_simult_channels = 2, + .supplies = ad7380_supplies, + .num_supplies = ARRAY_SIZE(ad7380_supplies), .has_mux = true, .available_scan_masks = ad7380_2x2_channel_scan_masks, .timing_specs = &ad7380_timing, @@ -388,6 +405,8 @@ static const struct ad7380_chip_info ad7387_chip_info = { .channels = ad7387_channels, .num_channels = ARRAY_SIZE(ad7387_channels), .num_simult_channels = 2, + .supplies = ad7380_supplies, + .num_supplies = ARRAY_SIZE(ad7380_supplies), .has_mux = true, .available_scan_masks = ad7380_2x2_channel_scan_masks, .timing_specs = &ad7380_timing, @@ -398,6 +417,8 @@ static const struct ad7380_chip_info ad7388_chip_info = { .channels = ad7388_channels, .num_channels = ARRAY_SIZE(ad7388_channels), .num_simult_channels = 2, + .supplies = ad7380_supplies, + .num_supplies = ARRAY_SIZE(ad7380_supplies), .has_mux = true, .available_scan_masks = ad7380_2x2_channel_scan_masks, .timing_specs = &ad7380_timing, @@ -408,6 +429,8 @@ static const struct ad7380_chip_info ad7380_4_chip_info = { .channels = ad7380_4_channels, .num_channels = ARRAY_SIZE(ad7380_4_channels), .num_simult_channels = 4, + .supplies = ad7380_supplies, + .num_supplies = ARRAY_SIZE(ad7380_supplies), .available_scan_masks = ad7380_4_channel_scan_masks, .timing_specs = &ad7380_4_timing, }; @@ -417,6 +440,8 @@ static const struct ad7380_chip_info ad7381_4_chip_info = { .channels = ad7381_4_channels, .num_channels = ARRAY_SIZE(ad7381_4_channels), .num_simult_channels = 4, + .supplies = ad7380_supplies, + .num_supplies = ARRAY_SIZE(ad7380_supplies), .available_scan_masks = ad7380_4_channel_scan_masks, .timing_specs = &ad7380_4_timing, }; @@ -426,6 +451,8 @@ static const struct ad7380_chip_info ad7383_4_chip_info = { .channels = ad7383_4_channels, .num_channels = ARRAY_SIZE(ad7383_4_channels), .num_simult_channels = 4, + .supplies = ad7380_supplies, + .num_supplies = ARRAY_SIZE(ad7380_supplies), .vcm_supplies = ad7380_4_channel_vcm_supplies, .num_vcm_supplies = ARRAY_SIZE(ad7380_4_channel_vcm_supplies), .available_scan_masks = ad7380_4_channel_scan_masks, @@ -437,6 +464,8 @@ static const struct ad7380_chip_info ad7384_4_chip_info = { .channels = ad7384_4_channels, .num_channels = ARRAY_SIZE(ad7384_4_channels), .num_simult_channels = 4, + .supplies = ad7380_supplies, + .num_supplies = ARRAY_SIZE(ad7380_supplies), .vcm_supplies = ad7380_4_channel_vcm_supplies, .num_vcm_supplies = ARRAY_SIZE(ad7380_4_channel_vcm_supplies), .available_scan_masks = ad7380_4_channel_scan_masks, @@ -448,6 +477,8 @@ static const struct ad7380_chip_info ad7386_4_chip_info = { .channels = ad7386_4_channels, .num_channels = ARRAY_SIZE(ad7386_4_channels), .num_simult_channels = 4, + .supplies = ad7380_supplies, + .num_supplies = ARRAY_SIZE(ad7380_supplies), .has_mux = true, .available_scan_masks = ad7380_2x4_channel_scan_masks, .timing_specs = &ad7380_4_timing, @@ -458,6 +489,8 @@ static const struct ad7380_chip_info ad7387_4_chip_info = { .channels = ad7387_4_channels, .num_channels = ARRAY_SIZE(ad7387_4_channels), .num_simult_channels = 4, + .supplies = ad7380_supplies, + .num_supplies = ARRAY_SIZE(ad7380_supplies), .has_mux = true, .available_scan_masks = ad7380_2x4_channel_scan_masks, .timing_specs = &ad7380_4_timing, @@ -468,6 +501,8 @@ static const struct ad7380_chip_info ad7388_4_chip_info = { .channels = ad7388_4_channels, .num_channels = ARRAY_SIZE(ad7388_4_channels), .num_simult_channels = 4, + .supplies = ad7380_supplies, + .num_supplies = ARRAY_SIZE(ad7380_supplies), .has_mux = true, .available_scan_masks = ad7380_2x4_channel_scan_masks, .timing_specs = &ad7380_4_timing, @@ -1004,6 +1039,14 @@ static int ad7380_probe(struct spi_device *spi) if (!st->chip_info) return dev_err_probe(&spi->dev, -EINVAL, "missing match data\n"); + ret = devm_regulator_bulk_get_enable(&spi->dev, st->chip_info->num_supplies, + st->chip_info->supplies); + + if (ret) + return dev_err_probe(&spi->dev, ret, + "Failed to enable power supplies\n"); + fsleep(T_POWERUP_US); + /* * If there is no REFIO supply, then it means that we are using * the internal 2.5V reference, otherwise REFIO is reference voltage. -- cgit From 05f9c67179c9a8d66dee175fb4b17f380908a26f Mon Sep 17 00:00:00 2001 From: Julien Stephan Date: Tue, 22 Oct 2024 15:22:39 +0200 Subject: iio: adc: ad7380: fix supplies for ad7380-4 ad7380-4 is the only device in the family that does not have an internal reference. It uses "refin" as a required external reference. All other devices in the family use "refio"" as an optional external reference. Fixes: 737413da8704 ("iio: adc: ad7380: add support for ad738x-4 4 channels variants") Reviewed-by: Nuno Sa Reviewed-by: David Lechner Signed-off-by: Julien Stephan Link: https://patch.msgid.link/20241022-ad7380-fix-supplies-v3-4-f0cefe1b7fa6@baylibre.com Cc: Signed-off-by: Jonathan Cameron --- drivers/iio/adc/ad7380.c | 36 ++++++++++++++++++++++++++---------- 1 file changed, 26 insertions(+), 10 deletions(-) diff --git a/drivers/iio/adc/ad7380.c b/drivers/iio/adc/ad7380.c index b107d8e97ab3..fb728570debe 100644 --- a/drivers/iio/adc/ad7380.c +++ b/drivers/iio/adc/ad7380.c @@ -89,6 +89,7 @@ struct ad7380_chip_info { bool has_mux; const char * const *supplies; unsigned int num_supplies; + bool external_ref_only; const char * const *vcm_supplies; unsigned int num_vcm_supplies; const unsigned long *available_scan_masks; @@ -431,6 +432,7 @@ static const struct ad7380_chip_info ad7380_4_chip_info = { .num_simult_channels = 4, .supplies = ad7380_supplies, .num_supplies = ARRAY_SIZE(ad7380_supplies), + .external_ref_only = true, .available_scan_masks = ad7380_4_channel_scan_masks, .timing_specs = &ad7380_4_timing, }; @@ -1047,17 +1049,31 @@ static int ad7380_probe(struct spi_device *spi) "Failed to enable power supplies\n"); fsleep(T_POWERUP_US); - /* - * If there is no REFIO supply, then it means that we are using - * the internal 2.5V reference, otherwise REFIO is reference voltage. - */ - ret = devm_regulator_get_enable_read_voltage(&spi->dev, "refio"); - if (ret < 0 && ret != -ENODEV) - return dev_err_probe(&spi->dev, ret, - "Failed to get refio regulator\n"); + if (st->chip_info->external_ref_only) { + ret = devm_regulator_get_enable_read_voltage(&spi->dev, + "refin"); + if (ret < 0) + return dev_err_probe(&spi->dev, ret, + "Failed to get refin regulator\n"); + + st->vref_mv = ret / 1000; - external_ref_en = ret != -ENODEV; - st->vref_mv = external_ref_en ? ret / 1000 : AD7380_INTERNAL_REF_MV; + /* these chips don't have a register bit for this */ + external_ref_en = false; + } else { + /* + * If there is no REFIO supply, then it means that we are using + * the internal reference, otherwise REFIO is reference voltage. + */ + ret = devm_regulator_get_enable_read_voltage(&spi->dev, + "refio"); + if (ret < 0 && ret != -ENODEV) + return dev_err_probe(&spi->dev, ret, + "Failed to get refio regulator\n"); + + external_ref_en = ret != -ENODEV; + st->vref_mv = external_ref_en ? ret / 1000 : AD7380_INTERNAL_REF_MV; + } if (st->chip_info->num_vcm_supplies > ARRAY_SIZE(st->vcm_mv)) return dev_err_probe(&spi->dev, -EINVAL, -- cgit From 795114e849ddfd48150eb0135d04748a8c81cec5 Mon Sep 17 00:00:00 2001 From: Julien Stephan Date: Tue, 22 Oct 2024 15:22:40 +0200 Subject: docs: iio: ad7380: fix supply for ad7380-4 ad7380-4 is the only device from ad738x family that doesn't have an internal reference. Moreover it's external reference is called REFIN in the datasheet while all other use REFIO as an optional external reference. Update documentation to highlight this. Fixes: 3e82dfc82f38 ("docs: iio: new docs for ad7380 driver") Reviewed-by: David Lechner Signed-off-by: Julien Stephan Link: https://patch.msgid.link/20241022-ad7380-fix-supplies-v3-5-f0cefe1b7fa6@baylibre.com Cc: Signed-off-by: Jonathan Cameron --- Documentation/iio/ad7380.rst | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/Documentation/iio/ad7380.rst b/Documentation/iio/ad7380.rst index 9c784c1e652e..6f70b49b9ef2 100644 --- a/Documentation/iio/ad7380.rst +++ b/Documentation/iio/ad7380.rst @@ -41,13 +41,22 @@ supports only 1 SDO line. Reference voltage ----------------- -2 possible reference voltage sources are supported: +ad7380-4 +~~~~~~~~ + +ad7380-4 supports only an external reference voltage (2.5V to 3.3V). It must be +declared in the device tree as ``refin-supply``. + +All other devices from ad738x family +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +All other devices from ad738x support 2 possible reference voltage sources: - Internal reference (2.5V) - External reference (2.5V to 3.3V) The source is determined by the device tree. If ``refio-supply`` is present, -then the external reference is used, else the internal reference is used. +then it is used as external reference, else the internal reference is used. Oversampling and resolution boost --------------------------------- -- cgit From 6bd301819f8f69331a55ae2336c8b111fc933f3d Mon Sep 17 00:00:00 2001 From: Zicheng Qu Date: Tue, 22 Oct 2024 13:43:54 +0000 Subject: staging: iio: frequency: ad9832: fix division by zero in ad9832_calc_freqreg() In the ad9832_write_frequency() function, clk_get_rate() might return 0. This can lead to a division by zero when calling ad9832_calc_freqreg(). The check if (fout > (clk_get_rate(st->mclk) / 2)) does not protect against the case when fout is 0. The ad9832_write_frequency() function is called from ad9832_write(), and fout is derived from a text buffer, which can contain any value. Link: https://lore.kernel.org/all/2024100904-CVE-2024-47663-9bdc@gregkh/ Fixes: ea707584bac1 ("Staging: IIO: DDS: AD9832 / AD9835 driver") Cc: stable@vger.kernel.org Signed-off-by: Zicheng Qu Reviewed-by: Nuno Sa Reviewed-by: Dan Carpenter Link: https://patch.msgid.link/20241022134354.574614-1-quzicheng@huawei.com Signed-off-by: Jonathan Cameron --- drivers/staging/iio/frequency/ad9832.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/staging/iio/frequency/ad9832.c b/drivers/staging/iio/frequency/ad9832.c index 6c390c4eb26d..492612e8f8ba 100644 --- a/drivers/staging/iio/frequency/ad9832.c +++ b/drivers/staging/iio/frequency/ad9832.c @@ -129,12 +129,15 @@ static unsigned long ad9832_calc_freqreg(unsigned long mclk, unsigned long fout) static int ad9832_write_frequency(struct ad9832_state *st, unsigned int addr, unsigned long fout) { + unsigned long clk_freq; unsigned long regval; - if (fout > (clk_get_rate(st->mclk) / 2)) + clk_freq = clk_get_rate(st->mclk); + + if (!clk_freq || fout > (clk_freq / 2)) return -EINVAL; - regval = ad9832_calc_freqreg(clk_get_rate(st->mclk), fout); + regval = ad9832_calc_freqreg(clk_freq, fout); st->freq_data[0] = cpu_to_be16((AD9832_CMD_FRE8BITSW << CMD_SHIFT) | (addr << ADD_SHIFT) | -- cgit From efa353ae1b0541981bc96dbf2e586387d0392baa Mon Sep 17 00:00:00 2001 From: Zicheng Qu Date: Tue, 22 Oct 2024 13:43:30 +0000 Subject: iio: adc: ad7124: fix division by zero in ad7124_set_channel_odr() In the ad7124_write_raw() function, parameter val can potentially be zero. This may lead to a division by zero when DIV_ROUND_CLOSEST() is called within ad7124_set_channel_odr(). The ad7124_write_raw() function is invoked through the sequence: iio_write_channel_raw() -> iio_write_channel_attribute() -> iio_channel_write(), with no checks in place to ensure val is non-zero. Cc: stable@vger.kernel.org Fixes: 7b8d045e497a ("iio: adc: ad7124: allow more than 8 channels") Signed-off-by: Zicheng Qu Reviewed-by: Nuno Sa Link: https://patch.msgid.link/20241022134330.574601-1-quzicheng@huawei.com Signed-off-by: Jonathan Cameron --- drivers/iio/adc/ad7124.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iio/adc/ad7124.c b/drivers/iio/adc/ad7124.c index a5d91933f505..b79c48d46ccc 100644 --- a/drivers/iio/adc/ad7124.c +++ b/drivers/iio/adc/ad7124.c @@ -637,7 +637,7 @@ static int ad7124_write_raw(struct iio_dev *indio_dev, switch (info) { case IIO_CHAN_INFO_SAMP_FREQ: - if (val2 != 0) { + if (val2 != 0 || val == 0) { ret = -EINVAL; break; } -- cgit From 7bd4923940c8d67d9f3f3fde8d7c067e9e804fc6 Mon Sep 17 00:00:00 2001 From: Jinjie Ruan Date: Thu, 24 Oct 2024 09:55:53 +0800 Subject: iio: dac: Kconfig: Fix build error for ltc2664 If REGMAP_SPI is n and LTC2664 is y, the following build error occurs: riscv64-unknown-linux-gnu-ld: drivers/iio/dac/ltc2664.o: in function `ltc2664_probe': ltc2664.c:(.text+0x714): undefined reference to `__devm_regmap_init_spi' Select REGMAP_SPI instead of REGMAP for LTC2664 to fix it. Fixes: 4cc2fc445d2e ("iio: dac: ltc2664: Add driver for LTC2664 and LTC2672") Reviewed-by: Nuno Sa Signed-off-by: Jinjie Ruan Link: https://patch.msgid.link/20241024015553.1111253-1-ruanjinjie@huawei.com Signed-off-by: Jonathan Cameron --- drivers/iio/dac/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iio/dac/Kconfig b/drivers/iio/dac/Kconfig index 45e337c6d256..9f5d5ebb8653 100644 --- a/drivers/iio/dac/Kconfig +++ b/drivers/iio/dac/Kconfig @@ -380,7 +380,7 @@ config LTC2632 config LTC2664 tristate "Analog Devices LTC2664 and LTC2672 DAC SPI driver" depends on SPI - select REGMAP + select REGMAP_SPI help Say yes here to build support for Analog Devices LTC2664 and LTC2672 converters (DAC). -- cgit From bf40167d54d55d4b54d0103713d86a8638fb9290 Mon Sep 17 00:00:00 2001 From: Alexandre Ghiti Date: Wed, 16 Oct 2024 10:36:24 +0200 Subject: riscv: vdso: Prevent the compiler from inserting calls to memset() The compiler is smart enough to insert a call to memset() in riscv_vdso_get_cpus(), which generates a dynamic relocation. So prevent this by using -fno-builtin option. Fixes: e2c0cdfba7f6 ("RISC-V: User-facing API") Cc: stable@vger.kernel.org Signed-off-by: Alexandre Ghiti Reviewed-by: Guo Ren Link: https://lore.kernel.org/r/20241016083625.136311-2-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt --- arch/riscv/kernel/vdso/Makefile | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/riscv/kernel/vdso/Makefile b/arch/riscv/kernel/vdso/Makefile index 960feb1526ca..3f1c4b2d0b06 100644 --- a/arch/riscv/kernel/vdso/Makefile +++ b/arch/riscv/kernel/vdso/Makefile @@ -18,6 +18,7 @@ obj-vdso = $(patsubst %, %.o, $(vdso-syms)) note.o ccflags-y := -fno-stack-protector ccflags-y += -DDISABLE_BRANCH_PROFILING +ccflags-y += -fno-builtin ifneq ($(c-gettimeofday-y),) CFLAGS_vgettimeofday.o += -fPIC -include $(c-gettimeofday-y) -- cgit From cce3cd647721dc30273f0546852b5c26820eb715 Mon Sep 17 00:00:00 2001 From: Li Zhijian Date: Tue, 22 Oct 2024 11:00:54 +0800 Subject: cxl/core: Return error when cxl_endpoint_gather_bandwidth() handles a non-PCI device The function cxl_endpoint_gather_bandwidth() invokes pci_bus_read/write_XXX(), however, not all CXL devices are presently implemented via PCI. It is recognized that the cxl_test has realized a CXL device using a platform device. Calling pci_bus_read/write_XXX() in cxl_test will cause kernel panic: platform cxl_host_bridge.3: host supports CXL (restricted) Oops: general protection fault, probably for non-canonical address 0x3ef17856fcae4fbd: 0000 [#1] PREEMPT SMP PTI Call Trace: ? __die_body.cold+0x19/0x27 ? die_addr+0x38/0x60 ? exc_general_protection+0x1f5/0x4b0 ? asm_exc_general_protection+0x22/0x30 ? pci_bus_read_config_word+0x1c/0x60 pcie_capability_read_word+0x93/0xb0 pcie_link_speed_mbps+0x18/0x50 cxl_pci_get_bandwidth+0x18/0x60 [cxl_core] cxl_endpoint_gather_bandwidth.constprop.0+0xf4/0x230 [cxl_core] ? xas_store+0x54/0x660 ? preempt_count_add+0x69/0xa0 ? _raw_spin_lock+0x13/0x40 ? __kmalloc_cache_noprof+0xe7/0x270 cxl_region_shared_upstream_bandwidth_update+0x9c/0x790 [cxl_core] cxl_region_attach+0x520/0x7e0 [cxl_core] store_targetN+0xf2/0x120 [cxl_core] kernfs_fop_write_iter+0x13a/0x1f0 vfs_write+0x23b/0x410 ksys_write+0x53/0xd0 do_syscall_64+0x62/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e And Ying also reported a KASAN error with similar calltrace. Reported-by: Huang, Ying Closes: http://lore.kernel.org/87y12w9vp5.fsf@yhuang6-desk2.ccr.corp.intel.com Fixes: a5ab0de0ebaa ("cxl: Calculate region bandwidth of targets with shared upstream link") Signed-off-by: Li Zhijian Reviewed-by: Dan Williams Tested-by: Huang, Ying Link: https://patch.msgid.link/20241022030054.258942-1-lizhijian@fujitsu.com Signed-off-by: Ira Weiny --- drivers/cxl/core/cdat.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c index ef1621d40f05..e9cd7939c407 100644 --- a/drivers/cxl/core/cdat.c +++ b/drivers/cxl/core/cdat.c @@ -641,6 +641,9 @@ static int cxl_endpoint_gather_bandwidth(struct cxl_region *cxlr, void *ptr; int rc; + if (!dev_is_pci(cxlds->dev)) + return -ENODEV; + if (cxlds->rcd) return -ENODEV; -- cgit From 2045fc4295c427d420aa1ff551b4de8179b6e5d5 Mon Sep 17 00:00:00 2001 From: Gianfranco Trad Date: Wed, 23 Oct 2024 23:30:44 +0200 Subject: bcachefs: Fix invalid shift in validate_sb_layout() Add check on layout->sb_max_size_bits against BCH_SB_LAYOUT_SIZE_BITS_MAX to prevent UBSAN shift-out-of-bounds in validate_sb_layout(). Reported-by: syzbot+089fad5a3a5e77825426@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=089fad5a3a5e77825426 Fixes: 03ef80b469d5 ("bcachefs: Ignore unknown mount options") Tested-by: syzbot+089fad5a3a5e77825426@syzkaller.appspotmail.com Signed-off-by: Gianfranco Trad Signed-off-by: Kent Overstreet --- fs/bcachefs/errcode.h | 1 + fs/bcachefs/super-io.c | 5 +++++ 2 files changed, 6 insertions(+) diff --git a/fs/bcachefs/errcode.h b/fs/bcachefs/errcode.h index 649263516ab1..b6cbd716000b 100644 --- a/fs/bcachefs/errcode.h +++ b/fs/bcachefs/errcode.h @@ -222,6 +222,7 @@ x(BCH_ERR_invalid_sb_layout, invalid_sb_layout_type) \ x(BCH_ERR_invalid_sb_layout, invalid_sb_layout_nr_superblocks) \ x(BCH_ERR_invalid_sb_layout, invalid_sb_layout_superblocks_overlap) \ + x(BCH_ERR_invalid_sb_layout, invalid_sb_layout_sb_max_size_bits) \ x(BCH_ERR_invalid_sb, invalid_sb_members_missing) \ x(BCH_ERR_invalid_sb, invalid_sb_members) \ x(BCH_ERR_invalid_sb, invalid_sb_disk_groups) \ diff --git a/fs/bcachefs/super-io.c b/fs/bcachefs/super-io.c index ce7410d72089..7c71594f6a8b 100644 --- a/fs/bcachefs/super-io.c +++ b/fs/bcachefs/super-io.c @@ -287,6 +287,11 @@ static int validate_sb_layout(struct bch_sb_layout *layout, struct printbuf *out return -BCH_ERR_invalid_sb_layout_nr_superblocks; } + if (layout->sb_max_size_bits > BCH_SB_LAYOUT_SIZE_BITS_MAX) { + prt_printf(out, "Invalid superblock layout: max_size_bits too high"); + return -BCH_ERR_invalid_sb_layout_sb_max_size_bits; + } + max_sectors = 1 << layout->sb_max_size_bits; prev_offset = le64_to_cpu(layout->sb_offset[0]); -- cgit From 5c41f75d1b921b9eaf79588cdd3b22b00fb4ec52 Mon Sep 17 00:00:00 2001 From: Jeongjun Park Date: Tue, 22 Oct 2024 00:43:56 +0900 Subject: bcachefs: fix shift oob in alloc_lru_idx_fragmentation The size of a.data_type is set abnormally large, causing shift-out-of-bounds. To fix this, we need to add validation on a.data_type in alloc_lru_idx_fragmentation(). Reported-by: syzbot+7f45fa9805c40db3f108@syzkaller.appspotmail.com Fixes: 260af1562ec1 ("bcachefs: Kill alloc_v4.fragmentation_lru") Signed-off-by: Jeongjun Park Signed-off-by: Kent Overstreet --- fs/bcachefs/alloc_background.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/bcachefs/alloc_background.h b/fs/bcachefs/alloc_background.h index f8e87c6721b1..163a67b97a40 100644 --- a/fs/bcachefs/alloc_background.h +++ b/fs/bcachefs/alloc_background.h @@ -168,6 +168,9 @@ static inline bool data_type_movable(enum bch_data_type type) static inline u64 alloc_lru_idx_fragmentation(struct bch_alloc_v4 a, struct bch_dev *ca) { + if (a.data_type >= BCH_DATA_NR) + return 0; + if (!data_type_movable(a.data_type) || !bch2_bucket_sectors_fragmented(ca, a)) return 0; -- cgit From 032532f91a1d06d0750f16c49a9698ef5374a68f Mon Sep 17 00:00:00 2001 From: Hans de Goede Date: Thu, 24 Oct 2024 23:56:12 +0200 Subject: ASoC: codecs: rt5640: Always disable IRQs from rt5640_cancel_work() Disable IRQs from rt5640_cancel_work(), this fixes a crash caused by the IRQ never getting freed when the driver is unbound from the i2c_client with jack-detection active: [ 193.138780] rt5640 i2c-rt5640: ASoC: unknown pin LDO2 [ 193.138830] rt5640 i2c-rt5640: ASoC: unknown pin MICBIAS1 [ 193.671218] BUG: kernel NULL pointer dereference, address: 0000000000000078 [ 193.671239] #PF: supervisor read access in kernel mode [ 193.671248] #PF: error_code(0x0000) - not-present page ... [ 193.671531] ? asm_exc_page_fault+0x22/0x30 [ 193.671551] ? rt5640_jack_inserted+0x10/0x80 [snd_soc_rt5640] [ 193.671574] rt5640_detect_headset+0x93/0x130 [snd_soc_rt5640] [ 193.671596] rt5640_jack_work+0x93/0x355 [snd_soc_rt5640] Signed-off-by: Hans de Goede Link: https://patch.msgid.link/20241024215612.92147-1-hdegoede@redhat.com Signed-off-by: Mark Brown --- sound/soc/codecs/rt5640.c | 27 +++++++++++++++------------ 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/sound/soc/codecs/rt5640.c b/sound/soc/codecs/rt5640.c index 16f3425a3e35..855139348edb 100644 --- a/sound/soc/codecs/rt5640.c +++ b/sound/soc/codecs/rt5640.c @@ -2419,10 +2419,20 @@ static irqreturn_t rt5640_jd_gpio_irq(int irq, void *data) return IRQ_HANDLED; } -static void rt5640_cancel_work(void *data) +static void rt5640_disable_irq_and_cancel_work(void *data) { struct rt5640_priv *rt5640 = data; + if (rt5640->jd_gpio_irq_requested) { + free_irq(rt5640->jd_gpio_irq, rt5640); + rt5640->jd_gpio_irq_requested = false; + } + + if (rt5640->irq_requested) { + free_irq(rt5640->irq, rt5640); + rt5640->irq_requested = false; + } + cancel_delayed_work_sync(&rt5640->jack_work); cancel_delayed_work_sync(&rt5640->bp_work); } @@ -2463,13 +2473,7 @@ static void rt5640_disable_jack_detect(struct snd_soc_component *component) if (!rt5640->jack) return; - if (rt5640->jd_gpio_irq_requested) - free_irq(rt5640->jd_gpio_irq, rt5640); - - if (rt5640->irq_requested) - free_irq(rt5640->irq, rt5640); - - rt5640_cancel_work(rt5640); + rt5640_disable_irq_and_cancel_work(rt5640); if (rt5640->jack->status & SND_JACK_MICROPHONE) { rt5640_disable_micbias1_ovcd_irq(component); @@ -2477,8 +2481,6 @@ static void rt5640_disable_jack_detect(struct snd_soc_component *component) snd_soc_jack_report(rt5640->jack, 0, SND_JACK_BTN_0); } - rt5640->jd_gpio_irq_requested = false; - rt5640->irq_requested = false; rt5640->jd_gpio = NULL; rt5640->jack = NULL; } @@ -2798,7 +2800,8 @@ static int rt5640_suspend(struct snd_soc_component *component) if (rt5640->jack) { /* disable jack interrupts during system suspend */ disable_irq(rt5640->irq); - rt5640_cancel_work(rt5640); + cancel_delayed_work_sync(&rt5640->jack_work); + cancel_delayed_work_sync(&rt5640->bp_work); } snd_soc_component_force_bias_level(component, SND_SOC_BIAS_OFF); @@ -3032,7 +3035,7 @@ static int rt5640_i2c_probe(struct i2c_client *i2c) INIT_DELAYED_WORK(&rt5640->jack_work, rt5640_jack_work); /* Make sure work is stopped on probe-error / remove */ - ret = devm_add_action_or_reset(&i2c->dev, rt5640_cancel_work, rt5640); + ret = devm_add_action_or_reset(&i2c->dev, rt5640_disable_irq_and_cancel_work, rt5640); if (ret) return ret; -- cgit From d48696b915527b5bcdd207a299aec03fb037eb17 Mon Sep 17 00:00:00 2001 From: Hans de Goede Date: Thu, 24 Oct 2024 23:16:14 +0200 Subject: ASoC: Intel: bytcr_rt5640: Add support for non ACPI instantiated codec MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On some x86 Bay Trail tablets which shipped with Android as factory OS, the DSDT is so broken that the codec needs to be manually instantatiated by the special x86-android-tablets.ko "fixup" driver for cases like this. This means that the codec-dev cannot be retrieved through its ACPI fwnode, add support to the bytcr_rt5640 machine driver for such manually instantiated rt5640 i2c_clients. An example of a tablet which needs this is the Vexia EDU ATLA 10 tablet, which has been distributed to schools in the Spanish Andalucía region. Signed-off-by: Hans de Goede Link: https://patch.msgid.link/20241024211615.79518-1-hdegoede@redhat.com Signed-off-by: Mark Brown --- sound/soc/intel/boards/bytcr_rt5640.c | 33 ++++++++++++++++++++++++++++++--- 1 file changed, 30 insertions(+), 3 deletions(-) diff --git a/sound/soc/intel/boards/bytcr_rt5640.c b/sound/soc/intel/boards/bytcr_rt5640.c index 2ed49acb4e36..8dfd91cc3668 100644 --- a/sound/soc/intel/boards/bytcr_rt5640.c +++ b/sound/soc/intel/boards/bytcr_rt5640.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -32,6 +33,8 @@ #include "../atom/sst-atom-controls.h" #include "../common/soc-intel-quirks.h" +#define BYT_RT5640_FALLBACK_CODEC_DEV_NAME "i2c-rt5640" + enum { BYT_RT5640_DMIC1_MAP, BYT_RT5640_DMIC2_MAP, @@ -1698,9 +1701,33 @@ static int snd_byt_rt5640_mc_probe(struct platform_device *pdev) codec_dev = acpi_get_first_physical_node(adev); acpi_dev_put(adev); - if (!codec_dev) - return -EPROBE_DEFER; - priv->codec_dev = get_device(codec_dev); + + if (codec_dev) { + priv->codec_dev = get_device(codec_dev); + } else { + /* + * Special case for Android tablets where the codec i2c_client + * has been manually instantiated by x86_android_tablets.ko due + * to a broken DSDT. + */ + codec_dev = bus_find_device_by_name(&i2c_bus_type, NULL, + BYT_RT5640_FALLBACK_CODEC_DEV_NAME); + if (!codec_dev) + return -EPROBE_DEFER; + + if (!i2c_verify_client(codec_dev)) { + dev_err(dev, "Error '%s' is not an i2c_client\n", + BYT_RT5640_FALLBACK_CODEC_DEV_NAME); + put_device(codec_dev); + } + + /* fixup codec name */ + strscpy(byt_rt5640_codec_name, BYT_RT5640_FALLBACK_CODEC_DEV_NAME, + sizeof(byt_rt5640_codec_name)); + + /* bus_find_device() returns a reference no need to get() */ + priv->codec_dev = codec_dev; + } /* * swap SSP0 if bytcr is detected -- cgit From 0107f28f135231da22a9ad5756bb16bd5cada4d5 Mon Sep 17 00:00:00 2001 From: Hans de Goede Date: Thu, 24 Oct 2024 23:16:15 +0200 Subject: ASoC: Intel: bytcr_rt5640: Add DMI quirk for Vexia Edu Atla 10 tablet The Vexia Edu Atla 10 tablet mostly uses the BYTCR tablet defaults, but as happens on more models it is using IN1 instead of IN3 for its internal mic and JD_SRC_JD2_IN4N instead of JD_SRC_JD1_IN4P for jack-detection. Add a DMI quirk for this to fix the internal-mic and jack-detection. Signed-off-by: Hans de Goede Link: https://patch.msgid.link/20241024211615.79518-2-hdegoede@redhat.com Signed-off-by: Mark Brown --- sound/soc/intel/boards/bytcr_rt5640.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/sound/soc/intel/boards/bytcr_rt5640.c b/sound/soc/intel/boards/bytcr_rt5640.c index 8dfd91cc3668..54f77f57ec8e 100644 --- a/sound/soc/intel/boards/bytcr_rt5640.c +++ b/sound/soc/intel/boards/bytcr_rt5640.c @@ -1132,6 +1132,21 @@ static const struct dmi_system_id byt_rt5640_quirk_table[] = { BYT_RT5640_SSP0_AIF2 | BYT_RT5640_MCLK_EN), }, + { /* Vexia Edu Atla 10 tablet */ + .matches = { + DMI_MATCH(DMI_BOARD_VENDOR, "AMI Corporation"), + DMI_MATCH(DMI_BOARD_NAME, "Aptio CRB"), + /* Above strings are too generic, also match on BIOS date */ + DMI_MATCH(DMI_BIOS_DATE, "08/25/2014"), + }, + .driver_data = (void *)(BYT_RT5640_IN1_MAP | + BYT_RT5640_JD_SRC_JD2_IN4N | + BYT_RT5640_OVCD_TH_2000UA | + BYT_RT5640_OVCD_SF_0P75 | + BYT_RT5640_DIFF_MIC | + BYT_RT5640_SSP0_AIF2 | + BYT_RT5640_MCLK_EN), + }, { /* Voyo Winpad A15 */ .matches = { DMI_MATCH(DMI_BOARD_VENDOR, "AMI Corporation"), -- cgit From bffdf9d7e51a7be8eeaac2ccf9e54a5fde01ff65 Mon Sep 17 00:00:00 2001 From: Dmitry Torokhov Date: Fri, 18 Oct 2024 17:17:48 -0700 Subject: Input: edt-ft5x06 - fix regmap leak when probe fails The driver neglects to free the instance of I2C regmap constructed at the beginning of the edt_ft5x06_ts_probe() method when probe fails. Additionally edt_ft5x06_ts_remove() is freeing the regmap too early, before the rest of the device resources that are managed by devm are released. Fix this by installing a custom devm action that will ensure that the regmap is released at the right time during normal teardown as well as in case of probe failure. Note that devm_regmap_init_i2c() could not be used because the driver may replace the original regmap with a regmap specific for M06 devices in the middle of the probe, and using devm_regmap_init_i2c() would result in releasing the M06 regmap too early. Reported-by: Li Zetao Fixes: 9dfd9708ffba ("Input: edt-ft5x06 - convert to use regmap API") Cc: stable@vger.kernel.org Reviewed-by: Oliver Graute Link: https://lore.kernel.org/r/ZxL6rIlVlgsAu-Jv@google.com Signed-off-by: Dmitry Torokhov --- drivers/input/touchscreen/edt-ft5x06.c | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/drivers/input/touchscreen/edt-ft5x06.c b/drivers/input/touchscreen/edt-ft5x06.c index e70415f189a5..126b0ed85aa5 100644 --- a/drivers/input/touchscreen/edt-ft5x06.c +++ b/drivers/input/touchscreen/edt-ft5x06.c @@ -1121,6 +1121,14 @@ static void edt_ft5x06_ts_set_regs(struct edt_ft5x06_ts_data *tsdata) } } +static void edt_ft5x06_exit_regmap(void *arg) +{ + struct edt_ft5x06_ts_data *data = arg; + + if (!IS_ERR_OR_NULL(data->regmap)) + regmap_exit(data->regmap); +} + static void edt_ft5x06_disable_regulators(void *arg) { struct edt_ft5x06_ts_data *data = arg; @@ -1154,6 +1162,16 @@ static int edt_ft5x06_ts_probe(struct i2c_client *client) return PTR_ERR(tsdata->regmap); } + /* + * We are not using devm_regmap_init_i2c() and instead install a + * custom action because we may replace regmap with M06-specific one + * and we need to make sure that it will not be released too early. + */ + error = devm_add_action_or_reset(&client->dev, edt_ft5x06_exit_regmap, + tsdata); + if (error) + return error; + chip_data = device_get_match_data(&client->dev); if (!chip_data) chip_data = (const struct edt_i2c_chip_data *)id->driver_data; @@ -1347,7 +1365,6 @@ static void edt_ft5x06_ts_remove(struct i2c_client *client) struct edt_ft5x06_ts_data *tsdata = i2c_get_clientdata(client); edt_ft5x06_ts_teardown_debugfs(tsdata); - regmap_exit(tsdata->regmap); } static int edt_ft5x06_ts_suspend(struct device *dev) -- cgit From 78e7be018784934081afec77f96d49a2483f9188 Mon Sep 17 00:00:00 2001 From: Kailang Yang Date: Fri, 18 Oct 2024 13:53:24 +0800 Subject: ALSA: hda/realtek: Limit internal Mic boost on Dell platform Dell want to limit internal Mic boost on all Dell platform. Signed-off-by: Kailang Yang Cc: Link: https://lore.kernel.org/561fc5f5eff04b6cbd79ed173cd1c1db@realtek.com Signed-off-by: Takashi Iwai --- sound/pci/hda/patch_realtek.c | 21 ++++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/sound/pci/hda/patch_realtek.c b/sound/pci/hda/patch_realtek.c index 3567b14b52b7..784ac058418f 100644 --- a/sound/pci/hda/patch_realtek.c +++ b/sound/pci/hda/patch_realtek.c @@ -7521,6 +7521,7 @@ enum { ALC286_FIXUP_SONY_MIC_NO_PRESENCE, ALC269_FIXUP_PINCFG_NO_HP_TO_LINEOUT, ALC269_FIXUP_DELL1_MIC_NO_PRESENCE, + ALC269_FIXUP_DELL1_LIMIT_INT_MIC_BOOST, ALC269_FIXUP_DELL2_MIC_NO_PRESENCE, ALC269_FIXUP_DELL3_MIC_NO_PRESENCE, ALC269_FIXUP_DELL4_MIC_NO_PRESENCE, @@ -7555,6 +7556,7 @@ enum { ALC255_FIXUP_ACER_MIC_NO_PRESENCE, ALC255_FIXUP_ASUS_MIC_NO_PRESENCE, ALC255_FIXUP_DELL1_MIC_NO_PRESENCE, + ALC255_FIXUP_DELL1_LIMIT_INT_MIC_BOOST, ALC255_FIXUP_DELL2_MIC_NO_PRESENCE, ALC255_FIXUP_HEADSET_MODE, ALC255_FIXUP_HEADSET_MODE_NO_HP_MIC, @@ -8114,6 +8116,12 @@ static const struct hda_fixup alc269_fixups[] = { .chained = true, .chain_id = ALC269_FIXUP_HEADSET_MODE }, + [ALC269_FIXUP_DELL1_LIMIT_INT_MIC_BOOST] = { + .type = HDA_FIXUP_FUNC, + .v.func = alc269_fixup_limit_int_mic_boost, + .chained = true, + .chain_id = ALC269_FIXUP_DELL1_MIC_NO_PRESENCE + }, [ALC269_FIXUP_DELL2_MIC_NO_PRESENCE] = { .type = HDA_FIXUP_PINS, .v.pins = (const struct hda_pintbl[]) { @@ -8394,6 +8402,12 @@ static const struct hda_fixup alc269_fixups[] = { .chained = true, .chain_id = ALC255_FIXUP_HEADSET_MODE }, + [ALC255_FIXUP_DELL1_LIMIT_INT_MIC_BOOST] = { + .type = HDA_FIXUP_FUNC, + .v.func = alc269_fixup_limit_int_mic_boost, + .chained = true, + .chain_id = ALC255_FIXUP_DELL1_MIC_NO_PRESENCE + }, [ALC255_FIXUP_DELL2_MIC_NO_PRESENCE] = { .type = HDA_FIXUP_PINS, .v.pins = (const struct hda_pintbl[]) { @@ -11076,6 +11090,7 @@ static const struct hda_model_fixup alc269_fixup_models[] = { {.id = ALC269_FIXUP_DELL2_MIC_NO_PRESENCE, .name = "dell-headset-dock"}, {.id = ALC269_FIXUP_DELL3_MIC_NO_PRESENCE, .name = "dell-headset3"}, {.id = ALC269_FIXUP_DELL4_MIC_NO_PRESENCE, .name = "dell-headset4"}, + {.id = ALC269_FIXUP_DELL4_MIC_NO_PRESENCE_QUIET, .name = "dell-headset4-quiet"}, {.id = ALC283_FIXUP_CHROME_BOOK, .name = "alc283-dac-wcaps"}, {.id = ALC283_FIXUP_SENSE_COMBO_JACK, .name = "alc283-sense-combo"}, {.id = ALC292_FIXUP_TPT440_DOCK, .name = "tpt440-dock"}, @@ -11630,16 +11645,16 @@ static const struct snd_hda_pin_quirk alc269_fallback_pin_fixup_tbl[] = { SND_HDA_PIN_QUIRK(0x10ec0289, 0x1028, "Dell", ALC269_FIXUP_DELL4_MIC_NO_PRESENCE, {0x19, 0x40000000}, {0x1b, 0x40000000}), - SND_HDA_PIN_QUIRK(0x10ec0295, 0x1028, "Dell", ALC269_FIXUP_DELL4_MIC_NO_PRESENCE, + SND_HDA_PIN_QUIRK(0x10ec0295, 0x1028, "Dell", ALC269_FIXUP_DELL4_MIC_NO_PRESENCE_QUIET, {0x19, 0x40000000}, {0x1b, 0x40000000}), SND_HDA_PIN_QUIRK(0x10ec0256, 0x1028, "Dell", ALC255_FIXUP_DELL1_MIC_NO_PRESENCE, {0x19, 0x40000000}, {0x1a, 0x40000000}), - SND_HDA_PIN_QUIRK(0x10ec0236, 0x1028, "Dell", ALC255_FIXUP_DELL1_MIC_NO_PRESENCE, + SND_HDA_PIN_QUIRK(0x10ec0236, 0x1028, "Dell", ALC255_FIXUP_DELL1_LIMIT_INT_MIC_BOOST, {0x19, 0x40000000}, {0x1a, 0x40000000}), - SND_HDA_PIN_QUIRK(0x10ec0274, 0x1028, "Dell", ALC274_FIXUP_DELL_AIO_LINEOUT_VERB, + SND_HDA_PIN_QUIRK(0x10ec0274, 0x1028, "Dell", ALC269_FIXUP_DELL1_LIMIT_INT_MIC_BOOST, {0x19, 0x40000000}, {0x1a, 0x40000000}), SND_HDA_PIN_QUIRK(0x10ec0256, 0x1043, "ASUS", ALC2XX_FIXUP_HEADSET_MIC, -- cgit From 6668610b4d8ce9a3ee3ed61a9471f62fb5f05bf9 Mon Sep 17 00:00:00 2001 From: Hans de Goede Date: Fri, 25 Oct 2024 11:02:21 +0200 Subject: ASoC: Intel: sst: Support LPE0F28 ACPI HID MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Some old Bay Trail tablets which shipped with Android as factory OS have the SST/LPE audio engine described by an ACPI device with a HID (Hardware-ID) of LPE0F28 instead of 80860F28. Add support for this. Note this uses a new sst_res_info for just the LPE0F28 case because it has a different layout for the IO-mem ACPI resources then the 80860F28. An example of a tablet which needs this is the Vexia EDU ATLA 10 tablet, which has been distributed to schools in the Spanish Andalucía region. Signed-off-by: Hans de Goede Link: https://patch.msgid.link/20241025090221.52198-1-hdegoede@redhat.com Signed-off-by: Mark Brown --- sound/hda/intel-dsp-config.c | 4 +++ sound/soc/intel/atom/sst/sst_acpi.c | 64 +++++++++++++++++++++++++++++++------ 2 files changed, 59 insertions(+), 9 deletions(-) diff --git a/sound/hda/intel-dsp-config.c b/sound/hda/intel-dsp-config.c index f018bd779862..9f849e05ce79 100644 --- a/sound/hda/intel-dsp-config.c +++ b/sound/hda/intel-dsp-config.c @@ -721,6 +721,10 @@ static const struct config_entry acpi_config_table[] = { #if IS_ENABLED(CONFIG_SND_SST_ATOM_HIFI2_PLATFORM_ACPI) || \ IS_ENABLED(CONFIG_SND_SOC_SOF_BAYTRAIL) /* BayTrail */ + { + .flags = FLAG_SST_OR_SOF_BYT, + .acpi_hid = "LPE0F28", + }, { .flags = FLAG_SST_OR_SOF_BYT, .acpi_hid = "80860F28", diff --git a/sound/soc/intel/atom/sst/sst_acpi.c b/sound/soc/intel/atom/sst/sst_acpi.c index 9956dc63db74..f4c4774249ee 100644 --- a/sound/soc/intel/atom/sst/sst_acpi.c +++ b/sound/soc/intel/atom/sst/sst_acpi.c @@ -125,6 +125,28 @@ static const struct sst_res_info bytcr_res_info = { .acpi_ipc_irq_index = 0 }; +/* For "LPE0F28" ACPI device found on some Android factory OS models */ +static const struct sst_res_info lpe8086_res_info = { + .shim_offset = 0x140000, + .shim_size = 0x000100, + .shim_phy_addr = SST_BYT_SHIM_PHY_ADDR, + .ssp0_offset = 0xa0000, + .ssp0_size = 0x1000, + .dma0_offset = 0x98000, + .dma0_size = 0x4000, + .dma1_offset = 0x9c000, + .dma1_size = 0x4000, + .iram_offset = 0x0c0000, + .iram_size = 0x14000, + .dram_offset = 0x100000, + .dram_size = 0x28000, + .mbox_offset = 0x144000, + .mbox_size = 0x1000, + .acpi_lpe_res_index = 1, + .acpi_ddr_index = 0, + .acpi_ipc_irq_index = 0 +}; + static struct sst_platform_info byt_rvp_platform_data = { .probe_data = &byt_fwparse_info, .ipc_info = &byt_ipc_info, @@ -268,10 +290,38 @@ static int sst_acpi_probe(struct platform_device *pdev) mach->pdata = &chv_platform_data; pdata = mach->pdata; - ret = kstrtouint(id->id, 16, &dev_id); - if (ret < 0) { - dev_err(dev, "Unique device id conversion error: %d\n", ret); - return ret; + if (!strcmp(id->id, "LPE0F28")) { + struct resource *rsrc; + + /* Use regular BYT SST PCI VID:PID */ + dev_id = 0x80860F28; + byt_rvp_platform_data.res_info = &lpe8086_res_info; + + /* + * The "LPE0F28" ACPI device has separate IO-mem resources for: + * DDR, SHIM, MBOX, IRAM, DRAM, CFG + * None of which covers the entire LPE base address range. + * lpe8086_res_info.acpi_lpe_res_index points to the SHIM. + * Patch this to cover the entire base address range as expected + * by sst_platform_get_resources(). + */ + rsrc = platform_get_resource(pdev, IORESOURCE_MEM, + pdata->res_info->acpi_lpe_res_index); + if (!rsrc) { + dev_err(ctx->dev, "Invalid SHIM base\n"); + return -EIO; + } + rsrc->start -= pdata->res_info->shim_offset; + rsrc->end = rsrc->start + 0x200000 - 1; + } else { + ret = kstrtouint(id->id, 16, &dev_id); + if (ret < 0) { + dev_err(dev, "Unique device id conversion error: %d\n", ret); + return ret; + } + + if (soc_intel_is_byt_cr(pdev)) + byt_rvp_platform_data.res_info = &bytcr_res_info; } dev_dbg(dev, "ACPI device id: %x\n", dev_id); @@ -280,11 +330,6 @@ static int sst_acpi_probe(struct platform_device *pdev) if (ret < 0) return ret; - if (soc_intel_is_byt_cr(pdev)) { - /* override resource info */ - byt_rvp_platform_data.res_info = &bytcr_res_info; - } - /* update machine parameters */ mach->mach_params.acpi_ipc_irq_index = pdata->res_info->acpi_ipc_irq_index; @@ -344,6 +389,7 @@ static void sst_acpi_remove(struct platform_device *pdev) } static const struct acpi_device_id sst_acpi_ids[] = { + { "LPE0F28", (unsigned long)&snd_soc_acpi_intel_baytrail_machines}, { "80860F28", (unsigned long)&snd_soc_acpi_intel_baytrail_machines}, { "808622A8", (unsigned long)&snd_soc_acpi_intel_cherrytrail_machines}, { }, -- cgit From 1966db682f064172891275cb951aa8c98a0a809b Mon Sep 17 00:00:00 2001 From: Yunhui Cui Date: Mon, 14 Oct 2024 21:01:41 +0800 Subject: RISC-V: ACPI: fix early_ioremap to early_memremap When SVPBMT is enabled, __acpi_map_table() will directly access the data in DDR through the IO attribute, rather than through hardware cache consistency, resulting in incorrect data in the obtained ACPI table. The log: ACPI: [ACPI:0x18] Invalid zero length. We do not assume whether the bootloader flushes or not. We should access in a cacheable way instead of maintaining cache consistency by software. Fixes: 3b426d4b5b14 ("RISC-V: ACPI : Fix for usage of pointers in different address space") Cc: stable@vger.kernel.org Reviewed-by: Alexandre Ghiti Signed-off-by: Yunhui Cui Reviewed-by: Sunil V L Link: https://lore.kernel.org/r/20241014130141.86426-1-cuiyunhui@bytedance.com Signed-off-by: Palmer Dabbelt --- arch/riscv/kernel/acpi.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/riscv/kernel/acpi.c b/arch/riscv/kernel/acpi.c index 6e0d333f57e5..2fd29695a788 100644 --- a/arch/riscv/kernel/acpi.c +++ b/arch/riscv/kernel/acpi.c @@ -210,7 +210,7 @@ void __init __iomem *__acpi_map_table(unsigned long phys, unsigned long size) if (!size) return NULL; - return early_ioremap(phys, size); + return early_memremap(phys, size); } void __init __acpi_unmap_table(void __iomem *map, unsigned long size) @@ -218,7 +218,7 @@ void __init __acpi_unmap_table(void __iomem *map, unsigned long size) if (!map || !size) return; - early_iounmap(map, size); + early_memunmap(map, size); } void __iomem *acpi_os_ioremap(acpi_physical_address phys, acpi_size size) -- cgit From afedc3126e11ff1404b32e538657b68022e933ca Mon Sep 17 00:00:00 2001 From: Alexandre Ghiti Date: Wed, 9 Oct 2024 09:27:49 +0200 Subject: riscv: Do not use fortify in early code Early code designates the code executed when the MMU is not yet enabled, and this comes with some limitations (see Documentation/arch/riscv/boot.rst, section "Pre-MMU execution"). FORTIFY_SOURCE must be disabled then since it can trigger kernel panics as reported in [1]. Reported-by: Jason Montleon Closes: https://lore.kernel.org/linux-riscv/CAJD_bPJes4QhmXY5f63GHV9B9HFkSCoaZjk-qCT2NGS7Q9HODg@mail.gmail.com/ [1] Fixes: a35707c3d850 ("riscv: add memory-type errata for T-Head") Fixes: 26e7aacb83df ("riscv: Allow to downgrade paging mode from the command line") Cc: stable@vger.kernel.org Signed-off-by: Alexandre Ghiti Link: https://lore.kernel.org/r/20241009072749.45006-1-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt --- arch/riscv/errata/Makefile | 6 ++++++ arch/riscv/kernel/Makefile | 5 +++++ arch/riscv/kernel/pi/Makefile | 6 +++++- 3 files changed, 16 insertions(+), 1 deletion(-) diff --git a/arch/riscv/errata/Makefile b/arch/riscv/errata/Makefile index 8a2739485123..f0da9d7b39c3 100644 --- a/arch/riscv/errata/Makefile +++ b/arch/riscv/errata/Makefile @@ -2,6 +2,12 @@ ifdef CONFIG_RELOCATABLE KBUILD_CFLAGS += -fno-pie endif +ifdef CONFIG_RISCV_ALTERNATIVE_EARLY +ifdef CONFIG_FORTIFY_SOURCE +KBUILD_CFLAGS += -D__NO_FORTIFY +endif +endif + obj-$(CONFIG_ERRATA_ANDES) += andes/ obj-$(CONFIG_ERRATA_SIFIVE) += sifive/ obj-$(CONFIG_ERRATA_THEAD) += thead/ diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile index 7f88cc4931f5..69dc8aaab3fb 100644 --- a/arch/riscv/kernel/Makefile +++ b/arch/riscv/kernel/Makefile @@ -36,6 +36,11 @@ KASAN_SANITIZE_alternative.o := n KASAN_SANITIZE_cpufeature.o := n KASAN_SANITIZE_sbi_ecall.o := n endif +ifdef CONFIG_FORTIFY_SOURCE +CFLAGS_alternative.o += -D__NO_FORTIFY +CFLAGS_cpufeature.o += -D__NO_FORTIFY +CFLAGS_sbi_ecall.o += -D__NO_FORTIFY +endif endif extra-y += vmlinux.lds diff --git a/arch/riscv/kernel/pi/Makefile b/arch/riscv/kernel/pi/Makefile index d5bf1bc7de62..81d69d45c06c 100644 --- a/arch/riscv/kernel/pi/Makefile +++ b/arch/riscv/kernel/pi/Makefile @@ -16,8 +16,12 @@ KBUILD_CFLAGS := $(filter-out $(CC_FLAGS_LTO), $(KBUILD_CFLAGS)) KBUILD_CFLAGS += -mcmodel=medany CFLAGS_cmdline_early.o += -D__NO_FORTIFY -CFLAGS_lib-fdt_ro.o += -D__NO_FORTIFY CFLAGS_fdt_early.o += -D__NO_FORTIFY +# lib/string.c already defines __NO_FORTIFY +CFLAGS_ctype.o += -D__NO_FORTIFY +CFLAGS_lib-fdt.o += -D__NO_FORTIFY +CFLAGS_lib-fdt_ro.o += -D__NO_FORTIFY +CFLAGS_archrandom_early.o += -D__NO_FORTIFY $(obj)/%.pi.o: OBJCOPYFLAGS := --prefix-symbols=__pi_ \ --remove-section=.note.gnu.property \ -- cgit From 33549fcf37ec461f398f0a41e1c9948be2e5aca4 Mon Sep 17 00:00:00 2001 From: Conor Dooley Date: Tue, 1 Oct 2024 12:28:13 +0100 Subject: RISC-V: disallow gcc + rust builds During the discussion before supporting rust on riscv, it was decided not to support gcc yet, due to differences in extension handling compared to llvm (only the version of libclang matching the c compiler is supported). Recently Jason Montleon reported [1] that building with gcc caused build issues, due to unsupported arguments being passed to libclang. After some discussion between myself and Miguel, it is better to disable gcc + rust builds to match the original intent, and subsequently support it when an appropriate set of extensions can be deduced from the version of libclang. Closes: https://lore.kernel.org/all/20240917000848.720765-2-jmontleo@redhat.com/ [1] Link: https://lore.kernel.org/all/20240926-battering-revolt-6c6a7827413e@spud/ [2] Fixes: 70a57b247251a ("RISC-V: enable building 64-bit kernels with rust support") Cc: stable@vger.kernel.org Reported-by: Jason Montleon Signed-off-by: Conor Dooley Acked-by: Miguel Ojeda Reviewed-by: Nathan Chancellor Link: https://lore.kernel.org/r/20241001-playlist-deceiving-16ece2f440f5@spud Signed-off-by: Palmer Dabbelt --- Documentation/rust/arch-support.rst | 2 +- arch/riscv/Kconfig | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/rust/arch-support.rst b/Documentation/rust/arch-support.rst index 750ff371570a..54be7ddf3e57 100644 --- a/Documentation/rust/arch-support.rst +++ b/Documentation/rust/arch-support.rst @@ -17,7 +17,7 @@ Architecture Level of support Constraints ============= ================ ============================================== ``arm64`` Maintained Little Endian only. ``loongarch`` Maintained \- -``riscv`` Maintained ``riscv64`` only. +``riscv`` Maintained ``riscv64`` and LLVM/Clang only. ``um`` Maintained \- ``x86`` Maintained ``x86_64`` only. ============= ================ ============================================== diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index 62545946ecf4..f4c570538d55 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -177,7 +177,7 @@ config RISCV select HAVE_REGS_AND_STACK_ACCESS_API select HAVE_RETHOOK if !XIP_KERNEL select HAVE_RSEQ - select HAVE_RUST if RUSTC_SUPPORTS_RISCV + select HAVE_RUST if RUSTC_SUPPORTS_RISCV && CC_IS_CLANG select HAVE_SAMPLE_FTRACE_DIRECT select HAVE_SAMPLE_FTRACE_DIRECT_MULTI select HAVE_STACKPROTECTOR -- cgit From d41373a4b910961df5a5e3527d7bde6ad45ca438 Mon Sep 17 00:00:00 2001 From: Heinrich Schuchardt Date: Sun, 29 Sep 2024 16:02:33 +0200 Subject: riscv: efi: Set NX compat flag in PE/COFF header The IMAGE_DLLCHARACTERISTICS_NX_COMPAT informs the firmware that the EFI binary does not rely on pages that are both executable and writable. The flag is used by some distro versions of GRUB to decide if the EFI binary may be executed. As the Linux kernel neither has RWX sections nor needs RWX pages for relocation we should set the flag. Cc: Ard Biesheuvel Cc: Signed-off-by: Heinrich Schuchardt Reviewed-by: Emil Renner Berthing Fixes: cb7d2dd5612a ("RISC-V: Add PE/COFF header for EFI stub") Acked-by: Ard Biesheuvel Link: https://lore.kernel.org/r/20240929140233.211800-1-heinrich.schuchardt@canonical.com Signed-off-by: Palmer Dabbelt --- arch/riscv/kernel/efi-header.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/riscv/kernel/efi-header.S b/arch/riscv/kernel/efi-header.S index 515b2dfbca75..c5f17c2710b5 100644 --- a/arch/riscv/kernel/efi-header.S +++ b/arch/riscv/kernel/efi-header.S @@ -64,7 +64,7 @@ extra_header_fields: .long efi_header_end - _start // SizeOfHeaders .long 0 // CheckSum .short IMAGE_SUBSYSTEM_EFI_APPLICATION // Subsystem - .short 0 // DllCharacteristics + .short IMAGE_DLL_CHARACTERISTICS_NX_COMPAT // DllCharacteristics .quad 0 // SizeOfStackReserve .quad 0 // SizeOfStackCommit .quad 0 // SizeOfHeapReserve -- cgit From 37233169a6ea912020c572f870075a63293b786a Mon Sep 17 00:00:00 2001 From: Miquel Sabaté Solà Date: Fri, 13 Sep 2024 10:00:52 +0200 Subject: riscv: Prevent a bad reference count on CPU nodes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When populating cache leaves we previously fetched the CPU device node at the very beginning. But when ACPI is enabled we go through a specific branch which returns early and does not call 'of_node_put' for the node that was acquired. Since we are not using a CPU device node for the ACPI code anyways, we can simply move the initialization of it just passed the ACPI block, and we are guaranteed to have an 'of_node_put' call for the acquired node. This prevents a bad reference count of the CPU device node. Moreover, the previous function did not check for errors when acquiring the device node, so a return -ENOENT has been added for that case. Signed-off-by: Miquel Sabaté Solà Reviewed-by: Sudeep Holla Reviewed-by: Sunil V L Reviewed-by: Alexandre Ghiti Fixes: 604f32ea6909 ("riscv: cacheinfo: initialize cacheinfo's level and type from ACPI PPTT") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240913080053.36636-1-mikisabate@gmail.com Signed-off-by: Palmer Dabbelt --- arch/riscv/kernel/cacheinfo.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/riscv/kernel/cacheinfo.c b/arch/riscv/kernel/cacheinfo.c index b320b1d9aa01..2d40736fc37c 100644 --- a/arch/riscv/kernel/cacheinfo.c +++ b/arch/riscv/kernel/cacheinfo.c @@ -80,8 +80,7 @@ int populate_cache_leaves(unsigned int cpu) { struct cpu_cacheinfo *this_cpu_ci = get_cpu_cacheinfo(cpu); struct cacheinfo *this_leaf = this_cpu_ci->info_list; - struct device_node *np = of_cpu_device_node_get(cpu); - struct device_node *prev = NULL; + struct device_node *np, *prev; int levels = 1, level = 1; if (!acpi_disabled) { @@ -105,6 +104,10 @@ int populate_cache_leaves(unsigned int cpu) return 0; } + np = of_cpu_device_node_get(cpu); + if (!np) + return -ENOENT; + if (of_property_read_bool(np, "cache-size")) ci_leaf_init(this_leaf++, CACHE_TYPE_UNIFIED, level); if (of_property_read_bool(np, "i-cache-size")) -- cgit From e0872ab72630dada3ae055bfa410bf463ff1d1e0 Mon Sep 17 00:00:00 2001 From: WangYuli Date: Thu, 17 Oct 2024 11:20:10 +0800 Subject: riscv: Use '%u' to format the output of 'cpu' 'cpu' is an unsigned integer, so its conversion specifier should be %u, not %d. Suggested-by: Wentao Guan Suggested-by: Maciej W. Rozycki Link: https://lore.kernel.org/all/alpine.DEB.2.21.2409122309090.40372@angie.orcam.me.uk/ Signed-off-by: WangYuli Reviewed-by: Charlie Jenkins Tested-by: Charlie Jenkins Fixes: f1e58583b9c7 ("RISC-V: Support cpu hotplug") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/4C127DEECDA287C8+20241017032010.96772-1-wangyuli@uniontech.com Signed-off-by: Palmer Dabbelt --- arch/riscv/kernel/cpu-hotplug.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/riscv/kernel/cpu-hotplug.c b/arch/riscv/kernel/cpu-hotplug.c index 28b58fc5ad19..a1e38ecfc8be 100644 --- a/arch/riscv/kernel/cpu-hotplug.c +++ b/arch/riscv/kernel/cpu-hotplug.c @@ -58,7 +58,7 @@ void arch_cpuhp_cleanup_dead_cpu(unsigned int cpu) if (cpu_ops->cpu_is_stopped) ret = cpu_ops->cpu_is_stopped(cpu); if (ret) - pr_warn("CPU%d may not have stopped: %d\n", cpu, ret); + pr_warn("CPU%u may not have stopped: %d\n", cpu, ret); } /* -- cgit From 46d4e5ac6f2f801f97bcd0ec82365969197dc9b1 Mon Sep 17 00:00:00 2001 From: Chunyan Zhang Date: Tue, 8 Oct 2024 17:41:38 +0800 Subject: riscv: Remove unused GENERATING_ASM_OFFSETS The macro is not used in the current version of kernel, it looks like can be removed to avoid a build warning: ../arch/riscv/kernel/asm-offsets.c: At top level: ../arch/riscv/kernel/asm-offsets.c:7: warning: macro "GENERATING_ASM_OFFSETS" is not used [-Wunused-macros] 7 | #define GENERATING_ASM_OFFSETS Fixes: 9639a44394b9 ("RISC-V: Provide a cleaner raw_smp_processor_id()") Cc: stable@vger.kernel.org Reviewed-by: Alexandre Ghiti Tested-by: Alexandre Ghiti Signed-off-by: Chunyan Zhang Link: https://lore.kernel.org/r/20241008094141.549248-2-zhangchunyan@iscas.ac.cn Signed-off-by: Palmer Dabbelt --- arch/riscv/kernel/asm-offsets.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c index e94180ba432f..c2f3129a8e5c 100644 --- a/arch/riscv/kernel/asm-offsets.c +++ b/arch/riscv/kernel/asm-offsets.c @@ -4,8 +4,6 @@ * Copyright (C) 2017 SiFive */ -#define GENERATING_ASM_OFFSETS - #include #include #include -- cgit From 164f66de6bb6ef454893f193c898dc8f1da6d18b Mon Sep 17 00:00:00 2001 From: Chunyan Zhang Date: Tue, 8 Oct 2024 17:41:39 +0800 Subject: riscv: Remove duplicated GET_RM The macro GET_RM defined twice in this file, one can be removed. Reviewed-by: Alexandre Ghiti Signed-off-by: Chunyan Zhang Fixes: 956d705dd279 ("riscv: Unaligned load/store handling for M_MODE") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20241008094141.549248-3-zhangchunyan@iscas.ac.cn Signed-off-by: Palmer Dabbelt --- arch/riscv/kernel/traps_misaligned.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/riscv/kernel/traps_misaligned.c b/arch/riscv/kernel/traps_misaligned.c index d4fd8af7aaf5..1b9867136b61 100644 --- a/arch/riscv/kernel/traps_misaligned.c +++ b/arch/riscv/kernel/traps_misaligned.c @@ -136,8 +136,6 @@ #define REG_PTR(insn, pos, regs) \ (ulong *)((ulong)(regs) + REG_OFFSET(insn, pos)) -#define GET_RM(insn) (((insn) >> 12) & 7) - #define GET_RS1(insn, regs) (*REG_PTR(insn, SH_RS1, regs)) #define GET_RS2(insn, regs) (*REG_PTR(insn, SH_RS2, regs)) #define GET_RS1S(insn, regs) (*REG_PTR(RVC_RS1S(insn), 0, regs)) -- cgit From 3ed092997a004d68a3a5b0eeb94e71b69839d0f7 Mon Sep 17 00:00:00 2001 From: Emmanuel Grumbach Date: Thu, 10 Oct 2024 14:04:59 +0300 Subject: wifi: iwlwifi: mvm: don't leak a link on AP removal Release the link mapping resource in AP removal. This impacted devices that do not support the MLD API (9260 and down). On those devices, we couldn't start the AP again after the AP has been already started and stopped. Fixes: a8b5d4809b50 ("wifi: iwlwifi: mvm: Configure the link mapping for non-MLD FW") Signed-off-by: Emmanuel Grumbach Signed-off-by: Miri Korenblit Link: https://patch.msgid.link/20241010140328.c54c42779882.Ied79e0d6244dc5a372e8b6ffa8ee9c6e1379ec1d@changeid Signed-off-by: Johannes Berg --- drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c b/drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c index a327893c6dce..39b815904501 100644 --- a/drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c +++ b/drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c @@ -1970,7 +1970,6 @@ static void iwl_mvm_mac_remove_interface(struct ieee80211_hw *hw, mvm->p2p_device_vif = NULL; } - iwl_mvm_unset_link_mapping(mvm, vif, &vif->bss_conf); iwl_mvm_mac_ctxt_remove(mvm, vif); RCU_INIT_POINTER(mvm->vif_id_to_mac[mvmvif->id], NULL); @@ -1979,6 +1978,7 @@ static void iwl_mvm_mac_remove_interface(struct ieee80211_hw *hw, mvm->monitor_on = false; out: + iwl_mvm_unset_link_mapping(mvm, vif, &vif->bss_conf); if (vif->type == NL80211_IFTYPE_AP || vif->type == NL80211_IFTYPE_ADHOC) { iwl_mvm_dealloc_int_sta(mvm, &mvmvif->deflink.mcast_sta); -- cgit From cbe84e9ad5e28ef083beff7f6edf2e623fac09e4 Mon Sep 17 00:00:00 2001 From: Miri Korenblit Date: Thu, 10 Oct 2024 14:05:01 +0300 Subject: wifi: iwlwifi: mvm: really send iwl_txpower_constraints_cmd iwl_mvm_send_ap_tx_power_constraint_cmd is a no-op if the link is not active (we need to know the band etc.) However, for the station case it will be called just before we set the link to active (by calling iwl_mvm_link_changed with the LINK_CONTEXT_MODIFY_ACTIVE bit set in the 'changed' flags and active = true), so it will end up doing nothing. Fix this by calling iwl_mvm_send_ap_tx_power_constraint_cmd before iwl_mvm_link_changed. Fixes: 6b82f4e119d1 ("wifi: iwlwifi: mvm: handle TPE advertised by AP") Signed-off-by: Miri Korenblit Link: https://patch.msgid.link/20241010140328.5c235fccd3f1.I2d40dea21e5547eba458565edcb4c354d094d82a@changeid Signed-off-by: Johannes Berg --- drivers/net/wireless/intel/iwlwifi/mvm/mld-mac80211.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/mld-mac80211.c b/drivers/net/wireless/intel/iwlwifi/mvm/mld-mac80211.c index f2378e0fb2fb..bd043db906db 100644 --- a/drivers/net/wireless/intel/iwlwifi/mvm/mld-mac80211.c +++ b/drivers/net/wireless/intel/iwlwifi/mvm/mld-mac80211.c @@ -350,11 +350,6 @@ __iwl_mvm_mld_assign_vif_chanctx(struct iwl_mvm *mvm, rcu_read_unlock(); } - if (vif->type == NL80211_IFTYPE_STATION) - iwl_mvm_send_ap_tx_power_constraint_cmd(mvm, vif, - link_conf, - false); - /* then activate */ ret = iwl_mvm_link_changed(mvm, vif, link_conf, LINK_CONTEXT_MODIFY_ACTIVE | @@ -363,6 +358,11 @@ __iwl_mvm_mld_assign_vif_chanctx(struct iwl_mvm *mvm, if (ret) goto out; + if (vif->type == NL80211_IFTYPE_STATION) + iwl_mvm_send_ap_tx_power_constraint_cmd(mvm, vif, + link_conf, + false); + /* * Power state must be updated before quotas, * otherwise fw will complain. -- cgit From 9715246ca0bfc9feaec1b4ff5b3d38de65a7025d Mon Sep 17 00:00:00 2001 From: Daniel Gabay Date: Thu, 10 Oct 2024 14:05:03 +0300 Subject: wifi: iwlwifi: mvm: Use the sync timepoint API in suspend When starting the suspend flow, HOST_D3_START triggers an _async_ firmware dump collection for debugging purposes. The async worker may race with suspend flow and fail to get NIC access, resulting in the following warning: "Timeout waiting for hardware access (CSR_GP_CNTRL 0xffffffff)" Fix this by switching to the sync version to ensure the dump completes before proceeding with the suspend flow, avoiding potential race issues. Signed-off-by: Daniel Gabay Signed-off-by: Miri Korenblit Link: https://patch.msgid.link/20241010140328.9aae318cd593.I4b322009f39489c0b1d8893495c887870f73ed9c@changeid Signed-off-by: Johannes Berg --- drivers/net/wireless/intel/iwlwifi/fw/init.c | 4 +++- drivers/net/wireless/intel/iwlwifi/mvm/d3.c | 2 ++ 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/net/wireless/intel/iwlwifi/fw/init.c b/drivers/net/wireless/intel/iwlwifi/fw/init.c index d8b083be5b6b..de87e0e3e072 100644 --- a/drivers/net/wireless/intel/iwlwifi/fw/init.c +++ b/drivers/net/wireless/intel/iwlwifi/fw/init.c @@ -39,10 +39,12 @@ void iwl_fw_runtime_init(struct iwl_fw_runtime *fwrt, struct iwl_trans *trans, } IWL_EXPORT_SYMBOL(iwl_fw_runtime_init); +/* Assumes the appropriate lock is held by the caller */ void iwl_fw_runtime_suspend(struct iwl_fw_runtime *fwrt) { iwl_fw_suspend_timestamp(fwrt); - iwl_dbg_tlv_time_point(fwrt, IWL_FW_INI_TIME_POINT_HOST_D3_START, NULL); + iwl_dbg_tlv_time_point_sync(fwrt, IWL_FW_INI_TIME_POINT_HOST_D3_START, + NULL); } IWL_EXPORT_SYMBOL(iwl_fw_runtime_suspend); diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/d3.c b/drivers/net/wireless/intel/iwlwifi/mvm/d3.c index 49a6aff42376..244ca8cab9d1 100644 --- a/drivers/net/wireless/intel/iwlwifi/mvm/d3.c +++ b/drivers/net/wireless/intel/iwlwifi/mvm/d3.c @@ -1398,7 +1398,9 @@ int iwl_mvm_suspend(struct ieee80211_hw *hw, struct cfg80211_wowlan *wowlan) iwl_mvm_pause_tcm(mvm, true); + mutex_lock(&mvm->mutex); iwl_fw_runtime_suspend(&mvm->fwrt); + mutex_unlock(&mvm->mutex); return __iwl_mvm_suspend(hw, wowlan, false); } -- cgit From 32d95ab330069f9c551b8e99770bb4e799730b55 Mon Sep 17 00:00:00 2001 From: Anjaneyulu Date: Thu, 10 Oct 2024 14:05:04 +0300 Subject: wifi: iwlwifi: mvm: SAR table alignment SAR table format in ACPI and local data base are different, So modified code to read data properly. Signed-off-by: Anjaneyulu Signed-off-by: Miri Korenblit Link: https://patch.msgid.link/20241010140328.f077aced4dee.I4dc618f12d01f7ad19f9f8881f6e09eea77e9a14@changeid Signed-off-by: Johannes Berg --- drivers/net/wireless/intel/iwlwifi/fw/acpi.c | 96 +++++++++++++++++----------- 1 file changed, 58 insertions(+), 38 deletions(-) diff --git a/drivers/net/wireless/intel/iwlwifi/fw/acpi.c b/drivers/net/wireless/intel/iwlwifi/fw/acpi.c index a7cea0a55b35..0bc32291815e 100644 --- a/drivers/net/wireless/intel/iwlwifi/fw/acpi.c +++ b/drivers/net/wireless/intel/iwlwifi/fw/acpi.c @@ -429,38 +429,28 @@ out_free: return ret; } -static int iwl_acpi_sar_set_profile(union acpi_object *table, - struct iwl_sar_profile *profile, - bool enabled, u8 num_chains, - u8 num_sub_bands) +static int +iwl_acpi_parse_chains_table(union acpi_object *table, + struct iwl_sar_profile_chain *chains, + u8 num_chains, u8 num_sub_bands) { - int i, j, idx = 0; - - /* - * The table from ACPI is flat, but we store it in a - * structured array. - */ - for (i = 0; i < BIOS_SAR_MAX_CHAINS_PER_PROFILE; i++) { - for (j = 0; j < BIOS_SAR_MAX_SUB_BANDS_NUM; j++) { + for (u8 chain = 0; chain < num_chains; chain++) { + for (u8 subband = 0; subband < BIOS_SAR_MAX_SUB_BANDS_NUM; + subband++) { /* if we don't have the values, use the default */ - if (i >= num_chains || j >= num_sub_bands) { - profile->chains[i].subbands[j] = 0; + if (subband >= num_sub_bands) { + chains[chain].subbands[subband] = 0; + } else if (table->type != ACPI_TYPE_INTEGER || + table->integer.value > U8_MAX) { + return -EINVAL; } else { - if (table[idx].type != ACPI_TYPE_INTEGER || - table[idx].integer.value > U8_MAX) - return -EINVAL; - - profile->chains[i].subbands[j] = - table[idx].integer.value; - - idx++; + chains[chain].subbands[subband] = + table->integer.value; + table++; } } } - /* Only if all values were valid can the profile be enabled */ - profile->enabled = enabled; - return 0; } @@ -543,9 +533,11 @@ read_table: /* The profile from WRDS is officially profile 1, but goes * into sar_profiles[0] (because we don't have a profile 0). */ - ret = iwl_acpi_sar_set_profile(table, &fwrt->sar_profiles[0], - flags & IWL_SAR_ENABLE_MSK, - num_chains, num_sub_bands); + ret = iwl_acpi_parse_chains_table(table, fwrt->sar_profiles[0].chains, + num_chains, num_sub_bands); + if (!ret && flags & IWL_SAR_ENABLE_MSK) + fwrt->sar_profiles[0].enabled = true; + out_free: kfree(data); return ret; @@ -557,7 +549,7 @@ int iwl_acpi_get_ewrd_table(struct iwl_fw_runtime *fwrt) bool enabled; int i, n_profiles, tbl_rev, pos; int ret = 0; - u8 num_chains, num_sub_bands; + u8 num_sub_bands; data = iwl_acpi_get_object(fwrt->dev, ACPI_EWRD_METHOD); if (IS_ERR(data)) @@ -573,7 +565,6 @@ int iwl_acpi_get_ewrd_table(struct iwl_fw_runtime *fwrt) goto out_free; } - num_chains = ACPI_SAR_NUM_CHAINS_REV2; num_sub_bands = ACPI_SAR_NUM_SUB_BANDS_REV2; goto read_table; @@ -589,7 +580,6 @@ int iwl_acpi_get_ewrd_table(struct iwl_fw_runtime *fwrt) goto out_free; } - num_chains = ACPI_SAR_NUM_CHAINS_REV1; num_sub_bands = ACPI_SAR_NUM_SUB_BANDS_REV1; goto read_table; @@ -605,7 +595,6 @@ int iwl_acpi_get_ewrd_table(struct iwl_fw_runtime *fwrt) goto out_free; } - num_chains = ACPI_SAR_NUM_CHAINS_REV0; num_sub_bands = ACPI_SAR_NUM_SUB_BANDS_REV0; goto read_table; @@ -637,23 +626,54 @@ read_table: /* the tables start at element 3 */ pos = 3; + BUILD_BUG_ON(ACPI_SAR_NUM_CHAINS_REV0 != ACPI_SAR_NUM_CHAINS_REV1); + BUILD_BUG_ON(ACPI_SAR_NUM_CHAINS_REV2 != 2 * ACPI_SAR_NUM_CHAINS_REV0); + + /* parse non-cdb chains for all profiles */ for (i = 0; i < n_profiles; i++) { union acpi_object *table = &wifi_pkg->package.elements[pos]; + /* The EWRD profiles officially go from 2 to 4, but we * save them in sar_profiles[1-3] (because we don't * have profile 0). So in the array we start from 1. */ - ret = iwl_acpi_sar_set_profile(table, - &fwrt->sar_profiles[i + 1], - enabled, num_chains, - num_sub_bands); + ret = iwl_acpi_parse_chains_table(table, + fwrt->sar_profiles[i + 1].chains, + ACPI_SAR_NUM_CHAINS_REV0, + num_sub_bands); if (ret < 0) - break; + goto out_free; /* go to the next table */ - pos += num_chains * num_sub_bands; + pos += ACPI_SAR_NUM_CHAINS_REV0 * num_sub_bands; } + /* non-cdb table revisions */ + if (tbl_rev < 2) + goto set_enabled; + + /* parse cdb chains for all profiles */ + for (i = 0; i < n_profiles; i++) { + struct iwl_sar_profile_chain *chains; + union acpi_object *table; + + table = &wifi_pkg->package.elements[pos]; + chains = &fwrt->sar_profiles[i + 1].chains[ACPI_SAR_NUM_CHAINS_REV0]; + ret = iwl_acpi_parse_chains_table(table, + chains, + ACPI_SAR_NUM_CHAINS_REV0, + num_sub_bands); + if (ret < 0) + goto out_free; + + /* go to the next table */ + pos += ACPI_SAR_NUM_CHAINS_REV0 * num_sub_bands; + } + +set_enabled: + for (i = 0; i < n_profiles; i++) + fwrt->sar_profiles[i + 1].enabled = enabled; + out_free: kfree(data); return ret; -- cgit From 07a6e3b78a65f4b2796a8d0d4adb1a15a81edead Mon Sep 17 00:00:00 2001 From: Daniel Gabay Date: Thu, 10 Oct 2024 14:05:05 +0300 Subject: wifi: iwlwifi: mvm: Fix response handling in iwl_mvm_send_recovery_cmd() 1. The size of the response packet is not validated. 2. The response buffer is not freed. Resolve these issues by switching to iwl_mvm_send_cmd_status(), which handles both size validation and frees the buffer. Fixes: f130bb75d881 ("iwlwifi: add FW recovery flow") Signed-off-by: Daniel Gabay Signed-off-by: Miri Korenblit Link: https://patch.msgid.link/20241010140328.76c73185951e.Id3b6ca82ced2081f5ee4f33c997491d0ebda83f7@changeid Signed-off-by: Johannes Berg --- drivers/net/wireless/intel/iwlwifi/mvm/fw.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/fw.c b/drivers/net/wireless/intel/iwlwifi/mvm/fw.c index 08546e673cf5..f30b0fc8eca9 100644 --- a/drivers/net/wireless/intel/iwlwifi/mvm/fw.c +++ b/drivers/net/wireless/intel/iwlwifi/mvm/fw.c @@ -1307,8 +1307,8 @@ static void iwl_mvm_disconnect_iterator(void *data, u8 *mac, void iwl_mvm_send_recovery_cmd(struct iwl_mvm *mvm, u32 flags) { u32 error_log_size = mvm->fw->ucode_capa.error_log_size; + u32 status = 0; int ret; - u32 resp; struct iwl_fw_error_recovery_cmd recovery_cmd = { .flags = cpu_to_le32(flags), @@ -1316,7 +1316,6 @@ void iwl_mvm_send_recovery_cmd(struct iwl_mvm *mvm, u32 flags) }; struct iwl_host_cmd host_cmd = { .id = WIDE_ID(SYSTEM_GROUP, FW_ERROR_RECOVERY_CMD), - .flags = CMD_WANT_SKB, .data = {&recovery_cmd, }, .len = {sizeof(recovery_cmd), }, }; @@ -1336,7 +1335,7 @@ void iwl_mvm_send_recovery_cmd(struct iwl_mvm *mvm, u32 flags) recovery_cmd.buf_size = cpu_to_le32(error_log_size); } - ret = iwl_mvm_send_cmd(mvm, &host_cmd); + ret = iwl_mvm_send_cmd_status(mvm, &host_cmd, &status); kfree(mvm->error_recovery_buf); mvm->error_recovery_buf = NULL; @@ -1347,11 +1346,10 @@ void iwl_mvm_send_recovery_cmd(struct iwl_mvm *mvm, u32 flags) /* skb respond is only relevant in ERROR_RECOVERY_UPDATE_DB */ if (flags & ERROR_RECOVERY_UPDATE_DB) { - resp = le32_to_cpu(*(__le32 *)host_cmd.resp_pkt->data); - if (resp) { + if (status) { IWL_ERR(mvm, "Failed to send recovery cmd blob was invalid %d\n", - resp); + status); ieee80211_iterate_interfaces(mvm->hw, 0, iwl_mvm_disconnect_iterator, -- cgit From 734a377e1eacc5153bae0ccd4423365726876e93 Mon Sep 17 00:00:00 2001 From: Emmanuel Grumbach Date: Thu, 10 Oct 2024 14:05:06 +0300 Subject: wifi: iwlwifi: mvm: don't add default link in fw restart flow When we add the vif (and its default link) in fw restart we may override the link that already exists. We take care of this but if link 0 is a valid MLO link, then we will re-create a default link on mvmvif->link[0] and we'll loose the real link we had there. In non-MLO, we need to re-create the default link upon the interface creation, this is fine. In MLO, we'll just wait for change_vif_links() to re-build the links. Fixes: bf976c814c86 ("wifi: iwlwifi: mvm: implement link change ops") Signed-off-by: Emmanuel Grumbach Signed-off-by: Miri Korenblit Link: https://patch.msgid.link/20241010140328.385bfea1b2e9.I4a127312285ccb529cc95cc4edf6fbe1e0a136ad@changeid Signed-off-by: Johannes Berg --- .../net/wireless/intel/iwlwifi/mvm/mld-mac80211.c | 24 ++++++++++++++++------ 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/mld-mac80211.c b/drivers/net/wireless/intel/iwlwifi/mvm/mld-mac80211.c index bd043db906db..e252f0dcea20 100644 --- a/drivers/net/wireless/intel/iwlwifi/mvm/mld-mac80211.c +++ b/drivers/net/wireless/intel/iwlwifi/mvm/mld-mac80211.c @@ -41,8 +41,6 @@ static int iwl_mvm_mld_mac_add_interface(struct ieee80211_hw *hw, /* reset deflink MLO parameters */ mvmvif->deflink.fw_link_id = IWL_MVM_FW_LINK_ID_INVALID; mvmvif->deflink.active = 0; - /* the first link always points to the default one */ - mvmvif->link[0] = &mvmvif->deflink; ret = iwl_mvm_mld_mac_ctxt_add(mvm, vif); if (ret) @@ -60,9 +58,19 @@ static int iwl_mvm_mld_mac_add_interface(struct ieee80211_hw *hw, IEEE80211_VIF_SUPPORTS_CQM_RSSI; } - ret = iwl_mvm_add_link(mvm, vif, &vif->bss_conf); - if (ret) - goto out_free_bf; + /* We want link[0] to point to the default link, unless we have MLO and + * in this case this will be modified later by .change_vif_links() + * If we are in the restart flow with an MLD connection, we will wait + * to .change_vif_links() to setup the links. + */ + if (!test_bit(IWL_MVM_STATUS_IN_HW_RESTART, &mvm->status) || + !ieee80211_vif_is_mld(vif)) { + mvmvif->link[0] = &mvmvif->deflink; + + ret = iwl_mvm_add_link(mvm, vif, &vif->bss_conf); + if (ret) + goto out_free_bf; + } /* Save a pointer to p2p device vif, so it can later be used to * update the p2p device MAC when a GO is started/stopped @@ -1194,7 +1202,11 @@ iwl_mvm_mld_change_vif_links(struct ieee80211_hw *hw, mutex_lock(&mvm->mutex); - if (old_links == 0) { + /* If we're in RESTART flow, the default link wasn't added in + * drv_add_interface(), and link[0] doesn't point to it. + */ + if (old_links == 0 && !test_bit(IWL_MVM_STATUS_IN_HW_RESTART, + &mvm->status)) { err = iwl_mvm_disable_link(mvm, vif, &vif->bss_conf); if (err) goto out_err; -- cgit From bfc0ed73e095cc3858d35731f191fa6e3d813262 Mon Sep 17 00:00:00 2001 From: Emmanuel Grumbach Date: Tue, 22 Oct 2024 09:22:11 +0200 Subject: Revert "wifi: iwlwifi: remove retry loops in start" Revert commit dfdfe4be183b ("wifi: iwlwifi: remove retry loops in start"), it turns out that there's an issue with the PNVM load notification from firmware not getting processed, that this patch has been somewhat successfully papering over. Since this is being reported, revert the loop removal for now. We will later at least clean this up to only attempt to retry if there was a timeout, but currently we don't even bubble up the failure reason to the correct layer, only returning NULL. Fixes: dfdfe4be183b ("wifi: iwlwifi: remove retry loops in start") Signed-off-by: Emmanuel Grumbach Link: https://patch.msgid.link/20241022092212.4aa82a558a00.Ibdeff9c8f0d608bc97fc42024392ae763b6937b7@changeid Signed-off-by: Johannes Berg --- drivers/net/wireless/intel/iwlwifi/iwl-drv.c | 28 +++++++++++++++-------- drivers/net/wireless/intel/iwlwifi/iwl-drv.h | 3 +++ drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c | 10 +++++++- 3 files changed, 31 insertions(+), 10 deletions(-) diff --git a/drivers/net/wireless/intel/iwlwifi/iwl-drv.c b/drivers/net/wireless/intel/iwlwifi/iwl-drv.c index 2abfc986701f..c620911a1193 100644 --- a/drivers/net/wireless/intel/iwlwifi/iwl-drv.c +++ b/drivers/net/wireless/intel/iwlwifi/iwl-drv.c @@ -1413,25 +1413,35 @@ _iwl_op_mode_start(struct iwl_drv *drv, struct iwlwifi_opmode_table *op) const struct iwl_op_mode_ops *ops = op->ops; struct dentry *dbgfs_dir = NULL; struct iwl_op_mode *op_mode = NULL; + int retry, max_retry = !!iwlwifi_mod_params.fw_restart * IWL_MAX_INIT_RETRY; /* also protects start/stop from racing against each other */ lockdep_assert_held(&iwlwifi_opmode_table_mtx); + for (retry = 0; retry <= max_retry; retry++) { + #ifdef CONFIG_IWLWIFI_DEBUGFS - drv->dbgfs_op_mode = debugfs_create_dir(op->name, - drv->dbgfs_drv); - dbgfs_dir = drv->dbgfs_op_mode; + drv->dbgfs_op_mode = debugfs_create_dir(op->name, + drv->dbgfs_drv); + dbgfs_dir = drv->dbgfs_op_mode; #endif - op_mode = ops->start(drv->trans, drv->trans->cfg, - &drv->fw, dbgfs_dir); - if (op_mode) - return op_mode; + op_mode = ops->start(drv->trans, drv->trans->cfg, + &drv->fw, dbgfs_dir); + + if (op_mode) + return op_mode; + + if (test_bit(STATUS_TRANS_DEAD, &drv->trans->status)) + break; + + IWL_ERR(drv, "retry init count %d\n", retry); #ifdef CONFIG_IWLWIFI_DEBUGFS - debugfs_remove_recursive(drv->dbgfs_op_mode); - drv->dbgfs_op_mode = NULL; + debugfs_remove_recursive(drv->dbgfs_op_mode); + drv->dbgfs_op_mode = NULL; #endif + } return NULL; } diff --git a/drivers/net/wireless/intel/iwlwifi/iwl-drv.h b/drivers/net/wireless/intel/iwlwifi/iwl-drv.h index 1549ff429549..6a1d31892417 100644 --- a/drivers/net/wireless/intel/iwlwifi/iwl-drv.h +++ b/drivers/net/wireless/intel/iwlwifi/iwl-drv.h @@ -98,6 +98,9 @@ void iwl_drv_stop(struct iwl_drv *drv); #define VISIBLE_IF_IWLWIFI_KUNIT static #endif +/* max retry for init flow */ +#define IWL_MAX_INIT_RETRY 2 + #define FW_NAME_PRE_BUFSIZE 64 struct iwl_trans; const char *iwl_drv_get_fwname_pre(struct iwl_trans *trans, char *buf); diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c b/drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c index 39b815904501..80b9a115245f 100644 --- a/drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c +++ b/drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c @@ -1293,12 +1293,14 @@ int iwl_mvm_mac_start(struct ieee80211_hw *hw) { struct iwl_mvm *mvm = IWL_MAC80211_GET_MVM(hw); int ret; + int retry, max_retry = 0; mutex_lock(&mvm->mutex); /* we are starting the mac not in error flow, and restart is enabled */ if (!test_bit(IWL_MVM_STATUS_HW_RESTART_REQUESTED, &mvm->status) && iwlwifi_mod_params.fw_restart) { + max_retry = IWL_MAX_INIT_RETRY; /* * This will prevent mac80211 recovery flows to trigger during * init failures @@ -1306,7 +1308,13 @@ int iwl_mvm_mac_start(struct ieee80211_hw *hw) set_bit(IWL_MVM_STATUS_STARTING, &mvm->status); } - ret = __iwl_mvm_mac_start(mvm); + for (retry = 0; retry <= max_retry; retry++) { + ret = __iwl_mvm_mac_start(mvm); + if (!ret) + break; + + IWL_ERR(mvm, "mac start retry %d\n", retry); + } clear_bit(IWL_MVM_STATUS_STARTING, &mvm->status); mutex_unlock(&mvm->mutex); -- cgit From 9b15c6cf8d2e82c8427cd06f535d8de93b5b995c Mon Sep 17 00:00:00 2001 From: Ben Greear Date: Thu, 10 Oct 2024 13:39:54 -0700 Subject: mac80211: fix user-power when emulating chanctx ieee80211_calc_hw_conf_chan was ignoring the configured user_txpower. If it is set, use it to potentially decrease txpower as requested. Signed-off-by: Ben Greear Link: https://patch.msgid.link/20241010203954.1219686-1-greearb@candelatech.com Signed-off-by: Johannes Berg --- net/mac80211/main.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/mac80211/main.c b/net/mac80211/main.c index 89084690350f..ee1211a213d7 100644 --- a/net/mac80211/main.c +++ b/net/mac80211/main.c @@ -167,6 +167,8 @@ static u32 ieee80211_calc_hw_conf_chan(struct ieee80211_local *local, } power = ieee80211_chandef_max_power(&chandef); + if (local->user_power_level != IEEE80211_UNSET_POWER_LEVEL) + power = min(local->user_power_level, power); rcu_read_lock(); list_for_each_entry_rcu(sdata, &local->interfaces, list) { -- cgit From d5fee261dfd9e17b08b1df8471ac5d5736070917 Mon Sep 17 00:00:00 2001 From: Johannes Berg Date: Tue, 22 Oct 2024 16:17:42 +0200 Subject: wifi: cfg80211: clear wdev->cqm_config pointer on free When we free wdev->cqm_config when unregistering, we also need to clear out the pointer since the same wdev/netdev may get re-registered in another network namespace, then destroyed later, running this code again, which results in a double-free. Reported-by: syzbot+36218cddfd84b5cc263e@syzkaller.appspotmail.com Fixes: 37c20b2effe9 ("wifi: cfg80211: fix cqm_config access race") Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20241022161742.7c34b2037726.I121b9cdb7eb180802eafc90b493522950d57ee18@changeid Signed-off-by: Johannes Berg --- net/wireless/core.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/wireless/core.c b/net/wireless/core.c index 8331064de9dd..74ca18833df1 100644 --- a/net/wireless/core.c +++ b/net/wireless/core.c @@ -1236,6 +1236,7 @@ static void _cfg80211_unregister_wdev(struct wireless_dev *wdev, /* deleted from the list, so can't be found from nl80211 any more */ cqm_config = rcu_access_pointer(wdev->cqm_config); kfree_rcu(cqm_config, rcu_head); + RCU_INIT_POINTER(wdev->cqm_config, NULL); /* * Ensure that all events have been processed and -- cgit From 7245012f0f496162dd95d888ed2ceb5a35170f1a Mon Sep 17 00:00:00 2001 From: Johannes Berg Date: Wed, 23 Oct 2024 09:17:44 +0200 Subject: wifi: iwlwifi: mvm: fix 6 GHz scan construction If more than 255 colocated APs exist for the set of all APs found during 2.4/5 GHz scanning, then the 6 GHz scan construction will loop forever since the loop variable has type u8, which can never reach the number found when that's bigger than 255, and is stored in a u32 variable. Also move it into the loops to have a smaller scope. Using a u32 there is fine, we limit the number of APs in the scan list and each has a limit on the number of RNR entries due to the frame size. With a limit of 1000 scan results, a frame size upper bound of 4096 (really it's more like ~2300) and a TBTT entry size of at least 11, we get an upper bound for the number of ~372k, well in the bounds of a u32. Cc: stable@vger.kernel.org Fixes: eae94cf82d74 ("iwlwifi: mvm: add support for 6GHz") Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219375 Link: https://patch.msgid.link/20241023091744.f4baed5c08a1.I8b417148bbc8c5d11c101e1b8f5bf372e17bf2a7@changeid Signed-off-by: Johannes Berg --- drivers/net/wireless/intel/iwlwifi/mvm/scan.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/scan.c b/drivers/net/wireless/intel/iwlwifi/mvm/scan.c index 3ce9150213a7..ddcbd80a49fb 100644 --- a/drivers/net/wireless/intel/iwlwifi/mvm/scan.c +++ b/drivers/net/wireless/intel/iwlwifi/mvm/scan.c @@ -1774,7 +1774,7 @@ iwl_mvm_umac_scan_cfg_channels_v7_6g(struct iwl_mvm *mvm, &cp->channel_config[ch_cnt]; u32 s_ssid_bitmap = 0, bssid_bitmap = 0, flags = 0; - u8 j, k, n_s_ssids = 0, n_bssids = 0; + u8 k, n_s_ssids = 0, n_bssids = 0; u8 max_s_ssids, max_bssids; bool force_passive = false, found = false, allow_passive = true, unsolicited_probe_on_chan = false, psc_no_listen = false; @@ -1799,7 +1799,7 @@ iwl_mvm_umac_scan_cfg_channels_v7_6g(struct iwl_mvm *mvm, cfg->v5.iter_count = 1; cfg->v5.iter_interval = 0; - for (j = 0; j < params->n_6ghz_params; j++) { + for (u32 j = 0; j < params->n_6ghz_params; j++) { s8 tmp_psd_20; if (!(scan_6ghz_params[j].channel_idx == i)) @@ -1873,7 +1873,7 @@ iwl_mvm_umac_scan_cfg_channels_v7_6g(struct iwl_mvm *mvm, * SSID. * TODO: improve this logic */ - for (j = 0; j < params->n_6ghz_params; j++) { + for (u32 j = 0; j < params->n_6ghz_params; j++) { if (!(scan_6ghz_params[j].channel_idx == i)) continue; -- cgit From 53ab8678e7180834be29cf56cd52825fc3427c02 Mon Sep 17 00:00:00 2001 From: Shiju Jose Date: Mon, 14 Oct 2024 15:30:03 +0100 Subject: cxl/events: Fix Trace DRAM Event Record CXL spec rev 3.0 section 8.2.9.2.1.2 defines the DRAM Event Record. Fix decode memory event type field of DRAM Event Record. For e.g. if value is 0x1 it will be reported as an Invalid Address (General Media Event Record - Memory Event Type) instead of Scrub Media ECC Error (DRAM Event Record - Memory Event Type) and so on. Fixes: 2d6c1e6d60ba ("cxl/mem: Trace DRAM Event Record") Signed-off-by: Shiju Jose Link: https://patch.msgid.link/20241014143003.1170-1-shiju.jose@huawei.com Signed-off-by: Ira Weiny --- drivers/cxl/core/trace.h | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h index 8672b42ee4d1..8389a94adb1a 100644 --- a/drivers/cxl/core/trace.h +++ b/drivers/cxl/core/trace.h @@ -279,7 +279,7 @@ TRACE_EVENT(cxl_generic_event, #define CXL_GMER_MEM_EVT_TYPE_ECC_ERROR 0x00 #define CXL_GMER_MEM_EVT_TYPE_INV_ADDR 0x01 #define CXL_GMER_MEM_EVT_TYPE_DATA_PATH_ERROR 0x02 -#define show_mem_event_type(type) __print_symbolic(type, \ +#define show_gmer_mem_event_type(type) __print_symbolic(type, \ { CXL_GMER_MEM_EVT_TYPE_ECC_ERROR, "ECC Error" }, \ { CXL_GMER_MEM_EVT_TYPE_INV_ADDR, "Invalid Address" }, \ { CXL_GMER_MEM_EVT_TYPE_DATA_PATH_ERROR, "Data Path Error" } \ @@ -373,7 +373,7 @@ TRACE_EVENT(cxl_general_media, "hpa=%llx region=%s region_uuid=%pUb", __entry->dpa, show_dpa_flags(__entry->dpa_flags), show_event_desc_flags(__entry->descriptor), - show_mem_event_type(__entry->type), + show_gmer_mem_event_type(__entry->type), show_trans_type(__entry->transaction_type), __entry->channel, __entry->rank, __entry->device, __print_hex(__entry->comp_id, CXL_EVENT_GEN_MED_COMP_ID_SIZE), @@ -391,6 +391,17 @@ TRACE_EVENT(cxl_general_media, * DRAM Event Record defines many fields the same as the General Media Event * Record. Reuse those definitions as appropriate. */ +#define CXL_DER_MEM_EVT_TYPE_ECC_ERROR 0x00 +#define CXL_DER_MEM_EVT_TYPE_SCRUB_MEDIA_ECC_ERROR 0x01 +#define CXL_DER_MEM_EVT_TYPE_INV_ADDR 0x02 +#define CXL_DER_MEM_EVT_TYPE_DATA_PATH_ERROR 0x03 +#define show_dram_mem_event_type(type) __print_symbolic(type, \ + { CXL_DER_MEM_EVT_TYPE_ECC_ERROR, "ECC Error" }, \ + { CXL_DER_MEM_EVT_TYPE_SCRUB_MEDIA_ECC_ERROR, "Scrub Media ECC Error" }, \ + { CXL_DER_MEM_EVT_TYPE_INV_ADDR, "Invalid Address" }, \ + { CXL_DER_MEM_EVT_TYPE_DATA_PATH_ERROR, "Data Path Error" } \ +) + #define CXL_DER_VALID_CHANNEL BIT(0) #define CXL_DER_VALID_RANK BIT(1) #define CXL_DER_VALID_NIBBLE BIT(2) @@ -477,7 +488,7 @@ TRACE_EVENT(cxl_dram, "hpa=%llx region=%s region_uuid=%pUb", __entry->dpa, show_dpa_flags(__entry->dpa_flags), show_event_desc_flags(__entry->descriptor), - show_mem_event_type(__entry->type), + show_dram_mem_event_type(__entry->type), show_trans_type(__entry->transaction_type), __entry->channel, __entry->rank, __entry->nibble_mask, __entry->bank_group, __entry->bank, -- cgit From 895669fd0d8c816572ff779979a032d0395a0194 Mon Sep 17 00:00:00 2001 From: David Vernet Date: Fri, 25 Oct 2024 00:40:13 -0500 Subject: scx: Fix exit selftest to use custom DSQ In commit 63fb3ec80516 ("sched_ext: Allow only user DSQs for scx_bpf_consume(), scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new()"), we updated the consume path to only accept user DSQs, thus making it invalid to consume SCX_DSQ_GLOBAL. This selftest was doing that, so let's create a custom DSQ and use that instead. The test now passes: [root@virtme-ng sched_ext]# ./runner -t exit ===== START ===== TEST: exit DESCRIPTION: Verify we can cleanly exit a scheduler in multiple places OUTPUT: [ 12.387229] sched_ext: BPF scheduler "exit" enabled [ 12.406064] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 12.453325] sched_ext: BPF scheduler "exit" enabled [ 12.474064] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 12.515241] sched_ext: BPF scheduler "exit" enabled [ 12.532064] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 12.592063] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 12.654063] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 12.715062] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) ok 1 exit # ===== END ===== Signed-off-by: David Vernet Signed-off-by: Tejun Heo --- tools/testing/selftests/sched_ext/exit.bpf.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/sched_ext/exit.bpf.c b/tools/testing/selftests/sched_ext/exit.bpf.c index bf79ccd55f8f..d75d4faf07f6 100644 --- a/tools/testing/selftests/sched_ext/exit.bpf.c +++ b/tools/testing/selftests/sched_ext/exit.bpf.c @@ -15,6 +15,8 @@ UEI_DEFINE(uei); #define EXIT_CLEANLY() scx_bpf_exit(exit_point, "%d", exit_point) +#define DSQ_ID 0 + s32 BPF_STRUCT_OPS(exit_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags) { @@ -31,7 +33,7 @@ void BPF_STRUCT_OPS(exit_enqueue, struct task_struct *p, u64 enq_flags) if (exit_point == EXIT_ENQUEUE) EXIT_CLEANLY(); - scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); + scx_bpf_dispatch(p, DSQ_ID, SCX_SLICE_DFL, enq_flags); } void BPF_STRUCT_OPS(exit_dispatch, s32 cpu, struct task_struct *p) @@ -39,7 +41,7 @@ void BPF_STRUCT_OPS(exit_dispatch, s32 cpu, struct task_struct *p) if (exit_point == EXIT_DISPATCH) EXIT_CLEANLY(); - scx_bpf_consume(SCX_DSQ_GLOBAL); + scx_bpf_consume(DSQ_ID); } void BPF_STRUCT_OPS(exit_enable, struct task_struct *p) @@ -67,7 +69,7 @@ s32 BPF_STRUCT_OPS_SLEEPABLE(exit_init) if (exit_point == EXIT_INIT) EXIT_CLEANLY(); - return 0; + return scx_bpf_create_dsq(DSQ_ID, -1); } SEC(".struct_ops.link") -- cgit From a25a83de45b435cf89e55c7fb8733f83c7826004 Mon Sep 17 00:00:00 2001 From: Jeongjun Park Date: Thu, 24 Oct 2024 01:13:45 +0900 Subject: bcachefs: fix null-ptr-deref in have_stripes() c->btree_roots_known[i].b can be NULL. In this case, a NULL pointer dereference occurs, so you need to add code to check the variable. Reported-by: syzbot+b468b9fef56949c3b528@syzkaller.appspotmail.com Fixes: 7773df19c35f ("bcachefs: metadata version bucket_stripe_sectors") Signed-off-by: Jeongjun Park Signed-off-by: Kent Overstreet --- fs/bcachefs/sb-downgrade.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/bcachefs/sb-downgrade.c b/fs/bcachefs/sb-downgrade.c index ae715ff658e8..8767c33c2b51 100644 --- a/fs/bcachefs/sb-downgrade.c +++ b/fs/bcachefs/sb-downgrade.c @@ -143,6 +143,9 @@ UPGRADE_TABLE() static int have_stripes(struct bch_fs *c) { + if (IS_ERR_OR_NULL(c->btree_roots_known[BTREE_ID_stripes].b)) + return 0; + return !btree_node_fake(c->btree_roots_known[BTREE_ID_stripes].b); } -- cgit From 8e910ca20e112d7f06ba3bf631a06ddb5ce14657 Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Fri, 25 Oct 2024 13:13:05 -0400 Subject: bcachefs: Fix UAF in bch2_reconstruct_alloc() write_super() -> sb_counters_from_cpu() may reallocate the superblock Reported-by: syzbot+9fc4dac4775d07bcfe34@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet --- fs/bcachefs/recovery.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/fs/bcachefs/recovery.c b/fs/bcachefs/recovery.c index 454b5a32dd7f..0ebc76dd7eb5 100644 --- a/fs/bcachefs/recovery.c +++ b/fs/bcachefs/recovery.c @@ -94,11 +94,10 @@ static void bch2_reconstruct_alloc(struct bch_fs *c) __set_bit_le64(BCH_FSCK_ERR_accounting_mismatch, ext->errors_silent); c->sb.compat &= ~(1ULL << BCH_COMPAT_alloc_info); - bch2_write_super(c); - mutex_unlock(&c->sb_lock); - c->opts.recovery_passes |= bch2_recovery_passes_from_stable(le64_to_cpu(ext->recovery_passes_required[0])); + bch2_write_super(c); + mutex_unlock(&c->sb_lock); bch2_shoot_down_journal_keys(c, BTREE_ID_alloc, 0, BTREE_MAX_DEPTH, POS_MIN, SPOS_MAX); -- cgit From d28d17a845600dd9f7de241de9b1528a1b138716 Mon Sep 17 00:00:00 2001 From: John Garry Date: Fri, 18 Oct 2024 10:16:55 +0000 Subject: scsi: scsi_debug: Fix do_device_access() handling of unexpected SG copy length If the sg_copy_buffer() call returns less than sdebug_sector_size, then we drop out of the copy loop. However, we still report that we copied the full expected amount, which is not proper. Fix by keeping a running total and return that value. Fixes: 84f3a3c01d70 ("scsi: scsi_debug: Atomic write support") Reported-by: Colin Ian King Suggested-by: Dan Carpenter Signed-off-by: John Garry Link: https://lore.kernel.org/r/20241018101655.4207-1-john.g.garry@oracle.com Reviewed-by: Dan Carpenter Reviewed-by: Colin Ian King Signed-off-by: Martin K. Petersen --- drivers/scsi/scsi_debug.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c index d95f417e24c0..9be2a6a00530 100644 --- a/drivers/scsi/scsi_debug.c +++ b/drivers/scsi/scsi_debug.c @@ -3651,7 +3651,7 @@ static int do_device_access(struct sdeb_store_info *sip, struct scsi_cmnd *scp, enum dma_data_direction dir; struct scsi_data_buffer *sdb = &scp->sdb; u8 *fsp; - int i; + int i, total = 0; /* * Even though reads are inherently atomic (in this driver), we expect @@ -3688,18 +3688,16 @@ static int do_device_access(struct sdeb_store_info *sip, struct scsi_cmnd *scp, fsp + (block * sdebug_sector_size), sdebug_sector_size, sg_skip, do_write); sdeb_data_sector_unlock(sip, do_write); - if (ret != sdebug_sector_size) { - ret += (i * sdebug_sector_size); + total += ret; + if (ret != sdebug_sector_size) break; - } sg_skip += sdebug_sector_size; if (++block >= sdebug_store_sectors) block = 0; } - ret = num * sdebug_sector_size; sdeb_data_unlock(sip, atomic); - return ret; + return total; } /* Returns number of bytes copied or -1 if error. */ -- cgit From cb7e509c4e0197f63717fee54fb41c4990ba8d3a Mon Sep 17 00:00:00 2001 From: Peter Wang Date: Thu, 24 Oct 2024 09:54:53 +0800 Subject: scsi: ufs: core: Fix another deadlock during RTC update If ufshcd_rtc_work calls ufshcd_rpm_put_sync() and the pm's usage_count is 0, we will enter the runtime suspend callback. However, the runtime suspend callback will wait to flush ufshcd_rtc_work, causing a deadlock. Replace ufshcd_rpm_put_sync() with ufshcd_rpm_put() to avoid the deadlock. Fixes: 6bf999e0eb41 ("scsi: ufs: core: Add UFS RTC support") Cc: stable@vger.kernel.org #6.11.x Signed-off-by: Peter Wang Link: https://lore.kernel.org/r/20241024015453.21684-1-peter.wang@mediatek.com Reviewed-by: Bart Van Assche Signed-off-by: Martin K. Petersen --- drivers/ufs/core/ufshcd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c index 706dc81eb924..0e22bbb78239 100644 --- a/drivers/ufs/core/ufshcd.c +++ b/drivers/ufs/core/ufshcd.c @@ -8219,7 +8219,7 @@ static void ufshcd_update_rtc(struct ufs_hba *hba) err = ufshcd_query_attr(hba, UPIU_QUERY_OPCODE_WRITE_ATTR, QUERY_ATTR_IDN_SECONDS_PASSED, 0, 0, &val); - ufshcd_rpm_put_sync(hba); + ufshcd_rpm_put(hba); if (err) dev_err(hba->dev, "%s: Failed to update rtc %d\n", __func__, err); -- cgit From 6575b268157f37929948a8d1f3bafb3d7c055bc1 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Fri, 25 Oct 2024 12:32:55 -0700 Subject: cxl/port: Fix CXL port initialization order when the subsystem is built-in When the CXL subsystem is built-in the module init order is determined by Makefile order. That order violates expectations. The expectation is that cxl_acpi and cxl_mem can race to attach. If cxl_acpi wins the race, cxl_mem will find the enabled CXL root ports it needs. If cxl_acpi loses the race it will retrigger cxl_mem to attach via cxl_bus_rescan(). That flow only works if cxl_acpi can assume ports are enabled immediately upon cxl_acpi_probe() return. That in turn can only happen in the CONFIG_CXL_ACPI=y case if the cxl_port driver is registered before cxl_acpi_probe() runs. Fix up the order to prevent initialization failures. Ensure that cxl_port is built-in when cxl_acpi is also built-in, arrange for Makefile order to resolve the subsys_initcall() order of cxl_port and cxl_acpi, and arrange for Makefile order to resolve the device_initcall() (module_init()) order of the remaining objects. As for what contributed to this not being found earlier, the CXL regression environment, cxl_test, builds all CXL functionality as a module to allow to symbol mocking and other dynamic reload tests. As a result there is no regression coverage for the built-in case. Reported-by: Gregory Price Closes: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net Tested-by: Gregory Price Fixes: 8dd2bc0f8e02 ("cxl/mem: Add the cxl_mem driver") Cc: stable@vger.kernel.org Cc: Davidlohr Bueso Cc: Jonathan Cameron Cc: Dave Jiang Cc: Alison Schofield Cc: Vishal Verma Cc: Ira Weiny Reviewed-by: Jonathan Cameron Reviewed-by: Ira Weiny Tested-by: Alejandro Lucero Reviewed-by: Alejandro Lucero Signed-off-by: Dan Williams Link: https://patch.msgid.link/172988474904.476062.7961350937442459266.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Ira Weiny --- drivers/cxl/Kconfig | 1 + drivers/cxl/Makefile | 20 ++++++++++++++------ drivers/cxl/port.c | 17 ++++++++++++++++- 3 files changed, 31 insertions(+), 7 deletions(-) diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig index 29c192f20082..876469e23f7a 100644 --- a/drivers/cxl/Kconfig +++ b/drivers/cxl/Kconfig @@ -60,6 +60,7 @@ config CXL_ACPI default CXL_BUS select ACPI_TABLE_LIB select ACPI_HMAT + select CXL_PORT help Enable support for host managed device memory (HDM) resources published by a platform's ACPI CXL memory layout description. See diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile index db321f48ba52..2caa90fa4bf2 100644 --- a/drivers/cxl/Makefile +++ b/drivers/cxl/Makefile @@ -1,13 +1,21 @@ # SPDX-License-Identifier: GPL-2.0 + +# Order is important here for the built-in case: +# - 'core' first for fundamental init +# - 'port' before platform root drivers like 'acpi' so that CXL-root ports +# are immediately enabled +# - 'mem' and 'pmem' before endpoint drivers so that memdevs are +# immediately enabled +# - 'pci' last, also mirrors the hardware enumeration hierarchy obj-y += core/ -obj-$(CONFIG_CXL_PCI) += cxl_pci.o -obj-$(CONFIG_CXL_MEM) += cxl_mem.o +obj-$(CONFIG_CXL_PORT) += cxl_port.o obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o obj-$(CONFIG_CXL_PMEM) += cxl_pmem.o -obj-$(CONFIG_CXL_PORT) += cxl_port.o +obj-$(CONFIG_CXL_MEM) += cxl_mem.o +obj-$(CONFIG_CXL_PCI) += cxl_pci.o -cxl_mem-y := mem.o -cxl_pci-y := pci.o +cxl_port-y := port.o cxl_acpi-y := acpi.o cxl_pmem-y := pmem.o security.o -cxl_port-y := port.o +cxl_mem-y := mem.o +cxl_pci-y := pci.o diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c index 861dde65768f..9dc394295e1f 100644 --- a/drivers/cxl/port.c +++ b/drivers/cxl/port.c @@ -208,7 +208,22 @@ static struct cxl_driver cxl_port_driver = { }, }; -module_cxl_driver(cxl_port_driver); +static int __init cxl_port_init(void) +{ + return cxl_driver_register(&cxl_port_driver); +} +/* + * Be ready to immediately enable ports emitted by the platform CXL root + * (e.g. cxl_acpi) when CONFIG_CXL_PORT=y. + */ +subsys_initcall(cxl_port_init); + +static void __exit cxl_port_exit(void) +{ + cxl_driver_unregister(&cxl_port_driver); +} +module_exit(cxl_port_exit); + MODULE_DESCRIPTION("CXL: Port enumeration and services"); MODULE_LICENSE("GPL v2"); MODULE_IMPORT_NS(CXL); -- cgit From 3d6ebf16438de5d712030fefbb4182b46373d677 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 22 Oct 2024 18:43:32 -0700 Subject: cxl/port: Fix cxl_bus_rescan() vs bus_rescan_devices() It turns out since its original introduction, pre-2.6.12, bus_rescan_devices() has skipped devices that might be in the process of attaching or detaching from their driver. For CXL this behavior is unwanted and expects that cxl_bus_rescan() is a probe barrier. That behavior is simple enough to achieve with bus_for_each_dev() paired with call to device_attach(), and it is unclear why bus_rescan_devices() took the position of lockless consumption of dev->driver which is racy. The "Fixes:" but no "Cc: stable" on this patch reflects that the issue is merely by inspection since the bug that triggered the discovery of this potential problem [1] is fixed by other means. However, a stable backport should do no harm. Fixes: 8dd2bc0f8e02 ("cxl/mem: Add the cxl_mem driver") Link: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net [1] Signed-off-by: Dan Williams Tested-by: Gregory Price Reviewed-by: Jonathan Cameron Reviewed-by: Ira Weiny Link: https://patch.msgid.link/172964781104.81806.4277549800082443769.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Ira Weiny --- drivers/cxl/core/port.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c index e666ec6a9085..af92c67bc954 100644 --- a/drivers/cxl/core/port.c +++ b/drivers/cxl/core/port.c @@ -2084,11 +2084,18 @@ static void cxl_bus_remove(struct device *dev) static struct workqueue_struct *cxl_bus_wq; -static void cxl_bus_rescan_queue(struct work_struct *w) +static int cxl_rescan_attach(struct device *dev, void *data) { - int rc = bus_rescan_devices(&cxl_bus_type); + int rc = device_attach(dev); + + dev_vdbg(dev, "rescan: %s\n", rc ? "attach" : "detached"); - pr_debug("CXL bus rescan result: %d\n", rc); + return 0; +} + +static void cxl_bus_rescan_queue(struct work_struct *w) +{ + bus_for_each_dev(&cxl_bus_type, NULL, NULL, cxl_rescan_attach); } void cxl_bus_rescan(void) -- cgit From 48f62d38a07d464a499fa834638afcfd2b68f852 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 22 Oct 2024 18:43:40 -0700 Subject: cxl/acpi: Ensure ports ready at cxl_acpi_probe() return In order to ensure root CXL ports are enabled upon cxl_acpi_probe() when the 'cxl_port' driver is built as a module, arrange for the module to be pre-loaded or built-in. The "Fixes:" but no "Cc: stable" on this patch reflects that the issue is merely by inspection since the bug that triggered the discovery of this potential problem [1] is fixed by other means. However, a stable backport should do no harm. Fixes: 8dd2bc0f8e02 ("cxl/mem: Add the cxl_mem driver") Link: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net [1] Signed-off-by: Dan Williams Tested-by: Gregory Price Reviewed-by: Jonathan Cameron Reviewed-by: Ira Weiny Link: https://patch.msgid.link/172964781969.81806.17276352414854540808.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Ira Weiny --- drivers/cxl/acpi.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c index 82b78e331d8e..432b7cfd12a8 100644 --- a/drivers/cxl/acpi.c +++ b/drivers/cxl/acpi.c @@ -924,6 +924,13 @@ static void __exit cxl_acpi_exit(void) /* load before dax_hmem sees 'Soft Reserved' CXL ranges */ subsys_initcall(cxl_acpi_init); + +/* + * Arrange for host-bridge ports to be active synchronous with + * cxl_acpi_probe() exit. + */ +MODULE_SOFTDEP("pre: cxl_port"); + module_exit(cxl_acpi_exit); MODULE_DESCRIPTION("CXL ACPI: Platform Support"); MODULE_LICENSE("GPL v2"); -- cgit From 101c268bd2f37e965a5468353e62d154db38838e Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 22 Oct 2024 18:43:49 -0700 Subject: cxl/port: Fix use-after-free, permit out-of-order decoder shutdown In support of investigating an initialization failure report [1], cxl_test was updated to register mock memory-devices after the mock root-port/bus device had been registered. That led to cxl_test crashing with a use-after-free bug with the following signature: cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1 cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1 cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0 1) cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1 [..] cxld_unregister: cxl decoder14.0: cxl_region_decode_reset: cxl_region region3: mock_decoder_reset: cxl_port port3: decoder3.0 reset 2) mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1 cxl_endpoint_decoder_release: cxl decoder14.0: [..] cxld_unregister: cxl decoder7.0: 3) cxl_region_decode_reset: cxl_region region3: Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP PTI [..] RIP: 0010:to_cxl_port+0x8/0x60 [cxl_core] [..] Call Trace: cxl_region_decode_reset+0x69/0x190 [cxl_core] cxl_region_detach+0xe8/0x210 [cxl_core] cxl_decoder_kill_region+0x27/0x40 [cxl_core] cxld_unregister+0x5d/0x60 [cxl_core] At 1) a region has been established with 2 endpoint decoders (7.0 and 14.0). Those endpoints share a common switch-decoder in the topology (3.0). At teardown, 2), decoder14.0 is the first to be removed and hits the "out of order reset case" in the switch decoder. The effect though is that region3 cleanup is aborted leaving it in-tact and referencing decoder14.0. At 3) the second attempt to teardown region3 trips over the stale decoder14.0 object which has long since been deleted. The fix here is to recognize that the CXL specification places no mandate on in-order shutdown of switch-decoders, the driver enforces in-order allocation, and hardware enforces in-order commit. So, rather than fail and leave objects dangling, always remove them. In support of making cxl_region_decode_reset() always succeed, cxl_region_invalidate_memregion() failures are turned into warnings. Crashing the kernel is ok there since system integrity is at risk if caches cannot be managed around physical address mutation events like CXL region destruction. A new device_for_each_child_reverse_from() is added to cleanup port->commit_end after all dependent decoders have been disabled. In other words if decoders are allocated 0->1->2 and disabled 1->2->0 then port->commit_end only decrements from 2 after 2 has been disabled, and it decrements all the way to zero since 1 was disabled previously. Link: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net [1] Cc: stable@vger.kernel.org Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware") Reviewed-by: Jonathan Cameron Cc: Greg Kroah-Hartman Cc: Davidlohr Bueso Cc: Dave Jiang Cc: Alison Schofield Cc: Ira Weiny Cc: Zijun Hu Signed-off-by: Dan Williams Reviewed-by: Ira Weiny Link: https://patch.msgid.link/172964782781.81806.17902885593105284330.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Ira Weiny --- drivers/base/core.c | 35 +++++++++++++++++++++++++++++++ drivers/cxl/core/hdm.c | 50 +++++++++++++++++++++++++++++++++++++------- drivers/cxl/core/region.c | 48 ++++++++++++------------------------------ drivers/cxl/cxl.h | 3 ++- include/linux/device.h | 3 +++ tools/testing/cxl/test/cxl.c | 14 +++++-------- 6 files changed, 100 insertions(+), 53 deletions(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index a4c853411a6b..e42f1ad73078 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -4037,6 +4037,41 @@ int device_for_each_child_reverse(struct device *parent, void *data, } EXPORT_SYMBOL_GPL(device_for_each_child_reverse); +/** + * device_for_each_child_reverse_from - device child iterator in reversed order. + * @parent: parent struct device. + * @from: optional starting point in child list + * @fn: function to be called for each device. + * @data: data for the callback. + * + * Iterate over @parent's child devices, starting at @from, and call @fn + * for each, passing it @data. This helper is identical to + * device_for_each_child_reverse() when @from is NULL. + * + * @fn is checked each iteration. If it returns anything other than 0, + * iteration stop and that value is returned to the caller of + * device_for_each_child_reverse_from(); + */ +int device_for_each_child_reverse_from(struct device *parent, + struct device *from, const void *data, + int (*fn)(struct device *, const void *)) +{ + struct klist_iter i; + struct device *child; + int error = 0; + + if (!parent->p) + return 0; + + klist_iter_init_node(&parent->p->klist_children, &i, + (from ? &from->p->knode_parent : NULL)); + while ((child = prev_device(&i)) && !error) + error = fn(child, data); + klist_iter_exit(&i); + return error; +} +EXPORT_SYMBOL_GPL(device_for_each_child_reverse_from); + /** * device_find_child - device iterator for locating a particular device. * @parent: parent struct device diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c index 3df10517a327..223c273c0cd1 100644 --- a/drivers/cxl/core/hdm.c +++ b/drivers/cxl/core/hdm.c @@ -712,7 +712,44 @@ static int cxl_decoder_commit(struct cxl_decoder *cxld) return 0; } -static int cxl_decoder_reset(struct cxl_decoder *cxld) +static int commit_reap(struct device *dev, const void *data) +{ + struct cxl_port *port = to_cxl_port(dev->parent); + struct cxl_decoder *cxld; + + if (!is_switch_decoder(dev) && !is_endpoint_decoder(dev)) + return 0; + + cxld = to_cxl_decoder(dev); + if (port->commit_end == cxld->id && + ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) { + port->commit_end--; + dev_dbg(&port->dev, "reap: %s commit_end: %d\n", + dev_name(&cxld->dev), port->commit_end); + } + + return 0; +} + +void cxl_port_commit_reap(struct cxl_decoder *cxld) +{ + struct cxl_port *port = to_cxl_port(cxld->dev.parent); + + lockdep_assert_held_write(&cxl_region_rwsem); + + /* + * Once the highest committed decoder is disabled, free any other + * decoders that were pinned allocated by out-of-order release. + */ + port->commit_end--; + dev_dbg(&port->dev, "reap: %s commit_end: %d\n", dev_name(&cxld->dev), + port->commit_end); + device_for_each_child_reverse_from(&port->dev, &cxld->dev, NULL, + commit_reap); +} +EXPORT_SYMBOL_NS_GPL(cxl_port_commit_reap, CXL); + +static void cxl_decoder_reset(struct cxl_decoder *cxld) { struct cxl_port *port = to_cxl_port(cxld->dev.parent); struct cxl_hdm *cxlhdm = dev_get_drvdata(&port->dev); @@ -721,14 +758,14 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld) u32 ctrl; if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0) - return 0; + return; - if (port->commit_end != id) { + if (port->commit_end == id) + cxl_port_commit_reap(cxld); + else dev_dbg(&port->dev, "%s: out of order reset, expected decoder%d.%d\n", dev_name(&cxld->dev), port->id, port->commit_end); - return -EBUSY; - } down_read(&cxl_dpa_rwsem); ctrl = readl(hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id)); @@ -741,7 +778,6 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld) writel(0, hdm + CXL_HDM_DECODER0_BASE_LOW_OFFSET(id)); up_read(&cxl_dpa_rwsem); - port->commit_end--; cxld->flags &= ~CXL_DECODER_F_ENABLE; /* Userspace is now responsible for reconfiguring this decoder */ @@ -751,8 +787,6 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld) cxled = to_cxl_endpoint_decoder(&cxld->dev); cxled->state = CXL_DECODER_STATE_MANUAL; } - - return 0; } static int cxl_setup_hdm_decoder_from_dvsec( diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c index e701e4b04032..3478d2058303 100644 --- a/drivers/cxl/core/region.c +++ b/drivers/cxl/core/region.c @@ -232,8 +232,8 @@ static int cxl_region_invalidate_memregion(struct cxl_region *cxlr) "Bypassing cpu_cache_invalidate_memregion() for testing!\n"); return 0; } else { - dev_err(&cxlr->dev, - "Failed to synchronize CPU cache state\n"); + dev_WARN(&cxlr->dev, + "Failed to synchronize CPU cache state\n"); return -ENXIO; } } @@ -242,19 +242,17 @@ static int cxl_region_invalidate_memregion(struct cxl_region *cxlr) return 0; } -static int cxl_region_decode_reset(struct cxl_region *cxlr, int count) +static void cxl_region_decode_reset(struct cxl_region *cxlr, int count) { struct cxl_region_params *p = &cxlr->params; - int i, rc = 0; + int i; /* - * Before region teardown attempt to flush, and if the flush - * fails cancel the region teardown for data consistency - * concerns + * Before region teardown attempt to flush, evict any data cached for + * this region, or scream loudly about missing arch / platform support + * for CXL teardown. */ - rc = cxl_region_invalidate_memregion(cxlr); - if (rc) - return rc; + cxl_region_invalidate_memregion(cxlr); for (i = count - 1; i >= 0; i--) { struct cxl_endpoint_decoder *cxled = p->targets[i]; @@ -277,23 +275,17 @@ static int cxl_region_decode_reset(struct cxl_region *cxlr, int count) cxl_rr = cxl_rr_load(iter, cxlr); cxld = cxl_rr->decoder; if (cxld->reset) - rc = cxld->reset(cxld); - if (rc) - return rc; + cxld->reset(cxld); set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags); } endpoint_reset: - rc = cxled->cxld.reset(&cxled->cxld); - if (rc) - return rc; + cxled->cxld.reset(&cxled->cxld); set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags); } /* all decoders associated with this region have been torn down */ clear_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags); - - return 0; } static int commit_decoder(struct cxl_decoder *cxld) @@ -409,16 +401,8 @@ static ssize_t commit_store(struct device *dev, struct device_attribute *attr, * still pending. */ if (p->state == CXL_CONFIG_RESET_PENDING) { - rc = cxl_region_decode_reset(cxlr, p->interleave_ways); - /* - * Revert to committed since there may still be active - * decoders associated with this region, or move forward - * to active to mark the reset successful - */ - if (rc) - p->state = CXL_CONFIG_COMMIT; - else - p->state = CXL_CONFIG_ACTIVE; + cxl_region_decode_reset(cxlr, p->interleave_ways); + p->state = CXL_CONFIG_ACTIVE; } } @@ -2054,13 +2038,7 @@ static int cxl_region_detach(struct cxl_endpoint_decoder *cxled) get_device(&cxlr->dev); if (p->state > CXL_CONFIG_ACTIVE) { - /* - * TODO: tear down all impacted regions if a device is - * removed out of order - */ - rc = cxl_region_decode_reset(cxlr, p->interleave_ways); - if (rc) - goto out; + cxl_region_decode_reset(cxlr, p->interleave_ways); p->state = CXL_CONFIG_ACTIVE; } diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h index 0d8b810a51f0..5406e3ab3d4a 100644 --- a/drivers/cxl/cxl.h +++ b/drivers/cxl/cxl.h @@ -359,7 +359,7 @@ struct cxl_decoder { struct cxl_region *region; unsigned long flags; int (*commit)(struct cxl_decoder *cxld); - int (*reset)(struct cxl_decoder *cxld); + void (*reset)(struct cxl_decoder *cxld); }; /* @@ -730,6 +730,7 @@ static inline bool is_cxl_root(struct cxl_port *port) int cxl_num_decoders_committed(struct cxl_port *port); bool is_cxl_port(const struct device *dev); struct cxl_port *to_cxl_port(const struct device *dev); +void cxl_port_commit_reap(struct cxl_decoder *cxld); struct pci_bus; int devm_cxl_register_pci_bus(struct device *host, struct device *uport_dev, struct pci_bus *bus); diff --git a/include/linux/device.h b/include/linux/device.h index b4bde8d22697..667cb6db9019 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -1078,6 +1078,9 @@ int device_for_each_child(struct device *dev, void *data, int (*fn)(struct device *dev, void *data)); int device_for_each_child_reverse(struct device *dev, void *data, int (*fn)(struct device *dev, void *data)); +int device_for_each_child_reverse_from(struct device *parent, + struct device *from, const void *data, + int (*fn)(struct device *, const void *)); struct device *device_find_child(struct device *dev, void *data, int (*match)(struct device *dev, void *data)); struct device *device_find_child_by_name(struct device *parent, diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c index 90d5afd52dd0..c5bbd89b3192 100644 --- a/tools/testing/cxl/test/cxl.c +++ b/tools/testing/cxl/test/cxl.c @@ -693,26 +693,22 @@ static int mock_decoder_commit(struct cxl_decoder *cxld) return 0; } -static int mock_decoder_reset(struct cxl_decoder *cxld) +static void mock_decoder_reset(struct cxl_decoder *cxld) { struct cxl_port *port = to_cxl_port(cxld->dev.parent); int id = cxld->id; if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0) - return 0; + return; dev_dbg(&port->dev, "%s reset\n", dev_name(&cxld->dev)); - if (port->commit_end != id) { + if (port->commit_end == id) + cxl_port_commit_reap(cxld); + else dev_dbg(&port->dev, "%s: out of order reset, expected decoder%d.%d\n", dev_name(&cxld->dev), port->id, port->commit_end); - return -EBUSY; - } - - port->commit_end--; cxld->flags &= ~CXL_DECODER_F_ENABLE; - - return 0; } static void default_mock_decoder(struct cxl_decoder *cxld) -- cgit From 105b6235ad0f24f271aef17f8865186c4546cb3a Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 22 Oct 2024 18:43:57 -0700 Subject: cxl/port: Prevent out-of-order decoder allocation With the recent change to allow out-of-order decoder de-commit it highlights a need to strengthen the in-order decoder commit guarantees. As it stands match_free_decoder() ensures that if 2 regions are racing decoder allocations the one that wins the race will get the lower id decoder, but that still leaves the race to *commit* the decoder. Rather than have this complicated case of "reserved in-order, but may still commit out-of-order", just arrange for the reservation order to match the commit-order. In other words, prevent subsequent allocations until the last reservation is committed. This precludes overlapping region creation events and requires the previous regionN to either move forward to the decoder commit stage or drop its reservation before regionN+1 can move forward. That is, provided that regionN and regionN+1 decode through the same switch port. As a side effect this allows match_free_decoder() to drop its dependency on needing write access to the device_find_child() @data parameter [1]. Reported-by: Zijun Hu Closes: http://lore.kernel.org/20240905-const_dfc_prepare-v4-0-4180e1d5a244@quicinc.com Cc: Davidlohr Bueso Cc: Vishal Verma Cc: Alison Schofield Cc: Jonathan Cameron Signed-off-by: Dan Williams Reviewed-by: Jonathan Cameron Reviewed-by: Ira Weiny Link: https://patch.msgid.link/172964783668.81806.14962699553881333486.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Ira Weiny --- drivers/cxl/core/region.c | 43 +++++++++++++++++++++++++++++++++---------- 1 file changed, 33 insertions(+), 10 deletions(-) diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c index 3478d2058303..dff618c708dc 100644 --- a/drivers/cxl/core/region.c +++ b/drivers/cxl/core/region.c @@ -778,26 +778,50 @@ out: return rc; } +static int check_commit_order(struct device *dev, const void *data) +{ + struct cxl_decoder *cxld = to_cxl_decoder(dev); + + /* + * if port->commit_end is not the only free decoder, then out of + * order shutdown has occurred, block further allocations until + * that is resolved + */ + if (((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) + return -EBUSY; + return 0; +} + static int match_free_decoder(struct device *dev, void *data) { + struct cxl_port *port = to_cxl_port(dev->parent); struct cxl_decoder *cxld; - int *id = data; + int rc; if (!is_switch_decoder(dev)) return 0; cxld = to_cxl_decoder(dev); - /* enforce ordered allocation */ - if (cxld->id != *id) + if (cxld->id != port->commit_end + 1) return 0; - if (!cxld->region) - return 1; - - (*id)++; + if (cxld->region) { + dev_dbg(dev->parent, + "next decoder to commit (%s) is already reserved (%s)\n", + dev_name(dev), dev_name(&cxld->region->dev)); + return 0; + } - return 0; + rc = device_for_each_child_reverse_from(dev->parent, dev, NULL, + check_commit_order); + if (rc) { + dev_dbg(dev->parent, + "unable to allocate %s due to out of order shutdown\n", + dev_name(dev)); + return 0; + } + return 1; } static int match_auto_decoder(struct device *dev, void *data) @@ -824,7 +848,6 @@ cxl_region_find_decoder(struct cxl_port *port, struct cxl_region *cxlr) { struct device *dev; - int id = 0; if (port == cxled_to_port(cxled)) return &cxled->cxld; @@ -833,7 +856,7 @@ cxl_region_find_decoder(struct cxl_port *port, dev = device_find_child(&port->dev, &cxlr->params, match_auto_decoder); else - dev = device_find_child(&port->dev, &id, match_free_decoder); + dev = device_find_child(&port->dev, NULL, match_free_decoder); if (!dev) return NULL; /* -- cgit From 3a2b97b3210bd5758f66fad04c5171f85a016a04 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 22 Oct 2024 18:44:06 -0700 Subject: cxl/test: Improve init-order fidelity relative to real-world systems The investigation of an initialization failure [1] highlighted that cxl_test does not reflect the init-order of real world systems. The expected order is root/bus first then async probing of the memory devices. Fix up cxl_test to reflect that order. While it did not reproduce the initial bug report (since that is dependent on built-in vs modular builds), it did reveal a separate latent bug in the subsystem's decoder shutdown flow. Fix for that sent separately. Link: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net [1] Cc: Davidlohr Bueso Cc: Jonathan Cameron Cc: Dave Jiang Cc: Alison Schofield Cc: Vishal Verma Cc: Ira Weiny Signed-off-by: Dan Williams Reviewed-by: Jonathan Cameron Link: https://patch.msgid.link/172964784521.81806.15791069994065969243.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Ira Weiny --- tools/testing/cxl/test/cxl.c | 186 ++++++++++++++++++++++++------------------- tools/testing/cxl/test/mem.c | 1 + 2 files changed, 104 insertions(+), 83 deletions(-) diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c index c5bbd89b3192..050725afa45d 100644 --- a/tools/testing/cxl/test/cxl.c +++ b/tools/testing/cxl/test/cxl.c @@ -1058,7 +1058,7 @@ static void mock_companion(struct acpi_device *adev, struct device *dev) #define SZ_64G (SZ_32G * 2) #endif -static __init int cxl_rch_init(void) +static __init int cxl_rch_topo_init(void) { int rc, i; @@ -1086,30 +1086,8 @@ static __init int cxl_rch_init(void) goto err_bridge; } - for (i = 0; i < ARRAY_SIZE(cxl_rcd); i++) { - int idx = NR_MEM_MULTI + NR_MEM_SINGLE + i; - struct platform_device *rch = cxl_rch[i]; - struct platform_device *pdev; - - pdev = platform_device_alloc("cxl_rcd", idx); - if (!pdev) - goto err_mem; - pdev->dev.parent = &rch->dev; - set_dev_node(&pdev->dev, i % 2); - - rc = platform_device_add(pdev); - if (rc) { - platform_device_put(pdev); - goto err_mem; - } - cxl_rcd[i] = pdev; - } - return 0; -err_mem: - for (i = ARRAY_SIZE(cxl_rcd) - 1; i >= 0; i--) - platform_device_unregister(cxl_rcd[i]); err_bridge: for (i = ARRAY_SIZE(cxl_rch) - 1; i >= 0; i--) { struct platform_device *pdev = cxl_rch[i]; @@ -1123,12 +1101,10 @@ err_bridge: return rc; } -static void cxl_rch_exit(void) +static void cxl_rch_topo_exit(void) { int i; - for (i = ARRAY_SIZE(cxl_rcd) - 1; i >= 0; i--) - platform_device_unregister(cxl_rcd[i]); for (i = ARRAY_SIZE(cxl_rch) - 1; i >= 0; i--) { struct platform_device *pdev = cxl_rch[i]; @@ -1139,7 +1115,7 @@ static void cxl_rch_exit(void) } } -static __init int cxl_single_init(void) +static __init int cxl_single_topo_init(void) { int i, rc; @@ -1224,29 +1200,8 @@ static __init int cxl_single_init(void) cxl_swd_single[i] = pdev; } - for (i = 0; i < ARRAY_SIZE(cxl_mem_single); i++) { - struct platform_device *dport = cxl_swd_single[i]; - struct platform_device *pdev; - - pdev = platform_device_alloc("cxl_mem", NR_MEM_MULTI + i); - if (!pdev) - goto err_mem; - pdev->dev.parent = &dport->dev; - set_dev_node(&pdev->dev, i % 2); - - rc = platform_device_add(pdev); - if (rc) { - platform_device_put(pdev); - goto err_mem; - } - cxl_mem_single[i] = pdev; - } - return 0; -err_mem: - for (i = ARRAY_SIZE(cxl_mem_single) - 1; i >= 0; i--) - platform_device_unregister(cxl_mem_single[i]); err_dport: for (i = ARRAY_SIZE(cxl_swd_single) - 1; i >= 0; i--) platform_device_unregister(cxl_swd_single[i]); @@ -1269,12 +1224,10 @@ err_bridge: return rc; } -static void cxl_single_exit(void) +static void cxl_single_topo_exit(void) { int i; - for (i = ARRAY_SIZE(cxl_mem_single) - 1; i >= 0; i--) - platform_device_unregister(cxl_mem_single[i]); for (i = ARRAY_SIZE(cxl_swd_single) - 1; i >= 0; i--) platform_device_unregister(cxl_swd_single[i]); for (i = ARRAY_SIZE(cxl_swu_single) - 1; i >= 0; i--) @@ -1291,6 +1244,91 @@ static void cxl_single_exit(void) } } +static void cxl_mem_exit(void) +{ + int i; + + for (i = ARRAY_SIZE(cxl_rcd) - 1; i >= 0; i--) + platform_device_unregister(cxl_rcd[i]); + for (i = ARRAY_SIZE(cxl_mem_single) - 1; i >= 0; i--) + platform_device_unregister(cxl_mem_single[i]); + for (i = ARRAY_SIZE(cxl_mem) - 1; i >= 0; i--) + platform_device_unregister(cxl_mem[i]); +} + +static int cxl_mem_init(void) +{ + int i, rc; + + for (i = 0; i < ARRAY_SIZE(cxl_mem); i++) { + struct platform_device *dport = cxl_switch_dport[i]; + struct platform_device *pdev; + + pdev = platform_device_alloc("cxl_mem", i); + if (!pdev) + goto err_mem; + pdev->dev.parent = &dport->dev; + set_dev_node(&pdev->dev, i % 2); + + rc = platform_device_add(pdev); + if (rc) { + platform_device_put(pdev); + goto err_mem; + } + cxl_mem[i] = pdev; + } + + for (i = 0; i < ARRAY_SIZE(cxl_mem_single); i++) { + struct platform_device *dport = cxl_swd_single[i]; + struct platform_device *pdev; + + pdev = platform_device_alloc("cxl_mem", NR_MEM_MULTI + i); + if (!pdev) + goto err_single; + pdev->dev.parent = &dport->dev; + set_dev_node(&pdev->dev, i % 2); + + rc = platform_device_add(pdev); + if (rc) { + platform_device_put(pdev); + goto err_single; + } + cxl_mem_single[i] = pdev; + } + + for (i = 0; i < ARRAY_SIZE(cxl_rcd); i++) { + int idx = NR_MEM_MULTI + NR_MEM_SINGLE + i; + struct platform_device *rch = cxl_rch[i]; + struct platform_device *pdev; + + pdev = platform_device_alloc("cxl_rcd", idx); + if (!pdev) + goto err_rcd; + pdev->dev.parent = &rch->dev; + set_dev_node(&pdev->dev, i % 2); + + rc = platform_device_add(pdev); + if (rc) { + platform_device_put(pdev); + goto err_rcd; + } + cxl_rcd[i] = pdev; + } + + return 0; + +err_rcd: + for (i = ARRAY_SIZE(cxl_rcd) - 1; i >= 0; i--) + platform_device_unregister(cxl_rcd[i]); +err_single: + for (i = ARRAY_SIZE(cxl_mem_single) - 1; i >= 0; i--) + platform_device_unregister(cxl_mem_single[i]); +err_mem: + for (i = ARRAY_SIZE(cxl_mem) - 1; i >= 0; i--) + platform_device_unregister(cxl_mem[i]); + return rc; +} + static __init int cxl_test_init(void) { int rc, i; @@ -1403,29 +1441,11 @@ static __init int cxl_test_init(void) cxl_switch_dport[i] = pdev; } - for (i = 0; i < ARRAY_SIZE(cxl_mem); i++) { - struct platform_device *dport = cxl_switch_dport[i]; - struct platform_device *pdev; - - pdev = platform_device_alloc("cxl_mem", i); - if (!pdev) - goto err_mem; - pdev->dev.parent = &dport->dev; - set_dev_node(&pdev->dev, i % 2); - - rc = platform_device_add(pdev); - if (rc) { - platform_device_put(pdev); - goto err_mem; - } - cxl_mem[i] = pdev; - } - - rc = cxl_single_init(); + rc = cxl_single_topo_init(); if (rc) - goto err_mem; + goto err_dport; - rc = cxl_rch_init(); + rc = cxl_rch_topo_init(); if (rc) goto err_single; @@ -1438,19 +1458,20 @@ static __init int cxl_test_init(void) rc = platform_device_add(cxl_acpi); if (rc) - goto err_add; + goto err_root; + + rc = cxl_mem_init(); + if (rc) + goto err_root; return 0; -err_add: +err_root: platform_device_put(cxl_acpi); err_rch: - cxl_rch_exit(); + cxl_rch_topo_exit(); err_single: - cxl_single_exit(); -err_mem: - for (i = ARRAY_SIZE(cxl_mem) - 1; i >= 0; i--) - platform_device_unregister(cxl_mem[i]); + cxl_single_topo_exit(); err_dport: for (i = ARRAY_SIZE(cxl_switch_dport) - 1; i >= 0; i--) platform_device_unregister(cxl_switch_dport[i]); @@ -1482,11 +1503,10 @@ static __exit void cxl_test_exit(void) { int i; + cxl_mem_exit(); platform_device_unregister(cxl_acpi); - cxl_rch_exit(); - cxl_single_exit(); - for (i = ARRAY_SIZE(cxl_mem) - 1; i >= 0; i--) - platform_device_unregister(cxl_mem[i]); + cxl_rch_topo_exit(); + cxl_single_topo_exit(); for (i = ARRAY_SIZE(cxl_switch_dport) - 1; i >= 0; i--) platform_device_unregister(cxl_switch_dport[i]); for (i = ARRAY_SIZE(cxl_switch_uport) - 1; i >= 0; i--) diff --git a/tools/testing/cxl/test/mem.c b/tools/testing/cxl/test/mem.c index ad5c4c18c5c6..71916e0e1546 100644 --- a/tools/testing/cxl/test/mem.c +++ b/tools/testing/cxl/test/mem.c @@ -1673,6 +1673,7 @@ static struct platform_driver cxl_mock_mem_driver = { .name = KBUILD_MODNAME, .dev_groups = cxl_mock_mem_groups, .groups = cxl_mock_mem_core_groups, + .probe_type = PROBE_PREFER_ASYNCHRONOUS, }, }; -- cgit From 0e7ffff1b8117b05635c87d3c9099f6aa9c9b689 Mon Sep 17 00:00:00 2001 From: David Vernet Date: Fri, 25 Oct 2024 15:54:08 -0500 Subject: scx: Fix raciness in scx_ops_bypass() scx_ops_bypass() can currently race on the ops enable / disable path as follows: 1. scx_ops_bypass(true) called on enable path, bypass depth is set to 1 2. An op on the init path exits, which schedules scx_ops_disable_workfn() 3. scx_ops_bypass(false) is called on the disable path, and bypass depth is decremented to 0 4. kthread is scheduled to execute scx_ops_disable_workfn() 5. scx_ops_bypass(true) called, bypass depth set to 1 6. scx_ops_bypass() races when iterating over CPUs While it's not safe to take any blocking locks on the bypass path, it is safe to take a raw spinlock which cannot be preempted. This patch therefore updates scx_ops_bypass() to use a raw spinlock to synchronize, and changes scx_ops_bypass_depth to be a regular int. Without this change, we observe the following warnings when running the 'exit' sched_ext selftest (sometimes requires a couple of runs): .[root@virtme-ng sched_ext]# ./runner -t exit ===== START ===== TEST: exit ... [ 14.935078] WARNING: CPU: 2 PID: 360 at kernel/sched/ext.c:4332 scx_ops_bypass+0x1ca/0x280 [ 14.935126] Modules linked in: [ 14.935150] CPU: 2 UID: 0 PID: 360 Comm: sched_ext_ops_h Not tainted 6.11.0-virtme #24 [ 14.935192] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 [ 14.935242] Sched_ext: exit (enabling+all) [ 14.935244] RIP: 0010:scx_ops_bypass+0x1ca/0x280 [ 14.935300] Code: ff ff ff e8 48 96 10 00 fb e9 08 ff ff ff c6 05 7b 34 e8 01 01 90 48 c7 c7 89 86 88 87 e8 be 1d f8 ff 90 0f 0b 90 90 eb 95 90 <0f> 0b 90 41 8b 84 24 24 0a 00 00 eb 97 90 0f 0b 90 41 8b 84 24 24 [ 14.935394] RSP: 0018:ffffb706c0957ce0 EFLAGS: 00010002 [ 14.935424] RAX: 0000000000000009 RBX: 0000000000000001 RCX: 00000000e3fb8b2a [ 14.935465] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffffff88a4c080 [ 14.935512] RBP: 0000000000009b56 R08: 0000000000000004 R09: 00000003f12e520a [ 14.935555] R10: ffffffff863a9795 R11: 0000000000000000 R12: ffff8fc5fec31300 [ 14.935598] R13: ffff8fc5fec31318 R14: 0000000000000286 R15: 0000000000000018 [ 14.935642] FS: 0000000000000000(0000) GS:ffff8fc5fe680000(0000) knlGS:0000000000000000 [ 14.935684] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 14.935721] CR2: 0000557d92890b88 CR3: 000000002464a000 CR4: 0000000000750ef0 [ 14.935765] PKRU: 55555554 [ 14.935782] Call Trace: [ 14.935802] [ 14.935823] ? __warn+0xce/0x220 [ 14.935850] ? scx_ops_bypass+0x1ca/0x280 [ 14.935881] ? report_bug+0xc1/0x160 [ 14.935909] ? handle_bug+0x61/0x90 [ 14.935934] ? exc_invalid_op+0x1a/0x50 [ 14.935959] ? asm_exc_invalid_op+0x1a/0x20 [ 14.935984] ? raw_spin_rq_lock_nested+0x15/0x30 [ 14.936019] ? scx_ops_bypass+0x1ca/0x280 [ 14.936046] ? srso_alias_return_thunk+0x5/0xfbef5 [ 14.936081] ? __pfx_scx_ops_disable_workfn+0x10/0x10 [ 14.936111] scx_ops_disable_workfn+0x146/0xac0 [ 14.936142] ? finish_task_switch+0xa9/0x2c0 [ 14.936172] ? srso_alias_return_thunk+0x5/0xfbef5 [ 14.936211] ? __pfx_scx_ops_disable_workfn+0x10/0x10 [ 14.936244] kthread_worker_fn+0x101/0x2c0 [ 14.936268] ? __pfx_kthread_worker_fn+0x10/0x10 [ 14.936299] kthread+0xec/0x110 [ 14.936327] ? __pfx_kthread+0x10/0x10 [ 14.936351] ret_from_fork+0x37/0x50 [ 14.936374] ? __pfx_kthread+0x10/0x10 [ 14.936400] ret_from_fork_asm+0x1a/0x30 [ 14.936427] [ 14.936443] irq event stamp: 21002 [ 14.936467] hardirqs last enabled at (21001): [] resched_cpu+0x9f/0xd0 [ 14.936521] hardirqs last disabled at (21002): [] scx_ops_bypass+0x11a/0x280 [ 14.936571] softirqs last enabled at (20642): [] __irq_exit_rcu+0x67/0xd0 [ 14.936622] softirqs last disabled at (20637): [] __irq_exit_rcu+0x67/0xd0 [ 14.936672] ---[ end trace 0000000000000000 ]--- [ 14.953282] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 14.953352] ------------[ cut here ]------------ [ 14.953383] WARNING: CPU: 2 PID: 360 at kernel/sched/ext.c:4335 scx_ops_bypass+0x1d8/0x280 [ 14.953428] Modules linked in: [ 14.953453] CPU: 2 UID: 0 PID: 360 Comm: sched_ext_ops_h Tainted: G W 6.11.0-virtme #24 [ 14.953505] Tainted: [W]=WARN [ 14.953527] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 [ 14.953574] RIP: 0010:scx_ops_bypass+0x1d8/0x280 [ 14.953603] Code: c6 05 7b 34 e8 01 01 90 48 c7 c7 89 86 88 87 e8 be 1d f8 ff 90 0f 0b 90 90 eb 95 90 0f 0b 90 41 8b 84 24 24 0a 00 00 eb 97 90 <0f> 0b 90 41 8b 84 24 24 0a 00 00 eb 92 f3 0f 1e fa 49 8d 84 24 f0 [ 14.953693] RSP: 0018:ffffb706c0957ce0 EFLAGS: 00010046 [ 14.953722] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000001 [ 14.953763] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8fc5fec31318 [ 14.953804] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 [ 14.953845] R10: ffffffff863a9795 R11: 0000000000000000 R12: ffff8fc5fec31300 [ 14.953888] R13: ffff8fc5fec31318 R14: 0000000000000286 R15: 0000000000000018 [ 14.953934] FS: 0000000000000000(0000) GS:ffff8fc5fe680000(0000) knlGS:0000000000000000 [ 14.953974] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 14.954009] CR2: 0000557d92890b88 CR3: 000000002464a000 CR4: 0000000000750ef0 [ 14.954052] PKRU: 55555554 [ 14.954068] Call Trace: [ 14.954085] [ 14.954102] ? __warn+0xce/0x220 [ 14.954126] ? scx_ops_bypass+0x1d8/0x280 [ 14.954150] ? report_bug+0xc1/0x160 [ 14.954178] ? handle_bug+0x61/0x90 [ 14.954203] ? exc_invalid_op+0x1a/0x50 [ 14.954226] ? asm_exc_invalid_op+0x1a/0x20 [ 14.954250] ? raw_spin_rq_lock_nested+0x15/0x30 [ 14.954285] ? scx_ops_bypass+0x1d8/0x280 [ 14.954311] ? __mutex_unlock_slowpath+0x3a/0x260 [ 14.954343] scx_ops_disable_workfn+0xa3e/0xac0 [ 14.954381] ? __pfx_scx_ops_disable_workfn+0x10/0x10 [ 14.954413] kthread_worker_fn+0x101/0x2c0 [ 14.954442] ? __pfx_kthread_worker_fn+0x10/0x10 [ 14.954479] kthread+0xec/0x110 [ 14.954507] ? __pfx_kthread+0x10/0x10 [ 14.954530] ret_from_fork+0x37/0x50 [ 14.954553] ? __pfx_kthread+0x10/0x10 [ 14.954576] ret_from_fork_asm+0x1a/0x30 [ 14.954603] [ 14.954621] irq event stamp: 21002 [ 14.954644] hardirqs last enabled at (21001): [] resched_cpu+0x9f/0xd0 [ 14.954686] hardirqs last disabled at (21002): [] scx_ops_bypass+0x11a/0x280 [ 14.954735] softirqs last enabled at (20642): [] __irq_exit_rcu+0x67/0xd0 [ 14.954782] softirqs last disabled at (20637): [] __irq_exit_rcu+0x67/0xd0 [ 14.954829] ---[ end trace 0000000000000000 ]--- [ 15.022283] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 15.092282] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 15.149282] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) ok 1 exit # ===== END ===== And with it, the test passes without issue after 1000s of runs: .[root@virtme-ng sched_ext]# ./runner -t exit ===== START ===== TEST: exit DESCRIPTION: Verify we can cleanly exit a scheduler in multiple places OUTPUT: [ 7.412856] sched_ext: BPF scheduler "exit" enabled [ 7.427924] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 7.466677] sched_ext: BPF scheduler "exit" enabled [ 7.475923] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 7.512803] sched_ext: BPF scheduler "exit" enabled [ 7.532924] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 7.586809] sched_ext: BPF scheduler "exit" enabled [ 7.595926] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 7.661923] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 7.723923] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) ok 1 exit # ===== END ===== ============================= RESULTS: PASSED: 1 SKIPPED: 0 FAILED: 0 Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class") Signed-off-by: David Vernet Signed-off-by: Tejun Heo --- kernel/sched/ext.c | 29 +++++++++++++++++------------ 1 file changed, 17 insertions(+), 12 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 6eae3b69bf6e..74344a43ccf1 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -862,7 +862,8 @@ static DEFINE_MUTEX(scx_ops_enable_mutex); DEFINE_STATIC_KEY_FALSE(__scx_ops_enabled); DEFINE_STATIC_PERCPU_RWSEM(scx_fork_rwsem); static atomic_t scx_ops_enable_state_var = ATOMIC_INIT(SCX_OPS_DISABLED); -static atomic_t scx_ops_bypass_depth = ATOMIC_INIT(0); +static int scx_ops_bypass_depth; +static DEFINE_RAW_SPINLOCK(__scx_ops_bypass_lock); static bool scx_ops_init_task_enabled; static bool scx_switching_all; DEFINE_STATIC_KEY_FALSE(__scx_switched_all); @@ -4298,18 +4299,20 @@ bool task_should_scx(struct task_struct *p) */ static void scx_ops_bypass(bool bypass) { - int depth, cpu; + int cpu; + unsigned long flags; + raw_spin_lock_irqsave(&__scx_ops_bypass_lock, flags); if (bypass) { - depth = atomic_inc_return(&scx_ops_bypass_depth); - WARN_ON_ONCE(depth <= 0); - if (depth != 1) - return; + scx_ops_bypass_depth++; + WARN_ON_ONCE(scx_ops_bypass_depth <= 0); + if (scx_ops_bypass_depth != 1) + goto unlock; } else { - depth = atomic_dec_return(&scx_ops_bypass_depth); - WARN_ON_ONCE(depth < 0); - if (depth != 0) - return; + scx_ops_bypass_depth--; + WARN_ON_ONCE(scx_ops_bypass_depth < 0); + if (scx_ops_bypass_depth != 0) + goto unlock; } /* @@ -4326,7 +4329,7 @@ static void scx_ops_bypass(bool bypass) struct rq_flags rf; struct task_struct *p, *n; - rq_lock_irqsave(rq, &rf); + rq_lock(rq, &rf); if (bypass) { WARN_ON_ONCE(rq->scx.flags & SCX_RQ_BYPASSING); @@ -4362,11 +4365,13 @@ static void scx_ops_bypass(bool bypass) sched_enq_and_set_task(&ctx); } - rq_unlock_irqrestore(rq, &rf); + rq_unlock(rq, &rf); /* resched to restore ticks and idle state */ resched_cpu(cpu); } +unlock: + raw_spin_unlock_irqrestore(&__scx_ops_bypass_lock, flags); } static void free_exit_info(struct scx_exit_info *ei) -- cgit From 7724abf0ca77460cb06ac3d5e4352a5c2289c3ae Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 25 Oct 2024 12:11:14 -1000 Subject: sched_ext: Make cast_mask() inline cast_mask() doesn't do any actual work and is defined in a header file. Force it to be inline. When it is not inlined and the function is not used, it can cause verificaiton failures like the following: # tools/testing/selftests/sched_ext/runner -t minimal ===== START ===== TEST: minimal DESCRIPTION: Verify we can load a fully minimal scheduler OUTPUT: libbpf: prog 'cast_mask': missing BPF prog type, check ELF section name '.text' libbpf: prog 'cast_mask': failed to load: -22 libbpf: failed to load object 'minimal' libbpf: failed to load BPF skeleton 'minimal': -22 ERR: minimal.c:20 Failed to open and load skel not ok 1 minimal # ===== END ===== Signed-off-by: Tejun Heo Fixes: a748db0c8c6a ("tools/sched_ext: Receive misc updates from SCX repo") --- tools/sched_ext/include/scx/common.bpf.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h index 27749c51c3ec..248ab790d143 100644 --- a/tools/sched_ext/include/scx/common.bpf.h +++ b/tools/sched_ext/include/scx/common.bpf.h @@ -320,7 +320,7 @@ u32 bpf_cpumask_weight(const struct cpumask *cpumask) __ksym; /* * Access a cpumask in read-only mode (typically to check bits). */ -const struct cpumask *cast_mask(struct bpf_cpumask *mask) +static __always_inline const struct cpumask *cast_mask(struct bpf_cpumask *mask) { return (const struct cpumask *)mask; } -- cgit From c31f2ee5cd7da3086eb4fbeef9f3afdc8e01d36b Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 25 Oct 2024 12:19:06 -1000 Subject: sched_ext: Fix enq_last_no_enq_fails selftest cc9877fb7677 ("sched_ext: Improve error reporting during loading") changed how load failures are reported so that more error context can be communicated. This breaks the enq_last_no_enq_fails test as attach no longer fails. The scheduler is guaranteed to be ejected on attach completion with full error information. Update enq_last_no_enq_fails so that it checks that the scheduler is ejected using ops.exit(). Signed-off-by: Tejun Heo Reported-by: Vishal Chourasia Link: http://lkml.kernel.org/r/Zxknp7RAVNjmdJSc@linux.ibm.com Fixes: cc9877fb7677 ("sched_ext: Improve error reporting during loading") --- tools/testing/selftests/sched_ext/enq_last_no_enq_fails.bpf.c | 8 ++++++++ tools/testing/selftests/sched_ext/enq_last_no_enq_fails.c | 10 +++++++--- 2 files changed, 15 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/sched_ext/enq_last_no_enq_fails.bpf.c b/tools/testing/selftests/sched_ext/enq_last_no_enq_fails.bpf.c index b0b99531d5d5..e1bd13e48889 100644 --- a/tools/testing/selftests/sched_ext/enq_last_no_enq_fails.bpf.c +++ b/tools/testing/selftests/sched_ext/enq_last_no_enq_fails.bpf.c @@ -12,10 +12,18 @@ char _license[] SEC("license") = "GPL"; +u32 exit_kind; + +void BPF_STRUCT_OPS_SLEEPABLE(enq_last_no_enq_fails_exit, struct scx_exit_info *info) +{ + exit_kind = info->kind; +} + SEC(".struct_ops.link") struct sched_ext_ops enq_last_no_enq_fails_ops = { .name = "enq_last_no_enq_fails", /* Need to define ops.enqueue() with SCX_OPS_ENQ_LAST */ .flags = SCX_OPS_ENQ_LAST, + .exit = (void *) enq_last_no_enq_fails_exit, .timeout_ms = 1000U, }; diff --git a/tools/testing/selftests/sched_ext/enq_last_no_enq_fails.c b/tools/testing/selftests/sched_ext/enq_last_no_enq_fails.c index 2a3eda5e2c0b..73e679953e27 100644 --- a/tools/testing/selftests/sched_ext/enq_last_no_enq_fails.c +++ b/tools/testing/selftests/sched_ext/enq_last_no_enq_fails.c @@ -31,8 +31,12 @@ static enum scx_test_status run(void *ctx) struct bpf_link *link; link = bpf_map__attach_struct_ops(skel->maps.enq_last_no_enq_fails_ops); - if (link) { - SCX_ERR("Incorrectly succeeded in to attaching scheduler"); + if (!link) { + SCX_ERR("Incorrectly failed at attaching scheduler"); + return SCX_TEST_FAIL; + } + if (!skel->bss->exit_kind) { + SCX_ERR("Incorrectly stayed loaded"); return SCX_TEST_FAIL; } @@ -50,7 +54,7 @@ static void cleanup(void *ctx) struct scx_test enq_last_no_enq_fails = { .name = "enq_last_no_enq_fails", - .description = "Verify we fail to load a scheduler if we specify " + .description = "Verify we eject a scheduler if we specify " "the SCX_OPS_ENQ_LAST flag without defining " "ops.enqueue()", .setup = setup, -- cgit From cf44e745048df2c935cb37de16e0ca476003a3b1 Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Fri, 25 Oct 2024 16:05:50 -0600 Subject: wifi: mac80211: ieee80211_i: Fix memory corruption bug in struct ieee80211_chanctx Move the `struct ieee80211_chanctx_conf conf` to the end of `struct ieee80211_chanctx` and fix a memory corruption bug triggered e.g. in `hwsim_set_chanctx_magic()`: `radar_detected` is being overwritten when `cp->magic = HWSIM_CHANCTX_MAGIC;` See the function call sequence below: drv_add_chanctx(... struct ieee80211_chanctx *ctx) -> local->ops->add_chanctx(&local->hw, &ctx->conf) -> mac80211_hwsim_add_chanctx(... struct ieee80211_chanctx_conf *ctx) -> hwsim_set_chanctx_magic(ctx) This also happens in a number of other drivers. Also, add a code comment to try to prevent people from introducing new members after `struct ieee80211_chanctx_conf conf`. Notice that `struct ieee80211_chanctx_conf` is a flexible structure --a structure that contains a flexible-array member, so it should always be at the end of any other containing structures. This change also fixes 50 of the following warnings: net/mac80211/ieee80211_i.h:895:39: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] -Wflex-array-member-not-at-end was introduced in GCC-14, and we are getting ready to enable it, globally. Fixes: bca8bc0399ac ("wifi: mac80211: handle ieee80211_radar_detected() for MLO") Signed-off-by: Gustavo A. R. Silva Link: https://patch.msgid.link/ZxwWPrncTeSi1UTq@kspp [also refer to other drivers in commit message] Signed-off-by: Johannes Berg --- net/mac80211/ieee80211_i.h | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/mac80211/ieee80211_i.h b/net/mac80211/ieee80211_i.h index 04fec7e516cf..3d3c9139ff5e 100644 --- a/net/mac80211/ieee80211_i.h +++ b/net/mac80211/ieee80211_i.h @@ -892,9 +892,10 @@ struct ieee80211_chanctx { /* temporary data for search algorithm etc. */ struct ieee80211_chan_req req; - struct ieee80211_chanctx_conf conf; - bool radar_detected; + + /* MUST be last - ends in a flexible-array member. */ + struct ieee80211_chanctx_conf conf; }; struct mac80211_qos_map { -- cgit From 2860586c588ad2dd8747e85ab43c4cf58bb066f4 Mon Sep 17 00:00:00 2001 From: Dmitry Torokhov Date: Fri, 4 Oct 2024 07:07:08 -0700 Subject: Input: adp5588-keys - do not try to disable interrupt 0 Commit dc748812fca0 ("Input: adp5588-keys - add support for pure gpio") made having interrupt line optional for the device, however it neglected to update suspend and resume handlers that try to disable interrupts for the duration of suspend. Fix this by checking if interrupt number assigned to the i2c device is not 0 before trying to disable or reenable it. Fixes: dc748812fca0 ("Input: adp5588-keys - add support for pure gpio") Link: https://lore.kernel.org/r/Zv_2jEMYSWDw2gKs@google.com Signed-off-by: Dmitry Torokhov --- drivers/input/keyboard/adp5588-keys.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/input/keyboard/adp5588-keys.c b/drivers/input/keyboard/adp5588-keys.c index d25d63a807f2..dc734974ce06 100644 --- a/drivers/input/keyboard/adp5588-keys.c +++ b/drivers/input/keyboard/adp5588-keys.c @@ -822,7 +822,8 @@ static int adp5588_suspend(struct device *dev) { struct i2c_client *client = to_i2c_client(dev); - disable_irq(client->irq); + if (client->irq) + disable_irq(client->irq); return 0; } @@ -831,7 +832,8 @@ static int adp5588_resume(struct device *dev) { struct i2c_client *client = to_i2c_client(dev); - enable_irq(client->irq); + if (client->irq) + enable_irq(client->irq); return 0; } -- cgit From 9c70b2a33cd2aa6a5a59c5523ef053bd42265209 Mon Sep 17 00:00:00 2001 From: Shawn Wang Date: Fri, 25 Oct 2024 10:22:08 +0800 Subject: sched/numa: Fix the potential null pointer dereference in task_numa_work() When running stress-ng-vm-segv test, we found a null pointer dereference error in task_numa_work(). Here is the backtrace: [323676.066985] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020 ...... [323676.067108] CPU: 35 PID: 2694524 Comm: stress-ng-vm-se ...... [323676.067113] pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--) [323676.067115] pc : vma_migratable+0x1c/0xd0 [323676.067122] lr : task_numa_work+0x1ec/0x4e0 [323676.067127] sp : ffff8000ada73d20 [323676.067128] x29: ffff8000ada73d20 x28: 0000000000000000 x27: 000000003e89f010 [323676.067130] x26: 0000000000080000 x25: ffff800081b5c0d8 x24: ffff800081b27000 [323676.067133] x23: 0000000000010000 x22: 0000000104d18cc0 x21: ffff0009f7158000 [323676.067135] x20: 0000000000000000 x19: 0000000000000000 x18: ffff8000ada73db8 [323676.067138] x17: 0001400000000000 x16: ffff800080df40b0 x15: 0000000000000035 [323676.067140] x14: ffff8000ada73cc8 x13: 1fffe0017cc72001 x12: ffff8000ada73cc8 [323676.067142] x11: ffff80008001160c x10: ffff000be639000c x9 : ffff8000800f4ba4 [323676.067145] x8 : ffff000810375000 x7 : ffff8000ada73974 x6 : 0000000000000001 [323676.067147] x5 : 0068000b33e26707 x4 : 0000000000000001 x3 : ffff0009f7158000 [323676.067149] x2 : 0000000000000041 x1 : 0000000000004400 x0 : 0000000000000000 [323676.067152] Call trace: [323676.067153] vma_migratable+0x1c/0xd0 [323676.067155] task_numa_work+0x1ec/0x4e0 [323676.067157] task_work_run+0x78/0xd8 [323676.067161] do_notify_resume+0x1ec/0x290 [323676.067163] el0_svc+0x150/0x160 [323676.067167] el0t_64_sync_handler+0xf8/0x128 [323676.067170] el0t_64_sync+0x17c/0x180 [323676.067173] Code: d2888001 910003fd f9000bf3 aa0003f3 (f9401000) [323676.067177] SMP: stopping secondary CPUs [323676.070184] Starting crashdump kernel... stress-ng-vm-segv in stress-ng is used to stress test the SIGSEGV error handling function of the system, which tries to cause a SIGSEGV error on return from unmapping the whole address space of the child process. Normally this program will not cause kernel crashes. But before the munmap system call returns to user mode, a potential task_numa_work() for numa balancing could be added and executed. In this scenario, since the child process has no vma after munmap, the vma_next() in task_numa_work() will return a null pointer even if the vma iterator restarts from 0. Recheck the vma pointer before dereferencing it in task_numa_work(). Fixes: 214dbc428137 ("sched: convert to vma iterator") Signed-off-by: Shawn Wang Signed-off-by: Peter Zijlstra (Intel) Cc: stable@vger.kernel.org # v6.2+ Link: https://lkml.kernel.org/r/20241025022208.125527-1-shawnwang@linux.alibaba.com --- kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 879614686188..2d16c8545c71 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3369,7 +3369,7 @@ retry_pids: vma = vma_next(&vmi); } - do { + for (; vma; vma = vma_next(&vmi)) { if (!vma_migratable(vma) || !vma_policy_mof(vma) || is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) { trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_UNSUITABLE); @@ -3491,7 +3491,7 @@ retry_pids: */ if (vma_pids_forced) break; - } for_each_vma(vmi, vma); + } /* * If no VMAs are remaining and VMAs were skipped due to the PID -- cgit From b5413156bad91dc2995a5c4eab1b05e56914638a Mon Sep 17 00:00:00 2001 From: Benjamin Segall Date: Fri, 25 Oct 2024 18:35:35 -0700 Subject: posix-cpu-timers: Clear TICK_DEP_BIT_POSIX_TIMER on clone When cloning a new thread, its posix_cputimers are not inherited, and are cleared by posix_cputimers_init(). However, this does not clear the tick dependency it creates in tsk->tick_dep_mask, and the handler does not reach the code to clear the dependency if there were no timers to begin with. Thus if a thread has a cputimer running before clone/fork, all descendants will prevent nohz_full unless they create a cputimer of their own. Fix this by entirely clearing the tick_dep_mask in copy_process(). (There is currently no inherited state that needs a tick dependency) Process-wide timers do not have this problem because fork does not copy signal_struct as a baseline, it creates one from scratch. Fixes: b78783000d5c ("posix-cpu-timers: Migrate to use new tick dependency mask model") Signed-off-by: Ben Segall Signed-off-by: Thomas Gleixner Reviewed-by: Frederic Weisbecker Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/xm26o737bq8o.fsf@google.com --- include/linux/tick.h | 8 ++++++++ kernel/fork.c | 2 ++ 2 files changed, 10 insertions(+) diff --git a/include/linux/tick.h b/include/linux/tick.h index 72744638c5b0..99c9c5a7252a 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -251,12 +251,19 @@ static inline void tick_dep_set_task(struct task_struct *tsk, if (tick_nohz_full_enabled()) tick_nohz_dep_set_task(tsk, bit); } + static inline void tick_dep_clear_task(struct task_struct *tsk, enum tick_dep_bits bit) { if (tick_nohz_full_enabled()) tick_nohz_dep_clear_task(tsk, bit); } + +static inline void tick_dep_init_task(struct task_struct *tsk) +{ + atomic_set(&tsk->tick_dep_mask, 0); +} + static inline void tick_dep_set_signal(struct task_struct *tsk, enum tick_dep_bits bit) { @@ -290,6 +297,7 @@ static inline void tick_dep_set_task(struct task_struct *tsk, enum tick_dep_bits bit) { } static inline void tick_dep_clear_task(struct task_struct *tsk, enum tick_dep_bits bit) { } +static inline void tick_dep_init_task(struct task_struct *tsk) { } static inline void tick_dep_set_signal(struct task_struct *tsk, enum tick_dep_bits bit) { } static inline void tick_dep_clear_signal(struct signal_struct *signal, diff --git a/kernel/fork.c b/kernel/fork.c index 89ceb4a68af2..6fa9fe62e01e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -105,6 +105,7 @@ #include #include #include +#include #include #include @@ -2292,6 +2293,7 @@ __latent_entropy struct task_struct *copy_process( acct_clear_integrals(p); posix_cputimers_init(&p->posix_cputimers); + tick_dep_init_task(p); p->io_context = NULL; audit_set_context(p, NULL); -- cgit From 5f994f534120f47432092fb36f5cb0c7a80ed2bf Mon Sep 17 00:00:00 2001 From: Jinjie Ruan Date: Sat, 26 Oct 2024 14:36:39 +0800 Subject: genirq/msi: Fix off-by-one error in msi_domain_alloc() The error path in msi_domain_alloc(), frees the already allocated MSI interrupts in a loop, but the loop condition terminates when the index reaches zero, which fails to free the first allocated MSI interrupt at index zero. Check for >= 0 so that msi[0] is freed as well. Fixes: f3cf8bb0d6c3 ("genirq: Add generic msi irq domain support") Signed-off-by: Jinjie Ruan Signed-off-by: Thomas Gleixner Link: https://lore.kernel.org/all/20241026063639.10711-1-ruanjinjie@huawei.com --- kernel/irq/msi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c index 3a24d6b5f559..396a067a8a56 100644 --- a/kernel/irq/msi.c +++ b/kernel/irq/msi.c @@ -718,7 +718,7 @@ static int msi_domain_alloc(struct irq_domain *domain, unsigned int virq, ret = ops->msi_init(domain, info, virq + i, hwirq + i, arg); if (ret < 0) { if (ops->msi_free) { - for (i--; i > 0; i--) + for (i--; i >= 0; i--) ops->msi_free(domain, info, virq + i); } irq_domain_free_irqs_top(domain, virq, nr_irqs); -- cgit From e6c24e2d05bb05de96ffb9bdb0ee62d20ad526f8 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Sun, 27 Oct 2024 10:22:20 +0000 Subject: irqchip/gic-v4: Correctly deal with set_affinity on lazily-mapped VPEs Zenghui points out that a recent change to the way set_affinity is handled for VPEs has the potential to return an error if the VPE hasn't been mapped yet (because the guest hasn't emited a MAPTI command yet), affecting GICv4.0 implementations that rely on the ITSList feature. Fix this by making the set_affinity succeed in this case, and return early, without trying to touch the HW. Fixes: 1442ee0011983 ("irqchip/gic-v4: Don't allow a VMOVP on a dying VPE") Reported-by: Zenghui Yu Signed-off-by: Marc Zyngier Signed-off-by: Thomas Gleixner Reviewed-by: Zenghui Yu Link: https://lore.kernel.org/all/20241027102220.1858558-1-maz@kernel.org Link: https://lore.kernel.org/r/aab45cd3-e5ca-58cf-e081-e32a17f5b4e7@huawei.com --- drivers/irqchip/irq-gic-v3-its.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c index ab597e74ba08..52f625e07658 100644 --- a/drivers/irqchip/irq-gic-v3-its.c +++ b/drivers/irqchip/irq-gic-v3-its.c @@ -3810,8 +3810,18 @@ static int its_vpe_set_affinity(struct irq_data *d, * Check if we're racing against a VPE being destroyed, for * which we don't want to allow a VMOVP. */ - if (!atomic_read(&vpe->vmapp_count)) - return -EINVAL; + if (!atomic_read(&vpe->vmapp_count)) { + if (gic_requires_eager_mapping()) + return -EINVAL; + + /* + * If we lazily map the VPEs, this isn't an error and + * we can exit cleanly. + */ + cpu = cpumask_first(mask_val); + irq_data_update_effective_affinity(d, cpumask_of(cpu)); + return IRQ_SET_MASK_OK_DONE; + } /* * Changing affinity is mega expensive, so let's be as lazy as -- cgit From c38a04ecb6ac25c0c8786b5c5bfa4724ee483d67 Mon Sep 17 00:00:00 2001 From: Miguel Ojeda Date: Sun, 27 Oct 2024 15:56:36 +0100 Subject: kbuild: rust: avoid errors with old `rustc`s without LLVM patch version Some old versions of `rustc` did not report the LLVM version without the patch version, e.g.: $ rustc --version --verbose rustc 1.48.0 (7eac88abb 2020-11-16) binary: rustc commit-hash: 7eac88abb2e57e752f3302f02be5f3ce3d7adfb4 commit-date: 2020-11-16 host: x86_64-unknown-linux-gnu release: 1.48.0 LLVM version: 11.0 Which would make the new `scripts/rustc-llvm-version.sh` fail and, in turn, the build: $ make LLVM=1 SYNC include/config/auto.conf.cmd ./scripts/rustc-llvm-version.sh: 13: arithmetic expression: expecting primary: "10000 * 10 + 100 * 0 + " init/Kconfig:83: syntax error init/Kconfig:83: invalid statement make[3]: *** [scripts/kconfig/Makefile:85: syncconfig] Error 1 make[2]: *** [Makefile:679: syncconfig] Error 2 make[1]: *** [/home/cam/linux/Makefile:780: include/config/auto.conf.cmd] Error 2 make: *** [Makefile:224: __sub-make] Error 2 Since we do not need to support such binaries, we can avoid adding logic for computing `rustc`'s LLVM version for those old binaries. Thus, instead, just make the match stricter. Other `rustc` binaries (even newer) did not report the LLVM version at all, but that was fine, since it would not match "LLVM", e.g.: $ rustc --version --verbose rustc 1.49.0 (e1884a8e3 2020-12-29) binary: rustc commit-hash: e1884a8e3c3e813aada8254edfa120e85bf5ffca commit-date: 2020-12-29 host: x86_64-unknown-linux-gnu release: 1.49.0 Cc: Thorsten Leemhuis Cc: Gary Guo Reported-by: Cameron MacPherson Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219423 Fixes: af0121c2d303 ("kbuild: rust: add `CONFIG_RUSTC_LLVM_VERSION`") Tested-by: Cameron MacPherson Reviewed-by: Nathan Chancellor Tested-by: Nathan Chancellor Link: https://lore.kernel.org/r/20241027145636.416030-1-ojeda@kernel.org Signed-off-by: Miguel Ojeda --- scripts/rustc-llvm-version.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/rustc-llvm-version.sh b/scripts/rustc-llvm-version.sh index b6063cbe5bdc..a500d1ae3101 100755 --- a/scripts/rustc-llvm-version.sh +++ b/scripts/rustc-llvm-version.sh @@ -13,7 +13,7 @@ get_canonical_version() echo $((10000 * $1 + 100 * $2 + $3)) } -if output=$("$@" --version --verbose 2>/dev/null | grep LLVM); then +if output=$("$@" --version --verbose 2>/dev/null | grep -E 'LLVM.*[0-9]+\.[0-9]+\.[0-9]+'); then set -- $output get_canonical_version $3 else -- cgit From f19910006effbd08398de79ca0233ea7e480616a Mon Sep 17 00:00:00 2001 From: Ian Kent Date: Mon, 28 Oct 2024 06:47:17 +0800 Subject: autofs: fix thinko in validate_dev_ioctl() I was so sure the per-dentry expire timeout patch worked ok but my testing was flawed. In validate_dev_ioctl() the check for ioctl AUTOFS_DEV_IOCTL_TIMEOUT_CMD should use the ioctl number not the passed in ioctl command. Fixes: 433f9d76a010 ("autofs: add per dentry expire timeout") Cc: # mainline only Signed-off-by: Ian Kent Link: https://lore.kernel.org/r/20241027224732.5507-1-raven@themaw.net Signed-off-by: Christian Brauner --- fs/autofs/dev-ioctl.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/fs/autofs/dev-ioctl.c b/fs/autofs/dev-ioctl.c index f011e026358e..6d57efbb8110 100644 --- a/fs/autofs/dev-ioctl.c +++ b/fs/autofs/dev-ioctl.c @@ -110,6 +110,7 @@ static inline void free_dev_ioctl(struct autofs_dev_ioctl *param) */ static int validate_dev_ioctl(int cmd, struct autofs_dev_ioctl *param) { + unsigned int inr = _IOC_NR(cmd); int err; err = check_dev_ioctl_version(cmd, param); @@ -133,7 +134,7 @@ static int validate_dev_ioctl(int cmd, struct autofs_dev_ioctl *param) * check_name() return for AUTOFS_DEV_IOCTL_TIMEOUT_CMD. */ err = check_name(param->path); - if (cmd == AUTOFS_DEV_IOCTL_TIMEOUT_CMD) + if (inr == AUTOFS_DEV_IOCTL_TIMEOUT_CMD) err = err ? 0 : -EINVAL; if (err) { pr_warn("invalid path supplied for cmd(0x%08x)\n", @@ -141,8 +142,6 @@ static int validate_dev_ioctl(int cmd, struct autofs_dev_ioctl *param) goto out; } } else { - unsigned int inr = _IOC_NR(cmd); - if (inr == AUTOFS_DEV_IOCTL_OPENMOUNT_CMD || inr == AUTOFS_DEV_IOCTL_REQUESTER_CMD || inr == AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD) { -- cgit From d221b844ee79823ffc29b7badc4010bdb0960224 Mon Sep 17 00:00:00 2001 From: Christophe JAILLET Date: Sat, 26 Oct 2024 22:46:34 +0200 Subject: ASoC: cs42l51: Fix some error handling paths in cs42l51_probe() If devm_gpiod_get_optional() fails, we need to disable previously enabled regulators, as done in the other error handling path of the function. Also, gpiod_set_value_cansleep(, 1) needs to be called to undo a potential gpiod_set_value_cansleep(, 0). If the "reset" gpio is not defined, this additional call is just a no-op. This behavior is the same as the one already in the .remove() function. Fixes: 11b9cd748e31 ("ASoC: cs42l51: add reset management") Signed-off-by: Christophe JAILLET Reviewed-by: Charles Keepax Link: https://patch.msgid.link/a5e5f4b9fb03f46abd2c93ed94b5c395972ce0d1.1729975570.git.christophe.jaillet@wanadoo.fr Signed-off-by: Mark Brown --- sound/soc/codecs/cs42l51.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/sound/soc/codecs/cs42l51.c b/sound/soc/codecs/cs42l51.c index e4827b8c2bde..6e51954bdb1e 100644 --- a/sound/soc/codecs/cs42l51.c +++ b/sound/soc/codecs/cs42l51.c @@ -747,8 +747,10 @@ int cs42l51_probe(struct device *dev, struct regmap *regmap) cs42l51->reset_gpio = devm_gpiod_get_optional(dev, "reset", GPIOD_OUT_LOW); - if (IS_ERR(cs42l51->reset_gpio)) - return PTR_ERR(cs42l51->reset_gpio); + if (IS_ERR(cs42l51->reset_gpio)) { + ret = PTR_ERR(cs42l51->reset_gpio); + goto error; + } if (cs42l51->reset_gpio) { dev_dbg(dev, "Release reset gpio\n"); @@ -780,6 +782,7 @@ int cs42l51_probe(struct device *dev, struct regmap *regmap) return 0; error: + gpiod_set_value_cansleep(cs42l51->reset_gpio, 1); regulator_bulk_disable(ARRAY_SIZE(cs42l51->supplies), cs42l51->supplies); return ret; -- cgit From c749d9b7ebbc5716af7a95f7768634b30d9446ec Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Sun, 27 Oct 2024 15:23:23 -0700 Subject: iov_iter: fix copy_page_from_iter_atomic() if KMAP_LOCAL_FORCE_MAP generic/077 on x86_32 CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP=y with highmem, on huge=always tmpfs, issues a warning and then hangs (interruptibly): WARNING: CPU: 5 PID: 3517 at mm/highmem.c:622 kunmap_local_indexed+0x62/0xc9 CPU: 5 UID: 0 PID: 3517 Comm: cp Not tainted 6.12.0-rc4 #2 ... copy_page_from_iter_atomic+0xa6/0x5ec generic_perform_write+0xf6/0x1b4 shmem_file_write_iter+0x54/0x67 Fix copy_page_from_iter_atomic() by limiting it in that case (include/linux/skbuff.h skb_frag_must_loop() does similar). But going forward, perhaps CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP is too surprising, has outlived its usefulness, and should just be removed? Fixes: 908a1ad89466 ("iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic()") Signed-off-by: Hugh Dickins Link: https://lore.kernel.org/r/dd5f0c89-186e-18e1-4f43-19a60f5a9774@google.com Reviewed-by: Christoph Hellwig Cc: stable@vger.kernel.org Signed-off-by: Christian Brauner --- lib/iov_iter.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/lib/iov_iter.c b/lib/iov_iter.c index cc4b5541eef8..908e75a28d90 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -461,6 +461,8 @@ size_t copy_page_from_iter_atomic(struct page *page, size_t offset, size_t bytes, struct iov_iter *i) { size_t n, copied = 0; + bool uses_kmap = IS_ENABLED(CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP) || + PageHighMem(page); if (!page_copy_sane(page, offset, bytes)) return 0; @@ -471,7 +473,7 @@ size_t copy_page_from_iter_atomic(struct page *page, size_t offset, char *p; n = bytes - copied; - if (PageHighMem(page)) { + if (uses_kmap) { page += offset / PAGE_SIZE; offset %= PAGE_SIZE; n = min_t(size_t, n, PAGE_SIZE - offset); @@ -482,7 +484,7 @@ size_t copy_page_from_iter_atomic(struct page *page, size_t offset, kunmap_atomic(p); copied += n; offset += n; - } while (PageHighMem(page) && copied != bytes && n > 0); + } while (uses_kmap && copied != bytes && n > 0); return copied; } -- cgit From d658d59471ed80c4a8aaf082ccc3e83cdf5ae4c1 Mon Sep 17 00:00:00 2001 From: Jarkko Sakkinen Date: Mon, 28 Oct 2024 07:49:59 +0200 Subject: tpm: Return tpm2_sessions_init() when null key creation fails Do not continue tpm2_sessions_init() further if the null key pair creation fails. Cc: stable@vger.kernel.org # v6.10+ Fixes: d2add27cf2b8 ("tpm: Add NULL primary creation") Reviewed-by: Stefan Berger Signed-off-by: Jarkko Sakkinen --- drivers/char/tpm/tpm2-sessions.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/drivers/char/tpm/tpm2-sessions.c b/drivers/char/tpm/tpm2-sessions.c index 511c67061728..bb7b94b32c46 100644 --- a/drivers/char/tpm/tpm2-sessions.c +++ b/drivers/char/tpm/tpm2-sessions.c @@ -1347,14 +1347,21 @@ static int tpm2_create_null_primary(struct tpm_chip *chip) * * Derive and context save the null primary and allocate memory in the * struct tpm_chip for the authorizations. + * + * Return: + * * 0 - OK + * * -errno - A system error + * * TPM_RC - A TPM error */ int tpm2_sessions_init(struct tpm_chip *chip) { int rc; rc = tpm2_create_null_primary(chip); - if (rc) - dev_err(&chip->dev, "TPM: security failed (NULL seed derivation): %d\n", rc); + if (rc) { + dev_err(&chip->dev, "null key creation failed with %d\n", rc); + return rc; + } chip->auth = kmalloc(sizeof(*chip->auth), GFP_KERNEL); if (!chip->auth) -- cgit From cc7d8594342a25693d40fe96f97e5c6c29ee609c Mon Sep 17 00:00:00 2001 From: Jarkko Sakkinen Date: Mon, 28 Oct 2024 07:50:00 +0200 Subject: tpm: Rollback tpm2_load_null() Do not continue on tpm2_create_primary() failure in tpm2_load_null(). Cc: stable@vger.kernel.org # v6.10+ Fixes: eb24c9788cd9 ("tpm: disable the TPM if NULL name changes") Reviewed-by: Stefan Berger Signed-off-by: Jarkko Sakkinen --- drivers/char/tpm/tpm2-sessions.c | 44 ++++++++++++++++++++++------------------ 1 file changed, 24 insertions(+), 20 deletions(-) diff --git a/drivers/char/tpm/tpm2-sessions.c b/drivers/char/tpm/tpm2-sessions.c index bb7b94b32c46..4821be276670 100644 --- a/drivers/char/tpm/tpm2-sessions.c +++ b/drivers/char/tpm/tpm2-sessions.c @@ -915,33 +915,37 @@ static int tpm2_parse_start_auth_session(struct tpm2_auth *auth, static int tpm2_load_null(struct tpm_chip *chip, u32 *null_key) { - int rc; unsigned int offset = 0; /* dummy offset for null seed context */ u8 name[SHA256_DIGEST_SIZE + 2]; + u32 tmp_null_key; + int rc; rc = tpm2_load_context(chip, chip->null_key_context, &offset, - null_key); - if (rc != -EINVAL) - return rc; + &tmp_null_key); + if (rc != -EINVAL) { + if (!rc) + *null_key = tmp_null_key; + goto err; + } - /* an integrity failure may mean the TPM has been reset */ - dev_err(&chip->dev, "NULL key integrity failure!\n"); - /* check the null name against what we know */ - tpm2_create_primary(chip, TPM2_RH_NULL, NULL, name); - if (memcmp(name, chip->null_key_name, sizeof(name)) == 0) - /* name unchanged, assume transient integrity failure */ - return rc; - /* - * Fatal TPM failure: the NULL seed has actually changed, so - * the TPM must have been illegally reset. All in-kernel TPM - * operations will fail because the NULL primary can't be - * loaded to salt the sessions, but disable the TPM anyway so - * userspace programmes can't be compromised by it. - */ - dev_err(&chip->dev, "NULL name has changed, disabling TPM due to interference\n"); + /* Try to re-create null key, given the integrity failure: */ + rc = tpm2_create_primary(chip, TPM2_RH_NULL, &tmp_null_key, name); + if (rc) + goto err; + + /* Return null key if the name has not been changed: */ + if (!memcmp(name, chip->null_key_name, sizeof(name))) { + *null_key = tmp_null_key; + return 0; + } + + /* Deduce from the name change TPM interference: */ + dev_err(&chip->dev, "null key integrity check failed\n"); + tpm2_flush_context(chip, tmp_null_key); chip->flags |= TPM_CHIP_FLAG_DISABLE; - return rc; +err: + return rc ? -ENODEV : 0; } /** -- cgit From 21a3a3d015aeee2402d14b425197d70aa3bd0d91 Mon Sep 17 00:00:00 2001 From: Arnaldo Carvalho de Melo Date: Mon, 28 Oct 2024 10:55:09 -0300 Subject: tools headers: Synchronize {uapi/}linux/bits.h with the kernel sources To pick up the changes in this cset: 947697c6f0f75f98 ("uapi: Define GENMASK_U128") This addresses these perf build warnings: Warning: Kernel ABI header differences: diff -u tools/include/uapi/linux/bits.h include/uapi/linux/bits.h diff -u tools/include/linux/bits.h include/linux/bits.h Please see tools/include/uapi/README for further details. Acked-by: Yury Norov Cc: Adrian Hunter Cc: Anshuman Khandual Cc: Ian Rogers Cc: Jiri Olsa Cc: Kan Liang Cc: Namhyung Kim Link: https://lore.kernel.org/lkml/Zx-ZVH7bHqtFn8Dv@x1 Signed-off-by: Arnaldo Carvalho de Melo --- tools/include/linux/bits.h | 15 +++++++++++++++ tools/include/uapi/linux/bits.h | 3 +++ 2 files changed, 18 insertions(+) diff --git a/tools/include/linux/bits.h b/tools/include/linux/bits.h index 0eb24d21aac2..60044b608817 100644 --- a/tools/include/linux/bits.h +++ b/tools/include/linux/bits.h @@ -36,4 +36,19 @@ #define GENMASK_ULL(h, l) \ (GENMASK_INPUT_CHECK(h, l) + __GENMASK_ULL(h, l)) +#if !defined(__ASSEMBLY__) +/* + * Missing asm support + * + * __GENMASK_U128() depends on _BIT128() which would not work + * in the asm code, as it shifts an 'unsigned __init128' data + * type instead of direct representation of 128 bit constants + * such as long and unsigned long. The fundamental problem is + * that a 128 bit constant will get silently truncated by the + * gcc compiler. + */ +#define GENMASK_U128(h, l) \ + (GENMASK_INPUT_CHECK(h, l) + __GENMASK_U128(h, l)) +#endif + #endif /* __LINUX_BITS_H */ diff --git a/tools/include/uapi/linux/bits.h b/tools/include/uapi/linux/bits.h index 3c2a101986a3..5ee30f882736 100644 --- a/tools/include/uapi/linux/bits.h +++ b/tools/include/uapi/linux/bits.h @@ -12,4 +12,7 @@ (((~_ULL(0)) - (_ULL(1) << (l)) + 1) & \ (~_ULL(0) >> (__BITS_PER_LONG_LONG - 1 - (h)))) +#define __GENMASK_U128(h, l) \ + ((_BIT128((h)) << 1) - (_BIT128(l))) + #endif /* _UAPI_LINUX_BITS_H */ -- cgit From 93e4b86b3e74e19c95b762cfeb42baa0a94f212f Mon Sep 17 00:00:00 2001 From: Arnaldo Carvalho de Melo Date: Mon, 28 Oct 2024 11:13:57 -0300 Subject: tools headers arm64: Sync arm64's cputype.h with the kernel sources To get the changes in: 924725707d80bc25 ("arm64: cputype: Add Neoverse-N3 definitions") That makes this perf source code to be rebuilt: CC /tmp/build/perf-tools/util/arm-spe.o The changes in the above patch add MIDR_NEOVERSE_N3, that probably need changes in arm-spe.c, so probably we need to add it to that array? Or maybe we need to leave this for later when this is all tested on those machines? static const struct midr_range neoverse_spe[] = { MIDR_ALL_VERSIONS(MIDR_NEOVERSE_N1), MIDR_ALL_VERSIONS(MIDR_NEOVERSE_N2), MIDR_ALL_VERSIONS(MIDR_NEOVERSE_V1), {}, }; Mark Rutland recommended about arm-spe.c in a previous update to this file: "I would not touch this for now -- someone would have to go audit the TRMs to check that those other cores have the same encoding, and I think it'd be better to do that as a follow-up." That addresses this perf build warning: Warning: Kernel ABI header differences: diff -u tools/arch/arm64/include/asm/cputype.h arch/arm64/include/asm/cputype.h Cc: Adrian Hunter Cc: Catalin Marinas Cc: Ian Rogers Cc: Jiri Olsa Cc: Kan Liang Cc: Mark Rutland Cc: Namhyung Kim Link: https://lore.kernel.org/lkml/Zx-dffKdGsgkhG96@x1 Signed-off-by: Arnaldo Carvalho de Melo --- tools/arch/arm64/include/asm/cputype.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/tools/arch/arm64/include/asm/cputype.h b/tools/arch/arm64/include/asm/cputype.h index 5a7dfeb8e8eb..488f8e751349 100644 --- a/tools/arch/arm64/include/asm/cputype.h +++ b/tools/arch/arm64/include/asm/cputype.h @@ -94,6 +94,7 @@ #define ARM_CPU_PART_NEOVERSE_V3 0xD84 #define ARM_CPU_PART_CORTEX_X925 0xD85 #define ARM_CPU_PART_CORTEX_A725 0xD87 +#define ARM_CPU_PART_NEOVERSE_N3 0xD8E #define APM_CPU_PART_XGENE 0x000 #define APM_CPU_VAR_POTENZA 0x00 @@ -176,6 +177,7 @@ #define MIDR_NEOVERSE_V3 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_V3) #define MIDR_CORTEX_X925 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_X925) #define MIDR_CORTEX_A725 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A725) +#define MIDR_NEOVERSE_N3 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_N3) #define MIDR_THUNDERX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX) #define MIDR_THUNDERX_81XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX_81XX) #define MIDR_THUNDERX_83XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX_83XX) -- cgit From 55f1b540d893da740a81200450014c45a8103f54 Mon Sep 17 00:00:00 2001 From: Arnaldo Carvalho de Melo Date: Mon, 28 Oct 2024 12:24:37 -0300 Subject: tools headers: Update the linux/unaligned.h copy with the kernel sources To pick up the changes in: 7f053812dab3946c ("random: vDSO: minimize and simplify header includes") That required adding a copy of include/vdso/unaligned.h and its checking in tools/perf/check-headers.h. Addressing this perf tools build warning: Warning: Kernel ABI header differences: diff -u tools/include/linux/unaligned.h include/linux/unaligned.h Please see tools/include/uapi/README for further details. Cc: Adrian Hunter Cc: Christophe Leroy Cc: Ian Rogers Cc: Jason A. Donenfeld Cc: Jiri Olsa Cc: Kan Liang Cc: Namhyung Kim Link: https://lore.kernel.org/lkml/Zx-uHvAbPAESofEN@x1 Signed-off-by: Arnaldo Carvalho de Melo --- tools/include/linux/unaligned.h | 11 +---------- tools/include/vdso/unaligned.h | 15 +++++++++++++++ tools/perf/check-headers.sh | 1 + 3 files changed, 17 insertions(+), 10 deletions(-) create mode 100644 tools/include/vdso/unaligned.h diff --git a/tools/include/linux/unaligned.h b/tools/include/linux/unaligned.h index bc0633bc4650..395a4464fe73 100644 --- a/tools/include/linux/unaligned.h +++ b/tools/include/linux/unaligned.h @@ -9,16 +9,7 @@ #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Wpacked" #pragma GCC diagnostic ignored "-Wattributes" - -#define __get_unaligned_t(type, ptr) ({ \ - const struct { type x; } __packed *__pptr = (typeof(__pptr))(ptr); \ - __pptr->x; \ -}) - -#define __put_unaligned_t(type, val, ptr) do { \ - struct { type x; } __packed *__pptr = (typeof(__pptr))(ptr); \ - __pptr->x = (val); \ -} while (0) +#include #define get_unaligned(ptr) __get_unaligned_t(typeof(*(ptr)), (ptr)) #define put_unaligned(val, ptr) __put_unaligned_t(typeof(*(ptr)), (val), (ptr)) diff --git a/tools/include/vdso/unaligned.h b/tools/include/vdso/unaligned.h new file mode 100644 index 000000000000..eee3d2a4dbe4 --- /dev/null +++ b/tools/include/vdso/unaligned.h @@ -0,0 +1,15 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __VDSO_UNALIGNED_H +#define __VDSO_UNALIGNED_H + +#define __get_unaligned_t(type, ptr) ({ \ + const struct { type x; } __packed *__pptr = (typeof(__pptr))(ptr); \ + __pptr->x; \ +}) + +#define __put_unaligned_t(type, val, ptr) do { \ + struct { type x; } __packed *__pptr = (typeof(__pptr))(ptr); \ + __pptr->x = (val); \ +} while (0) + +#endif /* __VDSO_UNALIGNED_H */ diff --git a/tools/perf/check-headers.sh b/tools/perf/check-headers.sh index 29adbb423327..a05c1c105c51 100755 --- a/tools/perf/check-headers.sh +++ b/tools/perf/check-headers.sh @@ -22,6 +22,7 @@ FILES=( "include/vdso/bits.h" "include/linux/const.h" "include/vdso/const.h" + "include/vdso/unaligned.h" "include/linux/hash.h" "include/linux/list-sort.h" "include/uapi/linux/hw_breakpoint.h" -- cgit From a5384c426744ebe41dafc6e5fa3acecc05e43462 Mon Sep 17 00:00:00 2001 From: Ian Rogers Date: Fri, 25 Oct 2024 22:54:48 -0700 Subject: perf cap: Add __NR_capget to arch/x86 unistd As there are duplicated kernel headers in tools/include libc can pick up the wrong definitions. This was causing the wrong system call for capget in perf. Reported-by: Adrian Hunter Fixes: e25ebda78e230283 ("perf cap: Tidy up and improve capability testing") Closes: https://lore.kernel.org/lkml/cc7d6bdf-1aeb-4179-9029-4baf50b59342@intel.com/ Signed-off-by: Ian Rogers Tested-by: Adrian Hunter Cc: Alexander Shishkin Cc: Ingo Molnar Cc: Jiri Olsa Cc: Kan Liang Cc: Mark Rutland Cc: Namhyung Kim Cc: Peter Zijlstra Link: https://lore.kernel.org/r/20241026055448.312247-1-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo --- tools/arch/x86/include/uapi/asm/unistd_32.h | 3 +++ tools/arch/x86/include/uapi/asm/unistd_64.h | 3 +++ tools/perf/util/cap.c | 10 +++------- 3 files changed, 9 insertions(+), 7 deletions(-) diff --git a/tools/arch/x86/include/uapi/asm/unistd_32.h b/tools/arch/x86/include/uapi/asm/unistd_32.h index 9de35df1afc3..63182a023e9d 100644 --- a/tools/arch/x86/include/uapi/asm/unistd_32.h +++ b/tools/arch/x86/include/uapi/asm/unistd_32.h @@ -11,6 +11,9 @@ #ifndef __NR_getpgid #define __NR_getpgid 132 #endif +#ifndef __NR_capget +#define __NR_capget 184 +#endif #ifndef __NR_gettid #define __NR_gettid 224 #endif diff --git a/tools/arch/x86/include/uapi/asm/unistd_64.h b/tools/arch/x86/include/uapi/asm/unistd_64.h index d0f2043d7132..77311e8d1b5d 100644 --- a/tools/arch/x86/include/uapi/asm/unistd_64.h +++ b/tools/arch/x86/include/uapi/asm/unistd_64.h @@ -11,6 +11,9 @@ #ifndef __NR_getpgid #define __NR_getpgid 121 #endif +#ifndef __NR_capget +#define __NR_capget 125 +#endif #ifndef __NR_gettid #define __NR_gettid 186 #endif diff --git a/tools/perf/util/cap.c b/tools/perf/util/cap.c index 7574a67651bc..69d9a2bcd40b 100644 --- a/tools/perf/util/cap.c +++ b/tools/perf/util/cap.c @@ -7,13 +7,9 @@ #include "debug.h" #include #include -#include #include #include - -#ifndef SYS_capget -#define SYS_capget 90 -#endif +#include #define MAX_LINUX_CAPABILITY_U32S _LINUX_CAPABILITY_U32S_3 @@ -21,9 +17,9 @@ bool perf_cap__capable(int cap, bool *used_root) { struct __user_cap_header_struct header = { .version = _LINUX_CAPABILITY_VERSION_3, - .pid = getpid(), + .pid = 0, }; - struct __user_cap_data_struct data[MAX_LINUX_CAPABILITY_U32S]; + struct __user_cap_data_struct data[MAX_LINUX_CAPABILITY_U32S] = {}; __u32 cap_val; *used_root = false; -- cgit From c1895ba181e560144601fafe46aeedbafdf4dbc4 Mon Sep 17 00:00:00 2001 From: Hans de Goede Date: Sat, 26 Oct 2024 16:36:15 +0200 Subject: ASoC: Intel: sst: Fix used of uninitialized ctx to log an error Fix the new "LPE0F28" code path using the uninitialized ctx variable to log an error. Fixes: 6668610b4d8c ("ASoC: Intel: sst: Support LPE0F28 ACPI HID") Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202410261106.EBx49ssy-lkp@intel.com/ Signed-off-by: Hans de Goede Link: https://patch.msgid.link/20241026143615.171821-1-hdegoede@redhat.com Signed-off-by: Mark Brown --- sound/soc/intel/atom/sst/sst_acpi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sound/soc/intel/atom/sst/sst_acpi.c b/sound/soc/intel/atom/sst/sst_acpi.c index f4c4774249ee..257180630475 100644 --- a/sound/soc/intel/atom/sst/sst_acpi.c +++ b/sound/soc/intel/atom/sst/sst_acpi.c @@ -308,7 +308,7 @@ static int sst_acpi_probe(struct platform_device *pdev) rsrc = platform_get_resource(pdev, IORESOURCE_MEM, pdata->res_info->acpi_lpe_res_index); if (!rsrc) { - dev_err(ctx->dev, "Invalid SHIM base\n"); + dev_err(dev, "Invalid SHIM base\n"); return -EIO; } rsrc->start -= pdata->res_info->shim_offset; -- cgit From 746ae46c11137ba21f0c0c68f082a9d8c1222c78 Mon Sep 17 00:00:00 2001 From: Matthew Brost Date: Wed, 23 Oct 2024 16:59:17 -0700 Subject: drm/sched: Mark scheduler work queues with WQ_MEM_RECLAIM drm_gpu_scheduler.submit_wq is used to submit jobs, jobs are in the path of dma-fences, and dma-fences are in the path of reclaim. Mark scheduler work queue with WQ_MEM_RECLAIM to ensure forward progress during reclaim; without WQ_MEM_RECLAIM, work queues cannot make forward progress during reclaim. v2: - Fixes tags (Philipp) - Reword commit message (Philipp) Cc: Luben Tuikov Cc: Danilo Krummrich Cc: Philipp Stanner Cc: stable@vger.kernel.org Fixes: 34f50cc6441b ("drm/sched: Use drm sched lockdep map for submit_wq") Fixes: a6149f039369 ("drm/sched: Convert drm scheduler to use a work queue rather than kthread") Signed-off-by: Matthew Brost Acked-by: Nirmoy Das Reviewed-by: Philipp Stanner Link: https://patchwork.freedesktop.org/patch/msgid/20241023235917.1836428-1-matthew.brost@intel.com Signed-off-by: Rodrigo Vivi --- drivers/gpu/drm/scheduler/sched_main.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index eaef20f41786..e97c6c60bc96 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -1276,10 +1276,11 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, sched->own_submit_wq = false; } else { #ifdef CONFIG_LOCKDEP - sched->submit_wq = alloc_ordered_workqueue_lockdep_map(name, 0, + sched->submit_wq = alloc_ordered_workqueue_lockdep_map(name, + WQ_MEM_RECLAIM, &drm_sched_lockdep_map); #else - sched->submit_wq = alloc_ordered_workqueue(name, 0); + sched->submit_wq = alloc_ordered_workqueue(name, WQ_MEM_RECLAIM); #endif if (!sched->submit_wq) return -ENOMEM; -- cgit From be0e822bb3f5259c7f9424ba97e8175211288813 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Mon, 28 Oct 2024 10:07:48 +0100 Subject: block: fix queue limits checks in blk_rq_map_user_bvec for real blk_rq_map_user_bvec currently only has ad-hoc checks for queue limits, and the last fix to it enabled valid NVMe I/O to pass, but also allowed invalid one for drivers that set a max_segment_size or seg_boundary limit. Fix it once for all by using the bio_split_rw_at helper from the I/O path that indicates if and where a bio would be have to be split to adhere to the queue limits, and it returns a positive value, turn that into -EREMOTEIO to retry using the copy path. Fixes: 2ff949441802 ("block: fix sanity checks in blk_rq_map_user_bvec") Signed-off-by: Christoph Hellwig Reviewed-by: John Garry Link: https://lore.kernel.org/r/20241028090840.446180-1-hch@lst.de Signed-off-by: Jens Axboe --- block/blk-map.c | 56 +++++++++++++++++--------------------------------------- 1 file changed, 17 insertions(+), 39 deletions(-) diff --git a/block/blk-map.c b/block/blk-map.c index 6ef2ec1f7d78..b5fd1d857461 100644 --- a/block/blk-map.c +++ b/block/blk-map.c @@ -561,55 +561,33 @@ EXPORT_SYMBOL(blk_rq_append_bio); /* Prepare bio for passthrough IO given ITER_BVEC iter */ static int blk_rq_map_user_bvec(struct request *rq, const struct iov_iter *iter) { - struct request_queue *q = rq->q; - size_t nr_iter = iov_iter_count(iter); - size_t nr_segs = iter->nr_segs; - struct bio_vec *bvecs, *bvprvp = NULL; - const struct queue_limits *lim = &q->limits; - unsigned int nsegs = 0, bytes = 0; + const struct queue_limits *lim = &rq->q->limits; + unsigned int max_bytes = lim->max_hw_sectors << SECTOR_SHIFT; + unsigned int nsegs; struct bio *bio; - size_t i; + int ret; - if (!nr_iter || (nr_iter >> SECTOR_SHIFT) > queue_max_hw_sectors(q)) - return -EINVAL; - if (nr_segs > queue_max_segments(q)) + if (!iov_iter_count(iter) || iov_iter_count(iter) > max_bytes) return -EINVAL; - /* no iovecs to alloc, as we already have a BVEC iterator */ + /* reuse the bvecs from the iterator instead of allocating new ones */ bio = blk_rq_map_bio_alloc(rq, 0, GFP_KERNEL); - if (bio == NULL) + if (!bio) return -ENOMEM; - bio_iov_bvec_set(bio, (struct iov_iter *)iter); - blk_rq_bio_prep(rq, bio, nr_segs); - - /* loop to perform a bunch of sanity checks */ - bvecs = (struct bio_vec *)iter->bvec; - for (i = 0; i < nr_segs; i++) { - struct bio_vec *bv = &bvecs[i]; - - /* - * If the queue doesn't support SG gaps and adding this - * offset would create a gap, fallback to copy. - */ - if (bvprvp && bvec_gap_to_prev(lim, bvprvp, bv->bv_offset)) { - blk_mq_map_bio_put(bio); - return -EREMOTEIO; - } - /* check full condition */ - if (nsegs >= nr_segs || bytes > UINT_MAX - bv->bv_len) - goto put_bio; - if (bytes + bv->bv_len > nr_iter) - break; - nsegs++; - bytes += bv->bv_len; - bvprvp = bv; + /* check that the data layout matches the hardware restrictions */ + ret = bio_split_rw_at(bio, lim, &nsegs, max_bytes); + if (ret) { + /* if we would have to split the bio, copy instead */ + if (ret > 0) + ret = -EREMOTEIO; + blk_mq_map_bio_put(bio); + return ret; } + + blk_rq_bio_prep(rq, bio, nsegs); return 0; -put_bio: - blk_mq_map_bio_put(bio); - return -EINVAL; } /** -- cgit From 1b6063a57754eae5705753c01e78dc268b989038 Mon Sep 17 00:00:00 2001 From: Ovidiu Bunea Date: Fri, 11 Oct 2024 11:12:19 -0400 Subject: Revert "drm/amd/display: update DML2 policy EnhancedPrefetchScheduleAccelerationFinal DCN35" This reverts commit 9dad21f910fc ("drm/amd/display: update DML2 policy EnhancedPrefetchScheduleAccelerationFinal DCN35") [why & how] The offending commit exposes a hang with lid close/open behavior. Both issues seem to be related to ODM 2:1 mode switching, so there is another issue generic to that sequence that needs to be investigated. Cc: Mario Limonciello Cc: Alex Deucher Reviewed-by: Nicholas Kazlauskas Signed-off-by: Ovidiu Bunea Signed-off-by: Tom Chung Tested-by: Daniel Wheeler Signed-off-by: Alex Deucher (cherry picked from commit 68bf95317ebf2cfa7105251e4279e951daceefb7) Cc: stable@vger.kernel.org --- drivers/gpu/drm/amd/display/dc/dml2/dml2_policy.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/amd/display/dc/dml2/dml2_policy.c b/drivers/gpu/drm/amd/display/dc/dml2/dml2_policy.c index 11c904ae2958..c4c52173ef22 100644 --- a/drivers/gpu/drm/amd/display/dc/dml2/dml2_policy.c +++ b/drivers/gpu/drm/amd/display/dc/dml2/dml2_policy.c @@ -303,6 +303,7 @@ void build_unoptimized_policy_settings(enum dml_project_id project, struct dml_m if (project == dml_project_dcn35 || project == dml_project_dcn351) { policy->DCCProgrammingAssumesScanDirectionUnknownFinal = false; + policy->EnhancedPrefetchScheduleAccelerationFinal = 0; policy->AllowForPStateChangeOrStutterInVBlankFinal = dml_prefetch_support_uclk_fclk_and_stutter_if_possible; /*new*/ policy->UseOnlyMaxPrefetchModes = 1; } -- cgit From 4aa923a6e6406b43566ef6ac35a3d9a3197fa3e8 Mon Sep 17 00:00:00 2001 From: Tvrtko Ursulin Date: Fri, 25 Oct 2024 15:56:39 +0100 Subject: drm/amd/pm: Vangogh: Fix kernel memory out of bounds write KASAN reports that the GPU metrics table allocated in vangogh_tables_init() is not large enough for the memset done in smu_cmn_init_soft_gpu_metrics(). Condensed report follows: [ 33.861314] BUG: KASAN: slab-out-of-bounds in smu_cmn_init_soft_gpu_metrics+0x73/0x200 [amdgpu] [ 33.861799] Write of size 168 at addr ffff888129f59500 by task mangoapp/1067 ... [ 33.861808] CPU: 6 UID: 1000 PID: 1067 Comm: mangoapp Tainted: G W 6.12.0-rc4 #356 1a56f59a8b5182eeaf67eb7cb8b13594dd23b544 [ 33.861816] Tainted: [W]=WARN [ 33.861818] Hardware name: Valve Galileo/Galileo, BIOS F7G0107 12/01/2023 [ 33.861822] Call Trace: [ 33.861826] [ 33.861829] dump_stack_lvl+0x66/0x90 [ 33.861838] print_report+0xce/0x620 [ 33.861853] kasan_report+0xda/0x110 [ 33.862794] kasan_check_range+0xfd/0x1a0 [ 33.862799] __asan_memset+0x23/0x40 [ 33.862803] smu_cmn_init_soft_gpu_metrics+0x73/0x200 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779] [ 33.863306] vangogh_get_gpu_metrics_v2_4+0x123/0xad0 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779] [ 33.864257] vangogh_common_get_gpu_metrics+0xb0c/0xbc0 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779] [ 33.865682] amdgpu_dpm_get_gpu_metrics+0xcc/0x110 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779] [ 33.866160] amdgpu_get_gpu_metrics+0x154/0x2d0 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779] [ 33.867135] dev_attr_show+0x43/0xc0 [ 33.867147] sysfs_kf_seq_show+0x1f1/0x3b0 [ 33.867155] seq_read_iter+0x3f8/0x1140 [ 33.867173] vfs_read+0x76c/0xc50 [ 33.867198] ksys_read+0xfb/0x1d0 [ 33.867214] do_syscall_64+0x90/0x160 ... [ 33.867353] Allocated by task 378 on cpu 7 at 22.794876s: [ 33.867358] kasan_save_stack+0x33/0x50 [ 33.867364] kasan_save_track+0x17/0x60 [ 33.867367] __kasan_kmalloc+0x87/0x90 [ 33.867371] vangogh_init_smc_tables+0x3f9/0x840 [amdgpu] [ 33.867835] smu_sw_init+0xa32/0x1850 [amdgpu] [ 33.868299] amdgpu_device_init+0x467b/0x8d90 [amdgpu] [ 33.868733] amdgpu_driver_load_kms+0x19/0xf0 [amdgpu] [ 33.869167] amdgpu_pci_probe+0x2d6/0xcd0 [amdgpu] [ 33.869608] local_pci_probe+0xda/0x180 [ 33.869614] pci_device_probe+0x43f/0x6b0 Empirically we can confirm that the former allocates 152 bytes for the table, while the latter memsets the 168 large block. Root cause appears that when GPU metrics tables for v2_4 parts were added it was not considered to enlarge the table to fit. The fix in this patch is rather "brute force" and perhaps later should be done in a smarter way, by extracting and consolidating the part version to size logic to a common helper, instead of brute forcing the largest possible allocation. Nevertheless, for now this works and fixes the out of bounds write. v2: * Drop impossible v3_0 case. (Mario) Signed-off-by: Tvrtko Ursulin Fixes: 41cec40bc9ba ("drm/amd/pm: Vangogh: Add new gpu_metrics_v2_4 to acquire gpu_metrics") Cc: Mario Limonciello Cc: Evan Quan Cc: Wenyou Yang Cc: Alex Deucher Reviewed-by: Mario Limonciello Link: https://lore.kernel.org/r/20241025145639.19124-1-tursulin@igalia.com Signed-off-by: Mario Limonciello Signed-off-by: Alex Deucher (cherry picked from commit 0880f58f9609f0200483a49429af0f050d281703) Cc: stable@vger.kernel.org # v6.6+ --- drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c index 22737b11b1bf..1fe020f1f4db 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c @@ -242,7 +242,9 @@ static int vangogh_tables_init(struct smu_context *smu) goto err0_out; smu_table->metrics_time = 0; - smu_table->gpu_metrics_table_size = max(sizeof(struct gpu_metrics_v2_3), sizeof(struct gpu_metrics_v2_2)); + smu_table->gpu_metrics_table_size = sizeof(struct gpu_metrics_v2_2); + smu_table->gpu_metrics_table_size = max(smu_table->gpu_metrics_table_size, sizeof(struct gpu_metrics_v2_3)); + smu_table->gpu_metrics_table_size = max(smu_table->gpu_metrics_table_size, sizeof(struct gpu_metrics_v2_4)); smu_table->gpu_metrics_table = kzalloc(smu_table->gpu_metrics_table_size, GFP_KERNEL); if (!smu_table->gpu_metrics_table) goto err1_out; -- cgit From 935abb86a95def8c20dbb184ce30051db168e541 Mon Sep 17 00:00:00 2001 From: Alex Deucher Date: Wed, 23 Oct 2024 09:13:21 -0400 Subject: drm/amdgpu/smu13: fix profile reporting The following 3 commits landed in parallel: commit d7d2688bf4ea ("drm/amd/pm: update workload mask after the setting") commit 7a1613e47e65 ("drm/amdgpu/smu13: always apply the powersave optimization") commit 7c210ca5a2d7 ("drm/amdgpu: handle default profile on on devices without fullscreen 3D") While everything is set correctly, this caused the profile to be reported incorrectly because both the powersave and fullscreen3d bits were set in the mask and when the driver prints the profile, it looks for the first bit set. Fixes: d7d2688bf4ea ("drm/amd/pm: update workload mask after the setting") Reviewed-by: Kenneth Feng Signed-off-by: Alex Deucher (cherry picked from commit ecfe9b237687a55d596fff0650ccc8cc455edd3f) Cc: stable@vger.kernel.org --- drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c index cb923e33fd6f..d53e162dcd8d 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c @@ -2485,7 +2485,7 @@ static int smu_v13_0_0_set_power_profile_mode(struct smu_context *smu, DpmActivityMonitorCoeffInt_t *activity_monitor = &(activity_monitor_external.DpmActivityMonitorCoeffInt); int workload_type, ret = 0; - u32 workload_mask; + u32 workload_mask, selected_workload_mask; smu->power_profile_mode = input[size]; @@ -2552,7 +2552,7 @@ static int smu_v13_0_0_set_power_profile_mode(struct smu_context *smu, if (workload_type < 0) return -EINVAL; - workload_mask = 1 << workload_type; + selected_workload_mask = workload_mask = 1 << workload_type; /* Add optimizations for SMU13.0.0/10. Reuse the power saving profile */ if ((amdgpu_ip_version(smu->adev, MP1_HWIP, 0) == IP_VERSION(13, 0, 0) && @@ -2572,7 +2572,7 @@ static int smu_v13_0_0_set_power_profile_mode(struct smu_context *smu, workload_mask, NULL); if (!ret) - smu->workload_mask = workload_mask; + smu->workload_mask = selected_workload_mask; return ret; } -- cgit From df745e25098dcb2f706399c0d06dd8d1bab6b6ec Mon Sep 17 00:00:00 2001 From: Jarkko Sakkinen Date: Mon, 28 Oct 2024 07:50:01 +0200 Subject: tpm: Lazily flush the auth session Move the allocation of chip->auth to tpm2_start_auth_session() so that this field can be used as flag to tell whether auth session is active or not. Instead of flushing and reloading the auth session for every transaction separately, keep the session open unless /dev/tpm0 is used. Reported-by: Pengyu Ma Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219229 Cc: stable@vger.kernel.org # v6.10+ Fixes: 7ca110f2679b ("tpm: Address !chip->auth in tpm_buf_append_hmac_session*()") Tested-by: Pengyu Ma Tested-by: Stefan Berger Reviewed-by: Stefan Berger Signed-off-by: Jarkko Sakkinen --- drivers/char/tpm/tpm-chip.c | 10 +++++++++ drivers/char/tpm/tpm-dev-common.c | 3 +++ drivers/char/tpm/tpm-interface.c | 6 ++++-- drivers/char/tpm/tpm2-sessions.c | 45 +++++++++++++++++++++++---------------- 4 files changed, 44 insertions(+), 20 deletions(-) diff --git a/drivers/char/tpm/tpm-chip.c b/drivers/char/tpm/tpm-chip.c index 854546000c92..1ff99a7091bb 100644 --- a/drivers/char/tpm/tpm-chip.c +++ b/drivers/char/tpm/tpm-chip.c @@ -674,6 +674,16 @@ EXPORT_SYMBOL_GPL(tpm_chip_register); */ void tpm_chip_unregister(struct tpm_chip *chip) { +#ifdef CONFIG_TCG_TPM2_HMAC + int rc; + + rc = tpm_try_get_ops(chip); + if (!rc) { + tpm2_end_auth_session(chip); + tpm_put_ops(chip); + } +#endif + tpm_del_legacy_sysfs(chip); if (tpm_is_hwrng_enabled(chip)) hwrng_unregister(&chip->hwrng); diff --git a/drivers/char/tpm/tpm-dev-common.c b/drivers/char/tpm/tpm-dev-common.c index c3fbbf4d3db7..48ff87444f85 100644 --- a/drivers/char/tpm/tpm-dev-common.c +++ b/drivers/char/tpm/tpm-dev-common.c @@ -27,6 +27,9 @@ static ssize_t tpm_dev_transmit(struct tpm_chip *chip, struct tpm_space *space, struct tpm_header *header = (void *)buf; ssize_t ret, len; + if (chip->flags & TPM_CHIP_FLAG_TPM2) + tpm2_end_auth_session(chip); + ret = tpm2_prepare_space(chip, space, buf, bufsiz); /* If the command is not implemented by the TPM, synthesize a * response with a TPM2_RC_COMMAND_CODE return for user-space. diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c index 5da134f12c9a..8134f002b121 100644 --- a/drivers/char/tpm/tpm-interface.c +++ b/drivers/char/tpm/tpm-interface.c @@ -379,10 +379,12 @@ int tpm_pm_suspend(struct device *dev) rc = tpm_try_get_ops(chip); if (!rc) { - if (chip->flags & TPM_CHIP_FLAG_TPM2) + if (chip->flags & TPM_CHIP_FLAG_TPM2) { + tpm2_end_auth_session(chip); tpm2_shutdown(chip, TPM2_SU_STATE); - else + } else { rc = tpm1_pm_suspend(chip, tpm_suspend_pcr); + } tpm_put_ops(chip); } diff --git a/drivers/char/tpm/tpm2-sessions.c b/drivers/char/tpm/tpm2-sessions.c index 4821be276670..0739830904b2 100644 --- a/drivers/char/tpm/tpm2-sessions.c +++ b/drivers/char/tpm/tpm2-sessions.c @@ -333,6 +333,9 @@ void tpm_buf_append_hmac_session(struct tpm_chip *chip, struct tpm_buf *buf, } #ifdef CONFIG_TCG_TPM2_HMAC + /* The first write to /dev/tpm{rm0} will flush the session. */ + attributes |= TPM2_SA_CONTINUE_SESSION; + /* * The Architecture Guide requires us to strip trailing zeros * before computing the HMAC @@ -484,7 +487,8 @@ static void tpm2_KDFe(u8 z[EC_PT_SZ], const char *str, u8 *pt_u, u8 *pt_v, sha256_final(&sctx, out); } -static void tpm_buf_append_salt(struct tpm_buf *buf, struct tpm_chip *chip) +static void tpm_buf_append_salt(struct tpm_buf *buf, struct tpm_chip *chip, + struct tpm2_auth *auth) { struct crypto_kpp *kpp; struct kpp_request *req; @@ -543,7 +547,7 @@ static void tpm_buf_append_salt(struct tpm_buf *buf, struct tpm_chip *chip) sg_set_buf(&s[0], chip->null_ec_key_x, EC_PT_SZ); sg_set_buf(&s[1], chip->null_ec_key_y, EC_PT_SZ); kpp_request_set_input(req, s, EC_PT_SZ*2); - sg_init_one(d, chip->auth->salt, EC_PT_SZ); + sg_init_one(d, auth->salt, EC_PT_SZ); kpp_request_set_output(req, d, EC_PT_SZ); crypto_kpp_compute_shared_secret(req); kpp_request_free(req); @@ -554,8 +558,7 @@ static void tpm_buf_append_salt(struct tpm_buf *buf, struct tpm_chip *chip) * This works because KDFe fully consumes the secret before it * writes the salt */ - tpm2_KDFe(chip->auth->salt, "SECRET", x, chip->null_ec_key_x, - chip->auth->salt); + tpm2_KDFe(auth->salt, "SECRET", x, chip->null_ec_key_x, auth->salt); out: crypto_free_kpp(kpp); @@ -853,7 +856,9 @@ int tpm_buf_check_hmac_response(struct tpm_chip *chip, struct tpm_buf *buf, if (rc) /* manually close the session if it wasn't consumed */ tpm2_flush_context(chip, auth->handle); - memzero_explicit(auth, sizeof(*auth)); + + kfree_sensitive(auth); + chip->auth = NULL; } else { /* reset for next use */ auth->session = TPM_HEADER_SIZE; @@ -881,7 +886,8 @@ void tpm2_end_auth_session(struct tpm_chip *chip) return; tpm2_flush_context(chip, auth->handle); - memzero_explicit(auth, sizeof(*auth)); + kfree_sensitive(auth); + chip->auth = NULL; } EXPORT_SYMBOL(tpm2_end_auth_session); @@ -962,16 +968,20 @@ err: */ int tpm2_start_auth_session(struct tpm_chip *chip) { + struct tpm2_auth *auth; struct tpm_buf buf; - struct tpm2_auth *auth = chip->auth; - int rc; u32 null_key; + int rc; - if (!auth) { - dev_warn_once(&chip->dev, "auth session is not active\n"); + if (chip->auth) { + dev_warn_once(&chip->dev, "auth session is active\n"); return 0; } + auth = kzalloc(sizeof(*auth), GFP_KERNEL); + if (!auth) + return -ENOMEM; + rc = tpm2_load_null(chip, &null_key); if (rc) goto out; @@ -992,7 +1002,7 @@ int tpm2_start_auth_session(struct tpm_chip *chip) tpm_buf_append(&buf, auth->our_nonce, sizeof(auth->our_nonce)); /* append encrypted salt and squirrel away unencrypted in auth */ - tpm_buf_append_salt(&buf, chip); + tpm_buf_append_salt(&buf, chip, auth); /* session type (HMAC, audit or policy) */ tpm_buf_append_u8(&buf, TPM2_SE_HMAC); @@ -1014,10 +1024,13 @@ int tpm2_start_auth_session(struct tpm_chip *chip) tpm_buf_destroy(&buf); - if (rc) - goto out; + if (rc == TPM2_RC_SUCCESS) { + chip->auth = auth; + return 0; + } - out: +out: + kfree_sensitive(auth); return rc; } EXPORT_SYMBOL(tpm2_start_auth_session); @@ -1367,10 +1380,6 @@ int tpm2_sessions_init(struct tpm_chip *chip) return rc; } - chip->auth = kmalloc(sizeof(*chip->auth), GFP_KERNEL); - if (!chip->auth) - return -ENOMEM; - return rc; } EXPORT_SYMBOL(tpm2_sessions_init); -- cgit From b935252cc2983d3bcb306fef5bf838e255bab631 Mon Sep 17 00:00:00 2001 From: Levi Zim Date: Mon, 21 Oct 2024 21:55:49 +0800 Subject: docs: networking: packet_mmap: replace dead links with archive.org links The original link returns 404 now. This commit replaces the dead google site link with archive.org link. Signed-off-by: Levi Zim Reviewed-by: Willem de Bruijn Link: https://patch.msgid.link/20241021-packet_mmap_fix_link-v1-1-dffae4a174c0@outlook.com Signed-off-by: Jakub Kicinski --- Documentation/networking/packet_mmap.rst | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/Documentation/networking/packet_mmap.rst b/Documentation/networking/packet_mmap.rst index dca15d15feaf..02370786e77b 100644 --- a/Documentation/networking/packet_mmap.rst +++ b/Documentation/networking/packet_mmap.rst @@ -16,7 +16,7 @@ ii) transmit network traffic, or any other that needs raw Howto can be found at: - https://sites.google.com/site/packetmmap/ + https://web.archive.org/web/20220404160947/https://sites.google.com/site/packetmmap/ Please send your comments to - Ulisses Alonso Camaró @@ -166,7 +166,8 @@ As capture, each frame contains two parts:: /* bind socket to eth0 */ bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll)); - A complete tutorial is available at: https://sites.google.com/site/packetmmap/ + A complete tutorial is available at: + https://web.archive.org/web/20220404160947/https://sites.google.com/site/packetmmap/ By default, the user should put data at:: -- cgit From 3deb12c788c385e17142ce6ec50f769852fcec65 Mon Sep 17 00:00:00 2001 From: "Matthieu Baerts (NGI0)" Date: Mon, 21 Oct 2024 12:25:26 +0200 Subject: mptcp: init: protect sched with rcu_read_lock Enabling CONFIG_PROVE_RCU_LIST with its dependence CONFIG_RCU_EXPERT creates this splat when an MPTCP socket is created: ============================= WARNING: suspicious RCU usage 6.12.0-rc2+ #11 Not tainted ----------------------------- net/mptcp/sched.c:44 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 no locks held by mptcp_connect/176. stack backtrace: CPU: 0 UID: 0 PID: 176 Comm: mptcp_connect Not tainted 6.12.0-rc2+ #11 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: dump_stack_lvl (lib/dump_stack.c:123) lockdep_rcu_suspicious (kernel/locking/lockdep.c:6822) mptcp_sched_find (net/mptcp/sched.c:44 (discriminator 7)) mptcp_init_sock (net/mptcp/protocol.c:2867 (discriminator 1)) ? sock_init_data_uid (arch/x86/include/asm/atomic.h:28) inet_create.part.0.constprop.0 (net/ipv4/af_inet.c:386) ? __sock_create (include/linux/rcupdate.h:347 (discriminator 1)) __sock_create (net/socket.c:1576) __sys_socket (net/socket.c:1671) ? __pfx___sys_socket (net/socket.c:1712) ? do_user_addr_fault (arch/x86/mm/fault.c:1419 (discriminator 1)) __x64_sys_socket (net/socket.c:1728) do_syscall_64 (arch/x86/entry/common.c:52 (discriminator 1)) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) That's because when the socket is initialised, rcu_read_lock() is not used despite the explicit comment written above the declaration of mptcp_sched_find() in sched.c. Adding the missing lock/unlock avoids the warning. Fixes: 1730b2b2c5a5 ("mptcp: add sched in mptcp_sock") Cc: stable@vger.kernel.org Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/523 Reviewed-by: Geliang Tang Signed-off-by: Matthieu Baerts (NGI0) Reviewed-by: Simon Horman Link: https://patch.msgid.link/20241021-net-mptcp-sched-lock-v1-1-637759cf061c@kernel.org Signed-off-by: Jakub Kicinski --- net/mptcp/protocol.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 6d0e201c3eb2..d263091659e0 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -2864,8 +2864,10 @@ static int mptcp_init_sock(struct sock *sk) if (unlikely(!net->mib.mptcp_statistics) && !mptcp_mib_alloc(net)) return -ENOMEM; + rcu_read_lock(); ret = mptcp_init_sched(mptcp_sk(sk), mptcp_sched_find(mptcp_get_scheduler(net))); + rcu_read_unlock(); if (ret) return ret; -- cgit From 5513dc1d8fec929006548dde4acdabdc54379beb Mon Sep 17 00:00:00 2001 From: "Matthieu Baerts (NGI0)" Date: Mon, 21 Oct 2024 12:25:28 +0200 Subject: selftests: mptcp: list sysctl data Listing all the values linked to the MPTCP sysctl knobs was not exercised in MPTCP test suite. Let's do that to avoid any regressions, but also to have a kernel with a debug kconfig verifying more assumptions. For the moment, we are not interested by the output, only to avoid crashes and warnings. Signed-off-by: Matthieu Baerts (NGI0) Reviewed-by: Simon Horman Link: https://patch.msgid.link/20241021-net-mptcp-sched-lock-v1-3-637759cf061c@kernel.org Signed-off-by: Jakub Kicinski --- tools/testing/selftests/net/mptcp/mptcp_connect.sh | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/tools/testing/selftests/net/mptcp/mptcp_connect.sh b/tools/testing/selftests/net/mptcp/mptcp_connect.sh index 57325d57e4c6..b48b4e56826a 100755 --- a/tools/testing/selftests/net/mptcp/mptcp_connect.sh +++ b/tools/testing/selftests/net/mptcp/mptcp_connect.sh @@ -259,6 +259,15 @@ check_mptcp_disabled() mptcp_lib_ns_init disabled_ns print_larger_title "New MPTCP socket can be blocked via sysctl" + + # mainly to cover more code + if ! ip netns exec ${disabled_ns} sysctl net.mptcp >/dev/null; then + mptcp_lib_pr_fail "not able to list net.mptcp sysctl knobs" + mptcp_lib_result_fail "not able to list net.mptcp sysctl knobs" + ret=${KSFT_FAIL} + return 1 + fi + # net.mptcp.enabled should be enabled by default if [ "$(ip netns exec ${disabled_ns} sysctl net.mptcp.enabled | awk '{ print $3 }')" -ne 1 ]; then mptcp_lib_pr_fail "net.mptcp.enabled sysctl is not 1 by default" -- cgit From f1e54d11b210b53d418ff1476c6b58a2f434dfc0 Mon Sep 17 00:00:00 2001 From: Jianbo Liu Date: Mon, 21 Oct 2024 13:03:09 +0300 Subject: macsec: Fix use-after-free while sending the offloading packet KASAN reports the following UAF. The metadata_dst, which is used to store the SCI value for macsec offload, is already freed by metadata_dst_free() in macsec_free_netdev(), while driver still use it for sending the packet. To fix this issue, dst_release() is used instead to release metadata_dst. So it is not freed instantly in macsec_free_netdev() if still referenced by skb. BUG: KASAN: slab-use-after-free in mlx5e_xmit+0x1e8f/0x4190 [mlx5_core] Read of size 2 at addr ffff88813e42e038 by task kworker/7:2/714 [...] Workqueue: mld mld_ifc_work Call Trace: dump_stack_lvl+0x51/0x60 print_report+0xc1/0x600 kasan_report+0xab/0xe0 mlx5e_xmit+0x1e8f/0x4190 [mlx5_core] dev_hard_start_xmit+0x120/0x530 sch_direct_xmit+0x149/0x11e0 __qdisc_run+0x3ad/0x1730 __dev_queue_xmit+0x1196/0x2ed0 vlan_dev_hard_start_xmit+0x32e/0x510 [8021q] dev_hard_start_xmit+0x120/0x530 __dev_queue_xmit+0x14a7/0x2ed0 macsec_start_xmit+0x13e9/0x2340 dev_hard_start_xmit+0x120/0x530 __dev_queue_xmit+0x14a7/0x2ed0 ip6_finish_output2+0x923/0x1a70 ip6_finish_output+0x2d7/0x970 ip6_output+0x1ce/0x3a0 NF_HOOK.constprop.0+0x15f/0x190 mld_sendpack+0x59a/0xbd0 mld_ifc_work+0x48a/0xa80 process_one_work+0x5aa/0xe50 worker_thread+0x79c/0x1290 kthread+0x28f/0x350 ret_from_fork+0x2d/0x70 ret_from_fork_asm+0x11/0x20 Allocated by task 3922: kasan_save_stack+0x20/0x40 kasan_save_track+0x10/0x30 __kasan_kmalloc+0x77/0x90 __kmalloc_noprof+0x188/0x400 metadata_dst_alloc+0x1f/0x4e0 macsec_newlink+0x914/0x1410 __rtnl_newlink+0xe08/0x15b0 rtnl_newlink+0x5f/0x90 rtnetlink_rcv_msg+0x667/0xa80 netlink_rcv_skb+0x12c/0x360 netlink_unicast+0x551/0x770 netlink_sendmsg+0x72d/0xbd0 __sock_sendmsg+0xc5/0x190 ____sys_sendmsg+0x52e/0x6a0 ___sys_sendmsg+0xeb/0x170 __sys_sendmsg+0xb5/0x140 do_syscall_64+0x4c/0x100 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Freed by task 4011: kasan_save_stack+0x20/0x40 kasan_save_track+0x10/0x30 kasan_save_free_info+0x37/0x50 poison_slab_object+0x10c/0x190 __kasan_slab_free+0x11/0x30 kfree+0xe0/0x290 macsec_free_netdev+0x3f/0x140 netdev_run_todo+0x450/0xc70 rtnetlink_rcv_msg+0x66f/0xa80 netlink_rcv_skb+0x12c/0x360 netlink_unicast+0x551/0x770 netlink_sendmsg+0x72d/0xbd0 __sock_sendmsg+0xc5/0x190 ____sys_sendmsg+0x52e/0x6a0 ___sys_sendmsg+0xeb/0x170 __sys_sendmsg+0xb5/0x140 do_syscall_64+0x4c/0x100 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Fixes: 0a28bfd4971f ("net/macsec: Add MACsec skb_metadata_dst Tx Data path support") Signed-off-by: Jianbo Liu Reviewed-by: Patrisious Haddad Reviewed-by: Chris Mi Signed-off-by: Tariq Toukan Reviewed-by: Simon Horman Reviewed-by: Sabrina Dubroca Link: https://patch.msgid.link/20241021100309.234125-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski --- drivers/net/macsec.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c index 26034f80d4a4..ee2159282573 100644 --- a/drivers/net/macsec.c +++ b/drivers/net/macsec.c @@ -3798,8 +3798,7 @@ static void macsec_free_netdev(struct net_device *dev) { struct macsec_dev *macsec = macsec_priv(dev); - if (macsec->secy.tx_sc.md_dst) - metadata_dst_free(macsec->secy.tx_sc.md_dst); + dst_release(&macsec->secy.tx_sc.md_dst->dst); free_percpu(macsec->stats); free_percpu(macsec->secy.tx_sc.stats); -- cgit From 94c11e852955b2eef5c4f0b36cfeae7dcf11a759 Mon Sep 17 00:00:00 2001 From: Benjamin Große Date: Sun, 20 Oct 2024 18:41:28 +0100 Subject: usb: add support for new USB device ID 0x17EF:0x3098 for the r8152 driver MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This patch adds support for another Lenovo Mini dock 0x17EF:0x3098 to the r8152 driver. The device has been tested on NixOS, hotplugging and sleep included. Signed-off-by: Benjamin Große Reviewed-by: Simon Horman Link: https://patch.msgid.link/20241020174128.160898-1-ste3ls@gmail.com Signed-off-by: Jakub Kicinski --- drivers/net/usb/r8152.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c index a5612c799f5e..468c73974046 100644 --- a/drivers/net/usb/r8152.c +++ b/drivers/net/usb/r8152.c @@ -10069,6 +10069,7 @@ static const struct usb_device_id rtl8152_table[] = { { USB_DEVICE(VENDOR_ID_LENOVO, 0x3062) }, { USB_DEVICE(VENDOR_ID_LENOVO, 0x3069) }, { USB_DEVICE(VENDOR_ID_LENOVO, 0x3082) }, + { USB_DEVICE(VENDOR_ID_LENOVO, 0x3098) }, { USB_DEVICE(VENDOR_ID_LENOVO, 0x7205) }, { USB_DEVICE(VENDOR_ID_LENOVO, 0x720c) }, { USB_DEVICE(VENDOR_ID_LENOVO, 0x7214) }, -- cgit From 2ef9439f7a19fd3d43b288d38b1c6e55b668a4fe Mon Sep 17 00:00:00 2001 From: Aleksei Vetrov Date: Mon, 28 Oct 2024 22:50:30 +0000 Subject: ASoC: dapm: fix bounds checker error in dapm_widget_list_create The widgets array in the snd_soc_dapm_widget_list has a __counted_by attribute attached to it, which points to the num_widgets variable. This attribute is used in bounds checking, and if it is not set before the array is filled, then the bounds sanitizer will issue a warning or a kernel panic if CONFIG_UBSAN_TRAP is set. This patch sets the size of the widgets list calculated with list_for_each as the initial value for num_widgets as it is used for allocating memory for the array. It is updated with the actual number of added elements after the array is filled. Signed-off-by: Aleksei Vetrov Fixes: 80e698e2df5b ("ASoC: soc-dapm: Annotate struct snd_soc_dapm_widget_list with __counted_by") Link: https://patch.msgid.link/20241028-soc-dapm-bounds-checker-fix-v1-1-262b0394e89e@google.com Signed-off-by: Mark Brown --- sound/soc/soc-dapm.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/sound/soc/soc-dapm.c b/sound/soc/soc-dapm.c index c34934c31ffe..99521c784a9b 100644 --- a/sound/soc/soc-dapm.c +++ b/sound/soc/soc-dapm.c @@ -1147,6 +1147,8 @@ static int dapm_widget_list_create(struct snd_soc_dapm_widget_list **list, if (*list == NULL) return -ENOMEM; + (*list)->num_widgets = size; + list_for_each_entry(w, widgets, work_list) (*list)->widgets[i++] = w; -- cgit From 9a71892cbcdb9d1459c84f5a4c722b14354158a5 Mon Sep 17 00:00:00 2001 From: Greg Kroah-Hartman Date: Tue, 29 Oct 2024 01:23:04 +0100 Subject: Revert "driver core: Fix uevent_show() vs driver detach race" This reverts commit 15fffc6a5624b13b428bb1c6e9088e32a55eb82c. This commit causes a regression, so revert it for now until it can come back in a way that works for everyone. Link: https://lore.kernel.org/all/172790598832.1168608.4519484276671503678.stgit@dwillia2-xfh.jf.intel.com/ Fixes: 15fffc6a5624 ("driver core: Fix uevent_show() vs driver detach race") Cc: stable Cc: Ashish Sangwan Cc: Namjae Jeon Cc: Dirk Behme Cc: Greg Kroah-Hartman Cc: Rafael J. Wysocki Cc: Dan Williams Signed-off-by: Greg Kroah-Hartman --- drivers/base/core.c | 13 +++++-------- drivers/base/module.c | 4 ---- 2 files changed, 5 insertions(+), 12 deletions(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index a4c853411a6b..adc0d74fa96c 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -26,7 +26,6 @@ #include #include #include -#include #include #include #include @@ -2634,7 +2633,6 @@ static const char *dev_uevent_name(const struct kobject *kobj) static int dev_uevent(const struct kobject *kobj, struct kobj_uevent_env *env) { const struct device *dev = kobj_to_dev(kobj); - struct device_driver *driver; int retval = 0; /* add device node properties if present */ @@ -2663,12 +2661,8 @@ static int dev_uevent(const struct kobject *kobj, struct kobj_uevent_env *env) if (dev->type && dev->type->name) add_uevent_var(env, "DEVTYPE=%s", dev->type->name); - /* Synchronize with module_remove_driver() */ - rcu_read_lock(); - driver = READ_ONCE(dev->driver); - if (driver) - add_uevent_var(env, "DRIVER=%s", driver->name); - rcu_read_unlock(); + if (dev->driver) + add_uevent_var(env, "DRIVER=%s", dev->driver->name); /* Add common DT information about the device */ of_device_uevent(dev, env); @@ -2738,8 +2732,11 @@ static ssize_t uevent_show(struct device *dev, struct device_attribute *attr, if (!env) return -ENOMEM; + /* Synchronize with really_probe() */ + device_lock(dev); /* let the kset specific function add its keys */ retval = kset->uevent_ops->uevent(&dev->kobj, env); + device_unlock(dev); if (retval) goto out; diff --git a/drivers/base/module.c b/drivers/base/module.c index c4eaa1158d54..5bc71bea883a 100644 --- a/drivers/base/module.c +++ b/drivers/base/module.c @@ -7,7 +7,6 @@ #include #include #include -#include #include "base.h" static char *make_driver_name(const struct device_driver *drv) @@ -102,9 +101,6 @@ void module_remove_driver(const struct device_driver *drv) if (!drv) return; - /* Synchronize with dev_uevent() */ - synchronize_rcu(); - sysfs_remove_link(&drv->p->kobj, "module"); if (drv->owner) -- cgit From 740be3b9a6d73336f8c7d540842d0831dc7a808b Mon Sep 17 00:00:00 2001 From: Cong Wang Date: Sat, 26 Oct 2024 11:55:22 -0700 Subject: sock_map: fix a NULL pointer dereference in sock_map_link_update_prog() The following race condition could trigger a NULL pointer dereference: sock_map_link_detach(): sock_map_link_update_prog(): mutex_lock(&sockmap_mutex); ... sockmap_link->map = NULL; mutex_unlock(&sockmap_mutex); mutex_lock(&sockmap_mutex); ... sock_map_prog_link_lookup(sockmap_link->map); mutex_unlock(&sockmap_mutex); Fix it by adding a NULL pointer check. In this specific case, it makes no sense to update a link which is being released. Reported-by: Ruan Bonan Fixes: 699c23f02c65 ("bpf: Add bpf_link support for sk_msg and sk_skb progs") Cc: Yonghong Song Cc: John Fastabend Cc: Jakub Sitnicki Signed-off-by: Cong Wang Link: https://lore.kernel.org/r/20241026185522.338562-1-xiyou.wangcong@gmail.com Signed-off-by: Martin KaFai Lau --- net/core/sock_map.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/net/core/sock_map.c b/net/core/sock_map.c index 07d6aa4e39ef..78347d7d25ef 100644 --- a/net/core/sock_map.c +++ b/net/core/sock_map.c @@ -1760,6 +1760,10 @@ static int sock_map_link_update_prog(struct bpf_link *link, ret = -EINVAL; goto out; } + if (!sockmap_link->map) { + ret = -ENOLINK; + goto out; + } ret = sock_map_prog_link_lookup(sockmap_link->map, &pprog, &plink, sockmap_link->attach_type); -- cgit From cb617e148bb3d50dfbbd44db81227edcee2cd4bc Mon Sep 17 00:00:00 2001 From: Abylay Ospan Date: Wed, 23 Oct 2024 16:34:25 +0000 Subject: MAINTAINERS: add netup_unidvb maintainer Adding/restoring maintainership for the following drivers: F: drivers/media/pci/netup_unidvb/* F: drivers/media/dvb-frontends/helene* F: drivers/media/dvb-frontends/horus3a* F: drivers/media/dvb-frontends/lnbh25* F: drivers/media/dvb-frontends/ascot2e* F: drivers/media/dvb-frontends/cxd2841er* Signed-off-by: Abylay Ospan Link: https://lore.kernel.org/r/20241023163425.30492-1-aospan@amazon.com Signed-off-by: Greg Kroah-Hartman --- MAINTAINERS | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 54 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index e9659a5a7fb3..332ab5fd4b59 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14140,6 +14140,15 @@ S: Maintained T: git git://linuxtv.org/media_tree.git F: drivers/media/platform/nxp/imx-pxp.[ch] +MEDIA DRIVERS FOR ASCOT2E +M: Abylay Ospan +L: linux-media@vger.kernel.org +S: Supported +W: https://linuxtv.org +W: http://netup.tv/ +T: git git://linuxtv.org/media_tree.git +F: drivers/media/dvb-frontends/ascot2e* + MEDIA DRIVERS FOR CXD2099AR CI CONTROLLERS M: Jasmin Jessich L: linux-media@vger.kernel.org @@ -14148,6 +14157,15 @@ W: https://linuxtv.org T: git git://linuxtv.org/media_tree.git F: drivers/media/dvb-frontends/cxd2099* +MEDIA DRIVERS FOR CXD2841ER +M: Abylay Ospan +L: linux-media@vger.kernel.org +S: Supported +W: https://linuxtv.org +W: http://netup.tv/ +T: git git://linuxtv.org/media_tree.git +F: drivers/media/dvb-frontends/cxd2841er* + MEDIA DRIVERS FOR CXD2880 M: Yasunari Takiguchi L: linux-media@vger.kernel.org @@ -14192,6 +14210,33 @@ F: drivers/media/platform/nxp/imx-mipi-csis.c F: drivers/media/platform/nxp/imx7-media-csi.c F: drivers/media/platform/nxp/imx8mq-mipi-csi2.c +MEDIA DRIVERS FOR HELENE +M: Abylay Ospan +L: linux-media@vger.kernel.org +S: Supported +W: https://linuxtv.org +W: http://netup.tv/ +T: git git://linuxtv.org/media_tree.git +F: drivers/media/dvb-frontends/helene* + +MEDIA DRIVERS FOR HORUS3A +M: Abylay Ospan +L: linux-media@vger.kernel.org +S: Supported +W: https://linuxtv.org +W: http://netup.tv/ +T: git git://linuxtv.org/media_tree.git +F: drivers/media/dvb-frontends/horus3a* + +MEDIA DRIVERS FOR LNBH25 +M: Abylay Ospan +L: linux-media@vger.kernel.org +S: Supported +W: https://linuxtv.org +W: http://netup.tv/ +T: git git://linuxtv.org/media_tree.git +F: drivers/media/dvb-frontends/lnbh25* + MEDIA DRIVERS FOR MXL5XX TUNER DEMODULATORS L: linux-media@vger.kernel.org S: Orphan @@ -14199,6 +14244,15 @@ W: https://linuxtv.org T: git git://linuxtv.org/media_tree.git F: drivers/media/dvb-frontends/mxl5xx* +MEDIA DRIVERS FOR NETUP PCI UNIVERSAL DVB devices +M: Abylay Ospan +L: linux-media@vger.kernel.org +S: Supported +W: https://linuxtv.org +W: http://netup.tv/ +T: git git://linuxtv.org/media_tree.git +F: drivers/media/pci/netup_unidvb/* + MEDIA DRIVERS FOR NVIDIA TEGRA - VDE M: Dmitry Osipenko L: linux-media@vger.kernel.org -- cgit From 4adf613e01bf99e1739f6ff3e162ad5b7d578d1a Mon Sep 17 00:00:00 2001 From: Alexander Usyskin Date: Tue, 15 Oct 2024 15:31:57 +0300 Subject: mei: use kvmalloc for read buffer Read buffer is allocated according to max message size, reported by the firmware and may reach 64K in systems with pxp client. Contiguous 64k allocation may fail under memory pressure. Read buffer is used as in-driver message storage and not required to be contiguous. Use kvmalloc to allow kernel to allocate non-contiguous memory. Fixes: 3030dc056459 ("mei: add wrapper for queuing control commands.") Cc: stable Reported-by: Rohit Agarwal Closes: https://lore.kernel.org/all/20240813084542.2921300-1-rohiagar@chromium.org/ Tested-by: Brian Geffon Signed-off-by: Alexander Usyskin Acked-by: Tomas Winkler Link: https://lore.kernel.org/r/20241015123157.2337026-1-alexander.usyskin@intel.com Signed-off-by: Greg Kroah-Hartman --- drivers/misc/mei/client.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/misc/mei/client.c b/drivers/misc/mei/client.c index 9d090fa07516..be011cef12e5 100644 --- a/drivers/misc/mei/client.c +++ b/drivers/misc/mei/client.c @@ -321,7 +321,7 @@ void mei_io_cb_free(struct mei_cl_cb *cb) return; list_del(&cb->list); - kfree(cb->buf.data); + kvfree(cb->buf.data); kfree(cb->ext_hdr); kfree(cb); } @@ -497,7 +497,7 @@ struct mei_cl_cb *mei_cl_alloc_cb(struct mei_cl *cl, size_t length, if (length == 0) return cb; - cb->buf.data = kmalloc(roundup(length, MEI_SLOT_SIZE), GFP_KERNEL); + cb->buf.data = kvmalloc(roundup(length, MEI_SLOT_SIZE), GFP_KERNEL); if (!cb->buf.data) { mei_io_cb_free(cb); return NULL; -- cgit From fa0122eaca4f14272fbf76a70d51db78c69091f6 Mon Sep 17 00:00:00 2001 From: zhouyuhang Date: Mon, 28 Oct 2024 16:41:32 +0800 Subject: selftests/mount_setattr: fix idmap_mount_tree_invalid failed to run Test case idmap_mount_tree_invalid failed to run on the newer kernel with the following output: # RUN mount_setattr_idmapped.idmap_mount_tree_invalid ... # mount_setattr_test.c:1428:idmap_mount_tree_invalid:Expected sys_mount_setattr(open_tree_fd, "", AT_EMPTY_PATH, &attr, sizeof(attr)) (0) ! = 0 (0) # idmap_mount_tree_invalid: Test terminated by assertion This is because tmpfs is mounted at "/mnt/A", and tmpfs already contains the flag FS_ALLOW_IDMAP after the commit 7a80e5b8c6fa ("shmem: support idmapped mounts for tmpfs"). So calling sys_mount_setattr here returns 0 instead of -EINVAL as expected. Ramfs does not support idmap mounts, so we can use it here to test invalid mounts, which allows the test case to pass with the following output: # Starting 1 tests from 1 test cases. # RUN mount_setattr_idmapped.idmap_mount_tree_invalid ... # OK mount_setattr_idmapped.idmap_mount_tree_invalid ok 1 mount_setattr_idmapped.idmap_mount_tree_invalid # PASSED: 1 / 1 tests passed. Link: https://lore.kernel.org/all/20241028084132.3212598-1-zhouyuhang1010@163.com/ Signed-off-by: zhouyuhang Reviewed-by: Christian Brauner Signed-off-by: Shuah Khan --- tools/testing/selftests/mount_setattr/mount_setattr_test.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/tools/testing/selftests/mount_setattr/mount_setattr_test.c b/tools/testing/selftests/mount_setattr/mount_setattr_test.c index c6a8c732b802..68801e1a9ec2 100644 --- a/tools/testing/selftests/mount_setattr/mount_setattr_test.c +++ b/tools/testing/selftests/mount_setattr/mount_setattr_test.c @@ -1414,6 +1414,13 @@ TEST_F(mount_setattr_idmapped, idmap_mount_tree_invalid) ASSERT_EQ(expected_uid_gid(-EBADF, "/tmp/B/b", 0, 0, 0), 0); ASSERT_EQ(expected_uid_gid(-EBADF, "/tmp/B/BB/b", 0, 0, 0), 0); + ASSERT_EQ(mount("testing", "/mnt/A", "ramfs", MS_NOATIME | MS_NODEV, + "size=100000,mode=700"), 0); + + ASSERT_EQ(mkdir("/mnt/A/AA", 0777), 0); + + ASSERT_EQ(mount("/tmp", "/mnt/A/AA", NULL, MS_BIND | MS_REC, NULL), 0); + open_tree_fd = sys_open_tree(-EBADF, "/mnt/A", AT_RECURSIVE | AT_EMPTY_PATH | @@ -1433,6 +1440,8 @@ TEST_F(mount_setattr_idmapped, idmap_mount_tree_invalid) ASSERT_EQ(expected_uid_gid(-EBADF, "/tmp/B/BB/b", 0, 0, 0), 0); ASSERT_EQ(expected_uid_gid(open_tree_fd, "B/b", 0, 0, 0), 0); ASSERT_EQ(expected_uid_gid(open_tree_fd, "B/BB/b", 0, 0, 0), 0); + + (void)umount2("/mnt/A", MNT_DETACH); } TEST_F(mount_setattr, mount_attr_nosymfollow) -- cgit From 6553bfcb8499bf5e7e6d07d93f29459198dba798 Mon Sep 17 00:00:00 2001 From: Alessandro Zanni Date: Mon, 28 Oct 2024 20:08:43 +0100 Subject: selftests/intel_pstate: fix operand expected error Running "make kselftest TARGETS=intel_pstate" results in the following errors: - ./run.sh: line 90: / 1000: syntax error: operand expected (error token is "/ 1000") - ./run.sh: line 92: / 1000: syntax error: operand expected (error token is "/ 1000") This fix allows to have cross-platform compatibility when using arithmetic expression with command substitutions. Link: https://lore.kernel.org/r/f37df23888cd5ea6b3976f19d3e25796129dd090.1730141362.git.alessandro.zanni87@gmail.com Signed-off-by: Alessandro Zanni Signed-off-by: Shuah Khan --- tools/testing/selftests/intel_pstate/run.sh | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/intel_pstate/run.sh b/tools/testing/selftests/intel_pstate/run.sh index e7008f614ad7..0c1b6c1308a4 100755 --- a/tools/testing/selftests/intel_pstate/run.sh +++ b/tools/testing/selftests/intel_pstate/run.sh @@ -87,9 +87,9 @@ mkt_freq=${_mkt_freq}0 # Get the ranges from cpupower _min_freq=$(cpupower frequency-info -l | tail -1 | awk ' { print $1 } ') -min_freq=$(($_min_freq / 1000)) +min_freq=$((_min_freq / 1000)) _max_freq=$(cpupower frequency-info -l | tail -1 | awk ' { print $2 } ') -max_freq=$(($_max_freq / 1000)) +max_freq=$((_max_freq / 1000)) [ $EVALUATE_ONLY -eq 0 ] && for freq in `seq $max_freq -100 $min_freq` -- cgit From 722d89c34cc496aadc737e2df40234580fa05877 Mon Sep 17 00:00:00 2001 From: Alessandro Zanni Date: Mon, 28 Oct 2024 20:08:44 +0100 Subject: selftests/intel_pstate: check if cpupower is installed Running "make kselftest TARGETS=intel_pstate" results in the following errors: - ./run.sh: line 89: cpupower: command not found - ./run.sh: line 91: cpupower: command not found if the cpupower is not installed. Since the test depends on cpupower, this patch stops the test if the cpupower is not installed. Link: https://lore.kernel.org/all/cc01753c8dab0f33669a5a0fc162544078055bd1.1730141362.git.alessandro.zanni87@gmail.com/ Signed-off-by: Alessandro Zanni Signed-off-by: Shuah Khan --- tools/testing/selftests/intel_pstate/run.sh | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/tools/testing/selftests/intel_pstate/run.sh b/tools/testing/selftests/intel_pstate/run.sh index 0c1b6c1308a4..6a3b8503264e 100755 --- a/tools/testing/selftests/intel_pstate/run.sh +++ b/tools/testing/selftests/intel_pstate/run.sh @@ -44,6 +44,11 @@ if [ $UID != 0 ] && [ $EVALUATE_ONLY == 0 ]; then exit $ksft_skip fi +if ! command -v cpupower &> /dev/null; then + echo $msg cpupower could not be found, please install it >&2 + exit $ksft_skip +fi + max_cpus=$(($(nproc)-1)) function run_test () { -- cgit From e7cd4b811c9e019f5acbce85699c622b30194c24 Mon Sep 17 00:00:00 2001 From: Zongmin Zhou Date: Thu, 24 Oct 2024 10:27:00 +0800 Subject: usbip: tools: Fix detach_port() invalid port error path The detach_port() doesn't return error when detach is attempted on an invalid port. Fixes: 40ecdeb1a187 ("usbip: usbip_detach: fix to check for invalid ports") Cc: stable@vger.kernel.org Reviewed-by: Hongren Zheng Reviewed-by: Shuah Khan Signed-off-by: Zongmin Zhou Link: https://lore.kernel.org/r/20241024022700.1236660-1-min_halo@163.com Signed-off-by: Greg Kroah-Hartman --- tools/usb/usbip/src/usbip_detach.c | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/usb/usbip/src/usbip_detach.c b/tools/usb/usbip/src/usbip_detach.c index b29101986b5a..6b78d4a81e95 100644 --- a/tools/usb/usbip/src/usbip_detach.c +++ b/tools/usb/usbip/src/usbip_detach.c @@ -68,6 +68,7 @@ static int detach_port(char *port) } if (!found) { + ret = -1; err("Invalid port %s > maxports %d", port, vhci_driver->nports); goto call_driver_close; -- cgit From 31004740e42846a6f0bb255e6348281df3eb8032 Mon Sep 17 00:00:00 2001 From: Basavaraj Natikar Date: Thu, 24 Oct 2024 19:07:18 +0530 Subject: xhci: Use pm_runtime_get to prevent RPM on unsupported systems Use pm_runtime_put in the remove function and pm_runtime_get to disable RPM on platforms that don't support runtime D3, as re-enabling it through sysfs auto power control may cause the controller to malfunction. This can lead to issues such as hotplug devices not being detected due to failed interrupt generation. Fixes: a5d6264b638e ("xhci: Enable RPM on controllers that support low-power states") Cc: stable Signed-off-by: Basavaraj Natikar Reviewed-by: Mario Limonciello Link: https://lore.kernel.org/r/20241024133718.723846-1-Basavaraj.Natikar@amd.com Signed-off-by: Greg Kroah-Hartman --- drivers/usb/host/xhci-pci.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/usb/host/xhci-pci.c b/drivers/usb/host/xhci-pci.c index 7e538194a0a4..cb07cee9ed0c 100644 --- a/drivers/usb/host/xhci-pci.c +++ b/drivers/usb/host/xhci-pci.c @@ -640,7 +640,7 @@ int xhci_pci_common_probe(struct pci_dev *dev, const struct pci_device_id *id) pm_runtime_put_noidle(&dev->dev); if (pci_choose_state(dev, PMSG_SUSPEND) == PCI_D0) - pm_runtime_forbid(&dev->dev); + pm_runtime_get(&dev->dev); else if (xhci->quirks & XHCI_DEFAULT_PM_RUNTIME_ALLOW) pm_runtime_allow(&dev->dev); @@ -683,7 +683,9 @@ void xhci_pci_remove(struct pci_dev *dev) xhci->xhc_state |= XHCI_STATE_REMOVING; - if (xhci->quirks & XHCI_DEFAULT_PM_RUNTIME_ALLOW) + if (pci_choose_state(dev, PMSG_SUSPEND) == PCI_D0) + pm_runtime_put(&dev->dev); + else if (xhci->quirks & XHCI_DEFAULT_PM_RUNTIME_ALLOW) pm_runtime_forbid(&dev->dev); if (xhci->shared_hcd) { -- cgit From 075919f6df5dd82ad0b1894898b315fbb3c29b84 Mon Sep 17 00:00:00 2001 From: Faisal Hassan Date: Tue, 22 Oct 2024 21:26:31 +0530 Subject: xhci: Fix Link TRB DMA in command ring stopped completion event During the aborting of a command, the software receives a command completion event for the command ring stopped, with the TRB pointing to the next TRB after the aborted command. If the command we abort is located just before the Link TRB in the command ring, then during the 'command ring stopped' completion event, the xHC gives the Link TRB in the event's cmd DMA, which causes a mismatch in handling command completion event. To address this situation, move the 'command ring stopped' completion event check slightly earlier, since the specific command it stopped on isn't of significant concern. Fixes: 7f84eef0dafb ("USB: xhci: No-op command queueing and irq handler.") Cc: stable@vger.kernel.org Signed-off-by: Faisal Hassan Acked-by: Mathias Nyman Link: https://lore.kernel.org/r/20241022155631.1185-1-quic_faisalh@quicinc.com Signed-off-by: Greg Kroah-Hartman --- drivers/usb/host/xhci-ring.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index b6eb928e260f..928b93ad1ee8 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -1718,6 +1718,14 @@ static void handle_cmd_completion(struct xhci_hcd *xhci, trace_xhci_handle_command(xhci->cmd_ring, &cmd_trb->generic); + cmd_comp_code = GET_COMP_CODE(le32_to_cpu(event->status)); + + /* If CMD ring stopped we own the trbs between enqueue and dequeue */ + if (cmd_comp_code == COMP_COMMAND_RING_STOPPED) { + complete_all(&xhci->cmd_ring_stop_completion); + return; + } + cmd_dequeue_dma = xhci_trb_virt_to_dma(xhci->cmd_ring->deq_seg, cmd_trb); /* @@ -1734,14 +1742,6 @@ static void handle_cmd_completion(struct xhci_hcd *xhci, cancel_delayed_work(&xhci->cmd_timer); - cmd_comp_code = GET_COMP_CODE(le32_to_cpu(event->status)); - - /* If CMD ring stopped we own the trbs between enqueue and dequeue */ - if (cmd_comp_code == COMP_COMMAND_RING_STOPPED) { - complete_all(&xhci->cmd_ring_stop_completion); - return; - } - if (cmd->command_trb != xhci->cmd_ring->dequeue) { xhci_err(xhci, "Command completion event does not match command\n"); -- cgit From f3b311325fa20023fd1e322538388dca2ddb8dc0 Mon Sep 17 00:00:00 2001 From: Stefan Wahren Date: Fri, 25 Oct 2024 12:36:13 +0200 Subject: Revert "usb: dwc2: Skip clock gating on Broadcom SoCs" The commit d483f034f032 ("usb: dwc2: Skip clock gating on Broadcom SoCs") introduced a regression on Raspberry Pi 3 B Plus, which prevents enumeration of the onboard Microchip LAN7800 in case no external USB device is connected during boot. Fixes: d483f034f032 ("usb: dwc2: Skip clock gating on Broadcom SoCs") Signed-off-by: Stefan Wahren Link: https://lore.kernel.org/r/20241025103621.4780-2-wahrenst@gmx.net Signed-off-by: Greg Kroah-Hartman --- drivers/usb/dwc2/params.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/usb/dwc2/params.c b/drivers/usb/dwc2/params.c index 68226defdc60..4d73fae80b12 100644 --- a/drivers/usb/dwc2/params.c +++ b/drivers/usb/dwc2/params.c @@ -23,7 +23,6 @@ static void dwc2_set_bcm_params(struct dwc2_hsotg *hsotg) p->max_transfer_size = 65535; p->max_packet_count = 511; p->ahbcfg = 0x10; - p->no_clock_gating = true; } static void dwc2_set_his_params(struct dwc2_hsotg *hsotg) -- cgit From 623dae3e7084a9504e6dc4cf0cb83f305f413b4d Mon Sep 17 00:00:00 2001 From: Mathias Nyman Date: Thu, 24 Oct 2024 16:13:55 +0300 Subject: usb: acpi: fix boot hang due to early incorrect 'tunneled' USB3 device links Fix a boot hang issue triggered when a USB3 device is incorrectly assumed to be tunneled over USB4, thus attempting to create a device link between the USB3 "consumer" device and the USB4 "supplier" Host Interface before the USB4 side is properly bound to a driver. This could happen if xhci isn't capable of detecting tunneled devices, but ACPI tables contain all info needed to assume device is tunneled. i.e. udev->tunnel_mode == USB_LINK_UNKNOWN. It turns out that even for actual tunneled USB3 devices it can't be assumed that the thunderbolt driver providing the tunnel is loaded before the tunneled USB3 device is created. The tunnel can be created by BIOS and remain in use by thunderbolt/USB4 host driver once it loads. Solve this by making the device link "stateless", which doesn't create a driver presence order dependency between the supplier and consumer drivers. It still guarantees correct suspend/resume and shutdown ordering. cc: Mario Limonciello Fixes: f1bfb4a6fed6 ("usb: acpi: add device link between tunneled USB3 device and USB4 Host Interface") Tested-by: Harry Wentland Signed-off-by: Mathias Nyman Reviewed-by: Mika Westerberg Tested-by: Mario Limonciello Link: https://lore.kernel.org/r/20241024131355.3836538-1-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman --- drivers/usb/core/usb-acpi.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/usb/core/usb-acpi.c b/drivers/usb/core/usb-acpi.c index 21585ed89ef8..03c22114214b 100644 --- a/drivers/usb/core/usb-acpi.c +++ b/drivers/usb/core/usb-acpi.c @@ -170,11 +170,11 @@ static int usb_acpi_add_usb4_devlink(struct usb_device *udev) struct fwnode_handle *nhi_fwnode __free(fwnode_handle) = fwnode_find_reference(dev_fwnode(&port_dev->dev), "usb4-host-interface", 0); - if (IS_ERR(nhi_fwnode)) + if (IS_ERR(nhi_fwnode) || !nhi_fwnode->dev) return 0; link = device_link_add(&port_dev->child->dev, nhi_fwnode->dev, - DL_FLAG_AUTOREMOVE_CONSUMER | + DL_FLAG_STATELESS | DL_FLAG_RPM_ACTIVE | DL_FLAG_PM_RUNTIME); if (!link) { -- cgit From 7f02b8a5b602098f2901166e7e4d583acaed872a Mon Sep 17 00:00:00 2001 From: Javier Carrasco Date: Sun, 20 Oct 2024 14:56:34 +0200 Subject: usb: typec: qcom-pmic-typec: use fwnode_handle_put() to release fwnodes The right function to release a fwnode acquired via device_get_named_child_node() is fwnode_handle_put(), and not fwnode_remove_software_node(), as no software node is being handled. Replace the calls to fwnode_remove_software_node() with fwnode_handle_put() in qcom_pmic_typec_probe() and qcom_pmic_typec_remove(). Cc: stable@vger.kernel.org Fixes: a4422ff22142 ("usb: typec: qcom: Add Qualcomm PMIC Type-C driver") Suggested-by: Dmitry Baryshkov Signed-off-by: Javier Carrasco Acked-by: Bryan O'Donoghue Reviewed-by: Heikki Krogerus Reviewed-by: Dmitry Baryshkov Link: https://lore.kernel.org/r/20241020-qcom_pmic_typec-fwnode_remove-v2-1-7054f3d2e215@gmail.com Signed-off-by: Greg Kroah-Hartman --- drivers/usb/typec/tcpm/qcom/qcom_pmic_typec.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/usb/typec/tcpm/qcom/qcom_pmic_typec.c b/drivers/usb/typec/tcpm/qcom/qcom_pmic_typec.c index 501eddb294e4..7d9d37c16fad 100644 --- a/drivers/usb/typec/tcpm/qcom/qcom_pmic_typec.c +++ b/drivers/usb/typec/tcpm/qcom/qcom_pmic_typec.c @@ -123,7 +123,7 @@ port_stop: port_unregister: tcpm_unregister_port(tcpm->tcpm_port); fwnode_remove: - fwnode_remove_software_node(tcpm->tcpc.fwnode); + fwnode_handle_put(tcpm->tcpc.fwnode); return ret; } @@ -135,7 +135,7 @@ static void qcom_pmic_typec_remove(struct platform_device *pdev) tcpm->pdphy_stop(tcpm); tcpm->port_stop(tcpm); tcpm_unregister_port(tcpm->tcpm_port); - fwnode_remove_software_node(tcpm->tcpc.fwnode); + fwnode_handle_put(tcpm->tcpc.fwnode); } static const struct pmic_typec_resources pm8150b_typec_res = { -- cgit From b8423a2f5814dbf055ed7c41f25bfe91c2066cbe Mon Sep 17 00:00:00 2001 From: Javier Carrasco Date: Sun, 20 Oct 2024 14:56:35 +0200 Subject: usb: typec: qcom-pmic-typec: fix missing fwnode removal in error path If drm_dp_hpd_bridge_register() fails, the probe function returns without removing the fwnode via fwnode_handle_put(), leaking the resource. Jump to fwnode_remove if drm_dp_hpd_bridge_register() fails to remove the fwnode acquired with device_get_named_child_node(). Cc: stable@vger.kernel.org Fixes: 7d9f1b72b296 ("usb: typec: qcom-pmic-typec: switch to DRM_AUX_HPD_BRIDGE") Signed-off-by: Javier Carrasco Reviewed-by: Dmitry Baryshkov Acked-by: Bryan O'Donoghue Reviewed-by: Heikki Krogerus Link: https://lore.kernel.org/r/20241020-qcom_pmic_typec-fwnode_remove-v2-2-7054f3d2e215@gmail.com Signed-off-by: Greg Kroah-Hartman --- drivers/usb/typec/tcpm/qcom/qcom_pmic_typec.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/usb/typec/tcpm/qcom/qcom_pmic_typec.c b/drivers/usb/typec/tcpm/qcom/qcom_pmic_typec.c index 7d9d37c16fad..b80eb2d78d88 100644 --- a/drivers/usb/typec/tcpm/qcom/qcom_pmic_typec.c +++ b/drivers/usb/typec/tcpm/qcom/qcom_pmic_typec.c @@ -93,8 +93,10 @@ static int qcom_pmic_typec_probe(struct platform_device *pdev) return -EINVAL; bridge_dev = devm_drm_dp_hpd_bridge_alloc(tcpm->dev, to_of_node(tcpm->tcpc.fwnode)); - if (IS_ERR(bridge_dev)) - return PTR_ERR(bridge_dev); + if (IS_ERR(bridge_dev)) { + ret = PTR_ERR(bridge_dev); + goto fwnode_remove; + } tcpm->tcpm_port = tcpm_register_port(tcpm->dev, &tcpm->tcpc); if (IS_ERR(tcpm->tcpm_port)) { -- cgit From 9581acb91eaf5bbe70086bbb6fca808220d358ba Mon Sep 17 00:00:00 2001 From: Javier Carrasco Date: Mon, 21 Oct 2024 22:45:29 +0200 Subject: usb: typec: fix unreleased fwnode_handle in typec_port_register_altmodes() The 'altmodes_node' fwnode_handle is never released after it is no longer required, which leaks the resource. Add the required call to fwnode_handle_put() when 'altmodes_node' is no longer required. Cc: stable@vger.kernel.org Fixes: 7b458a4c5d73 ("usb: typec: Add typec_port_register_altmodes()") Reviewed-by: Heikki Krogerus Signed-off-by: Javier Carrasco Link: https://lore.kernel.org/r/20241021-typec-class-fwnode_handle_put-v2-1-3281225d3d27@gmail.com Signed-off-by: Greg Kroah-Hartman --- drivers/usb/typec/class.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/usb/typec/class.c b/drivers/usb/typec/class.c index d61b4c74648d..1eb240604cf6 100644 --- a/drivers/usb/typec/class.c +++ b/drivers/usb/typec/class.c @@ -2341,6 +2341,7 @@ void typec_port_register_altmodes(struct typec_port *port, altmodes[index] = alt; index++; } + fwnode_handle_put(altmodes_node); } EXPORT_SYMBOL_GPL(typec_port_register_altmodes); -- cgit From 1ab0b9ae587373f9f800b6fda01b8faf02b3530b Mon Sep 17 00:00:00 2001 From: Javier Carrasco Date: Mon, 21 Oct 2024 22:45:30 +0200 Subject: usb: typec: use cleanup facility for 'altmodes_node' Use the __free() macro for 'altmodes_node' to automatically release the node when it goes out of scope, removing the need for explicit calls to fwnode_handle_put(). Suggested-by: Heikki Krogerus Signed-off-by: Javier Carrasco Reviewed-by: Heikki Krogerus Link: https://lore.kernel.org/r/20241021-typec-class-fwnode_handle_put-v2-2-3281225d3d27@gmail.com Signed-off-by: Greg Kroah-Hartman --- drivers/usb/typec/class.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/usb/typec/class.c b/drivers/usb/typec/class.c index 1eb240604cf6..58f40156de56 100644 --- a/drivers/usb/typec/class.c +++ b/drivers/usb/typec/class.c @@ -2293,7 +2293,7 @@ void typec_port_register_altmodes(struct typec_port *port, const struct typec_altmode_ops *ops, void *drvdata, struct typec_altmode **altmodes, size_t n) { - struct fwnode_handle *altmodes_node, *child; + struct fwnode_handle *child; struct typec_altmode_desc desc; struct typec_altmode *alt; size_t index = 0; @@ -2301,7 +2301,9 @@ void typec_port_register_altmodes(struct typec_port *port, u32 vdo; int ret; - altmodes_node = device_get_named_child_node(&port->dev, "altmodes"); + struct fwnode_handle *altmodes_node __free(fwnode_handle) = + device_get_named_child_node(&port->dev, "altmodes"); + if (!altmodes_node) return; /* No altmodes specified */ @@ -2341,7 +2343,6 @@ void typec_port_register_altmodes(struct typec_port *port, altmodes[index] = alt; index++; } - fwnode_handle_put(altmodes_node); } EXPORT_SYMBOL_GPL(typec_port_register_altmodes); -- cgit From dc1308bee1ed03b4d698d77c8bd670d399dcd04d Mon Sep 17 00:00:00 2001 From: Li Zhijian Date: Tue, 29 Oct 2024 11:13:24 +0800 Subject: selftests/watchdog-test: Fix system accidentally reset after watchdog-test When running watchdog-test with 'make run_tests', the watchdog-test will be terminated by a timeout signal(SIGTERM) due to the test timemout. And then, a system reboot would happen due to watchdog not stop. see the dmesg as below: ``` [ 1367.185172] watchdog: watchdog0: watchdog did not stop! ``` Fix it by registering more signals(including SIGTERM) in watchdog-test, where its signal handler will stop the watchdog. After that # timeout 1 ./watchdog-test Watchdog Ticking Away! . Stopping watchdog ticks... Link: https://lore.kernel.org/all/20241029031324.482800-1-lizhijian@fujitsu.com/ Signed-off-by: Li Zhijian Reviewed-by: Shuah Khan Signed-off-by: Shuah Khan --- tools/testing/selftests/watchdog/watchdog-test.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/tools/testing/selftests/watchdog/watchdog-test.c b/tools/testing/selftests/watchdog/watchdog-test.c index bc71cbca0dde..a1f506ba5578 100644 --- a/tools/testing/selftests/watchdog/watchdog-test.c +++ b/tools/testing/selftests/watchdog/watchdog-test.c @@ -334,7 +334,13 @@ int main(int argc, char *argv[]) printf("Watchdog Ticking Away!\n"); + /* + * Register the signals + */ signal(SIGINT, term); + signal(SIGTERM, term); + signal(SIGKILL, term); + signal(SIGQUIT, term); while (1) { keep_alive(); -- cgit From fdce49b5da6e0fb6d077986dec3e90ef2b094b50 Mon Sep 17 00:00:00 2001 From: Zijun Hu Date: Sun, 20 Oct 2024 17:33:42 +0800 Subject: usb: phy: Fix API devm_usb_put_phy() can not release the phy For devm_usb_put_phy(), its comment says it needs to invoke usb_put_phy() to release the phy, but it does not do that actually, so it can not fully undo what the API devm_usb_get_phy() does, that is wrong, fixed by using devres_release() instead of devres_destroy() within the API. Fixes: cedf8602373a ("usb: phy: move bulk of otg/otg.c to phy/phy.c") Cc: stable@vger.kernel.org Signed-off-by: Zijun Hu Link: https://lore.kernel.org/r/20241020-usb_phy_fix-v1-1-7f79243b8e1e@quicinc.com Signed-off-by: Greg Kroah-Hartman --- drivers/usb/phy/phy.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/usb/phy/phy.c b/drivers/usb/phy/phy.c index 06e0fb23566c..06f789097989 100644 --- a/drivers/usb/phy/phy.c +++ b/drivers/usb/phy/phy.c @@ -628,7 +628,7 @@ void devm_usb_put_phy(struct device *dev, struct usb_phy *phy) { int r; - r = devres_destroy(dev, devm_usb_phy_release, devm_usb_phy_match, phy); + r = devres_release(dev, devm_usb_phy_release, devm_usb_phy_match, phy); dev_WARN_ONCE(dev, r, "couldn't find PHY resource\n"); } EXPORT_SYMBOL_GPL(devm_usb_put_phy); -- cgit From afb92ad8733ef0a2843cc229e4d96aead80bc429 Mon Sep 17 00:00:00 2001 From: Amit Sunil Dhamne Date: Wed, 23 Oct 2024 19:22:30 -0700 Subject: usb: typec: tcpm: restrict SNK_WAIT_CAPABILITIES_TIMEOUT transitions to non self-powered devices PD3.1 spec ("8.3.3.3.3 PE_SNK_Wait_for_Capabilities State") mandates that the policy engine perform a hard reset when SinkWaitCapTimer expires. Instead the code explicitly does a GET_SOURCE_CAP when the timer expires as part of SNK_WAIT_CAPABILITIES_TIMEOUT. Due to this the following compliance test failures are reported by the compliance tester (added excerpts from the PD Test Spec): * COMMON.PROC.PD.2#1: The Tester receives a Get_Source_Cap Message from the UUT. This message is valid except the following conditions: [COMMON.PROC.PD.2#1] a. The check fails if the UUT sends this message before the Tester has established an Explicit Contract ... * TEST.PD.PROT.SNK.4: ... 4. The check fails if the UUT does not send a Hard Reset between tTypeCSinkWaitCap min and max. [TEST.PD.PROT.SNK.4#1] The delay is between the VBUS present vSafe5V min and the time of the first bit of Preamble of the Hard Reset sent by the UUT. For the purpose of interoperability, restrict the quirk introduced in https://lore.kernel.org/all/20240523171806.223727-1-sebastian.reichel@collabora.com/ to only non self-powered devices as battery powered devices will not have the issue mentioned in that commit. Cc: stable@vger.kernel.org Fixes: 122968f8dda8 ("usb: typec: tcpm: avoid resets for missing source capability messages") Reported-by: Badhri Jagan Sridharan Closes: https://lore.kernel.org/all/CAPTae5LAwsVugb0dxuKLHFqncjeZeJ785nkY4Jfd+M-tCjHSnQ@mail.gmail.com/ Signed-off-by: Amit Sunil Dhamne Reviewed-by: Badhri Jagan Sridharan Reviewed-by: Heikki Krogerus Tested-by: Xu Yang Reviewed-by: Sebastian Reichel Link: https://lore.kernel.org/r/20241024022233.3276995-1-amitsd@google.com Signed-off-by: Greg Kroah-Hartman --- drivers/usb/typec/tcpm/tcpm.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/usb/typec/tcpm/tcpm.c b/drivers/usb/typec/tcpm/tcpm.c index fc619478200f..7ae341a40342 100644 --- a/drivers/usb/typec/tcpm/tcpm.c +++ b/drivers/usb/typec/tcpm/tcpm.c @@ -4515,7 +4515,8 @@ static inline enum tcpm_state hard_reset_state(struct tcpm_port *port) return ERROR_RECOVERY; if (port->pwr_role == TYPEC_SOURCE) return SRC_UNATTACHED; - if (port->state == SNK_WAIT_CAPABILITIES_TIMEOUT) + if (port->state == SNK_WAIT_CAPABILITIES || + port->state == SNK_WAIT_CAPABILITIES_TIMEOUT) return SNK_READY; return SNK_UNATTACHED; } @@ -5043,8 +5044,11 @@ static void run_state_machine(struct tcpm_port *port) tcpm_set_state(port, SNK_SOFT_RESET, PD_T_SINK_WAIT_CAP); } else { - tcpm_set_state(port, SNK_WAIT_CAPABILITIES_TIMEOUT, - PD_T_SINK_WAIT_CAP); + if (!port->self_powered) + upcoming_state = SNK_WAIT_CAPABILITIES_TIMEOUT; + else + upcoming_state = hard_reset_state(port); + tcpm_set_state(port, upcoming_state, PD_T_SINK_WAIT_CAP); } break; case SNK_WAIT_CAPABILITIES_TIMEOUT: -- cgit From 7c18d4811000945677a8531e89de3e17582e8a36 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Tue, 15 Oct 2024 13:12:36 +0200 Subject: mm/pagewalk: fix usage of pmd_leaf()/pud_leaf() without present check pmd_leaf()/pud_leaf() only implies a pmd_present()/pud_present() check on some architectures. We really should check for pmd_present()/pud_present() first. This should explain the report we got on ppc64 (which has CONFIG_PGTABLE_HAS_HUGE_LEAVES set in the config) that triggered: VM_WARN_ON_ONCE(pmd_leaf(pmdp_get_lockless(pmdp))); Likely we had a PMD migration entry for which pmd_leaf() did not trigger. We raced with restoring the PMD migration entry, and suddenly saw a pmd_leaf(). In this case, pte_offset_map_lock() saved us from more trouble, because it rechecks the PMD value, but we would not have processed the migration entry -- which is not too bad because the only user of FW_MIGRATION is KSM for unsharing, and KSM only applies to small folios. Further, we shouldn't re-read the PMD/PUD value for our warning, the primary purpose of the VM_WARN_ON_ONCE() is to find spurious use of pmd_leaf()/pud_leaf() without CONFIG_PGTABLE_HAS_HUGE_LEAVES. As a side note, we are currently not implementing FW_MIGRATION support for PUD migration entries, which likely should exist due to hugetlb. Add a TODO so this won't fall through the cracks if more FW_MIGRATION users get added. Was able to write a quick reproducer and verify that the issue no longer triggers with this fix. https://gitlab.com/davidhildenbrand/scratchspace/-/blob/main/reproducers/move-pages-pmd-leaf.c Without this fix after a couple of seconds in a VM with 2 NUMA nodes: [ 54.333753] ------------[ cut here ]------------ [ 54.334901] WARNING: CPU: 20 PID: 1704 at mm/pagewalk.c:815 folio_walk_start+0x48f/0x6e0 [ 54.336455] Modules linked in: ... [ 54.345009] CPU: 20 UID: 0 PID: 1704 Comm: move-pages-pmd- Not tainted 6.12.0-rc2+ #81 [ 54.346529] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014 [ 54.348191] RIP: 0010:folio_walk_start+0x48f/0x6e0 [ 54.349134] Code: b5 ad 48 8d 35 00 00 00 00 e8 6d 59 d7 ff e8 08 74 da ff e9 9c fe ff ff 4c 8b 7c 24 08 4c 89 ff e8 26 2b be 00 e9 8a fe ff ff <0f> 0b e9 ec fe ff ff f7 c2 ff 0f 00 00 0f 85 81 fe ff ff 48 8b 02 [ 54.352660] RSP: 0018:ffffb7e4c430bc78 EFLAGS: 00010282 [ 54.353679] RAX: 80000002a3e008e7 RBX: ffff9946039aa580 RCX: ffff994380000000 [ 54.355056] RDX: ffff994606aec000 RSI: 00007f004b000000 RDI: 0000000000000000 [ 54.356440] RBP: 00007f004b000000 R08: 0000000000000591 R09: 0000000000000001 [ 54.357820] R10: 0000000000000200 R11: 0000000000000001 R12: ffffb7e4c430bd10 [ 54.359198] R13: ffff994606aec2c0 R14: 0000000000000002 R15: ffff994604a89b00 [ 54.360564] FS: 00007f004ae006c0(0000) GS:ffff9947f7400000(0000) knlGS:0000000000000000 [ 54.362111] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 54.363242] CR2: 00007f004adffe58 CR3: 0000000281e12005 CR4: 0000000000770ef0 [ 54.364615] PKRU: 55555554 [ 54.365153] Call Trace: [ 54.365646] [ 54.366073] ? __warn.cold+0xb7/0x14d [ 54.366796] ? folio_walk_start+0x48f/0x6e0 [ 54.367628] ? report_bug+0xff/0x140 [ 54.368324] ? handle_bug+0x58/0x90 [ 54.369019] ? exc_invalid_op+0x17/0x70 [ 54.369771] ? asm_exc_invalid_op+0x1a/0x20 [ 54.370606] ? folio_walk_start+0x48f/0x6e0 [ 54.371415] ? folio_walk_start+0x9e/0x6e0 [ 54.372227] do_pages_move+0x1c5/0x680 [ 54.372972] kernel_move_pages+0x1a1/0x2b0 [ 54.373804] __x64_sys_move_pages+0x25/0x30 Link: https://lkml.kernel.org/r/20241015111236.1290921-1-david@redhat.com Fixes: aa39ca6940f1 ("mm/pagewalk: introduce folio_walk_start() + folio_walk_end()") Signed-off-by: David Hildenbrand Reported-by: syzbot+7d917f67c05066cec295@syzkaller.appspotmail.com Closes: https://lkml.kernel.org/r/670d3248.050a0220.3e960.0064.GAE@google.com Acked-by: Kirill A. Shutemov Acked-by: Qi Zheng Cc: Jann Horn Signed-off-by: Andrew Morton --- mm/pagewalk.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/mm/pagewalk.c b/mm/pagewalk.c index 461ea3bbd8d9..5f9f01532e67 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -744,7 +744,8 @@ struct folio *folio_walk_start(struct folio_walk *fw, pud = pudp_get(pudp); if (pud_none(pud)) goto not_found; - if (IS_ENABLED(CONFIG_PGTABLE_HAS_HUGE_LEAVES) && pud_leaf(pud)) { + if (IS_ENABLED(CONFIG_PGTABLE_HAS_HUGE_LEAVES) && + (!pud_present(pud) || pud_leaf(pud))) { ptl = pud_lock(vma->vm_mm, pudp); pud = pudp_get(pudp); @@ -753,6 +754,10 @@ struct folio *folio_walk_start(struct folio_walk *fw, fw->pudp = pudp; fw->pud = pud; + /* + * TODO: FW_MIGRATION support for PUD migration entries + * once there are relevant users. + */ if (!pud_present(pud) || pud_devmap(pud) || pud_special(pud)) { spin_unlock(ptl); goto not_found; @@ -769,12 +774,13 @@ struct folio *folio_walk_start(struct folio_walk *fw, } pmd_table: - VM_WARN_ON_ONCE(pud_leaf(*pudp)); + VM_WARN_ON_ONCE(!pud_present(pud) || pud_leaf(pud)); pmdp = pmd_offset(pudp, addr); pmd = pmdp_get_lockless(pmdp); if (pmd_none(pmd)) goto not_found; - if (IS_ENABLED(CONFIG_PGTABLE_HAS_HUGE_LEAVES) && pmd_leaf(pmd)) { + if (IS_ENABLED(CONFIG_PGTABLE_HAS_HUGE_LEAVES) && + (!pmd_present(pmd) || pmd_leaf(pmd))) { ptl = pmd_lock(vma->vm_mm, pmdp); pmd = pmdp_get(pmdp); @@ -786,7 +792,7 @@ pmd_table: if (pmd_none(pmd)) { spin_unlock(ptl); goto not_found; - } else if (!pmd_leaf(pmd)) { + } else if (pmd_present(pmd) && !pmd_leaf(pmd)) { spin_unlock(ptl); goto pte_table; } else if (pmd_present(pmd)) { @@ -812,7 +818,7 @@ pmd_table: } pte_table: - VM_WARN_ON_ONCE(pmd_leaf(pmdp_get_lockless(pmdp))); + VM_WARN_ON_ONCE(!pmd_present(pmd) || pmd_leaf(pmd)); ptep = pte_offset_map_lock(vma->vm_mm, pmdp, addr, &ptl); if (!ptep) goto not_found; -- cgit From f64e67e5d3a45a4a04286c47afade4b518acd47b Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Tue, 15 Oct 2024 18:56:05 +0100 Subject: fork: do not invoke uffd on fork if error occurs Patch series "fork: do not expose incomplete mm on fork". During fork we may place the virtual memory address space into an inconsistent state before the fork operation is complete. In addition, we may encounter an error during the fork operation that indicates that the virtual memory address space is invalidated. As a result, we should not be exposing it in any way to external machinery that might interact with the mm or VMAs, machinery that is not designed to deal with incomplete state. We specifically update the fork logic to defer khugepaged and ksm to the end of the operation and only to be invoked if no error arose, and disallow uffd from observing fork events should an error have occurred. This patch (of 2): Currently on fork we expose the virtual address space of a process to userland unconditionally if uffd is registered in VMAs, regardless of whether an error arose in the fork. This is performed in dup_userfaultfd_complete() which is invoked unconditionally, and performs two duties - invoking registered handlers for the UFFD_EVENT_FORK event via dup_fctx(), and clearing down userfaultfd_fork_ctx objects established in dup_userfaultfd(). This is problematic, because the virtual address space may not yet be correctly initialised if an error arose. The change in commit d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") makes this more pertinent as we may be in a state where entries in the maple tree are not yet consistent. We address this by, on fork error, ensuring that we roll back state that we would otherwise expect to clean up through the event being handled by userland and perform the memory freeing duty otherwise performed by dup_userfaultfd_complete(). We do this by implementing a new function, dup_userfaultfd_fail(), which performs the same loop, only decrementing reference counts. Note that we perform mmgrab() on the parent and child mm's, however userfaultfd_ctx_put() will mmdrop() this once the reference count drops to zero, so we will avoid memory leaks correctly here. Link: https://lkml.kernel.org/r/cover.1729014377.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/d3691d58bb58712b6fb3df2be441d175bd3cdf07.1729014377.git.lorenzo.stoakes@oracle.com Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") Signed-off-by: Lorenzo Stoakes Reported-by: Jann Horn Reviewed-by: Jann Horn Reviewed-by: Liam R. Howlett Cc: Alexander Viro Cc: Christian Brauner Cc: Jan Kara Cc: Linus Torvalds Cc: Vlastimil Babka Cc: Signed-off-by: Andrew Morton --- fs/userfaultfd.c | 28 ++++++++++++++++++++++++++++ include/linux/userfaultfd_k.h | 5 +++++ kernel/fork.c | 5 ++++- 3 files changed, 37 insertions(+), 1 deletion(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 68cdd89c97a3..7c0bd0b55f88 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -692,6 +692,34 @@ void dup_userfaultfd_complete(struct list_head *fcs) } } +void dup_userfaultfd_fail(struct list_head *fcs) +{ + struct userfaultfd_fork_ctx *fctx, *n; + + /* + * An error has occurred on fork, we will tear memory down, but have + * allocated memory for fctx's and raised reference counts for both the + * original and child contexts (and on the mm for each as a result). + * + * These would ordinarily be taken care of by a user handling the event, + * but we are no longer doing so, so manually clean up here. + * + * mm tear down will take care of cleaning up VMA contexts. + */ + list_for_each_entry_safe(fctx, n, fcs, list) { + struct userfaultfd_ctx *octx = fctx->orig; + struct userfaultfd_ctx *ctx = fctx->new; + + atomic_dec(&octx->mmap_changing); + VM_BUG_ON(atomic_read(&octx->mmap_changing) < 0); + userfaultfd_ctx_put(octx); + userfaultfd_ctx_put(ctx); + + list_del(&fctx->list); + kfree(fctx); + } +} + void mremap_userfaultfd_prep(struct vm_area_struct *vma, struct vm_userfaultfd_ctx *vm_ctx) { diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 9fc6ce15c499..cb40f1a1d081 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -249,6 +249,7 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma, extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *); extern void dup_userfaultfd_complete(struct list_head *); +void dup_userfaultfd_fail(struct list_head *); extern void mremap_userfaultfd_prep(struct vm_area_struct *, struct vm_userfaultfd_ctx *); @@ -351,6 +352,10 @@ static inline void dup_userfaultfd_complete(struct list_head *l) { } +static inline void dup_userfaultfd_fail(struct list_head *l) +{ +} + static inline void mremap_userfaultfd_prep(struct vm_area_struct *vma, struct vm_userfaultfd_ctx *ctx) { diff --git a/kernel/fork.c b/kernel/fork.c index 89ceb4a68af2..597b477dd491 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -775,7 +775,10 @@ out: mmap_write_unlock(mm); flush_tlb_mm(oldmm); mmap_write_unlock(oldmm); - dup_userfaultfd_complete(&uf); + if (!retval) + dup_userfaultfd_complete(&uf); + else + dup_userfaultfd_fail(&uf); fail_uprobe_end: uprobe_end_dup_mmap(); return retval; -- cgit From 985da552a98e27096444508ce5d853244019111f Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Tue, 15 Oct 2024 18:56:06 +0100 Subject: fork: only invoke khugepaged, ksm hooks if no error There is no reason to invoke these hooks early against an mm that is in an incomplete state. The change in commit d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") makes this more pertinent as we may be in a state where entries in the maple tree are not yet consistent. Their placement early in dup_mmap() only appears to have been meaningful for early error checking, and since functionally it'd require a very small allocation to fail (in practice 'too small to fail') that'd only occur in the most dire circumstances, meaning the fork would fail or be OOM'd in any case. Since both khugepaged and KSM tracking are there to provide optimisations to memory performance rather than critical functionality, it doesn't really matter all that much if, under such dire memory pressure, we fail to register an mm with these. As a result, we follow the example of commit d2081b2bf819 ("mm: khugepaged: make khugepaged_enter() void function") and make ksm_fork() a void function also. We only expose the mm to these functions once we are done with them and only if no error occurred in the fork operation. Link: https://lkml.kernel.org/r/e0cb8b840c9d1d5a6e84d4f8eff5f3f2022aa10c.1729014377.git.lorenzo.stoakes@oracle.com Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") Signed-off-by: Lorenzo Stoakes Reported-by: Jann Horn Reviewed-by: Liam R. Howlett Reviewed-by: Vlastimil Babka Reviewed-by: Jann Horn Cc: Alexander Viro Cc: Christian Brauner Cc: Jan Kara Cc: Linus Torvalds Cc: Signed-off-by: Andrew Morton --- include/linux/ksm.h | 10 ++++------ kernel/fork.c | 7 ++----- 2 files changed, 6 insertions(+), 11 deletions(-) diff --git a/include/linux/ksm.h b/include/linux/ksm.h index 11690dacd986..ec9c05044d4f 100644 --- a/include/linux/ksm.h +++ b/include/linux/ksm.h @@ -54,12 +54,11 @@ static inline long mm_ksm_zero_pages(struct mm_struct *mm) return atomic_long_read(&mm->ksm_zero_pages); } -static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm) +static inline void ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm) { + /* Adding mm to ksm is best effort on fork. */ if (test_bit(MMF_VM_MERGEABLE, &oldmm->flags)) - return __ksm_enter(mm); - - return 0; + __ksm_enter(mm); } static inline int ksm_execve(struct mm_struct *mm) @@ -107,9 +106,8 @@ static inline int ksm_disable(struct mm_struct *mm) return 0; } -static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm) +static inline void ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm) { - return 0; } static inline int ksm_execve(struct mm_struct *mm) diff --git a/kernel/fork.c b/kernel/fork.c index 597b477dd491..3bf38d260cb3 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -653,11 +653,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, mm->exec_vm = oldmm->exec_vm; mm->stack_vm = oldmm->stack_vm; - retval = ksm_fork(mm, oldmm); - if (retval) - goto out; - khugepaged_fork(mm, oldmm); - /* Use __mt_dup() to efficiently build an identical maple tree. */ retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_KERNEL); if (unlikely(retval)) @@ -760,6 +755,8 @@ loop_out: vma_iter_free(&vmi); if (!retval) { mt_set_in_rcu(vmi.mas.tree); + ksm_fork(mm, oldmm); + khugepaged_fork(mm, oldmm); } else if (mpnt) { /* * The entire maple tree has already been duplicated. If the -- cgit From 281dd25c1a018261a04d1b8bf41a0674000bfe38 Mon Sep 17 00:00:00 2001 From: Matt Fleming Date: Fri, 11 Oct 2024 13:07:37 +0100 Subject: mm/page_alloc: let GFP_ATOMIC order-0 allocs access highatomic reserves Under memory pressure it's possible for GFP_ATOMIC order-0 allocations to fail even though free pages are available in the highatomic reserves. GFP_ATOMIC allocations cannot trigger unreserve_highatomic_pageblock() since it's only run from reclaim. Given that such allocations will pass the watermarks in __zone_watermark_unusable_free(), it makes sense to fallback to highatomic reserves the same way that ALLOC_OOM can. This fixes order-0 page allocation failures observed on Cloudflare's fleet when handling network packets: kswapd1: page allocation failure: order:0, mode:0x820(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0-7 CPU: 10 PID: 696 Comm: kswapd1 Kdump: loaded Tainted: G O 6.6.43-CUSTOM #1 Hardware name: MACHINE Call Trace: dump_stack_lvl+0x3c/0x50 warn_alloc+0x13a/0x1c0 __alloc_pages_slowpath.constprop.0+0xc9d/0xd10 __alloc_pages+0x327/0x340 __napi_alloc_skb+0x16d/0x1f0 bnxt_rx_page_skb+0x96/0x1b0 [bnxt_en] bnxt_rx_pkt+0x201/0x15e0 [bnxt_en] __bnxt_poll_work+0x156/0x2b0 [bnxt_en] bnxt_poll+0xd9/0x1c0 [bnxt_en] __napi_poll+0x2b/0x1b0 bpf_trampoline_6442524138+0x7d/0x1000 __napi_poll+0x5/0x1b0 net_rx_action+0x342/0x740 handle_softirqs+0xcf/0x2b0 irq_exit_rcu+0x6c/0x90 sysvec_apic_timer_interrupt+0x72/0x90 [mfleming@cloudflare.com: update comment] Link: https://lkml.kernel.org/r/20241015125158.3597702-1-matt@readmodwrite.com Link: https://lkml.kernel.org/r/20241011120737.3300370-1-matt@readmodwrite.com Link: https://lore.kernel.org/all/CAGis_TWzSu=P7QJmjD58WWiu3zjMTVKSzdOwWE8ORaGytzWJwQ@mail.gmail.com/ Fixes: 1d91df85f399 ("mm/page_alloc: handle a missing case for memalloc_nocma_{save/restore} APIs") Signed-off-by: Matt Fleming Suggested-by: Vlastimil Babka Reviewed-by: Vlastimil Babka Cc: Mel Gorman Cc: Michal Hocko Cc: Signed-off-by: Andrew Morton --- mm/page_alloc.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8afab64814dc..94a2ffe28008 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2893,12 +2893,12 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, page = __rmqueue(zone, order, migratetype, alloc_flags); /* - * If the allocation fails, allow OOM handling access - * to HIGHATOMIC reserves as failing now is worse than - * failing a high-order atomic allocation in the - * future. + * If the allocation fails, allow OOM handling and + * order-0 (atomic) allocs access to HIGHATOMIC + * reserves as failing now is worse than failing a + * high-order atomic allocation in the future. */ - if (!page && (alloc_flags & ALLOC_OOM)) + if (!page && (alloc_flags & (ALLOC_OOM|ALLOC_NON_BLOCK))) page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); if (!page) { -- cgit From 79f3d123caedbac30a6fd75f9597b2a60a89d513 Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Tue, 15 Oct 2024 21:34:55 -0400 Subject: mm/mmap: fix race in mmap_region() with ftruncate() Avoiding the zeroing of the vma tree in mmap_region() introduced a race with truncate in the page table walk. To avoid any races, create a hole in the rmap during the operation by clearing the pagetable entries earlier under the mmap write lock and (critically) before the new vma is installed into the vma tree. The result is that the old vma(s) are left in the vma tree, but free_pgtables() removes them from the rmap and clears the ptes while holding the necessary locks. This change extends the fix required for hugetblfs and the call_mmap() function by moving the cleanup higher in the function and running it unconditionally. Link: https://lkml.kernel.org/r/20241016013455.2241533-1-Liam.Howlett@oracle.com Fixes: f8d112a4e657 ("mm/mmap: avoid zeroing vma tree in mmap_region()") Signed-off-by: Liam R. Howlett Reported-by: Jann Horn Closes: https://lore.kernel.org/all/CAG48ez0ZpGzxi=-5O_uGQ0xKXOmbjeQ0LjZsRJ1Qtf2X5eOr1w@mail.gmail.com/ Reviewed-by: Jann Horn Reviewed-by: Lorenzo Stoakes Acked-by: Vlastimil Babka Cc: Matthew Wilcox Cc: David Hildenbrand Signed-off-by: Andrew Morton --- mm/mmap.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 9c0fb43064b5..3f1419460be3 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1418,6 +1418,13 @@ unsigned long mmap_region(struct file *file, unsigned long addr, vmg.flags = vm_flags; } + /* + * clear PTEs while the vma is still in the tree so that rmap + * cannot race with the freeing later in the truncate scenario. + * This is also needed for call_mmap(), which is why vm_ops + * close function is called. + */ + vms_clean_up_area(&vms, &mas_detach); vma = vma_merge_new_range(&vmg); if (vma) goto expanded; @@ -1439,11 +1446,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr, if (file) { vma->vm_file = get_file(file); - /* - * call_mmap() may map PTE, so ensure there are no existing PTEs - * and call the vm_ops close function if one exists. - */ - vms_clean_up_area(&vms, &mas_detach); error = call_mmap(file, vma); if (error) goto unmap_and_free_vma; -- cgit From b7c5f9a1fb9b40491d8b564b7eb9df26128cda3f Mon Sep 17 00:00:00 2001 From: Huang Ying Date: Tue, 15 Oct 2024 13:15:54 +0800 Subject: resource: remove dependency on SPARSEMEM from GET_FREE_REGION We want to use the functions (get_free_mem_region()) configured via GET_FREE_REGION in resource kunit tests. However, GET_FREE_REGION depends on SPARSEMEM now. This makes resource kunit tests cannot be built on some architectures lacking SPARSEMEM, or causes config warning as follows, WARNING: unmet direct dependencies detected for GET_FREE_REGION Depends on [n]: SPARSEMEM [=n] Selected by [y]: - RESOURCE_KUNIT_TEST [=y] && RUNTIME_TESTING_MENU [=y] && KUNIT [=y] When get_free_mem_region() was introduced the only consumers were those looking to pass the address range to memremap_pages(). That address range needed to be mindful of the maximum addressable platform physical address which at the time only SPARSMEM defined via MAX_PHYSMEM_BITS. Given that memremap_pages() also depended on SPARSEMEM via ZONE_DEVICE, it was easier to just depend on that definition than invent a general MAX_PHYSMEM_BITS concept outside of SPARSEMEM. Turns out that decision was buggy and did not account for KASAN consumption of physical address space. That problem was resolved recently with commit ea72ce5da228 ("x86/kaslr: Expose and use the end of the physical memory address space"), and GET_FREE_REGION dropped its MAX_PHYSMEM_BITS dependency. Then commit 99185c10d5d9 ("resource, kunit: add test case for region_intersects()"), went ahead and fixed up the only remaining dependency on SPARSEMEM which was usage of the PA_SECTION_SHIFT macro for setting the default alignment. A PAGE_SIZE fallback is fine in the SPARSEMEM=n case. With those build dependencies gone GET_FREE_REGION no longer depends on SPARSEMEM. So, the patch removes dependency on SPARSEMEM from GET_FREE_REGION to fix the build issues. Link: https://lkml.kernel.org/r/20241016014730.339369-1-ying.huang@intel.com Link: https://lore.kernel.org/lkml/20240922225041.603186-1-linux@roeck-us.net/ Link: https://lkml.kernel.org/r/20241015051554.294734-1-ying.huang@intel.com Fixes: 99185c10d5d9 ("resource, kunit: add test case for region_intersects()") Signed-off-by: "Huang, Ying" Co-developed-by: Dan Williams Signed-off-by: Dan Williams Tested-by: Guenter Roeck Acked-by: David Hildenbrand Tested-by: Nathan Chancellor # build Cc: Arnd Bergmann Cc: Jonathan Cameron Signed-off-by: Andrew Morton --- mm/Kconfig | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/Kconfig b/mm/Kconfig index 4c9f5ea13271..33fa51d608dc 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1085,7 +1085,6 @@ config HMM_MIRROR depends on MMU config GET_FREE_REGION - depends on SPARSEMEM bool config DEVICE_PRIVATE -- cgit From 1db272864ff250b5e607283eaec819e1186c8e26 Mon Sep 17 00:00:00 2001 From: Sabyrzhan Tasbolatov Date: Wed, 16 Oct 2024 20:24:07 +0500 Subject: x86/traps: move kmsan check after instrumentation_begin During x86_64 kernel build with CONFIG_KMSAN, the objtool warns following: AR built-in.a AR vmlinux.a LD vmlinux.o vmlinux.o: warning: objtool: handle_bug+0x4: call to kmsan_unpoison_entry_regs() leaves .noinstr.text section OBJCOPY modules.builtin.modinfo GEN modules.builtin MODPOST Module.symvers CC .vmlinux.export.o Moving kmsan_unpoison_entry_regs() _after_ instrumentation_begin() fixes the warning. There is decode_bug(regs->ip, &imm) is left before KMSAN unpoisoining, but it has the return condition and if we include it after instrumentation_begin() it results the warning "return with instrumentation enabled", hence, I'm concerned that regs will not be KMSAN unpoisoned if `ud_type == BUG_NONE` is true. Link: https://lkml.kernel.org/r/20241016152407.3149001-1-snovitoll@gmail.com Fixes: ba54d194f8da ("x86/traps: avoid KMSAN bugs originating from handle_bug()") Signed-off-by: Sabyrzhan Tasbolatov Reviewed-by: Alexander Potapenko Cc: Borislav Petkov (AMD) Cc: Dave Hansen Cc: Ingo Molnar Cc: Thomas Gleixner Cc: Signed-off-by: Andrew Morton --- arch/x86/kernel/traps.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index d05392db5d0f..2dbadf347b5f 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -261,12 +261,6 @@ static noinstr bool handle_bug(struct pt_regs *regs) int ud_type; u32 imm; - /* - * Normally @regs are unpoisoned by irqentry_enter(), but handle_bug() - * is a rare case that uses @regs without passing them to - * irqentry_enter(). - */ - kmsan_unpoison_entry_regs(regs); ud_type = decode_bug(regs->ip, &imm); if (ud_type == BUG_NONE) return handled; @@ -275,6 +269,12 @@ static noinstr bool handle_bug(struct pt_regs *regs) * All lies, just get the WARN/BUG out. */ instrumentation_begin(); + /* + * Normally @regs are unpoisoned by irqentry_enter(), but handle_bug() + * is a rare case that uses @regs without passing them to + * irqentry_enter(). + */ + kmsan_unpoison_entry_regs(regs); /* * Since we're emulating a CALL with exceptions, restore the interrupt * state to what it was at the exception site. -- cgit From 14611508cb5bf031f85bae58704c9218681d8e07 Mon Sep 17 00:00:00 2001 From: Jann Horn Date: Wed, 16 Oct 2024 17:07:53 +0200 Subject: mm: mark mas allocation in vms_abort_munmap_vmas as __GFP_NOFAIL vms_abort_munmap_vmas() is a recovery path where, on entry, some VMAs have already been torn down halfway (in a way we can't undo) but are still present in the maple tree. At this point, we *must* remove the VMAs from the VMA tree, otherwise we get UAF. Because removing VMA tree nodes can require memory allocation, the existing code has an error path which tries to handle this by reattaching the VMAs; but that can't be done safely. A nicer way to fix it would probably be to preallocate enough maple tree nodes for the removal before the point of no return, or something like that; but for now, fix it the easy and kinda ugly way, by marking this allocation __GFP_NOFAIL. Link: https://lkml.kernel.org/r/20241016-fix-munmap-abort-v1-1-601c94b2240d@google.com Fixes: 4f87153e82c4 ("mm: change failure of MAP_FIXED to restoring the gap on failure") Signed-off-by: Jann Horn Reviewed-by: Liam R. Howlett Acked-by: Vlastimil Babka Reviewed-by: Lorenzo Stoakes Signed-off-by: Andrew Morton --- mm/vma.h | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/mm/vma.h b/mm/vma.h index 819f994cf727..ebd78f1577f3 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -241,15 +241,9 @@ static inline void vms_abort_munmap_vmas(struct vma_munmap_struct *vms, * failure method of leaving a gap where the MAP_FIXED mapping failed. */ mas_set_range(mas, vms->start, vms->end - 1); - if (unlikely(mas_store_gfp(mas, NULL, GFP_KERNEL))) { - pr_warn_once("%s: (%d) Unable to abort munmap() operation\n", - current->comm, current->pid); - /* Leaving vmas detached and in-tree may hamper recovery */ - reattach_vmas(mas_detach); - } else { - /* Clean up the insertion of the unfortunate gap */ - vms_complete_munmap_vmas(vms, mas_detach); - } + mas_store_gfp(mas, NULL, GFP_KERNEL|__GFP_NOFAIL); + /* Clean up the insertion of the unfortunate gap */ + vms_complete_munmap_vmas(vms, mas_detach); } int -- cgit From d949d1d14fa281ace388b1de978e8f2cd52875cf Mon Sep 17 00:00:00 2001 From: Jeongjun Park Date: Mon, 9 Sep 2024 21:35:58 +0900 Subject: mm: shmem: fix data-race in shmem_getattr() I got the following KCSAN report during syzbot testing: ================================================================== BUG: KCSAN: data-race in generic_fillattr / inode_set_ctime_current write to 0xffff888102eb3260 of 4 bytes by task 6565 on cpu 1: inode_set_ctime_to_ts include/linux/fs.h:1638 [inline] inode_set_ctime_current+0x169/0x1d0 fs/inode.c:2626 shmem_mknod+0x117/0x180 mm/shmem.c:3443 shmem_create+0x34/0x40 mm/shmem.c:3497 lookup_open fs/namei.c:3578 [inline] open_last_lookups fs/namei.c:3647 [inline] path_openat+0xdbc/0x1f00 fs/namei.c:3883 do_filp_open+0xf7/0x200 fs/namei.c:3913 do_sys_openat2+0xab/0x120 fs/open.c:1416 do_sys_open fs/open.c:1431 [inline] __do_sys_openat fs/open.c:1447 [inline] __se_sys_openat fs/open.c:1442 [inline] __x64_sys_openat+0xf3/0x120 fs/open.c:1442 x64_sys_call+0x1025/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:258 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x54/0x120 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x76/0x7e read to 0xffff888102eb3260 of 4 bytes by task 3498 on cpu 0: inode_get_ctime_nsec include/linux/fs.h:1623 [inline] inode_get_ctime include/linux/fs.h:1629 [inline] generic_fillattr+0x1dd/0x2f0 fs/stat.c:62 shmem_getattr+0x17b/0x200 mm/shmem.c:1157 vfs_getattr_nosec fs/stat.c:166 [inline] vfs_getattr+0x19b/0x1e0 fs/stat.c:207 vfs_statx_path fs/stat.c:251 [inline] vfs_statx+0x134/0x2f0 fs/stat.c:315 vfs_fstatat+0xec/0x110 fs/stat.c:341 __do_sys_newfstatat fs/stat.c:505 [inline] __se_sys_newfstatat+0x58/0x260 fs/stat.c:499 __x64_sys_newfstatat+0x55/0x70 fs/stat.c:499 x64_sys_call+0x141f/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:263 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x54/0x120 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x76/0x7e value changed: 0x2755ae53 -> 0x27ee44d3 Reported by Kernel Concurrency Sanitizer on: CPU: 0 UID: 0 PID: 3498 Comm: udevd Not tainted 6.11.0-rc6-syzkaller-00326-gd1f2d51b711a-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/06/2024 ================================================================== When calling generic_fillattr(), if you don't hold read lock, data-race will occur in inode member variables, which can cause unexpected behavior. Since there is no special protection when shmem_getattr() calls generic_fillattr(), data-race occurs by functions such as shmem_unlink() or shmem_mknod(). This can cause unexpected results, so commenting it out is not enough. Therefore, when calling generic_fillattr() from shmem_getattr(), it is appropriate to protect the inode using inode_lock_shared() and inode_unlock_shared() to prevent data-race. Link: https://lkml.kernel.org/r/20240909123558.70229-1-aha310510@gmail.com Fixes: 44a30220bc0a ("shmem: recalculate file inode when fstat") Signed-off-by: Jeongjun Park Reported-by: syzbot Cc: Hugh Dickins Cc: Yu Zhao Cc: Signed-off-by: Andrew Morton --- mm/shmem.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/shmem.c b/mm/shmem.c index c5adb987b23c..4ba1d00fabda 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1166,7 +1166,9 @@ static int shmem_getattr(struct mnt_idmap *idmap, stat->attributes_mask |= (STATX_ATTR_APPEND | STATX_ATTR_IMMUTABLE | STATX_ATTR_NODUMP); + inode_lock_shared(inode); generic_fillattr(idmap, request_mask, inode, stat); + inode_unlock_shared(inode); if (shmem_huge_global_enabled(inode, 0, 0, false, NULL, 0)) stat->blksize = HPAGE_PMD_SIZE; -- cgit From bc0a2f3a73fcdac651fca64df39306d1e5ebe3b0 Mon Sep 17 00:00:00 2001 From: Edward Adam Davis Date: Wed, 16 Oct 2024 19:43:47 +0800 Subject: ocfs2: pass u64 to ocfs2_truncate_inline maybe overflow Syzbot reported a kernel BUG in ocfs2_truncate_inline. There are two reasons for this: first, the parameter value passed is greater than ocfs2_max_inline_data_with_xattr, second, the start and end parameters of ocfs2_truncate_inline are "unsigned int". So, we need to add a sanity check for byte_start and byte_len right before ocfs2_truncate_inline() in ocfs2_remove_inode_range(), if they are greater than ocfs2_max_inline_data_with_xattr return -EINVAL. Link: https://lkml.kernel.org/r/tencent_D48DB5122ADDAEDDD11918CFB68D93258C07@qq.com Fixes: 1afc32b95233 ("ocfs2: Write support for inline data") Signed-off-by: Edward Adam Davis Reported-by: syzbot+81092778aac03460d6b7@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=81092778aac03460d6b7 Reviewed-by: Joseph Qi Cc: Joel Becker Cc: Joseph Qi Cc: Mark Fasheh Cc: Junxiao Bi Cc: Changwei Ge Cc: Gang He Cc: Jun Piao Cc: Signed-off-by: Andrew Morton --- fs/ocfs2/file.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index 58887456e3c5..06af21982c16 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -1787,6 +1787,14 @@ int ocfs2_remove_inode_range(struct inode *inode, return 0; if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) { + int id_count = ocfs2_max_inline_data_with_xattr(inode->i_sb, di); + + if (byte_start > id_count || byte_start + byte_len > id_count) { + ret = -EINVAL; + mlog_errno(ret); + goto out; + } + ret = ocfs2_truncate_inline(inode, di_bh, byte_start, byte_start + byte_len, 0); if (ret) { -- cgit From d95fb348f0160f562ac07fa201dbbaf14524381f Mon Sep 17 00:00:00 2001 From: Nobuhiro Iwamatsu Date: Wed, 16 Oct 2024 18:21:01 +0900 Subject: mm: numa_clear_kernel_node_hotplug: Add NUMA_NO_NODE check for node id The acquired memory blocks for reserved may include blocks outside of memory management. In this case, the nid variable is set to NUMA_NO_NODE (-1), so an error occurs in node_set(). This adds a check using numa_valid_node() to numa_clear_kernel_node_hotplug() that skips node_set() when nid is set to NUMA_NO_NODE. Link: https://lkml.kernel.org/r/1729070461-13576-1-git-send-email-nobuhiro1.iwamatsu@toshiba.co.jp Fixes: 87482708210f ("mm: introduce numa_memblks") Signed-off-by: Nobuhiro Iwamatsu Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Anshuman Khandual Suggested-by: Yuji Ishikawa Signed-off-by: Andrew Morton --- mm/numa_memblks.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c index be52b93a9c58..a3877e9bc878 100644 --- a/mm/numa_memblks.c +++ b/mm/numa_memblks.c @@ -349,7 +349,7 @@ static void __init numa_clear_kernel_node_hotplug(void) for_each_reserved_mem_region(mb_region) { int nid = memblock_get_region_node(mb_region); - if (nid != MAX_NUMNODES) + if (numa_valid_node(nid)) node_set(nid, reserved_nodemask); } -- cgit From 41e192ad2779cae0102879612dfe46726e4396aa Mon Sep 17 00:00:00 2001 From: Ryusuke Konishi Date: Fri, 18 Oct 2024 04:33:10 +0900 Subject: nilfs2: fix kernel bug due to missing clearing of checked flag Syzbot reported that in directory operations after nilfs2 detects filesystem corruption and degrades to read-only, __block_write_begin_int(), which is called to prepare block writes, may fail the BUG_ON check for accesses exceeding the folio/page size, triggering a kernel bug. This was found to be because the "checked" flag of a page/folio was not cleared when it was discarded by nilfs2's own routine, which causes the sanity check of directory entries to be skipped when the directory page/folio is reloaded. So, fix that. This was necessary when the use of nilfs2's own page discard routine was applied to more than just metadata files. Link: https://lkml.kernel.org/r/20241017193359.5051-1-konishi.ryusuke@gmail.com Fixes: 8c26c4e2694a ("nilfs2: fix issue with flush kernel thread after remount in RO mode because of driver's internal error or metadata corruption") Signed-off-by: Ryusuke Konishi Reported-by: syzbot+d6ca2daf692c7a82f959@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d6ca2daf692c7a82f959 Cc: Signed-off-by: Andrew Morton --- fs/nilfs2/page.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/nilfs2/page.c b/fs/nilfs2/page.c index 5436eb0424bd..10def4b55995 100644 --- a/fs/nilfs2/page.c +++ b/fs/nilfs2/page.c @@ -401,6 +401,7 @@ void nilfs_clear_folio_dirty(struct folio *folio) folio_clear_uptodate(folio); folio_clear_mappedtodisk(folio); + folio_clear_checked(folio); head = folio_buffers(folio); if (head) { -- cgit From b125a0def25a082ae944c9615208bf359abdb61c Mon Sep 17 00:00:00 2001 From: Gregory Price Date: Thu, 17 Oct 2024 15:03:47 -0400 Subject: resource,kexec: walk_system_ram_res_rev must retain resource flags MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit walk_system_ram_res_rev() erroneously discards resource flags when passing the information to the callback. This causes systems with IORESOURCE_SYSRAM_DRIVER_MANAGED memory to have these resources selected during kexec to store kexec buffers if that memory happens to be at placed above normal system ram. This leads to undefined behavior after reboot. If the kexec buffer is never touched, nothing happens. If the kexec buffer is touched, it could lead to a crash (like below) or undefined behavior. Tested on a system with CXL memory expanders with driver managed memory, TPM enabled, and CONFIG_IMA_KEXEC=y. Adding printk's showed the flags were being discarded and as a result the check for IORESOURCE_SYSRAM_DRIVER_MANAGED passes. find_next_iomem_res: name(System RAM (kmem)) start(10000000000) end(1034fffffff) flags(83000200) locate_mem_hole_top_down: start(10000000000) end(1034fffffff) flags(0) [.] BUG: unable to handle page fault for address: ffff89834ffff000 [.] #PF: supervisor read access in kernel mode [.] #PF: error_code(0x0000) - not-present page [.] PGD c04c8bf067 P4D c04c8bf067 PUD c04c8be067 PMD 0 [.] Oops: 0000 [#1] SMP [.] RIP: 0010:ima_restore_measurement_list+0x95/0x4b0 [.] RSP: 0018:ffffc900000d3a80 EFLAGS: 00010286 [.] RAX: 0000000000001000 RBX: 0000000000000000 RCX: ffff89834ffff000 [.] RDX: 0000000000000018 RSI: ffff89834ffff000 RDI: ffff89834ffff018 [.] RBP: ffffc900000d3ba0 R08: 0000000000000020 R09: ffff888132b8a900 [.] R10: 4000000000000000 R11: 000000003a616d69 R12: 0000000000000000 [.] R13: ffffffff8404ac28 R14: 0000000000000000 R15: ffff89834ffff000 [.] FS: 0000000000000000(0000) GS:ffff893d44640000(0000) knlGS:0000000000000000 [.] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [.] ata5: SATA link down (SStatus 0 SControl 300) [.] CR2: ffff89834ffff000 CR3: 000001034d00f001 CR4: 0000000000770ef0 [.] PKRU: 55555554 [.] Call Trace: [.] [.] ? __die+0x78/0xc0 [.] ? page_fault_oops+0x2a8/0x3a0 [.] ? exc_page_fault+0x84/0x130 [.] ? asm_exc_page_fault+0x22/0x30 [.] ? ima_restore_measurement_list+0x95/0x4b0 [.] ? template_desc_init_fields+0x317/0x410 [.] ? crypto_alloc_tfm_node+0x9c/0xc0 [.] ? init_ima_lsm+0x30/0x30 [.] ima_load_kexec_buffer+0x72/0xa0 [.] ima_init+0x44/0xa0 [.] __initstub__kmod_ima__373_1201_init_ima7+0x1e/0xb0 [.] ? init_ima_lsm+0x30/0x30 [.] do_one_initcall+0xad/0x200 [.] ? idr_alloc_cyclic+0xaa/0x110 [.] ? new_slab+0x12c/0x420 [.] ? new_slab+0x12c/0x420 [.] ? number+0x12a/0x430 [.] ? sysvec_apic_timer_interrupt+0xa/0x80 [.] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 [.] ? parse_args+0xd4/0x380 [.] ? parse_args+0x14b/0x380 [.] kernel_init_freeable+0x1c1/0x2b0 [.] ? rest_init+0xb0/0xb0 [.] kernel_init+0x16/0x1a0 [.] ret_from_fork+0x2f/0x40 [.] ? rest_init+0xb0/0xb0 [.] ret_from_fork_asm+0x11/0x20 [.] Link: https://lore.kernel.org/all/20231114091658.228030-1-bhe@redhat.com/ Link: https://lkml.kernel.org/r/20241017190347.5578-1-gourry@gourry.net Fixes: 7acf164b259d ("resource: add walk_system_ram_res_rev()") Signed-off-by: Gregory Price Reviewed-by: Dan Williams Acked-by: Baoquan He Cc: AKASHI Takahiro Cc: Andy Shevchenko Cc: Bjorn Helgaas Cc: "Huang, Ying" Cc: Ilpo Järvinen Cc: Mika Westerberg Cc: Thomas Gleixner Cc: Signed-off-by: Andrew Morton --- kernel/resource.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/kernel/resource.c b/kernel/resource.c index b730bd28b422..4101016e8b20 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -459,9 +459,7 @@ int walk_system_ram_res_rev(u64 start, u64 end, void *arg, rams_size += 16; } - rams[i].start = res.start; - rams[i++].end = res.end; - + rams[i++] = res; start = res.end + 1; } -- cgit From c4d91e225ff3c9821c85ac6efd8e02c0025c0190 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Thu, 17 Oct 2024 15:31:45 +0100 Subject: mm/vma: add expand-only VMA merge mode and optimise do_brk_flags() Patch series "introduce VMA merge mode to improve brk() performance". A ~5% performance regression was discovered on the aim9.brk_test.ops_per_sec by the linux kernel test bot [0]. In the past to satisfy brk() performance we duplicated VMA expansion code and special-cased do_brk_flags(). This is however horrid and undoes work to abstract this logic, so in resolving the issue I have endeavoured to avoid this. Investigating further I was able to observe that the use of a vma_iter_next_range() and vma_prev() pair, causing an unnecessary maple tree walk. In addition there is work that we do that is simply unnecessary for brk(). Therefore, add a special VMA merge mode VMG_FLAG_JUST_EXPAND to avoid doing any of this - it assumes the VMA iterator is pointing at the previous VMA and which skips logic that brk() does not require. This mostly eliminates the performance regression reducing it to ~2% which is in the realm of noise. In addition, the will-it-scale test brk2, written to be more representative of real-world brk() usage, shows a modest performance improvement - which gives me confidence that we are not meaningfully regressing real workloads here. This series includes a test asserting that the 'just expand' mode works as expected. With many thanks to Oliver Sang for helping with performance testing of candidate patch sets! [0]:https://lore.kernel.org/linux-mm/202409301043.629bea78-oliver.sang@intel.com This patch (of 2): We know in advance that do_brk_flags() wants only to perform a VMA expansion (if the prior VMA is compatible), and that we assume no mergeable VMA follows it. These are the semantics of this function prior to the recent rewrite of the VMA merging logic, however we are now doing more work than necessary - positioning the VMA iterator at the prior VMA and performing tasks that are not required. Add a new field to the vmg struct to permit merge flags and add a new merge flag VMG_FLAG_JUST_EXPAND which implies this behaviour, and have do_brk_flags() use this. This fixes a reported performance regression in a brk() benchmarking suite. Link: https://lkml.kernel.org/r/cover.1729174352.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/4e65d4395e5841c5acf8470dbcb714016364fd39.1729174352.git.lorenzo.stoakes@oracle.com Fixes: cacded5e42b9 ("mm: avoid using vma_merge() for new VMAs") Reported-by: kernel test robot Closes: https://lore.kernel.org/linux-mm/202409301043.629bea78-oliver.sang@intel.com Signed-off-by: Lorenzo Stoakes Reviewed-by: Liam R. Howlett Cc: Jann Horn Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/mmap.c | 3 ++- mm/vma.c | 23 +++++++++++++++-------- mm/vma.h | 14 ++++++++++++++ 3 files changed, 31 insertions(+), 9 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 3f1419460be3..582036922d05 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1756,7 +1756,8 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, VMG_STATE(vmg, mm, vmi, addr, addr + len, flags, PHYS_PFN(addr)); vmg.prev = vma; - vma_iter_next_range(vmi); + /* vmi is positioned at prev, which this mode expects. */ + vmg.merge_flags = VMG_FLAG_JUST_EXPAND; if (vma_merge_new_range(&vmg)) goto out; diff --git a/mm/vma.c b/mm/vma.c index 4737afcb064c..b21ffec33f8e 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -917,6 +917,7 @@ struct vm_area_struct *vma_merge_new_range(struct vma_merge_struct *vmg) pgoff_t pgoff = vmg->pgoff; pgoff_t pglen = PHYS_PFN(end - start); bool can_merge_left, can_merge_right; + bool just_expand = vmg->merge_flags & VMG_FLAG_JUST_EXPAND; mmap_assert_write_locked(vmg->mm); VM_WARN_ON(vmg->vma); @@ -930,7 +931,7 @@ struct vm_area_struct *vma_merge_new_range(struct vma_merge_struct *vmg) return NULL; can_merge_left = can_vma_merge_left(vmg); - can_merge_right = can_vma_merge_right(vmg, can_merge_left); + can_merge_right = !just_expand && can_vma_merge_right(vmg, can_merge_left); /* If we can merge with the next VMA, adjust vmg accordingly. */ if (can_merge_right) { @@ -953,7 +954,11 @@ struct vm_area_struct *vma_merge_new_range(struct vma_merge_struct *vmg) if (can_merge_right && !can_merge_remove_vma(next)) vmg->end = end; - vma_prev(vmg->vmi); /* Equivalent to going to the previous range */ + /* In expand-only case we are already positioned at prev. */ + if (!just_expand) { + /* Equivalent to going to the previous range. */ + vma_prev(vmg->vmi); + } } /* @@ -967,12 +972,14 @@ struct vm_area_struct *vma_merge_new_range(struct vma_merge_struct *vmg) } /* If expansion failed, reset state. Allows us to retry merge later. */ - vmg->vma = NULL; - vmg->start = start; - vmg->end = end; - vmg->pgoff = pgoff; - if (vmg->vma == prev) - vma_iter_set(vmg->vmi, start); + if (!just_expand) { + vmg->vma = NULL; + vmg->start = start; + vmg->end = end; + vmg->pgoff = pgoff; + if (vmg->vma == prev) + vma_iter_set(vmg->vmi, start); + } return NULL; } diff --git a/mm/vma.h b/mm/vma.h index ebd78f1577f3..55457cb68200 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -59,6 +59,17 @@ enum vma_merge_state { VMA_MERGE_SUCCESS, }; +enum vma_merge_flags { + VMG_FLAG_DEFAULT = 0, + /* + * If we can expand, simply do so. We know there is nothing to merge to + * the right. Does not reset state upon failure to merge. The VMA + * iterator is assumed to be positioned at the previous VMA, rather than + * at the gap. + */ + VMG_FLAG_JUST_EXPAND = 1 << 0, +}; + /* Represents a VMA merge operation. */ struct vma_merge_struct { struct mm_struct *mm; @@ -75,6 +86,7 @@ struct vma_merge_struct { struct mempolicy *policy; struct vm_userfaultfd_ctx uffd_ctx; struct anon_vma_name *anon_name; + enum vma_merge_flags merge_flags; enum vma_merge_state state; }; @@ -99,6 +111,7 @@ static inline pgoff_t vma_pgoff_offset(struct vm_area_struct *vma, .flags = flags_, \ .pgoff = pgoff_, \ .state = VMA_MERGE_START, \ + .merge_flags = VMG_FLAG_DEFAULT, \ } #define VMG_VMA_STATE(name, vmi_, prev_, vma_, start_, end_) \ @@ -118,6 +131,7 @@ static inline pgoff_t vma_pgoff_offset(struct vm_area_struct *vma, .uffd_ctx = vma_->vm_userfaultfd_ctx, \ .anon_name = anon_vma_name(vma_), \ .state = VMA_MERGE_START, \ + .merge_flags = VMG_FLAG_DEFAULT, \ } #ifdef CONFIG_DEBUG_VM_MAPLE_TREE -- cgit From e8133a77999f650495dca9669c49f143d70bb4f6 Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Thu, 17 Oct 2024 15:31:46 +0100 Subject: tools: testing: add expand-only mode VMA test Add a test to assert that VMG_FLAG_JUST_EXPAND functions as expected - that is, when the VMA iterator is positioned at the previous VMA and no VMAs proceed it, we observe an expansion with all state as expected. Explicitly place a prior VMA that would otherwise fail this test if the mode were not enabled (as it would traverse to the previous-previous VMA). Link: https://lkml.kernel.org/r/d2f88330254a6448092412bf7dfe077a579ab0dc.1729174352.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Cc: Jann Horn Cc: kernel test robot Cc: Liam R. Howlett Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- tools/testing/vma/vma.c | 40 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+) diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c index c53f220eb6cc..b33b47342d41 100644 --- a/tools/testing/vma/vma.c +++ b/tools/testing/vma/vma.c @@ -1522,6 +1522,45 @@ static bool test_copy_vma(void) return true; } +static bool test_expand_only_mode(void) +{ + unsigned long flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE; + struct mm_struct mm = {}; + VMA_ITERATOR(vmi, &mm, 0); + struct vm_area_struct *vma_prev, *vma; + VMG_STATE(vmg, &mm, &vmi, 0x5000, 0x9000, flags, 5); + + /* + * Place a VMA prior to the one we're expanding so we assert that we do + * not erroneously try to traverse to the previous VMA even though we + * have, through the use of VMG_FLAG_JUST_EXPAND, indicated we do not + * need to do so. + */ + alloc_and_link_vma(&mm, 0, 0x2000, 0, flags); + + /* + * We will be positioned at the prev VMA, but looking to expand to + * 0x9000. + */ + vma_iter_set(&vmi, 0x3000); + vma_prev = alloc_and_link_vma(&mm, 0x3000, 0x5000, 3, flags); + vmg.prev = vma_prev; + vmg.merge_flags = VMG_FLAG_JUST_EXPAND; + + vma = vma_merge_new_range(&vmg); + ASSERT_NE(vma, NULL); + ASSERT_EQ(vma, vma_prev); + ASSERT_EQ(vmg.state, VMA_MERGE_SUCCESS); + ASSERT_EQ(vma->vm_start, 0x3000); + ASSERT_EQ(vma->vm_end, 0x9000); + ASSERT_EQ(vma->vm_pgoff, 3); + ASSERT_TRUE(vma_write_started(vma)); + ASSERT_EQ(vma_iter_addr(&vmi), 0x3000); + + cleanup_mm(&mm, &vmi); + return true; +} + int main(void) { int num_tests = 0, num_fail = 0; @@ -1553,6 +1592,7 @@ int main(void) TEST(vmi_prealloc_fail); TEST(merge_extend); TEST(copy_vma); + TEST(expand_only_mode); #undef TEST -- cgit From 5bb1f4c9340e01003b00b94d539eadb0da88f48e Mon Sep 17 00:00:00 2001 From: Edward Liaw Date: Fri, 18 Oct 2024 17:17:22 +0000 Subject: Revert "selftests/mm: fix deadlock for fork after pthread_create on ARM" Patch series "selftests/mm: revert pthread_barrier change" On Android arm, pthread_create followed by a fork caused a deadlock in the case where the fork required work to be completed by the created thread. The previous patches incorrectly assumed that the parent would always initialize the pthread_barrier for the child thread. This reverts the change and replaces the fix for wp-fork-with-event with the original use of atomic_bool. This patch (of 3): This reverts commit e142cc87ac4ec618f2ccf5f68aedcd6e28a59d9d. fork_event_consumer may be called by other tests that do not initialize the pthread_barrier, so this approach is not correct. The subsequent patch will revert to using atomic_bool instead. Link: https://lkml.kernel.org/r/20241018171734.2315053-1-edliaw@google.com Link: https://lkml.kernel.org/r/20241018171734.2315053-2-edliaw@google.com Fixes: e142cc87ac4e ("fix deadlock for fork after pthread_create on ARM") Signed-off-by: Edward Liaw Cc: Ryan Roberts Cc: Peter Xu Cc: Shuah Khan Cc: Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/uffd-unit-tests.c | 7 ------- 1 file changed, 7 deletions(-) diff --git a/tools/testing/selftests/mm/uffd-unit-tests.c b/tools/testing/selftests/mm/uffd-unit-tests.c index c8a3b1c7edff..3db2296ac631 100644 --- a/tools/testing/selftests/mm/uffd-unit-tests.c +++ b/tools/testing/selftests/mm/uffd-unit-tests.c @@ -241,9 +241,6 @@ static void *fork_event_consumer(void *data) fork_event_args *args = data; struct uffd_msg msg = { 0 }; - /* Ready for parent thread to fork */ - pthread_barrier_wait(&ready_for_fork); - /* Read until a full msg received */ while (uffd_read_msg(args->parent_uffd, &msg)); @@ -311,12 +308,8 @@ static int pagemap_test_fork(int uffd, bool with_event, bool test_pin) /* Prepare a thread to resolve EVENT_FORK */ if (with_event) { - pthread_barrier_init(&ready_for_fork, NULL, 2); if (pthread_create(&thread, NULL, fork_event_consumer, &args)) err("pthread_create()"); - /* Wait for child thread to start before forking */ - pthread_barrier_wait(&ready_for_fork); - pthread_barrier_destroy(&ready_for_fork); } child = fork(); -- cgit From 3673167a3a07f25b3f06754d69f406edea65543a Mon Sep 17 00:00:00 2001 From: Edward Liaw Date: Fri, 18 Oct 2024 17:17:23 +0000 Subject: Revert "selftests/mm: replace atomic_bool with pthread_barrier_t" This reverts commit e61ef21e27e8deed8c474e9f47f4aa7bc37e138c. uffd_poll_thread may be called by other tests that do not initialize the pthread_barrier, so this approach is not correct. This will revert to using atomic_bool instead. Link: https://lkml.kernel.org/r/20241018171734.2315053-3-edliaw@google.com Fixes: e61ef21e27e8 ("selftests/mm: replace atomic_bool with pthread_barrier_t") Signed-off-by: Edward Liaw Cc: Ryan Roberts Cc: Peter Xu Cc: Shuah Khan Cc: Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/uffd-common.c | 5 ++--- tools/testing/selftests/mm/uffd-common.h | 3 ++- tools/testing/selftests/mm/uffd-unit-tests.c | 14 ++++++-------- 3 files changed, 10 insertions(+), 12 deletions(-) diff --git a/tools/testing/selftests/mm/uffd-common.c b/tools/testing/selftests/mm/uffd-common.c index 852e7281026e..717539eddf98 100644 --- a/tools/testing/selftests/mm/uffd-common.c +++ b/tools/testing/selftests/mm/uffd-common.c @@ -18,7 +18,7 @@ bool test_uffdio_wp = true; unsigned long long *count_verify; uffd_test_ops_t *uffd_test_ops; uffd_test_case_ops_t *uffd_test_case_ops; -pthread_barrier_t ready_for_fork; +atomic_bool ready_for_fork; static int uffd_mem_fd_create(off_t mem_size, bool hugetlb) { @@ -519,8 +519,7 @@ void *uffd_poll_thread(void *arg) pollfd[1].fd = pipefd[cpu*2]; pollfd[1].events = POLLIN; - /* Ready for parent thread to fork */ - pthread_barrier_wait(&ready_for_fork); + ready_for_fork = true; for (;;) { ret = poll(pollfd, 2, -1); diff --git a/tools/testing/selftests/mm/uffd-common.h b/tools/testing/selftests/mm/uffd-common.h index 3e6228d8e0dc..a70ae10b5f62 100644 --- a/tools/testing/selftests/mm/uffd-common.h +++ b/tools/testing/selftests/mm/uffd-common.h @@ -33,6 +33,7 @@ #include #include #include +#include #include "../kselftest.h" #include "vm_util.h" @@ -104,7 +105,7 @@ extern bool map_shared; extern bool test_uffdio_wp; extern unsigned long long *count_verify; extern volatile bool test_uffdio_copy_eexist; -extern pthread_barrier_t ready_for_fork; +extern atomic_bool ready_for_fork; extern uffd_test_ops_t anon_uffd_test_ops; extern uffd_test_ops_t shmem_uffd_test_ops; diff --git a/tools/testing/selftests/mm/uffd-unit-tests.c b/tools/testing/selftests/mm/uffd-unit-tests.c index 3db2296ac631..b3d21eed203d 100644 --- a/tools/testing/selftests/mm/uffd-unit-tests.c +++ b/tools/testing/selftests/mm/uffd-unit-tests.c @@ -774,7 +774,7 @@ static void uffd_sigbus_test_common(bool wp) char c; struct uffd_args args = { 0 }; - pthread_barrier_init(&ready_for_fork, NULL, 2); + ready_for_fork = false; fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK); @@ -791,9 +791,8 @@ static void uffd_sigbus_test_common(bool wp) if (pthread_create(&uffd_mon, NULL, uffd_poll_thread, &args)) err("uffd_poll_thread create"); - /* Wait for child thread to start before forking */ - pthread_barrier_wait(&ready_for_fork); - pthread_barrier_destroy(&ready_for_fork); + while (!ready_for_fork) + ; /* Wait for the poll_thread to start executing before forking */ pid = fork(); if (pid < 0) @@ -834,7 +833,7 @@ static void uffd_events_test_common(bool wp) char c; struct uffd_args args = { 0 }; - pthread_barrier_init(&ready_for_fork, NULL, 2); + ready_for_fork = false; fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK); if (uffd_register(uffd, area_dst, nr_pages * page_size, @@ -845,9 +844,8 @@ static void uffd_events_test_common(bool wp) if (pthread_create(&uffd_mon, NULL, uffd_poll_thread, &args)) err("uffd_poll_thread create"); - /* Wait for child thread to start before forking */ - pthread_barrier_wait(&ready_for_fork); - pthread_barrier_destroy(&ready_for_fork); + while (!ready_for_fork) + ; /* Wait for the poll_thread to start executing before forking */ pid = fork(); if (pid < 0) -- cgit From f2330b650e97a68c1afce66305f10651a9544037 Mon Sep 17 00:00:00 2001 From: Edward Liaw Date: Fri, 18 Oct 2024 17:17:24 +0000 Subject: selftests/mm: fix deadlock for fork after pthread_create with atomic_bool Some additional synchronization is needed on Android ARM64; we see a deadlock with pthread_create when the parent thread races forward before the child has a chance to start doing work. Link: https://lkml.kernel.org/r/20241018171734.2315053-4-edliaw@google.com Fixes: cff294582798 ("selftests/mm: extend and rename uffd pagemap test") Signed-off-by: Edward Liaw Cc: Ryan Roberts Cc: Peter Xu Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/uffd-unit-tests.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/tools/testing/selftests/mm/uffd-unit-tests.c b/tools/testing/selftests/mm/uffd-unit-tests.c index b3d21eed203d..a2e71b1636e7 100644 --- a/tools/testing/selftests/mm/uffd-unit-tests.c +++ b/tools/testing/selftests/mm/uffd-unit-tests.c @@ -241,6 +241,8 @@ static void *fork_event_consumer(void *data) fork_event_args *args = data; struct uffd_msg msg = { 0 }; + ready_for_fork = true; + /* Read until a full msg received */ while (uffd_read_msg(args->parent_uffd, &msg)); @@ -308,8 +310,11 @@ static int pagemap_test_fork(int uffd, bool with_event, bool test_pin) /* Prepare a thread to resolve EVENT_FORK */ if (with_event) { + ready_for_fork = false; if (pthread_create(&thread, NULL, fork_event_consumer, &args)) err("pthread_create()"); + while (!ready_for_fork) + ; /* Wait for the poll_thread to start executing before forking */ } child = fork(); -- cgit From 58a039e679fe72bd0efa8b2abe669a7914bb4429 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Fri, 18 Oct 2024 18:14:15 +0200 Subject: mm: split critical region in remap_file_pages() and invoke LSMs in between Commit ea7e2d5e49c0 ("mm: call the security_mmap_file() LSM hook in remap_file_pages()") fixed a security issue, it added an LSM check when trying to remap file pages, so that LSMs have the opportunity to evaluate such action like for other memory operations such as mmap() and mprotect(). However, that commit called security_mmap_file() inside the mmap_lock lock, while the other calls do it before taking the lock, after commit 8b3ec6814c83 ("take security_mmap_file() outside of ->mmap_sem"). This caused lock inversion issue with IMA which was taking the mmap_lock and i_mutex lock in the opposite way when the remap_file_pages() system call was called. Solve the issue by splitting the critical region in remap_file_pages() in two regions: the first takes a read lock of mmap_lock, retrieves the VMA and the file descriptor associated, and calculates the 'prot' and 'flags' variables; the second takes a write lock on mmap_lock, checks that the VMA flags and the VMA file descriptor are the same as the ones obtained in the first critical region (otherwise the system call fails), and calls do_mmap(). In between, after releasing the read lock and before taking the write lock, call security_mmap_file(), and solve the lock inversion issue. Link: https://lkml.kernel.org/r/20241018161415.3845146-1-roberto.sassu@huaweicloud.com Fixes: ea7e2d5e49c0 ("mm: call the security_mmap_file() LSM hook in remap_file_pages()") Signed-off-by: Kirill A. Shutemov Signed-off-by: Roberto Sassu Reported-by: syzbot+1cd571a672400ef3a930@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-security-module/66f7b10e.050a0220.46d20.0036.GAE@google.com/ Tested-by: Roberto Sassu Reviewed-by: Roberto Sassu Reviewed-by: Jann Horn Reviewed-by: Lorenzo Stoakes Reviewed-by: Liam R. Howlett Reviewed-by: Paul Moore Tested-by: syzbot+1cd571a672400ef3a930@syzkaller.appspotmail.com Cc: Jarkko Sakkinen Cc: Dmitry Kasatkin Cc: Eric Snowberg Cc: James Morris Cc: Mimi Zohar Cc: "Serge E. Hallyn" Cc: Shu Han Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/mmap.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 52 insertions(+), 17 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 582036922d05..1e0e34cb993f 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1642,6 +1642,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size, unsigned long populate = 0; unsigned long ret = -EINVAL; struct file *file; + vm_flags_t vm_flags; pr_warn_once("%s (%d) uses deprecated remap_file_pages() syscall. See Documentation/mm/remap_file_pages.rst.\n", current->comm, current->pid); @@ -1658,12 +1659,60 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size, if (pgoff + (size >> PAGE_SHIFT) < pgoff) return ret; - if (mmap_write_lock_killable(mm)) + if (mmap_read_lock_killable(mm)) return -EINTR; + /* + * Look up VMA under read lock first so we can perform the security + * without holding locks (which can be problematic). We reacquire a + * write lock later and check nothing changed underneath us. + */ vma = vma_lookup(mm, start); - if (!vma || !(vma->vm_flags & VM_SHARED)) + if (!vma || !(vma->vm_flags & VM_SHARED)) { + mmap_read_unlock(mm); + return -EINVAL; + } + + prot |= vma->vm_flags & VM_READ ? PROT_READ : 0; + prot |= vma->vm_flags & VM_WRITE ? PROT_WRITE : 0; + prot |= vma->vm_flags & VM_EXEC ? PROT_EXEC : 0; + + flags &= MAP_NONBLOCK; + flags |= MAP_SHARED | MAP_FIXED | MAP_POPULATE; + if (vma->vm_flags & VM_LOCKED) + flags |= MAP_LOCKED; + + /* Save vm_flags used to calculate prot and flags, and recheck later. */ + vm_flags = vma->vm_flags; + file = get_file(vma->vm_file); + + mmap_read_unlock(mm); + + /* Call outside mmap_lock to be consistent with other callers. */ + ret = security_mmap_file(file, prot, flags); + if (ret) { + fput(file); + return ret; + } + + ret = -EINVAL; + + /* OK security check passed, take write lock + let it rip. */ + if (mmap_write_lock_killable(mm)) { + fput(file); + return -EINTR; + } + + vma = vma_lookup(mm, start); + + if (!vma) + goto out; + + /* Make sure things didn't change under us. */ + if (vma->vm_flags != vm_flags) + goto out; + if (vma->vm_file != file) goto out; if (start + size > vma->vm_end) { @@ -1691,25 +1740,11 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size, goto out; } - prot |= vma->vm_flags & VM_READ ? PROT_READ : 0; - prot |= vma->vm_flags & VM_WRITE ? PROT_WRITE : 0; - prot |= vma->vm_flags & VM_EXEC ? PROT_EXEC : 0; - - flags &= MAP_NONBLOCK; - flags |= MAP_SHARED | MAP_FIXED | MAP_POPULATE; - if (vma->vm_flags & VM_LOCKED) - flags |= MAP_LOCKED; - - file = get_file(vma->vm_file); - ret = security_mmap_file(vma->vm_file, prot, flags); - if (ret) - goto out_fput; ret = do_mmap(vma->vm_file, start, size, prot, flags, 0, pgoff, &populate, NULL); -out_fput: - fput(file); out: mmap_write_unlock(mm); + fput(file); if (populate) mm_populate(ret, populate); if (!IS_ERR_VALUE(ret)) -- cgit From 183430079869fcb4b2967800d7659bbeb6052d07 Mon Sep 17 00:00:00 2001 From: Jeff Xu Date: Tue, 8 Oct 2024 04:09:41 +0000 Subject: mseal: update mseal.rst MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Pedro Falcato's optimization [1] for checking sealed VMAs, which replaces the can_modify_mm() function with an in-loop check, necessitates an update to the mseal.rst documentation to reflect this change. Furthermore, the document has received offline comments regarding the code sample and suggestions for sentence clarification to enhance reader comprehension. [1] https://lore.kernel.org/linux-mm/20240817-mseal-depessimize-v3-0-d8d2e037df30@gmail.com/ Update doc after in-loop change: mprotect/madvise can have partially updated and munmap is atomic. Fix indentation and clarify some sections to improve readability. Link: https://lkml.kernel.org/r/20241008040942.1478931-2-jeffxu@chromium.org Fixes: df2a7df9a9aa ("mm/munmap: replace can_modify_mm with can_modify_vma") Fixes: 4a2dd02b0916 ("mm/mprotect: replace can_modify_mm with can_modify_vma") Fixes: 38075679b5f1 ("mm/mremap: replace can_modify_mm with can_modify_vma") Fixes: 23c57d1fa2b9 ("mseal: replace can_modify_mm_madv with a vma variant") Signed-off-by: Jeff Xu Reviewed-by: Randy Dunlap Cc: Elliott Hughes Cc: Greg Kroah-Hartman Cc: Guenter Roeck Cc: Jann Horn Cc: Jonathan Corbet Cc: Jorge Lucangeli Obes Cc: Kees Cook Cc: "Liam R. Howlett" Cc: Linus Torvalds Cc: Lorenzo Stoakes Cc: Matthew Wilcox Cc: Muhammad Usama Anjum Cc: Pedro Falcato Cc: Stephen Röttger Cc: Suren Baghdasaryan Cc: "Theo de Raadt" Signed-off-by: Andrew Morton --- Documentation/userspace-api/mseal.rst | 307 ++++++++++++++++------------------ 1 file changed, 148 insertions(+), 159 deletions(-) diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst index 4132eec995a3..41102f74c5e2 100644 --- a/Documentation/userspace-api/mseal.rst +++ b/Documentation/userspace-api/mseal.rst @@ -23,177 +23,166 @@ applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2]. -User API -======== -mseal() ------------ -The mseal() syscall has the following signature: - -``int mseal(void addr, size_t len, unsigned long flags)`` - -**addr/len**: virtual memory address range. - -The address range set by ``addr``/``len`` must meet: - - The start address must be in an allocated VMA. - - The start address must be page aligned. - - The end address (``addr`` + ``len``) must be in an allocated VMA. - - no gap (unallocated memory) between start and end address. - -The ``len`` will be paged aligned implicitly by the kernel. - -**flags**: reserved for future use. - -**return values**: - -- ``0``: Success. - -- ``-EINVAL``: - - Invalid input ``flags``. - - The start address (``addr``) is not page aligned. - - Address range (``addr`` + ``len``) overflow. - -- ``-ENOMEM``: - - The start address (``addr``) is not allocated. - - The end address (``addr`` + ``len``) is not allocated. - - A gap (unallocated memory) between start and end address. - -- ``-EPERM``: - - sealing is supported only on 64-bit CPUs, 32-bit is not supported. - -- For above error cases, users can expect the given memory range is - unmodified, i.e. no partial update. - -- There might be other internal errors/cases not listed here, e.g. - error during merging/splitting VMAs, or the process reaching the max - number of supported VMAs. In those cases, partial updates to the given - memory range could happen. However, those cases should be rare. - -**Blocked operations after sealing**: - Unmapping, moving to another location, and shrinking the size, - via munmap() and mremap(), can leave an empty space, therefore - can be replaced with a VMA with a new set of attributes. - - Moving or expanding a different VMA into the current location, - via mremap(). - - Modifying a VMA via mmap(MAP_FIXED). - - Size expansion, via mremap(), does not appear to pose any - specific risks to sealed VMAs. It is included anyway because - the use case is unclear. In any case, users can rely on - merging to expand a sealed VMA. - - mprotect() and pkey_mprotect(). - - Some destructive madvice() behaviors (e.g. MADV_DONTNEED) - for anonymous memory, when users don't have write permission to the - memory. Those behaviors can alter region contents by discarding pages, - effectively a memset(0) for anonymous memory. - - Kernel will return -EPERM for blocked operations. - - For blocked operations, one can expect the given address is unmodified, - i.e. no partial update. Note, this is different from existing mm - system call behaviors, where partial updates are made till an error is - found and returned to userspace. To give an example: - - Assume following code sequence: - - - ptr = mmap(null, 8192, PROT_NONE); - - munmap(ptr + 4096, 4096); - - ret1 = mprotect(ptr, 8192, PROT_READ); - - mseal(ptr, 4096); - - ret2 = mprotect(ptr, 8192, PROT_NONE); - - ret1 will be -ENOMEM, the page from ptr is updated to PROT_READ. - - ret2 will be -EPERM, the page remains to be PROT_READ. - -**Note**: - -- mseal() only works on 64-bit CPUs, not 32-bit CPU. - -- users can call mseal() multiple times, mseal() on an already sealed memory - is a no-action (not error). - -- munseal() is not supported. - -Use cases: -========== +SYSCALL +======= +mseal syscall signature +----------------------- + ``int mseal(void \* addr, size_t len, unsigned long flags)`` + + **addr**/**len**: virtual memory address range. + The address range set by **addr**/**len** must meet: + - The start address must be in an allocated VMA. + - The start address must be page aligned. + - The end address (**addr** + **len**) must be in an allocated VMA. + - no gap (unallocated memory) between start and end address. + + The ``len`` will be paged aligned implicitly by the kernel. + + **flags**: reserved for future use. + + **Return values**: + - **0**: Success. + - **-EINVAL**: + * Invalid input ``flags``. + * The start address (``addr``) is not page aligned. + * Address range (``addr`` + ``len``) overflow. + - **-ENOMEM**: + * The start address (``addr``) is not allocated. + * The end address (``addr`` + ``len``) is not allocated. + * A gap (unallocated memory) between start and end address. + - **-EPERM**: + * sealing is supported only on 64-bit CPUs, 32-bit is not supported. + + **Note about error return**: + - For above error cases, users can expect the given memory range is + unmodified, i.e. no partial update. + - There might be other internal errors/cases not listed here, e.g. + error during merging/splitting VMAs, or the process reaching the maximum + number of supported VMAs. In those cases, partial updates to the given + memory range could happen. However, those cases should be rare. + + **Architecture support**: + mseal only works on 64-bit CPUs, not 32-bit CPUs. + + **Idempotent**: + users can call mseal multiple times. mseal on an already sealed memory + is a no-action (not error). + + **no munseal** + Once mapping is sealed, it can't be unsealed. The kernel should never + have munseal, this is consistent with other sealing feature, e.g. + F_SEAL_SEAL for file. + +Blocked mm syscall for sealed mapping +------------------------------------- + It might be important to note: **once the mapping is sealed, it will + stay in the process's memory until the process terminates**. + + Example:: + + *ptr = mmap(0, 4096, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); + rc = mseal(ptr, 4096, 0); + /* munmap will fail */ + rc = munmap(ptr, 4096); + assert(rc < 0); + + Blocked mm syscall: + - munmap + - mmap + - mremap + - mprotect and pkey_mprotect + - some destructive madvise behaviors: MADV_DONTNEED, MADV_FREE, + MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK + + The first set of syscalls to block is munmap, mremap, mmap. They can + either leave an empty space in the address space, therefore allowing + replacement with a new mapping with new set of attributes, or can + overwrite the existing mapping with another mapping. + + mprotect and pkey_mprotect are blocked because they changes the + protection bits (RWX) of the mapping. + + Certain destructive madvise behaviors, specifically MADV_DONTNEED, + MADV_FREE, MADV_DONTNEED_LOCKED, and MADV_WIPEONFORK, can introduce + risks when applied to anonymous memory by threads lacking write + permissions. Consequently, these operations are prohibited under such + conditions. The aforementioned behaviors have the potential to modify + region contents by discarding pages, effectively performing a memset(0) + operation on the anonymous memory. + + Kernel will return -EPERM for blocked syscalls. + + When blocked syscall return -EPERM due to sealing, the memory regions may + or may not be changed, depends on the syscall being blocked: + + - munmap: munmap is atomic. If one of VMAs in the given range is + sealed, none of VMAs are updated. + - mprotect, pkey_mprotect, madvise: partial update might happen, e.g. + when mprotect over multiple VMAs, mprotect might update the beginning + VMAs before reaching the sealed VMA and return -EPERM. + - mmap and mremap: undefined behavior. + +Use cases +========= - glibc: The dynamic linker, during loading ELF executables, can apply sealing to - non-writable memory segments. - -- Chrome browser: protect some security sensitive data-structures. + mapping segments. -Notes on which memory to seal: -============================== +- Chrome browser: protect some security sensitive data structures. -It might be important to note that sealing changes the lifetime of a mapping, -i.e. the sealed mapping won’t be unmapped till the process terminates or the -exec system call is invoked. Applications can apply sealing to any virtual -memory region from userspace, but it is crucial to thoroughly analyze the -mapping's lifetime prior to apply the sealing. +When not to use mseal +===================== +Applications can apply sealing to any virtual memory region from userspace, +but it is *crucial to thoroughly analyze the mapping's lifetime* prior to +apply the sealing. This is because the sealed mapping *won’t be unmapped* +until the process terminates or the exec system call is invoked. For example: + - aio/shm + aio/shm can call mmap and munmap on behalf of userspace, e.g. + ksys_shmdt() in shm.c. The lifetimes of those mapping are not tied to + the lifetime of the process. If those memories are sealed from userspace, + then munmap will fail, causing leaks in VMA address space during the + lifetime of the process. + + - ptr allocated by malloc (heap) + Don't use mseal on the memory ptr return from malloc(). + malloc() is implemented by allocator, e.g. by glibc. Heap manager might + allocate a ptr from brk or mapping created by mmap. + If an app calls mseal on a ptr returned from malloc(), this can affect + the heap manager's ability to manage the mappings; the outcome is + non-deterministic. + + Example:: + + ptr = malloc(size); + /* don't call mseal on ptr return from malloc. */ + mseal(ptr, size); + /* free will success, allocator can't shrink heap lower than ptr */ + free(ptr); + +mseal doesn't block +=================== +In a nutshell, mseal blocks certain mm syscall from modifying some of VMA's +attributes, such as protection bits (RWX). Sealed mappings doesn't mean the +memory is immutable. -- aio/shm - - aio/shm can call mmap()/munmap() on behalf of userspace, e.g. ksys_shmdt() in - shm.c. The lifetime of those mapping are not tied to the lifetime of the - process. If those memories are sealed from userspace, then munmap() will fail, - causing leaks in VMA address space during the lifetime of the process. - -- Brk (heap) - - Currently, userspace applications can seal parts of the heap by calling - malloc() and mseal(). - let's assume following calls from user space: - - - ptr = malloc(size); - - mprotect(ptr, size, RO); - - mseal(ptr, size); - - free(ptr); - - Technically, before mseal() is added, the user can change the protection of - the heap by calling mprotect(RO). As long as the user changes the protection - back to RW before free(), the memory range can be reused. - - Adding mseal() into the picture, however, the heap is then sealed partially, - the user can still free it, but the memory remains to be RO. If the address - is re-used by the heap manager for another malloc, the process might crash - soon after. Therefore, it is important not to apply sealing to any memory - that might get recycled. - - Furthermore, even if the application never calls the free() for the ptr, - the heap manager may invoke the brk system call to shrink the size of the - heap. In the kernel, the brk-shrink will call munmap(). Consequently, - depending on the location of the ptr, the outcome of brk-shrink is - nondeterministic. - - -Additional notes: -================= As Jann Horn pointed out in [3], there are still a few ways to write -to RO memory, which is, in a way, by design. Those cases are not covered -by mseal(). If applications want to block such cases, sandbox tools (such as -seccomp, LSM, etc) might be considered. +to RO memory, which is, in a way, by design. And those could be blocked +by different security measures. Those cases are: -- Write to read-only memory through /proc/self/mem interface. -- Write to read-only memory through ptrace (such as PTRACE_POKETEXT). -- userfaultfd. + - Write to read-only memory through /proc/self/mem interface (FOLL_FORCE). + - Write to read-only memory through ptrace (such as PTRACE_POKETEXT). + - userfaultfd. The idea that inspired this patch comes from Stephen Röttger’s work in V8 CFI [4]. Chrome browser in ChromeOS will be the first user of this API. -Reference: -========== -[1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274 - -[2] https://man.openbsd.org/mimmutable.2 - -[3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com - -[4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc +Reference +========= +- [1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274 +- [2] https://man.openbsd.org/mimmutable.2 +- [3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com +- [4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc -- cgit From 01626a18230246efdcea322aa8f067e60ffe5ccd Mon Sep 17 00:00:00 2001 From: Barry Song Date: Fri, 27 Sep 2024 09:19:36 +1200 Subject: mm: avoid unconditional one-tick sleep when swapcache_prepare fails Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache") introduced an unconditional one-tick sleep when `swapcache_prepare()` fails, which has led to reports of UI stuttering on latency-sensitive Android devices. To address this, we can use a waitqueue to wake up tasks that fail `swapcache_prepare()` sooner, instead of always sleeping for a full tick. While tasks may occasionally be woken by an unrelated `do_swap_page()`, this method is preferable to two scenarios: rapid re-entry into page faults, which can cause livelocks, and multiple millisecond sleeps, which visibly degrade user experience. Oven's testing shows that a single waitqueue resolves the UI stuttering issue. If a 'thundering herd' problem becomes apparent later, a waitqueue hash similar to `folio_wait_table[PAGE_WAIT_TABLE_SIZE]` for page bit locks can be introduced. [v-songbaohua@oppo.com: wake_up only when swapcache_wq waitqueue is active] Link: https://lkml.kernel.org/r/20241008130807.40833-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240926211936.75373-1-21cnbao@gmail.com Fixes: 13ddaf26be32 ("mm/swap: fix race when skipping swapcache") Signed-off-by: Barry Song Reported-by: Oven Liyang Tested-by: Oven Liyang Cc: Kairui Song Cc: "Huang, Ying" Cc: Yu Zhao Cc: David Hildenbrand Cc: Chris Li Cc: Hugh Dickins Cc: Johannes Weiner Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Minchan Kim Cc: Yosry Ahmed Cc: SeongJae Park Cc: Kalesh Singh Cc: Suren Baghdasaryan Cc: Signed-off-by: Andrew Morton --- mm/memory.c | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 3ccee51adfbb..bdf77a3ec47b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4187,6 +4187,8 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq); + /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. @@ -4199,6 +4201,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct folio *swapcache, *folio = NULL; + DECLARE_WAITQUEUE(wait, current); struct page *page; struct swap_info_struct *si = NULL; rmap_t rmap_flags = RMAP_NONE; @@ -4297,7 +4300,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * Relax a bit to prevent rapid * repeated page faults. */ + add_wait_queue(&swapcache_wq, &wait); schedule_timeout_uninterruptible(1); + remove_wait_queue(&swapcache_wq, &wait); goto out_page; } need_clear_cache = true; @@ -4604,8 +4609,11 @@ unlock: pte_unmap_unlock(vmf->pte, vmf->ptl); out: /* Clear the swap cache pin for direct swapin after PTL unlock */ - if (need_clear_cache) + if (need_clear_cache) { swapcache_clear(si, entry, nr_pages); + if (waitqueue_active(&swapcache_wq)) + wake_up(&swapcache_wq); + } if (si) put_swap_device(si); return ret; @@ -4620,8 +4628,11 @@ out_release: folio_unlock(swapcache); folio_put(swapcache); } - if (need_clear_cache) + if (need_clear_cache) { swapcache_clear(si, entry, nr_pages); + if (waitqueue_active(&swapcache_wq)) + wake_up(&swapcache_wq); + } if (si) put_swap_device(si); return ret; -- cgit From ef5fbdf732a158ec27eeba69d8be851351f29f73 Mon Sep 17 00:00:00 2001 From: Piyush Raj Chouhan Date: Mon, 28 Oct 2024 15:55:16 +0000 Subject: ALSA: hda/realtek: Add subwoofer quirk for Infinix ZERO BOOK 13 Infinix ZERO BOOK 13 has a 2+2 speaker system which isn't probed correctly. This patch adds a quirk with the proper pin connections. Also The mic in this laptop suffers too high gain resulting in mostly fan noise being recorded, This patch Also limit mic boost. HW Probe for device; https://linux-hardware.org/?probe=a2e892c47b Test: All 4 speaker works, Mic has low noise. Signed-off-by: Piyush Raj Chouhan Link: https://patch.msgid.link/20241028155516.15552-1-piyuschouhan1598@gmail.com Signed-off-by: Takashi Iwai --- sound/pci/hda/patch_realtek.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/sound/pci/hda/patch_realtek.c b/sound/pci/hda/patch_realtek.c index 784ac058418f..7f4926194e50 100644 --- a/sound/pci/hda/patch_realtek.c +++ b/sound/pci/hda/patch_realtek.c @@ -7552,6 +7552,7 @@ enum { ALC290_FIXUP_SUBWOOFER_HSJACK, ALC269_FIXUP_THINKPAD_ACPI, ALC269_FIXUP_DMIC_THINKPAD_ACPI, + ALC269VB_FIXUP_INFINIX_ZERO_BOOK_13, ALC269VB_FIXUP_CHUWI_COREBOOK_XPRO, ALC255_FIXUP_ACER_MIC_NO_PRESENCE, ALC255_FIXUP_ASUS_MIC_NO_PRESENCE, @@ -7998,6 +7999,16 @@ static const struct hda_fixup alc269_fixups[] = { .type = HDA_FIXUP_FUNC, .v.func = alc269_fixup_pincfg_U7x7_headset_mic, }, + [ALC269VB_FIXUP_INFINIX_ZERO_BOOK_13] = { + .type = HDA_FIXUP_PINS, + .v.pins = (const struct hda_pintbl[]) { + { 0x14, 0x90170151 }, /* use as internal speaker (LFE) */ + { 0x1b, 0x90170152 }, /* use as internal speaker (back) */ + { } + }, + .chained = true, + .chain_id = ALC269_FIXUP_LIMIT_INT_MIC_BOOST + }, [ALC269VB_FIXUP_CHUWI_COREBOOK_XPRO] = { .type = HDA_FIXUP_PINS, .v.pins = (const struct hda_pintbl[]) { @@ -11003,6 +11014,7 @@ static const struct snd_pci_quirk alc269_fixup_tbl[] = { SND_PCI_QUIRK(0x1d72, 0x1945, "Redmi G", ALC256_FIXUP_ASUS_HEADSET_MIC), SND_PCI_QUIRK(0x1d72, 0x1947, "RedmiBook Air", ALC255_FIXUP_XIAOMI_HEADSET_MIC), SND_PCI_QUIRK(0x2782, 0x0214, "VAIO VJFE-CL", ALC269_FIXUP_LIMIT_INT_MIC_BOOST), + SND_PCI_QUIRK(0x2782, 0x0228, "Infinix ZERO BOOK 13", ALC269VB_FIXUP_INFINIX_ZERO_BOOK_13), SND_PCI_QUIRK(0x2782, 0x0232, "CHUWI CoreBook XPro", ALC269VB_FIXUP_CHUWI_COREBOOK_XPRO), SND_PCI_QUIRK(0x2782, 0x1707, "Vaio VJFE-ADL", ALC298_FIXUP_SPK_VOLUME), SND_PCI_QUIRK(0x8086, 0x2074, "Intel NUC 8", ALC233_FIXUP_INTEL_NUC8_DMIC), -- cgit From 704573851b51808b45dae2d62059d1d8189138a2 Mon Sep 17 00:00:00 2001 From: Qun-Wei Lin Date: Fri, 25 Oct 2024 16:58:11 +0800 Subject: mm: krealloc: Fix MTE false alarm in __do_krealloc This patch addresses an issue introduced by commit 1a83a716ec233 ("mm: krealloc: consider spare memory for __GFP_ZERO") which causes MTE (Memory Tagging Extension) to falsely report a slab-out-of-bounds error. The problem occurs when zeroing out spare memory in __do_krealloc. The original code only considered software-based KASAN and did not account for MTE. It does not reset the KASAN tag before calling memset, leading to a mismatch between the pointer tag and the memory tag, resulting in a false positive. Example of the error: ================================================================== swapper/0: BUG: KASAN: slab-out-of-bounds in __memset+0x84/0x188 swapper/0: Write at addr f4ffff8005f0fdf0 by task swapper/0/1 swapper/0: Pointer tag: [f4], memory tag: [fe] swapper/0: swapper/0: CPU: 4 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.12. swapper/0: Hardware name: MT6991(ENG) (DT) swapper/0: Call trace: swapper/0: dump_backtrace+0xfc/0x17c swapper/0: show_stack+0x18/0x28 swapper/0: dump_stack_lvl+0x40/0xa0 swapper/0: print_report+0x1b8/0x71c swapper/0: kasan_report+0xec/0x14c swapper/0: __do_kernel_fault+0x60/0x29c swapper/0: do_bad_area+0x30/0xdc swapper/0: do_tag_check_fault+0x20/0x34 swapper/0: do_mem_abort+0x58/0x104 swapper/0: el1_abort+0x3c/0x5c swapper/0: el1h_64_sync_handler+0x80/0xcc swapper/0: el1h_64_sync+0x68/0x6c swapper/0: __memset+0x84/0x188 swapper/0: btf_populate_kfunc_set+0x280/0x3d8 swapper/0: __register_btf_kfunc_id_set+0x43c/0x468 swapper/0: register_btf_kfunc_id_set+0x48/0x60 swapper/0: register_nf_nat_bpf+0x1c/0x40 swapper/0: nf_nat_init+0xc0/0x128 swapper/0: do_one_initcall+0x184/0x464 swapper/0: do_initcall_level+0xdc/0x1b0 swapper/0: do_initcalls+0x70/0xc0 swapper/0: do_basic_setup+0x1c/0x28 swapper/0: kernel_init_freeable+0x144/0x1b8 swapper/0: kernel_init+0x20/0x1a8 swapper/0: ret_from_fork+0x10/0x20 ================================================================== Fixes: 1a83a716ec233 ("mm: krealloc: consider spare memory for __GFP_ZERO") Signed-off-by: Qun-Wei Lin Acked-by: David Rientjes Signed-off-by: Vlastimil Babka --- mm/slab_common.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/slab_common.c b/mm/slab_common.c index 3d26c257ed8b..552b92dfdac7 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -1209,7 +1209,7 @@ __do_krealloc(const void *p, size_t new_size, gfp_t flags) /* Zero out spare memory. */ if (want_init_on_alloc(flags)) { kasan_disable_current(); - memset((void *)p + new_size, 0, ks - new_size); + memset(kasan_reset_tag(p) + new_size, 0, ks - new_size); kasan_enable_current(); } -- cgit From f84ef58e553206b02d06e02158c98fbccba25d19 Mon Sep 17 00:00:00 2001 From: Ley Foon Tan Date: Mon, 21 Oct 2024 13:46:25 +0800 Subject: net: stmmac: dwmac4: Fix high address display by updating reg_space[] from register values The high address will display as 0 if the driver does not set the reg_space[]. To fix this, read the high address registers and update the reg_space[] accordingly. Fixes: fbf68229ffe7 ("net: stmmac: unify registers dumps methods") Signed-off-by: Ley Foon Tan Reviewed-by: Simon Horman Link: https://patch.msgid.link/20241021054625.1791965-1-leyfoon.tan@starfivetech.com Signed-off-by: Paolo Abeni --- drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c | 8 ++++++++ drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.h | 2 ++ 2 files changed, 10 insertions(+) diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c index e0165358c4ac..77b35abc6f6f 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c @@ -203,8 +203,12 @@ static void _dwmac4_dump_dma_regs(struct stmmac_priv *priv, readl(ioaddr + DMA_CHAN_TX_CONTROL(dwmac4_addrs, channel)); reg_space[DMA_CHAN_RX_CONTROL(default_addrs, channel) / 4] = readl(ioaddr + DMA_CHAN_RX_CONTROL(dwmac4_addrs, channel)); + reg_space[DMA_CHAN_TX_BASE_ADDR_HI(default_addrs, channel) / 4] = + readl(ioaddr + DMA_CHAN_TX_BASE_ADDR_HI(dwmac4_addrs, channel)); reg_space[DMA_CHAN_TX_BASE_ADDR(default_addrs, channel) / 4] = readl(ioaddr + DMA_CHAN_TX_BASE_ADDR(dwmac4_addrs, channel)); + reg_space[DMA_CHAN_RX_BASE_ADDR_HI(default_addrs, channel) / 4] = + readl(ioaddr + DMA_CHAN_RX_BASE_ADDR_HI(dwmac4_addrs, channel)); reg_space[DMA_CHAN_RX_BASE_ADDR(default_addrs, channel) / 4] = readl(ioaddr + DMA_CHAN_RX_BASE_ADDR(dwmac4_addrs, channel)); reg_space[DMA_CHAN_TX_END_ADDR(default_addrs, channel) / 4] = @@ -225,8 +229,12 @@ static void _dwmac4_dump_dma_regs(struct stmmac_priv *priv, readl(ioaddr + DMA_CHAN_CUR_TX_DESC(dwmac4_addrs, channel)); reg_space[DMA_CHAN_CUR_RX_DESC(default_addrs, channel) / 4] = readl(ioaddr + DMA_CHAN_CUR_RX_DESC(dwmac4_addrs, channel)); + reg_space[DMA_CHAN_CUR_TX_BUF_ADDR_HI(default_addrs, channel) / 4] = + readl(ioaddr + DMA_CHAN_CUR_TX_BUF_ADDR_HI(dwmac4_addrs, channel)); reg_space[DMA_CHAN_CUR_TX_BUF_ADDR(default_addrs, channel) / 4] = readl(ioaddr + DMA_CHAN_CUR_TX_BUF_ADDR(dwmac4_addrs, channel)); + reg_space[DMA_CHAN_CUR_RX_BUF_ADDR_HI(default_addrs, channel) / 4] = + readl(ioaddr + DMA_CHAN_CUR_RX_BUF_ADDR_HI(dwmac4_addrs, channel)); reg_space[DMA_CHAN_CUR_RX_BUF_ADDR(default_addrs, channel) / 4] = readl(ioaddr + DMA_CHAN_CUR_RX_BUF_ADDR(dwmac4_addrs, channel)); reg_space[DMA_CHAN_STATUS(default_addrs, channel) / 4] = diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.h b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.h index 17d9120db5fe..4f980dcd3958 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.h +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.h @@ -127,7 +127,9 @@ static inline u32 dma_chanx_base_addr(const struct dwmac4_addrs *addrs, #define DMA_CHAN_SLOT_CTRL_STATUS(addrs, x) (dma_chanx_base_addr(addrs, x) + 0x3c) #define DMA_CHAN_CUR_TX_DESC(addrs, x) (dma_chanx_base_addr(addrs, x) + 0x44) #define DMA_CHAN_CUR_RX_DESC(addrs, x) (dma_chanx_base_addr(addrs, x) + 0x4c) +#define DMA_CHAN_CUR_TX_BUF_ADDR_HI(addrs, x) (dma_chanx_base_addr(addrs, x) + 0x50) #define DMA_CHAN_CUR_TX_BUF_ADDR(addrs, x) (dma_chanx_base_addr(addrs, x) + 0x54) +#define DMA_CHAN_CUR_RX_BUF_ADDR_HI(addrs, x) (dma_chanx_base_addr(addrs, x) + 0x58) #define DMA_CHAN_CUR_RX_BUF_ADDR(addrs, x) (dma_chanx_base_addr(addrs, x) + 0x5c) #define DMA_CHAN_STATUS(addrs, x) (dma_chanx_base_addr(addrs, x) + 0x60) -- cgit From c1fa854acc72e783fa6a464d3e35766e06d18d83 Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Fri, 25 Oct 2024 20:18:48 -0400 Subject: bcachefs: Fix unhandled transaction restart in fallocate This used to not matter, but now we're being more strict. Signed-off-by: Kent Overstreet --- fs/bcachefs/fs-io.c | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/fs/bcachefs/fs-io.c b/fs/bcachefs/fs-io.c index 15d3f073b824..2456c41b215e 100644 --- a/fs/bcachefs/fs-io.c +++ b/fs/bcachefs/fs-io.c @@ -587,7 +587,7 @@ static noinline int __bchfs_fallocate(struct bch_inode_info *inode, int mode, POS(inode->v.i_ino, start_sector), BTREE_ITER_slots|BTREE_ITER_intent); - while (!ret && bkey_lt(iter.pos, end_pos)) { + while (!ret) { s64 i_sectors_delta = 0; struct quota_res quota_res = { 0 }; struct bkey_s_c k; @@ -598,6 +598,9 @@ static noinline int __bchfs_fallocate(struct bch_inode_info *inode, int mode, bch2_trans_begin(trans); + if (bkey_ge(iter.pos, end_pos)) + break; + ret = bch2_subvolume_get_snapshot(trans, inode->ei_inum.subvol, &snapshot); if (ret) @@ -634,12 +637,15 @@ static noinline int __bchfs_fallocate(struct bch_inode_info *inode, int mode, if (bch2_clamp_data_hole(&inode->v, &hole_start, &hole_end, - opts.data_replicas, true)) + opts.data_replicas, true)) { ret = drop_locks_do(trans, (bch2_clamp_data_hole(&inode->v, &hole_start, &hole_end, opts.data_replicas, false), 0)); + if (ret) + goto bkey_err; + } bch2_btree_iter_set_pos(&iter, POS(iter.pos.inode, hole_start)); if (ret) @@ -667,10 +673,13 @@ static noinline int __bchfs_fallocate(struct bch_inode_info *inode, int mode, bch2_i_sectors_acct(c, inode, "a_res, i_sectors_delta); if (bch2_mark_pagecache_reserved(inode, &hole_start, - iter.pos.offset, true)) - drop_locks_do(trans, + iter.pos.offset, true)) { + ret = drop_locks_do(trans, bch2_mark_pagecache_reserved(inode, &hole_start, iter.pos.offset, false)); + if (ret) + goto bkey_err; + } bkey_err: bch2_quota_reservation_put(c, inode, "a_res); if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) -- cgit From 3fd27e9c57bf12c4eb1e41b87fc1aa579ec772da Mon Sep 17 00:00:00 2001 From: Piotr Zalewski Date: Sat, 26 Oct 2024 00:15:49 +0000 Subject: bcachefs: init freespace inited bits to 0 in bch2_fs_initialize Initialize freespace_initialized bits to 0 in member's flags and update member's cached version for each device in bch2_fs_initialize. It's possible for the bits to be set to 1 before fs is initialized and if call to bch2_trans_mark_dev_sbs (just before bch2_fs_freespace_init) fails bits remain to be 1 which can later indirectly trigger BUG condition in bch2_bucket_alloc_freelist during shutdown. Reported-by: syzbot+2b6a17991a6af64f9489@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=2b6a17991a6af64f9489 Fixes: bbe682c76789 ("bcachefs: Ensure devices are always correctly initialized") Suggested-by: Kent Overstreet Signed-off-by: Piotr Zalewski Signed-off-by: Kent Overstreet --- fs/bcachefs/recovery.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/fs/bcachefs/recovery.c b/fs/bcachefs/recovery.c index 0ebc76dd7eb5..32d15aacc069 100644 --- a/fs/bcachefs/recovery.c +++ b/fs/bcachefs/recovery.c @@ -1001,6 +1001,7 @@ int bch2_fs_initialize(struct bch_fs *c) struct bch_inode_unpacked root_inode, lostfound_inode; struct bkey_inode_buf packed_inode; struct qstr lostfound = QSTR("lost+found"); + struct bch_member *m; int ret; bch_notice(c, "initializing new filesystem"); @@ -1017,6 +1018,14 @@ int bch2_fs_initialize(struct bch_fs *c) SET_BCH_SB_VERSION_UPGRADE_COMPLETE(c->disk_sb.sb, bcachefs_metadata_version_current); bch2_write_super(c); } + + for_each_member_device(c, ca) { + m = bch2_members_v2_get_mut(c->disk_sb.sb, ca->dev_idx); + SET_BCH_MEMBER_FREESPACE_INITIALIZED(m, false); + ca->mi = bch2_mi_to_cpu(m); + } + + bch2_write_super(c); mutex_unlock(&c->sb_lock); c->curr_recovery_pass = BCH_RECOVERY_PASS_NR; -- cgit From a34eef6dd179463e70a97bbf8453b7ca21d1e666 Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Sun, 20 Oct 2024 20:02:09 -0400 Subject: bcachefs: Don't keep tons of cached pointers around We had a bug report where the data update path was creating an extent that failed to validate because it had too many pointers; almost all of them were cached. To fix this, we have: - want_cached_ptr(), a new helper that checks if we even want a cached pointer (is on appropriate target, device is readable). - bch2_extent_set_ptr_cached() now only sets a pointer cached if we want it. - bch2_extent_normalize_by_opts() now ensures that we only have a single cached pointer that we want. While working on this, it was noticed that this doesn't work well with reflinked data and per-file options. Another patch series is coming that plumbs through additional io path options through bch_extent_rebalance, with improved option handling. Reported-by: Reed Riley Signed-off-by: Kent Overstreet --- fs/bcachefs/data_update.c | 21 +++++++----- fs/bcachefs/data_update.h | 3 +- fs/bcachefs/extents.c | 86 ++++++++++++++++++++++++++++++++++++++--------- fs/bcachefs/extents.h | 5 ++- fs/bcachefs/move.c | 2 +- 5 files changed, 89 insertions(+), 28 deletions(-) diff --git a/fs/bcachefs/data_update.c b/fs/bcachefs/data_update.c index a6ee0beee6b0..8e75a852b358 100644 --- a/fs/bcachefs/data_update.c +++ b/fs/bcachefs/data_update.c @@ -236,7 +236,8 @@ static int __bch2_data_update_index_update(struct btree_trans *trans, if (((1U << i) & m->data_opts.rewrite_ptrs) && (ptr = bch2_extent_has_ptr(old, p, bkey_i_to_s(insert))) && !ptr->cached) { - bch2_extent_ptr_set_cached(bkey_i_to_s(insert), ptr); + bch2_extent_ptr_set_cached(c, &m->op.opts, + bkey_i_to_s(insert), ptr); rewrites_found |= 1U << i; } i++; @@ -284,7 +285,8 @@ restart_drop_extra_replicas: durability - ptr_durability >= m->op.opts.data_replicas) { durability -= ptr_durability; - bch2_extent_ptr_set_cached(bkey_i_to_s(insert), &entry->ptr); + bch2_extent_ptr_set_cached(c, &m->op.opts, + bkey_i_to_s(insert), &entry->ptr); goto restart_drop_extra_replicas; } } @@ -295,7 +297,7 @@ restart_drop_extra_replicas: bch2_extent_ptr_decoded_append(insert, &p); bch2_bkey_narrow_crcs(insert, (struct bch_extent_crc_unpacked) { 0 }); - bch2_extent_normalize(c, bkey_i_to_s(insert)); + bch2_extent_normalize_by_opts(c, &m->op.opts, bkey_i_to_s(insert)); ret = bch2_sum_sector_overwrites(trans, &iter, insert, &should_check_enospc, @@ -558,7 +560,8 @@ void bch2_data_update_to_text(struct printbuf *out, struct data_update *m) int bch2_extent_drop_ptrs(struct btree_trans *trans, struct btree_iter *iter, struct bkey_s_c k, - struct data_update_opts data_opts) + struct bch_io_opts *io_opts, + struct data_update_opts *data_opts) { struct bch_fs *c = trans->c; struct bkey_i *n; @@ -569,11 +572,11 @@ int bch2_extent_drop_ptrs(struct btree_trans *trans, if (ret) return ret; - while (data_opts.kill_ptrs) { - unsigned i = 0, drop = __fls(data_opts.kill_ptrs); + while (data_opts->kill_ptrs) { + unsigned i = 0, drop = __fls(data_opts->kill_ptrs); bch2_bkey_drop_ptrs_noerror(bkey_i_to_s(n), ptr, i++ == drop); - data_opts.kill_ptrs ^= 1U << drop; + data_opts->kill_ptrs ^= 1U << drop; } /* @@ -581,7 +584,7 @@ int bch2_extent_drop_ptrs(struct btree_trans *trans, * will do the appropriate thing with it (turning it into a * KEY_TYPE_error key, or just a discard if it was a cached extent) */ - bch2_extent_normalize(c, bkey_i_to_s(n)); + bch2_extent_normalize_by_opts(c, io_opts, bkey_i_to_s(n)); /* * Since we're not inserting through an extent iterator @@ -720,7 +723,7 @@ int bch2_data_update_init(struct btree_trans *trans, m->data_opts.rewrite_ptrs = 0; /* if iter == NULL, it's just a promote */ if (iter) - ret = bch2_extent_drop_ptrs(trans, iter, k, m->data_opts); + ret = bch2_extent_drop_ptrs(trans, iter, k, &io_opts, &m->data_opts); goto out; } diff --git a/fs/bcachefs/data_update.h b/fs/bcachefs/data_update.h index 8d36365bdea8..e4b50723428e 100644 --- a/fs/bcachefs/data_update.h +++ b/fs/bcachefs/data_update.h @@ -40,7 +40,8 @@ void bch2_data_update_read_done(struct data_update *, int bch2_extent_drop_ptrs(struct btree_trans *, struct btree_iter *, struct bkey_s_c, - struct data_update_opts); + struct bch_io_opts *, + struct data_update_opts *); void bch2_data_update_exit(struct data_update *); int bch2_data_update_init(struct btree_trans *, struct btree_iter *, diff --git a/fs/bcachefs/extents.c b/fs/bcachefs/extents.c index cc0d22085aef..c4e91d123849 100644 --- a/fs/bcachefs/extents.c +++ b/fs/bcachefs/extents.c @@ -978,31 +978,54 @@ bch2_extent_has_ptr(struct bkey_s_c k1, struct extent_ptr_decoded p1, struct bke return NULL; } -void bch2_extent_ptr_set_cached(struct bkey_s k, struct bch_extent_ptr *ptr) +static bool want_cached_ptr(struct bch_fs *c, struct bch_io_opts *opts, + struct bch_extent_ptr *ptr) +{ + if (!opts->promote_target || + !bch2_dev_in_target(c, ptr->dev, opts->promote_target)) + return false; + + struct bch_dev *ca = bch2_dev_rcu_noerror(c, ptr->dev); + + return ca && bch2_dev_is_readable(ca) && !dev_ptr_stale_rcu(ca, ptr); +} + +void bch2_extent_ptr_set_cached(struct bch_fs *c, + struct bch_io_opts *opts, + struct bkey_s k, + struct bch_extent_ptr *ptr) { struct bkey_ptrs ptrs = bch2_bkey_ptrs(k); union bch_extent_entry *entry; - union bch_extent_entry *ec = NULL; + struct extent_ptr_decoded p; - bkey_extent_entry_for_each(ptrs, entry) { + rcu_read_lock(); + if (!want_cached_ptr(c, opts, ptr)) { + bch2_bkey_drop_ptr_noerror(k, ptr); + goto out; + } + + /* + * Stripes can't contain cached data, for - reasons. + * + * Possibly something we can fix in the future? + */ + bkey_for_each_ptr_decode(k.k, ptrs, p, entry) if (&entry->ptr == ptr) { - ptr->cached = true; - if (ec) - extent_entry_drop(k, ec); - return; + if (p.has_ec) + bch2_bkey_drop_ptr_noerror(k, ptr); + else + ptr->cached = true; + goto out; } - if (extent_entry_is_stripe_ptr(entry)) - ec = entry; - else if (extent_entry_is_ptr(entry)) - ec = NULL; - } - BUG(); +out: + rcu_read_unlock(); } /* - * bch_extent_normalize - clean up an extent, dropping stale pointers etc. + * bch2_extent_normalize - clean up an extent, dropping stale pointers etc. * * Returns true if @k should be dropped entirely * @@ -1016,8 +1039,39 @@ bool bch2_extent_normalize(struct bch_fs *c, struct bkey_s k) rcu_read_lock(); bch2_bkey_drop_ptrs(k, ptr, ptr->cached && - (ca = bch2_dev_rcu(c, ptr->dev)) && - dev_ptr_stale_rcu(ca, ptr) > 0); + (!(ca = bch2_dev_rcu(c, ptr->dev)) || + dev_ptr_stale_rcu(ca, ptr) > 0)); + rcu_read_unlock(); + + return bkey_deleted(k.k); +} + +/* + * bch2_extent_normalize_by_opts - clean up an extent, dropping stale pointers etc. + * + * Like bch2_extent_normalize(), but also only keeps a single cached pointer on + * the promote target. + */ +bool bch2_extent_normalize_by_opts(struct bch_fs *c, + struct bch_io_opts *opts, + struct bkey_s k) +{ + struct bkey_ptrs ptrs; + bool have_cached_ptr; + + rcu_read_lock(); +restart_drop_ptrs: + ptrs = bch2_bkey_ptrs(k); + have_cached_ptr = false; + + bkey_for_each_ptr(ptrs, ptr) + if (ptr->cached) { + if (have_cached_ptr || !want_cached_ptr(c, opts, ptr)) { + bch2_bkey_drop_ptr(k, ptr); + goto restart_drop_ptrs; + } + have_cached_ptr = true; + } rcu_read_unlock(); return bkey_deleted(k.k); diff --git a/fs/bcachefs/extents.h b/fs/bcachefs/extents.h index 923a5f1849a8..bcffcf60aaaf 100644 --- a/fs/bcachefs/extents.h +++ b/fs/bcachefs/extents.h @@ -686,9 +686,12 @@ bool bch2_extents_match(struct bkey_s_c, struct bkey_s_c); struct bch_extent_ptr * bch2_extent_has_ptr(struct bkey_s_c, struct extent_ptr_decoded, struct bkey_s); -void bch2_extent_ptr_set_cached(struct bkey_s, struct bch_extent_ptr *); +void bch2_extent_ptr_set_cached(struct bch_fs *, struct bch_io_opts *, + struct bkey_s, struct bch_extent_ptr *); +bool bch2_extent_normalize_by_opts(struct bch_fs *, struct bch_io_opts *, struct bkey_s); bool bch2_extent_normalize(struct bch_fs *, struct bkey_s); + void bch2_extent_ptr_to_text(struct printbuf *out, struct bch_fs *, const struct bch_extent_ptr *); void bch2_bkey_ptrs_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c); diff --git a/fs/bcachefs/move.c b/fs/bcachefs/move.c index 8c456d8b8b99..0ef4a86850bb 100644 --- a/fs/bcachefs/move.c +++ b/fs/bcachefs/move.c @@ -266,7 +266,7 @@ int bch2_move_extent(struct moving_context *ctxt, if (!data_opts.rewrite_ptrs && !data_opts.extra_replicas) { if (data_opts.kill_ptrs) - return bch2_extent_drop_ptrs(trans, iter, k, data_opts); + return bch2_extent_drop_ptrs(trans, iter, k, &io_opts, &data_opts); return 0; } -- cgit From e0fafac5c4b61501f60c3841649170424eda641f Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Fri, 18 Oct 2024 02:26:59 -0400 Subject: bcachefs: Don't filter partial list buckets in open_buckets_to_text() these are an important source of stranded buckets we need to be able to watch Signed-off-by: Kent Overstreet --- fs/bcachefs/alloc_foreground.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/bcachefs/alloc_foreground.c b/fs/bcachefs/alloc_foreground.c index 5836870ab882..32f2a24104a6 100644 --- a/fs/bcachefs/alloc_foreground.c +++ b/fs/bcachefs/alloc_foreground.c @@ -1610,8 +1610,7 @@ void bch2_open_buckets_to_text(struct printbuf *out, struct bch_fs *c, ob < c->open_buckets + ARRAY_SIZE(c->open_buckets); ob++) { spin_lock(&ob->lock); - if (ob->valid && !ob->on_partial_list && - (!ca || ob->dev == ca->dev_idx)) + if (ob->valid && (!ca || ob->dev == ca->dev_idx)) bch2_open_bucket_to_text(out, c, ob); spin_unlock(&ob->lock); } -- cgit From 778ac324ccfad7b941bba604118e38a19800657b Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Sat, 26 Oct 2024 20:21:41 -0400 Subject: bcachefs: Fix deadlock on -ENOSPC w.r.t. partial open buckets Open buckets on the partial list should not count as allocated when we're trying to allocate from the partial list. Signed-off-by: Kent Overstreet --- fs/bcachefs/alloc_foreground.c | 16 +++++++++++++++- fs/bcachefs/bcachefs.h | 1 + 2 files changed, 16 insertions(+), 1 deletion(-) diff --git a/fs/bcachefs/alloc_foreground.c b/fs/bcachefs/alloc_foreground.c index 32f2a24104a6..372178c8d416 100644 --- a/fs/bcachefs/alloc_foreground.c +++ b/fs/bcachefs/alloc_foreground.c @@ -162,6 +162,10 @@ static void open_bucket_free_unused(struct bch_fs *c, struct open_bucket *ob) ARRAY_SIZE(c->open_buckets_partial)); spin_lock(&c->freelist_lock); + rcu_read_lock(); + bch2_dev_rcu(c, ob->dev)->nr_partial_buckets++; + rcu_read_unlock(); + ob->on_partial_list = true; c->open_buckets_partial[c->open_buckets_partial_nr++] = ob - c->open_buckets; @@ -972,7 +976,7 @@ static int bucket_alloc_set_partial(struct bch_fs *c, u64 avail; bch2_dev_usage_read_fast(ca, &usage); - avail = dev_buckets_free(ca, usage, watermark); + avail = dev_buckets_free(ca, usage, watermark) + ca->nr_partial_buckets; if (!avail) continue; @@ -981,6 +985,10 @@ static int bucket_alloc_set_partial(struct bch_fs *c, i); ob->on_partial_list = false; + rcu_read_lock(); + bch2_dev_rcu(c, ob->dev)->nr_partial_buckets--; + rcu_read_unlock(); + ret = add_new_bucket(c, ptrs, devs_may_alloc, nr_replicas, nr_effective, have_cache, ob); @@ -1191,7 +1199,13 @@ void bch2_open_buckets_stop(struct bch_fs *c, struct bch_dev *ca, --c->open_buckets_partial_nr; swap(c->open_buckets_partial[i], c->open_buckets_partial[c->open_buckets_partial_nr]); + ob->on_partial_list = false; + + rcu_read_lock(); + bch2_dev_rcu(c, ob->dev)->nr_partial_buckets--; + rcu_read_unlock(); + spin_unlock(&c->freelist_lock); bch2_open_bucket_put(c, ob); spin_lock(&c->freelist_lock); diff --git a/fs/bcachefs/bcachefs.h b/fs/bcachefs/bcachefs.h index f4151ee51b03..e94a83b8113e 100644 --- a/fs/bcachefs/bcachefs.h +++ b/fs/bcachefs/bcachefs.h @@ -555,6 +555,7 @@ struct bch_dev { u64 alloc_cursor[3]; unsigned nr_open_buckets; + unsigned nr_partial_buckets; unsigned nr_btree_reserve; size_t inc_gen_needs_gc; -- cgit From ca959e328b2243687aa0b95de01414d13e4f3ade Mon Sep 17 00:00:00 2001 From: Gaosheng Cui Date: Sat, 26 Oct 2024 18:26:58 +0800 Subject: bcachefs: fix possible null-ptr-deref in __bch2_ec_stripe_head_get() The function ec_new_stripe_head_alloc() returns nullptr if kzalloc() fails. It is crucial to verify its return value before dereferencing it to avoid a potential nullptr dereference. Fixes: 035d72f72c91 ("bcachefs: bch2_ec_stripe_head_get() now checks for change in rw devices") Signed-off-by: Gaosheng Cui Signed-off-by: Kent Overstreet --- fs/bcachefs/ec.c | 4 ++++ fs/bcachefs/errcode.h | 1 + 2 files changed, 5 insertions(+) diff --git a/fs/bcachefs/ec.c b/fs/bcachefs/ec.c index a0aa5bb467d9..749dcf368841 100644 --- a/fs/bcachefs/ec.c +++ b/fs/bcachefs/ec.c @@ -1870,6 +1870,10 @@ __bch2_ec_stripe_head_get(struct btree_trans *trans, } h = ec_new_stripe_head_alloc(c, disk_label, algo, redundancy, watermark); + if (!h) { + h = ERR_PTR(-BCH_ERR_ENOMEM_stripe_head_alloc); + goto err; + } found: if (h->rw_devs_change_count != c->rw_devs_change_count) ec_stripe_head_devs_update(c, h); diff --git a/fs/bcachefs/errcode.h b/fs/bcachefs/errcode.h index b6cbd716000b..a1bc6c7a8ba0 100644 --- a/fs/bcachefs/errcode.h +++ b/fs/bcachefs/errcode.h @@ -83,6 +83,7 @@ x(ENOMEM, ENOMEM_fs_other_alloc) \ x(ENOMEM, ENOMEM_dev_alloc) \ x(ENOMEM, ENOMEM_disk_accounting) \ + x(ENOMEM, ENOMEM_stripe_head_alloc) \ x(ENOSPC, ENOSPC_disk_reservation) \ x(ENOSPC, ENOSPC_bucket_alloc) \ x(ENOSPC, ENOSPC_disk_label_add) \ -- cgit From 3726a1970bd72419aa7a54f574635f855b98d67a Mon Sep 17 00:00:00 2001 From: Piotr Zalewski Date: Sun, 27 Oct 2024 19:46:52 +0000 Subject: bcachefs: Fix NULL ptr dereference in btree_node_iter_and_journal_peek Add NULL check for key returned from bch2_btree_and_journal_iter_peek in btree_node_iter_and_journal_peek to avoid NULL ptr dereference in bch2_bkey_buf_reassemble. When key returned from bch2_btree_and_journal_iter_peek is NULL it means that btree topology needs repair. Print topology error message with position at which node wasn't found, its parent node information and btree_id with level. Return error code returned by bch2_topology_error to ensure that topology error is handled properly by recovery. Reported-by: syzbot+005ef9aa519f30d97657@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=005ef9aa519f30d97657 Fixes: 5222a4607cd8 ("bcachefs: BTREE_ITER_WITH_JOURNAL") Suggested-by: Alan Huang Suggested-by: Kent Overstreet Signed-off-by: Piotr Zalewski Signed-off-by: Kent Overstreet --- fs/bcachefs/btree_iter.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/fs/bcachefs/btree_iter.c b/fs/bcachefs/btree_iter.c index 0883cf6e1a3e..eef9b89c561d 100644 --- a/fs/bcachefs/btree_iter.c +++ b/fs/bcachefs/btree_iter.c @@ -882,6 +882,18 @@ static noinline int btree_node_iter_and_journal_peek(struct btree_trans *trans, __bch2_btree_and_journal_iter_init_node_iter(trans, &jiter, l->b, l->iter, path->pos); k = bch2_btree_and_journal_iter_peek(&jiter); + if (!k.k) { + struct printbuf buf = PRINTBUF; + + prt_str(&buf, "node not found at pos "); + bch2_bpos_to_text(&buf, path->pos); + prt_str(&buf, " at btree "); + bch2_btree_pos_to_text(&buf, c, l->b); + + ret = bch2_fs_topology_error(c, "%s", buf.buf); + printbuf_exit(&buf); + goto err; + } bch2_bkey_buf_reassemble(out, c, k); @@ -889,6 +901,7 @@ static noinline int btree_node_iter_and_journal_peek(struct btree_trans *trans, c->opts.btree_node_prefetch) ret = btree_path_prefetch_j(trans, path, &jiter); +err: bch2_btree_and_journal_iter_exit(&jiter); return ret; } -- cgit From 66600fac7a984dea4ae095411f644770b2561ede Mon Sep 17 00:00:00 2001 From: Furong Xu <0x1207@gmail.com> Date: Mon, 21 Oct 2024 14:10:23 +0800 Subject: net: stmmac: TSO: Fix unbalanced DMA map/unmap for non-paged SKB data In case the non-paged data of a SKB carries protocol header and protocol payload to be transmitted on a certain platform that the DMA AXI address width is configured to 40-bit/48-bit, or the size of the non-paged data is bigger than TSO_MAX_BUFF_SIZE on a certain platform that the DMA AXI address width is configured to 32-bit, then this SKB requires at least two DMA transmit descriptors to serve it. For example, three descriptors are allocated to split one DMA buffer mapped from one piece of non-paged data: dma_desc[N + 0], dma_desc[N + 1], dma_desc[N + 2]. Then three elements of tx_q->tx_skbuff_dma[] will be allocated to hold extra information to be reused in stmmac_tx_clean(): tx_q->tx_skbuff_dma[N + 0], tx_q->tx_skbuff_dma[N + 1], tx_q->tx_skbuff_dma[N + 2]. Now we focus on tx_q->tx_skbuff_dma[entry].buf, which is the DMA buffer address returned by DMA mapping call. stmmac_tx_clean() will try to unmap the DMA buffer _ONLY_IF_ tx_q->tx_skbuff_dma[entry].buf is a valid buffer address. The expected behavior that saves DMA buffer address of this non-paged data to tx_q->tx_skbuff_dma[entry].buf is: tx_q->tx_skbuff_dma[N + 0].buf = NULL; tx_q->tx_skbuff_dma[N + 1].buf = NULL; tx_q->tx_skbuff_dma[N + 2].buf = dma_map_single(); Unfortunately, the current code misbehaves like this: tx_q->tx_skbuff_dma[N + 0].buf = dma_map_single(); tx_q->tx_skbuff_dma[N + 1].buf = NULL; tx_q->tx_skbuff_dma[N + 2].buf = NULL; On the stmmac_tx_clean() side, when dma_desc[N + 0] is closed by the DMA engine, tx_q->tx_skbuff_dma[N + 0].buf is a valid buffer address obviously, then the DMA buffer will be unmapped immediately. There may be a rare case that the DMA engine does not finish the pending dma_desc[N + 1], dma_desc[N + 2] yet. Now things will go horribly wrong, DMA is going to access a unmapped/unreferenced memory region, corrupted data will be transmited or iommu fault will be triggered :( In contrast, the for-loop that maps SKB fragments behaves perfectly as expected, and that is how the driver should do for both non-paged data and paged frags actually. This patch corrects DMA map/unmap sequences by fixing the array index for tx_q->tx_skbuff_dma[entry].buf when assigning DMA buffer address. Tested and verified on DWXGMAC CORE 3.20a Reported-by: Suraj Jaiswal Fixes: f748be531d70 ("stmmac: support new GMAC4") Signed-off-by: Furong Xu <0x1207@gmail.com> Reviewed-by: Hariprasad Kelam Reviewed-by: Simon Horman Link: https://patch.msgid.link/20241021061023.2162701-1-0x1207@gmail.com Signed-off-by: Paolo Abeni --- drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 22 +++++++++++++++++----- 1 file changed, 17 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c index d3895d7eecfc..208dbc68aaf9 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c @@ -4304,11 +4304,6 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev) if (dma_mapping_error(priv->device, des)) goto dma_map_err; - tx_q->tx_skbuff_dma[first_entry].buf = des; - tx_q->tx_skbuff_dma[first_entry].len = skb_headlen(skb); - tx_q->tx_skbuff_dma[first_entry].map_as_page = false; - tx_q->tx_skbuff_dma[first_entry].buf_type = STMMAC_TXBUF_T_SKB; - if (priv->dma_cap.addr64 <= 32) { first->des0 = cpu_to_le32(des); @@ -4327,6 +4322,23 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev) stmmac_tso_allocator(priv, des, tmp_pay_len, (nfrags == 0), queue); + /* In case two or more DMA transmit descriptors are allocated for this + * non-paged SKB data, the DMA buffer address should be saved to + * tx_q->tx_skbuff_dma[].buf corresponding to the last descriptor, + * and leave the other tx_q->tx_skbuff_dma[].buf as NULL to guarantee + * that stmmac_tx_clean() does not unmap the entire DMA buffer too early + * since the tail areas of the DMA buffer can be accessed by DMA engine + * sooner or later. + * By saving the DMA buffer address to tx_q->tx_skbuff_dma[].buf + * corresponding to the last descriptor, stmmac_tx_clean() will unmap + * this DMA buffer right after the DMA engine completely finishes the + * full buffer transmission. + */ + tx_q->tx_skbuff_dma[tx_q->cur_tx].buf = des; + tx_q->tx_skbuff_dma[tx_q->cur_tx].len = skb_headlen(skb); + tx_q->tx_skbuff_dma[tx_q->cur_tx].map_as_page = false; + tx_q->tx_skbuff_dma[tx_q->cur_tx].buf_type = STMMAC_TXBUF_T_SKB; + /* Prepare fragments */ for (i = 0; i < nfrags; i++) { const skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; -- cgit From 1c10941e34c5fdc0357e46a25bd130d9cf40b925 Mon Sep 17 00:00:00 2001 From: Pierre Gondois Date: Mon, 28 Oct 2024 13:56:56 +0100 Subject: ACPI: CPPC: Make rmw_lock a raw_spin_lock The following BUG was triggered: ============================= [ BUG: Invalid wait context ] 6.12.0-rc2-XXX #406 Not tainted ----------------------------- kworker/1:1/62 is trying to lock: ffffff8801593030 (&cpc_ptr->rmw_lock){+.+.}-{3:3}, at: cpc_write+0xcc/0x370 other info that might help us debug this: context-{5:5} 2 locks held by kworker/1:1/62: #0: ffffff897ef5ec98 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x2c/0x50 #1: ffffff880154e238 (&sg_policy->update_lock){....}-{2:2}, at: sugov_update_shared+0x3c/0x280 stack backtrace: CPU: 1 UID: 0 PID: 62 Comm: kworker/1:1 Not tainted 6.12.0-rc2-g9654bd3e8806 #406 Workqueue: 0x0 (events) Call trace: dump_backtrace+0xa4/0x130 show_stack+0x20/0x38 dump_stack_lvl+0x90/0xd0 dump_stack+0x18/0x28 __lock_acquire+0x480/0x1ad8 lock_acquire+0x114/0x310 _raw_spin_lock+0x50/0x70 cpc_write+0xcc/0x370 cppc_set_perf+0xa0/0x3a8 cppc_cpufreq_fast_switch+0x40/0xc0 cpufreq_driver_fast_switch+0x4c/0x218 sugov_update_shared+0x234/0x280 update_load_avg+0x6ec/0x7b8 dequeue_entities+0x108/0x830 dequeue_task_fair+0x58/0x408 __schedule+0x4f0/0x1070 schedule+0x54/0x130 worker_thread+0xc0/0x2e8 kthread+0x130/0x148 ret_from_fork+0x10/0x20 sugov_update_shared() locks a raw_spinlock while cpc_write() locks a spinlock. To have a correct wait-type order, update rmw_lock to a raw spinlock and ensure that interrupts will be disabled on the CPU holding it. Fixes: 60949b7b8054 ("ACPI: CPPC: Fix MASK_VAL() usage") Signed-off-by: Pierre Gondois Link: https://patch.msgid.link/20241028125657.1271512-1-pierre.gondois@arm.com [ rjw: Changelog edits ] Signed-off-by: Rafael J. Wysocki --- drivers/acpi/cppc_acpi.c | 9 +++++---- include/acpi/cppc_acpi.h | 2 +- 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c index c3fc2c05d868..1a40f0514eaa 100644 --- a/drivers/acpi/cppc_acpi.c +++ b/drivers/acpi/cppc_acpi.c @@ -867,7 +867,7 @@ int acpi_cppc_processor_probe(struct acpi_processor *pr) /* Store CPU Logical ID */ cpc_ptr->cpu_id = pr->id; - spin_lock_init(&cpc_ptr->rmw_lock); + raw_spin_lock_init(&cpc_ptr->rmw_lock); /* Parse PSD data for this CPU */ ret = acpi_get_psd(cpc_ptr, handle); @@ -1087,6 +1087,7 @@ static int cpc_write(int cpu, struct cpc_register_resource *reg_res, u64 val) int pcc_ss_id = per_cpu(cpu_pcc_subspace_idx, cpu); struct cpc_reg *reg = ®_res->cpc_entry.reg; struct cpc_desc *cpc_desc; + unsigned long flags; size = GET_BIT_WIDTH(reg); @@ -1126,7 +1127,7 @@ static int cpc_write(int cpu, struct cpc_register_resource *reg_res, u64 val) return -ENODEV; } - spin_lock(&cpc_desc->rmw_lock); + raw_spin_lock_irqsave(&cpc_desc->rmw_lock, flags); switch (size) { case 8: prev_val = readb_relaxed(vaddr); @@ -1141,7 +1142,7 @@ static int cpc_write(int cpu, struct cpc_register_resource *reg_res, u64 val) prev_val = readq_relaxed(vaddr); break; default: - spin_unlock(&cpc_desc->rmw_lock); + raw_spin_unlock_irqrestore(&cpc_desc->rmw_lock, flags); return -EFAULT; } val = MASK_VAL_WRITE(reg, prev_val, val); @@ -1174,7 +1175,7 @@ static int cpc_write(int cpu, struct cpc_register_resource *reg_res, u64 val) } if (reg->space_id == ACPI_ADR_SPACE_SYSTEM_MEMORY) - spin_unlock(&cpc_desc->rmw_lock); + raw_spin_unlock_irqrestore(&cpc_desc->rmw_lock, flags); return ret_val; } diff --git a/include/acpi/cppc_acpi.h b/include/acpi/cppc_acpi.h index 76e44e102780..62d368bcd9ec 100644 --- a/include/acpi/cppc_acpi.h +++ b/include/acpi/cppc_acpi.h @@ -65,7 +65,7 @@ struct cpc_desc { int write_cmd_status; int write_cmd_id; /* Lock used for RMW operations in cpc_write() */ - spinlock_t rmw_lock; + raw_spinlock_t rmw_lock; struct cpc_register_resource cpc_regs[MAX_CPC_REG_ENT]; struct acpi_psd_package domain_info; struct kobject kobj; -- cgit From cc8475a07cf34891bf11a63025659d3537b638ef Mon Sep 17 00:00:00 2001 From: Dmitry Yashin Date: Tue, 29 Oct 2024 02:33:12 +0500 Subject: ASoC: dt-bindings: rockchip,rk3308-codec: add port property Fix DTB warnings when rk3308-codec used with audio-graph-card by documenting port property: codec@ff560000: 'port' does not match any of the regexes: 'pinctrl-[0-9]+' Signed-off-by: Dmitry Yashin Reviewed-by: Luca Ceresoli Link: https://patch.msgid.link/20241028213314.476776-2-dmt.yashin@gmail.com Signed-off-by: Mark Brown --- Documentation/devicetree/bindings/sound/rockchip,rk3308-codec.yaml | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/Documentation/devicetree/bindings/sound/rockchip,rk3308-codec.yaml b/Documentation/devicetree/bindings/sound/rockchip,rk3308-codec.yaml index ecf3d7d968c8..2cf229a076f0 100644 --- a/Documentation/devicetree/bindings/sound/rockchip,rk3308-codec.yaml +++ b/Documentation/devicetree/bindings/sound/rockchip,rk3308-codec.yaml @@ -48,6 +48,10 @@ properties: - const: mclk_rx - const: hclk + port: + $ref: audio-graph-port.yaml# + unevaluatedProperties: false + resets: maxItems: 1 -- cgit From 5db91545ef8150c45a526675ef99e8998b648a41 Mon Sep 17 00:00:00 2001 From: Aboorva Devarajan Date: Sat, 26 Oct 2024 00:20:20 +0530 Subject: sched: Pass correct scheduling policy to __setscheduler_class Commit 98442f0ccd82 ("sched: Fix delayed_dequeue vs switched_from_fair()") overlooked that __setscheduler_prio(), now __setscheduler_class() relies on p->policy for task_should_scx(), and moved the call before __setscheduler_params() updates it, causing it to be using the old p->policy value. Resolve this by changing task_should_scx() to take the policy itself instead of a task pointer, such that __sched_setscheduler() can pass in the updated policy. Fixes: 98442f0ccd82 ("sched: Fix delayed_dequeue vs switched_from_fair()") Signed-off-by: Aboorva Devarajan Signed-off-by: Peter Zijlstra (Intel) Acked-by: Tejun Heo --- kernel/sched/core.c | 8 ++++---- kernel/sched/ext.c | 8 ++++---- kernel/sched/ext.h | 2 +- kernel/sched/sched.h | 2 +- kernel/sched/syscalls.c | 2 +- 5 files changed, 11 insertions(+), 11 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index dbfb5717d6af..719e0ed1e976 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4711,7 +4711,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p) if (rt_prio(p->prio)) { p->sched_class = &rt_sched_class; #ifdef CONFIG_SCHED_CLASS_EXT - } else if (task_should_scx(p)) { + } else if (task_should_scx(p->policy)) { p->sched_class = &ext_sched_class; #endif } else { @@ -7025,7 +7025,7 @@ int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flag } EXPORT_SYMBOL(default_wake_function); -const struct sched_class *__setscheduler_class(struct task_struct *p, int prio) +const struct sched_class *__setscheduler_class(int policy, int prio) { if (dl_prio(prio)) return &dl_sched_class; @@ -7034,7 +7034,7 @@ const struct sched_class *__setscheduler_class(struct task_struct *p, int prio) return &rt_sched_class; #ifdef CONFIG_SCHED_CLASS_EXT - if (task_should_scx(p)) + if (task_should_scx(policy)) return &ext_sched_class; #endif @@ -7142,7 +7142,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task) queue_flag &= ~DEQUEUE_MOVE; prev_class = p->sched_class; - next_class = __setscheduler_class(p, prio); + next_class = __setscheduler_class(p->policy, prio); if (prev_class != next_class && p->se.sched_delayed) dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED | DEQUEUE_NOCLOCK); diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 5900b06fd036..40bdfe84e4f0 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -4256,14 +4256,14 @@ static const struct kset_uevent_ops scx_uevent_ops = { * Used by sched_fork() and __setscheduler_prio() to pick the matching * sched_class. dl/rt are already handled. */ -bool task_should_scx(struct task_struct *p) +bool task_should_scx(int policy) { if (!scx_enabled() || unlikely(scx_ops_enable_state() == SCX_OPS_DISABLING)) return false; if (READ_ONCE(scx_switching_all)) return true; - return p->policy == SCHED_EXT; + return policy == SCHED_EXT; } /** @@ -4493,7 +4493,7 @@ static void scx_ops_disable_workfn(struct kthread_work *work) sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx); - p->sched_class = __setscheduler_class(p, p->prio); + p->sched_class = __setscheduler_class(p->policy, p->prio); check_class_changing(task_rq(p), p, old_class); sched_enq_and_set_task(&ctx); @@ -5204,7 +5204,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link) sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx); p->scx.slice = SCX_SLICE_DFL; - p->sched_class = __setscheduler_class(p, p->prio); + p->sched_class = __setscheduler_class(p->policy, p->prio); check_class_changing(task_rq(p), p, old_class); sched_enq_and_set_task(&ctx); diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h index 246019519231..b1675bb59fc4 100644 --- a/kernel/sched/ext.h +++ b/kernel/sched/ext.h @@ -18,7 +18,7 @@ bool scx_can_stop_tick(struct rq *rq); void scx_rq_activate(struct rq *rq); void scx_rq_deactivate(struct rq *rq); int scx_check_setscheduler(struct task_struct *p, int policy); -bool task_should_scx(struct task_struct *p); +bool task_should_scx(int policy); void init_sched_ext_class(void); static inline u32 scx_cpuperf_target(s32 cpu) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 9f9d1cc390b1..6c54a57275cc 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -3830,7 +3830,7 @@ static inline int rt_effective_prio(struct task_struct *p, int prio) extern int __sched_setscheduler(struct task_struct *p, const struct sched_attr *attr, bool user, bool pi); extern int __sched_setaffinity(struct task_struct *p, struct affinity_context *ctx); -extern const struct sched_class *__setscheduler_class(struct task_struct *p, int prio); +extern const struct sched_class *__setscheduler_class(int policy, int prio); extern void set_load_weight(struct task_struct *p, bool update_load); extern void enqueue_task(struct rq *rq, struct task_struct *p, int flags); extern bool dequeue_task(struct rq *rq, struct task_struct *p, int flags); diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c index 0470bcc3d204..24f9f90b6574 100644 --- a/kernel/sched/syscalls.c +++ b/kernel/sched/syscalls.c @@ -707,7 +707,7 @@ change: } prev_class = p->sched_class; - next_class = __setscheduler_class(p, newprio); + next_class = __setscheduler_class(policy, newprio); if (prev_class != next_class && p->se.sched_delayed) dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED | DEQUEUE_NOCLOCK); -- cgit From 041db4bbe04e8e0b48350b3bbbd9a799794d5c1e Mon Sep 17 00:00:00 2001 From: Alexey Klimov Date: Tue, 22 Oct 2024 04:31:30 +0100 Subject: ASoC: codecs: wcd937x: add missing LO Switch control The wcd937x supports also AUX input but the control that sets correct soundwire port for this is missing. This control is required for audio playback, for instance, on qrb4210 RB2 board as well as on other SoCs. Reported-by: Adam Skladowski Reported-by: Prasad Kumpatla Suggested-by: Adam Skladowski Suggested-by: Prasad Kumpatla Cc: Srinivas Kandagatla Cc: Mohammad Rafi Shaik Signed-off-by: Alexey Klimov Link: https://patch.msgid.link/20241022033132.787416-2-alexey.klimov@linaro.org Signed-off-by: Mark Brown --- sound/soc/codecs/wcd937x.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/sound/soc/codecs/wcd937x.c b/sound/soc/codecs/wcd937x.c index 45f32d281908..0f0d2537d322 100644 --- a/sound/soc/codecs/wcd937x.c +++ b/sound/soc/codecs/wcd937x.c @@ -2049,6 +2049,8 @@ static const struct snd_kcontrol_new wcd937x_snd_controls[] = { wcd937x_get_swr_port, wcd937x_set_swr_port), SOC_SINGLE_EXT("HPHR Switch", WCD937X_HPH_R, 0, 1, 0, wcd937x_get_swr_port, wcd937x_set_swr_port), + SOC_SINGLE_EXT("LO Switch", WCD937X_LO, 0, 1, 0, + wcd937x_get_swr_port, wcd937x_set_swr_port), SOC_SINGLE_EXT("ADC1 Switch", WCD937X_ADC1, 1, 1, 0, wcd937x_get_swr_port, wcd937x_set_swr_port), -- cgit From 107a5c853eef5336a9846e7dd2f9184b6e3c07c7 Mon Sep 17 00:00:00 2001 From: Alexey Klimov Date: Tue, 22 Oct 2024 04:31:31 +0100 Subject: ASoC: codecs: wcd937x: relax the AUX PDM watchdog On a system with wcd937x, rxmacro and Qualcomm audio DSP, which is pretty common set of devices on Qualcomm platforms, and due to the order of how DAPM widgets are powered on (they are sorted), there is a small time window when wcd937x chip is online and expects the flow of incoming data but rxmacro is not yet online. When wcd937x is programmed to receive data via AUX port then its AUX PDM watchdog is enabled in wcd937x_codec_enable_aux_pa(). If due to some reasons the rxmacro and soundwire machinery are delayed to start streaming data, then there is a chance for this AUX PDM watchdog to reset the wcd937x codec. Such event is not logged as a message and only wcd937x IRQ counter is increased however there could be a lot of other reasons for that IRQ. There is a similar opportunity for such delay during DAPM widgets power down sequence. If wcd937x codec reset happens on the start of the playback, then there will be no sound and if such reset happens at the end of a playback then it may generate additional clicks and pops noises. On qrb4210 RB2 board without any debugging bits the wcd937x resets are sometimes observed at the end of a playback though not always. With some debugging messages or with some tracing enabled the AUX PDM watchdog resets the wcd937x codec at the start of a playback and there is no sound output at all. In this patch: - TIMEOUT_SEL bit in PDM_WD_CTL2 register is set to increase the watchdog reset delay to 100ms which eliminates the AUX PDM watchdog IRQs on qrb4210 RB2 board completely and decreases the number of unwanted clicks noises; - HOLD_OFF bit postpones triggering such watchdog IRQ till wcd937x codec reset which usually happens at the end of a playback. This allows to actually output some sound in case of debugging. Cc: Adam Skladowski Cc: Mohammad Rafi Shaik Cc: Prasad Kumpatla Cc: Srinivas Kandagatla Signed-off-by: Alexey Klimov Link: https://patch.msgid.link/20241022033132.787416-3-alexey.klimov@linaro.org Signed-off-by: Mark Brown --- sound/soc/codecs/wcd937x.c | 10 ++++++++-- sound/soc/codecs/wcd937x.h | 4 ++++ 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/sound/soc/codecs/wcd937x.c b/sound/soc/codecs/wcd937x.c index 0f0d2537d322..08fb13a334a4 100644 --- a/sound/soc/codecs/wcd937x.c +++ b/sound/soc/codecs/wcd937x.c @@ -715,12 +715,17 @@ static int wcd937x_codec_enable_aux_pa(struct snd_soc_dapm_widget *w, struct snd_soc_component *component = snd_soc_dapm_to_component(w->dapm); struct wcd937x_priv *wcd937x = snd_soc_component_get_drvdata(component); int hph_mode = wcd937x->hph_mode; + u8 val; switch (event) { case SND_SOC_DAPM_PRE_PMU: + val = WCD937X_DIGITAL_PDM_WD_CTL2_EN | + WCD937X_DIGITAL_PDM_WD_CTL2_TIMEOUT_SEL | + WCD937X_DIGITAL_PDM_WD_CTL2_HOLD_OFF; snd_soc_component_update_bits(component, WCD937X_DIGITAL_PDM_WD_CTL2, - BIT(0), BIT(0)); + WCD937X_DIGITAL_PDM_WD_CTL2_MASK, + val); break; case SND_SOC_DAPM_POST_PMU: usleep_range(1000, 1010); @@ -741,7 +746,8 @@ static int wcd937x_codec_enable_aux_pa(struct snd_soc_dapm_widget *w, hph_mode); snd_soc_component_update_bits(component, WCD937X_DIGITAL_PDM_WD_CTL2, - BIT(0), 0x00); + WCD937X_DIGITAL_PDM_WD_CTL2_MASK, + 0x00); break; } diff --git a/sound/soc/codecs/wcd937x.h b/sound/soc/codecs/wcd937x.h index 35f3d48bd7dd..4afa48dcaf74 100644 --- a/sound/soc/codecs/wcd937x.h +++ b/sound/soc/codecs/wcd937x.h @@ -391,6 +391,10 @@ #define WCD937X_DIGITAL_PDM_WD_CTL0 0x3465 #define WCD937X_DIGITAL_PDM_WD_CTL1 0x3466 #define WCD937X_DIGITAL_PDM_WD_CTL2 0x3467 +#define WCD937X_DIGITAL_PDM_WD_CTL2_HOLD_OFF BIT(2) +#define WCD937X_DIGITAL_PDM_WD_CTL2_TIMEOUT_SEL BIT(1) +#define WCD937X_DIGITAL_PDM_WD_CTL2_EN BIT(0) +#define WCD937X_DIGITAL_PDM_WD_CTL2_MASK GENMASK(2, 0) #define WCD937X_DIGITAL_INTR_MODE 0x346A #define WCD937X_DIGITAL_INTR_MASK_0 0x346B #define WCD937X_DIGITAL_INTR_MASK_1 0x346C -- cgit From 25f2ff53838ccbd5ce558b5d23fac8a5d7f86655 Mon Sep 17 00:00:00 2001 From: Maarten Lankhorst Date: Thu, 5 Sep 2024 17:00:50 +0200 Subject: drm/xe: Remove runtime argument from display s/r functions The previous change ensures that pm_suspend is only called when suspending or resuming. This ensures no further bugs like those in the previous commit. Signed-off-by: Maarten Lankhorst Reviewed-by: Lucas De Marchi Reviewed-by: Vinod Govindapillai Link: https://patchwork.freedesktop.org/patch/msgid/20240905150052.174895-3-maarten.lankhorst@linux.intel.com (cherry picked from commit f90491d4b64e302e940133103d3d9908e70e454f) Signed-off-by: Lucas De Marchi --- drivers/gpu/drm/xe/display/xe_display.c | 53 ++++++++++++++++++++------------- drivers/gpu/drm/xe/display/xe_display.h | 8 ++--- drivers/gpu/drm/xe/xe_pm.c | 6 ++-- 3 files changed, 39 insertions(+), 28 deletions(-) diff --git a/drivers/gpu/drm/xe/display/xe_display.c b/drivers/gpu/drm/xe/display/xe_display.c index 75736faf2a80..1c25c4f6a53b 100644 --- a/drivers/gpu/drm/xe/display/xe_display.c +++ b/drivers/gpu/drm/xe/display/xe_display.c @@ -309,18 +309,7 @@ static void xe_display_flush_cleanup_work(struct xe_device *xe) } /* TODO: System and runtime suspend/resume sequences will be sanitized as a follow-up. */ -void xe_display_pm_runtime_suspend(struct xe_device *xe) -{ - if (!xe->info.probe_display) - return; - - if (xe->d3cold.allowed) - xe_display_pm_suspend(xe, true); - - intel_hpd_poll_enable(xe); -} - -void xe_display_pm_suspend(struct xe_device *xe, bool runtime) +static void __xe_display_pm_suspend(struct xe_device *xe, bool runtime) { struct intel_display *display = &xe->display; bool s2idle = suspend_to_idle(); @@ -355,26 +344,31 @@ void xe_display_pm_suspend(struct xe_device *xe, bool runtime) intel_dmc_suspend(xe); } -void xe_display_pm_suspend_late(struct xe_device *xe) +void xe_display_pm_suspend(struct xe_device *xe) +{ + __xe_display_pm_suspend(xe, false); +} + +void xe_display_pm_runtime_suspend(struct xe_device *xe) { - bool s2idle = suspend_to_idle(); if (!xe->info.probe_display) return; - intel_power_domains_suspend(xe, s2idle); + if (xe->d3cold.allowed) + __xe_display_pm_suspend(xe, true); - intel_display_power_suspend_late(xe); + intel_hpd_poll_enable(xe); } -void xe_display_pm_runtime_resume(struct xe_device *xe) +void xe_display_pm_suspend_late(struct xe_device *xe) { + bool s2idle = suspend_to_idle(); if (!xe->info.probe_display) return; - intel_hpd_poll_disable(xe); + intel_power_domains_suspend(xe, s2idle); - if (xe->d3cold.allowed) - xe_display_pm_resume(xe, true); + intel_display_power_suspend_late(xe); } void xe_display_pm_resume_early(struct xe_device *xe) @@ -387,7 +381,7 @@ void xe_display_pm_resume_early(struct xe_device *xe) intel_power_domains_resume(xe); } -void xe_display_pm_resume(struct xe_device *xe, bool runtime) +static void __xe_display_pm_resume(struct xe_device *xe, bool runtime) { struct intel_display *display = &xe->display; @@ -421,6 +415,23 @@ void xe_display_pm_resume(struct xe_device *xe, bool runtime) intel_power_domains_enable(xe); } +void xe_display_pm_resume(struct xe_device *xe) +{ + __xe_display_pm_resume(xe, false); +} + +void xe_display_pm_runtime_resume(struct xe_device *xe) +{ + if (!xe->info.probe_display) + return; + + intel_hpd_poll_disable(xe); + + if (xe->d3cold.allowed) + __xe_display_pm_resume(xe, true); +} + + static void display_device_remove(struct drm_device *dev, void *arg) { struct xe_device *xe = arg; diff --git a/drivers/gpu/drm/xe/display/xe_display.h b/drivers/gpu/drm/xe/display/xe_display.h index 53d727fd792b..bed55fd26f30 100644 --- a/drivers/gpu/drm/xe/display/xe_display.h +++ b/drivers/gpu/drm/xe/display/xe_display.h @@ -34,10 +34,10 @@ void xe_display_irq_enable(struct xe_device *xe, u32 gu_misc_iir); void xe_display_irq_reset(struct xe_device *xe); void xe_display_irq_postinstall(struct xe_device *xe, struct xe_gt *gt); -void xe_display_pm_suspend(struct xe_device *xe, bool runtime); +void xe_display_pm_suspend(struct xe_device *xe); void xe_display_pm_suspend_late(struct xe_device *xe); void xe_display_pm_resume_early(struct xe_device *xe); -void xe_display_pm_resume(struct xe_device *xe, bool runtime); +void xe_display_pm_resume(struct xe_device *xe); void xe_display_pm_runtime_suspend(struct xe_device *xe); void xe_display_pm_runtime_resume(struct xe_device *xe); @@ -65,10 +65,10 @@ static inline void xe_display_irq_enable(struct xe_device *xe, u32 gu_misc_iir) static inline void xe_display_irq_reset(struct xe_device *xe) {} static inline void xe_display_irq_postinstall(struct xe_device *xe, struct xe_gt *gt) {} -static inline void xe_display_pm_suspend(struct xe_device *xe, bool runtime) {} +static inline void xe_display_pm_suspend(struct xe_device *xe) {} static inline void xe_display_pm_suspend_late(struct xe_device *xe) {} static inline void xe_display_pm_resume_early(struct xe_device *xe) {} -static inline void xe_display_pm_resume(struct xe_device *xe, bool runtime) {} +static inline void xe_display_pm_resume(struct xe_device *xe) {} static inline void xe_display_pm_runtime_suspend(struct xe_device *xe) {} static inline void xe_display_pm_runtime_resume(struct xe_device *xe) {} diff --git a/drivers/gpu/drm/xe/xe_pm.c b/drivers/gpu/drm/xe/xe_pm.c index 7cf2160fe040..33eb039053e4 100644 --- a/drivers/gpu/drm/xe/xe_pm.c +++ b/drivers/gpu/drm/xe/xe_pm.c @@ -123,7 +123,7 @@ int xe_pm_suspend(struct xe_device *xe) for_each_gt(gt, xe, id) xe_gt_suspend_prepare(gt); - xe_display_pm_suspend(xe, false); + xe_display_pm_suspend(xe); /* FIXME: Super racey... */ err = xe_bo_evict_all(xe); @@ -133,7 +133,7 @@ int xe_pm_suspend(struct xe_device *xe) for_each_gt(gt, xe, id) { err = xe_gt_suspend(gt); if (err) { - xe_display_pm_resume(xe, false); + xe_display_pm_resume(xe); goto err; } } @@ -187,7 +187,7 @@ int xe_pm_resume(struct xe_device *xe) for_each_gt(gt, xe, id) xe_gt_resume(gt); - xe_display_pm_resume(xe, false); + xe_display_pm_resume(xe); err = xe_bo_restore_user(xe); if (err) -- cgit From dcb6c1d071712186c213c26b245779f7859b9cec Mon Sep 17 00:00:00 2001 From: Imre Deak Date: Wed, 9 Oct 2024 22:43:57 +0300 Subject: drm/xe/display: Separate the d3cold and non-d3cold runtime PM handling For clarity separate the d3cold and non-d3cold runtime PM handling. The only change in behavior is disabling polling later during runtime resume. This shouldn't make a difference, since the poll disabling is handled from a work, which could run at any point wrt. the runtime resume handler. The work will also require a runtime PM reference, syncing it with the resume handler. Cc: Rodrigo Vivi Reviewed-by: Jonathan Cavitt Signed-off-by: Imre Deak Link: https://patchwork.freedesktop.org/patch/msgid/20241009194358.1321200-4-imre.deak@intel.com (cherry picked from commit a4de6beb83fc5adee788518350247c629568901e) Signed-off-by: Lucas De Marchi --- drivers/gpu/drm/xe/display/xe_display.c | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/xe/display/xe_display.c b/drivers/gpu/drm/xe/display/xe_display.c index 1c25c4f6a53b..fa44fe802858 100644 --- a/drivers/gpu/drm/xe/display/xe_display.c +++ b/drivers/gpu/drm/xe/display/xe_display.c @@ -342,6 +342,9 @@ static void __xe_display_pm_suspend(struct xe_device *xe, bool runtime) intel_opregion_suspend(display, s2idle ? PCI_D1 : PCI_D3cold); intel_dmc_suspend(xe); + + if (runtime && has_display(xe)) + intel_hpd_poll_enable(xe); } void xe_display_pm_suspend(struct xe_device *xe) @@ -354,8 +357,10 @@ void xe_display_pm_runtime_suspend(struct xe_device *xe) if (!xe->info.probe_display) return; - if (xe->d3cold.allowed) + if (xe->d3cold.allowed) { __xe_display_pm_suspend(xe, true); + return; + } intel_hpd_poll_enable(xe); } @@ -405,9 +410,11 @@ static void __xe_display_pm_resume(struct xe_device *xe, bool runtime) intel_display_driver_resume(xe); drm_kms_helper_poll_enable(&xe->drm); intel_display_driver_enable_user_access(xe); - intel_hpd_poll_disable(xe); } + if (has_display(xe)) + intel_hpd_poll_disable(xe); + intel_opregion_resume(display); intel_fbdev_set_suspend(&xe->drm, FBINFO_STATE_RUNNING, false); @@ -425,10 +432,12 @@ void xe_display_pm_runtime_resume(struct xe_device *xe) if (!xe->info.probe_display) return; - intel_hpd_poll_disable(xe); - - if (xe->d3cold.allowed) + if (xe->d3cold.allowed) { __xe_display_pm_resume(xe, true); + return; + } + + intel_hpd_poll_disable(xe); } -- cgit From 6a9d2e2988fa3ef9b03ddd9ba9aaa54dc23635e6 Mon Sep 17 00:00:00 2001 From: Imre Deak Date: Wed, 9 Oct 2024 22:43:58 +0300 Subject: drm/xe/display: Add missing HPD interrupt enabling during non-d3cold RPM resume Atm the display HPD interrupts that got disabled during runtime suspend, are re-enabled only if d3cold is enabled. Fix things by also re-enabling the interrupts if d3cold is disabled. Cc: Rodrigo Vivi Reviewed-by: Jonathan Cavitt Signed-off-by: Imre Deak Link: https://patchwork.freedesktop.org/patch/msgid/20241009194358.1321200-5-imre.deak@intel.com (cherry picked from commit bbc4a30de095f0349d3c278500345a1b620d495e) Signed-off-by: Lucas De Marchi --- drivers/gpu/drm/xe/display/xe_display.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/xe/display/xe_display.c b/drivers/gpu/drm/xe/display/xe_display.c index fa44fe802858..c6e0c8d77a70 100644 --- a/drivers/gpu/drm/xe/display/xe_display.c +++ b/drivers/gpu/drm/xe/display/xe_display.c @@ -437,6 +437,7 @@ void xe_display_pm_runtime_resume(struct xe_device *xe) return; } + intel_hpd_init(xe); intel_hpd_poll_disable(xe); } -- cgit From 338c4d3902feb5be49bfda530a72c7ab860e2c9f Mon Sep 17 00:00:00 2001 From: Wander Lairson Costa Date: Mon, 21 Oct 2024 16:26:24 -0700 Subject: igb: Disable threaded IRQ for igb_msix_other During testing of SR-IOV, Red Hat QE encountered an issue where the ip link up command intermittently fails for the igbvf interfaces when using the PREEMPT_RT variant. Investigation revealed that e1000_write_posted_mbx returns an error due to the lack of an ACK from e1000_poll_for_ack. The underlying issue arises from the fact that IRQs are threaded by default under PREEMPT_RT. While the exact hardware details are not available, it appears that the IRQ handled by igb_msix_other must be processed before e1000_poll_for_ack times out. However, e1000_write_posted_mbx is called with preemption disabled, leading to a scenario where the IRQ is serviced only after the failure of e1000_write_posted_mbx. To resolve this, we set IRQF_NO_THREAD for the affected interrupt, ensuring that the kernel handles it immediately, thereby preventing the aforementioned error. Reproducer: #!/bin/bash # echo 2 > /sys/class/net/ens14f0/device/sriov_numvfs ipaddr_vlan=3 nic_test=ens14f0 vf=${nic_test}v0 while true; do ip link set ${nic_test} mtu 1500 ip link set ${vf} mtu 1500 ip link set $vf up ip link set ${nic_test} vf 0 vlan ${ipaddr_vlan} ip addr add 172.30.${ipaddr_vlan}.1/24 dev ${vf} ip addr add 2021:db8:${ipaddr_vlan}::1/64 dev ${vf} if ! ip link show $vf | grep 'state UP'; then echo 'Error found' break fi ip link set $vf down done Signed-off-by: Wander Lairson Costa Fixes: 9d5c824399de ("igb: PCI-Express 82575 Gigabit Ethernet driver") Reported-by: Yuying Ma Reviewed-by: Przemek Kitszel Tested-by: Rafal Romanowski Signed-off-by: Jacob Keller Reviewed-by: Simon Horman Signed-off-by: Paolo Abeni --- drivers/net/ethernet/intel/igb/igb_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index f1d088168723..b83df5f94b1f 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/drivers/net/ethernet/intel/igb/igb_main.c @@ -907,7 +907,7 @@ static int igb_request_msix(struct igb_adapter *adapter) int i, err = 0, vector = 0, free_vector = 0; err = request_irq(adapter->msix_entries[vector].vector, - igb_msix_other, 0, netdev->name, adapter); + igb_msix_other, IRQF_NO_THREAD, netdev->name, adapter); if (err) goto err_out; -- cgit From 3e13a8c0a5263827380c5090d822a92cb13767dd Mon Sep 17 00:00:00 2001 From: Michal Swiatkowski Date: Mon, 21 Oct 2024 16:26:25 -0700 Subject: ice: block SF port creation in legacy mode There is no support for SF in legacy mode. Reflect it in the code. Reviewed-by: Przemek Kitszel Fixes: eda69d654c7e ("ice: add basic devlink subfunctions support") Signed-off-by: Michal Swiatkowski Reviewed-by: Kalesh AP Tested-by: Rafal Romanowski Signed-off-by: Jacob Keller Reviewed-by: Simon Horman Signed-off-by: Paolo Abeni --- drivers/net/ethernet/intel/ice/devlink/devlink_port.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/drivers/net/ethernet/intel/ice/devlink/devlink_port.c b/drivers/net/ethernet/intel/ice/devlink/devlink_port.c index 928c8bdb6649..c6779d9dffff 100644 --- a/drivers/net/ethernet/intel/ice/devlink/devlink_port.c +++ b/drivers/net/ethernet/intel/ice/devlink/devlink_port.c @@ -989,5 +989,11 @@ ice_devlink_port_new(struct devlink *devlink, if (err) return err; + if (!ice_is_eswitch_mode_switchdev(pf)) { + NL_SET_ERR_MSG_MOD(extack, + "SF ports are only supported in eswitch switchdev mode"); + return -EOPNOTSUPP; + } + return ice_alloc_dynamic_port(pf, new_attr, extack, devlink_port); } -- cgit From 6e58c33106220c6c0c8fbee9ab63eae76ad8f260 Mon Sep 17 00:00:00 2001 From: Arkadiusz Kubalewski Date: Mon, 21 Oct 2024 16:26:26 -0700 Subject: ice: fix crash on probe for DPLL enabled E810 LOM The E810 Lan On Motherboard (LOM) design is vendor specific. Intel provides the reference design, but it is up to vendor on the final product design. For some cases, like Linux DPLL support, the static values defined in the driver does not reflect the actual LOM design. Current implementation of dpll pins is causing the crash on probe of the ice driver for such DPLL enabled E810 LOM designs: WARNING: (...) at drivers/dpll/dpll_core.c:495 dpll_pin_get+0x2c4/0x330 ... Call Trace: ? __warn+0x83/0x130 ? dpll_pin_get+0x2c4/0x330 ? report_bug+0x1b7/0x1d0 ? handle_bug+0x42/0x70 ? exc_invalid_op+0x18/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? dpll_pin_get+0x117/0x330 ? dpll_pin_get+0x2c4/0x330 ? dpll_pin_get+0x117/0x330 ice_dpll_get_pins.isra.0+0x52/0xe0 [ice] ... The number of dpll pins enabled by LOM vendor is greater than expected and defined in the driver for Intel designed NICs, which causes the crash. Prevent the crash and allow generic pin initialization within Linux DPLL subsystem for DPLL enabled E810 LOM designs. Newly designed solution for described issue will be based on "per HW design" pin initialization. It requires pin information dynamically acquired from the firmware and is already in progress, planned for next-tree only. Fixes: d7999f5ea64b ("ice: implement dpll interface to control cgu") Reviewed-by: Karol Kolacinski Signed-off-by: Arkadiusz Kubalewski Tested-by: Pucha Himasekhar Reddy Signed-off-by: Jacob Keller Reviewed-by: Simon Horman Signed-off-by: Paolo Abeni --- drivers/net/ethernet/intel/ice/ice_dpll.c | 70 +++++++++++++++++++++++++++++ drivers/net/ethernet/intel/ice/ice_ptp_hw.c | 21 ++++++++- drivers/net/ethernet/intel/ice/ice_ptp_hw.h | 1 + 3 files changed, 90 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/ice/ice_dpll.c b/drivers/net/ethernet/intel/ice/ice_dpll.c index 74c0e7319a4c..d5ad6d84007c 100644 --- a/drivers/net/ethernet/intel/ice/ice_dpll.c +++ b/drivers/net/ethernet/intel/ice/ice_dpll.c @@ -10,6 +10,7 @@ #define ICE_DPLL_PIN_IDX_INVALID 0xff #define ICE_DPLL_RCLK_NUM_PER_PF 1 #define ICE_DPLL_PIN_ESYNC_PULSE_HIGH_PERCENT 25 +#define ICE_DPLL_PIN_GEN_RCLK_FREQ 1953125 /** * enum ice_dpll_pin_type - enumerate ice pin types: @@ -2063,6 +2064,73 @@ static int ice_dpll_init_worker(struct ice_pf *pf) return 0; } +/** + * ice_dpll_init_info_pins_generic - initializes generic pins info + * @pf: board private structure + * @input: if input pins initialized + * + * Init information for generic pins, cache them in PF's pins structures. + * + * Return: + * * 0 - success + * * negative - init failure reason + */ +static int ice_dpll_init_info_pins_generic(struct ice_pf *pf, bool input) +{ + struct ice_dpll *de = &pf->dplls.eec, *dp = &pf->dplls.pps; + static const char labels[][sizeof("99")] = { + "0", "1", "2", "3", "4", "5", "6", "7", "8", + "9", "10", "11", "12", "13", "14", "15" }; + u32 cap = DPLL_PIN_CAPABILITIES_STATE_CAN_CHANGE; + enum ice_dpll_pin_type pin_type; + int i, pin_num, ret = -EINVAL; + struct ice_dpll_pin *pins; + u32 phase_adj_max; + + if (input) { + pin_num = pf->dplls.num_inputs; + pins = pf->dplls.inputs; + phase_adj_max = pf->dplls.input_phase_adj_max; + pin_type = ICE_DPLL_PIN_TYPE_INPUT; + cap |= DPLL_PIN_CAPABILITIES_PRIORITY_CAN_CHANGE; + } else { + pin_num = pf->dplls.num_outputs; + pins = pf->dplls.outputs; + phase_adj_max = pf->dplls.output_phase_adj_max; + pin_type = ICE_DPLL_PIN_TYPE_OUTPUT; + } + if (pin_num > ARRAY_SIZE(labels)) + return ret; + + for (i = 0; i < pin_num; i++) { + pins[i].idx = i; + pins[i].prop.board_label = labels[i]; + pins[i].prop.phase_range.min = phase_adj_max; + pins[i].prop.phase_range.max = -phase_adj_max; + pins[i].prop.capabilities = cap; + pins[i].pf = pf; + ret = ice_dpll_pin_state_update(pf, &pins[i], pin_type, NULL); + if (ret) + break; + if (input && pins[i].freq == ICE_DPLL_PIN_GEN_RCLK_FREQ) + pins[i].prop.type = DPLL_PIN_TYPE_MUX; + else + pins[i].prop.type = DPLL_PIN_TYPE_EXT; + if (!input) + continue; + ret = ice_aq_get_cgu_ref_prio(&pf->hw, de->dpll_idx, i, + &de->input_prio[i]); + if (ret) + break; + ret = ice_aq_get_cgu_ref_prio(&pf->hw, dp->dpll_idx, i, + &dp->input_prio[i]); + if (ret) + break; + } + + return ret; +} + /** * ice_dpll_init_info_direct_pins - initializes direct pins info * @pf: board private structure @@ -2101,6 +2169,8 @@ ice_dpll_init_info_direct_pins(struct ice_pf *pf, default: return -EINVAL; } + if (num_pins != ice_cgu_get_num_pins(hw, input)) + return ice_dpll_init_info_pins_generic(pf, input); for (i = 0; i < num_pins; i++) { caps = 0; diff --git a/drivers/net/ethernet/intel/ice/ice_ptp_hw.c b/drivers/net/ethernet/intel/ice/ice_ptp_hw.c index 3a33e6b9b313..ec8db830ac73 100644 --- a/drivers/net/ethernet/intel/ice/ice_ptp_hw.c +++ b/drivers/net/ethernet/intel/ice/ice_ptp_hw.c @@ -34,7 +34,6 @@ static const struct ice_cgu_pin_desc ice_e810t_sfp_cgu_inputs[] = { ARRAY_SIZE(ice_cgu_pin_freq_common), ice_cgu_pin_freq_common }, { "GNSS-1PPS", ZL_REF4P, DPLL_PIN_TYPE_GNSS, ARRAY_SIZE(ice_cgu_pin_freq_1_hz), ice_cgu_pin_freq_1_hz }, - { "OCXO", ZL_REF4N, DPLL_PIN_TYPE_INT_OSCILLATOR, 0, }, }; static const struct ice_cgu_pin_desc ice_e810t_qsfp_cgu_inputs[] = { @@ -52,7 +51,6 @@ static const struct ice_cgu_pin_desc ice_e810t_qsfp_cgu_inputs[] = { ARRAY_SIZE(ice_cgu_pin_freq_common), ice_cgu_pin_freq_common }, { "GNSS-1PPS", ZL_REF4P, DPLL_PIN_TYPE_GNSS, ARRAY_SIZE(ice_cgu_pin_freq_1_hz), ice_cgu_pin_freq_1_hz }, - { "OCXO", ZL_REF4N, DPLL_PIN_TYPE_INT_OSCILLATOR, }, }; static const struct ice_cgu_pin_desc ice_e810t_sfp_cgu_outputs[] = { @@ -5964,6 +5962,25 @@ ice_cgu_get_pin_desc(struct ice_hw *hw, bool input, int *size) return t; } +/** + * ice_cgu_get_num_pins - get pin description array size + * @hw: pointer to the hw struct + * @input: if request is done against input or output pins + * + * Return: size of pin description array for given hw. + */ +int ice_cgu_get_num_pins(struct ice_hw *hw, bool input) +{ + const struct ice_cgu_pin_desc *t; + int size; + + t = ice_cgu_get_pin_desc(hw, input, &size); + if (t) + return size; + + return 0; +} + /** * ice_cgu_get_pin_type - get pin's type * @hw: pointer to the hw struct diff --git a/drivers/net/ethernet/intel/ice/ice_ptp_hw.h b/drivers/net/ethernet/intel/ice/ice_ptp_hw.h index 0852a34ade91..6cedc1a906af 100644 --- a/drivers/net/ethernet/intel/ice/ice_ptp_hw.h +++ b/drivers/net/ethernet/intel/ice/ice_ptp_hw.h @@ -404,6 +404,7 @@ int ice_read_sma_ctrl_e810t(struct ice_hw *hw, u8 *data); int ice_write_sma_ctrl_e810t(struct ice_hw *hw, u8 data); int ice_read_pca9575_reg_e810t(struct ice_hw *hw, u8 offset, u8 *data); bool ice_is_pca9575_present(struct ice_hw *hw); +int ice_cgu_get_num_pins(struct ice_hw *hw, bool input); enum dpll_pin_type ice_cgu_get_pin_type(struct ice_hw *hw, u8 pin, bool input); struct dpll_pin_frequency * ice_cgu_get_pin_freq_supp(struct ice_hw *hw, u8 pin, bool input, u8 *num); -- cgit From f3c3ccc4fe49dbc560b01d16bebd1b116c46c2b4 Mon Sep 17 00:00:00 2001 From: Jason Gunthorpe Date: Wed, 16 Oct 2024 20:52:33 -0300 Subject: PCI: Fix pci_enable_acs() support for the ACS quirks There are ACS quirks that hijack the normal ACS processing and deliver to to special quirk code. The enable path needs to call pci_dev_specific_enable_acs() and then pci_dev_specific_acs_enabled() will report the hidden ACS state controlled by the quirk. The recent rework got this out of order and we should try to call pci_dev_specific_enable_acs() regardless of any actual ACS support in the device. As before command line parameters that effect standard PCI ACS don't interact with the quirk versions, including the new config_acs= option. Link: https://lore.kernel.org/r/0-v1-f96b686c625b+124-pci_acs_quirk_fix_jgg@nvidia.com Fixes: 47c8846a49ba ("PCI: Extend ACS configurability") Reported-by: Jiri Slaby Closes: https://lore.kernel.org/all/e89107da-ac99-4d3a-9527-a4df9986e120@kernel.org Closes: https://bugzilla.suse.com/show_bug.cgi?id=1229019 Tested-by: Steffen Dirkwinkel Signed-off-by: Jason Gunthorpe Signed-off-by: Bjorn Helgaas --- drivers/pci/pci.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 7d85c04fbba2..225a6cd2e9ca 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1067,8 +1067,15 @@ static void pci_std_enable_acs(struct pci_dev *dev, struct pci_acs *caps) static void pci_enable_acs(struct pci_dev *dev) { struct pci_acs caps; + bool enable_acs = false; int pos; + /* If an iommu is present we start with kernel default caps */ + if (pci_acs_enable) { + if (pci_dev_specific_enable_acs(dev)) + enable_acs = true; + } + pos = dev->acs_cap; if (!pos) return; @@ -1077,11 +1084,8 @@ static void pci_enable_acs(struct pci_dev *dev) pci_read_config_word(dev, pos + PCI_ACS_CTRL, &caps.ctrl); caps.fw_ctrl = caps.ctrl; - /* If an iommu is present we start with kernel default caps */ - if (pci_acs_enable) { - if (pci_dev_specific_enable_acs(dev)) - pci_std_enable_acs(dev, &caps); - } + if (enable_acs) + pci_std_enable_acs(dev, &caps); /* * Always apply caps from the command line, even if there is no iommu. -- cgit From fce9642c765a18abd1db0339a7d832c29b68456a Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Tue, 29 Oct 2024 09:23:20 +0000 Subject: x86/amd_nb: Fix compile-testing without CONFIG_AMD_NB MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit node_to_amd_nb() is defined to NULL in non-AMD configs: drivers/platform/x86/amd/hsmp/plat.c: In function 'init_platform_device': drivers/platform/x86/amd/hsmp/plat.c:165:68: error: dereferencing 'void *' pointer [-Werror] 165 | sock->root = node_to_amd_nb(i)->root; | ^~ drivers/platform/x86/amd/hsmp/plat.c:165:68: error: request for member 'root' in something not a structure or union Users of the interface who also allow COMPILE_TEST will cause the above build error so provide an inline stub to fix that. [ bp: Massage commit message. ] Signed-off-by: Arnd Bergmann Signed-off-by: Borislav Petkov (AMD) Reviewed-by: Ilpo Järvinen Link: https://lore.kernel.org/r/20241029092329.3857004-1-arnd@kernel.org --- arch/x86/include/asm/amd_nb.h | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/amd_nb.h b/arch/x86/include/asm/amd_nb.h index 6f3b6aef47ba..d0caac26533f 100644 --- a/arch/x86/include/asm/amd_nb.h +++ b/arch/x86/include/asm/amd_nb.h @@ -116,7 +116,10 @@ static inline bool amd_gart_present(void) #define amd_nb_num(x) 0 #define amd_nb_has_feature(x) false -#define node_to_amd_nb(x) NULL +static inline struct amd_northbridge *node_to_amd_nb(int node) +{ + return NULL; +} #define amd_gart_present(x) false #endif -- cgit From a32aee8f0d987a7cba7fcc28002553361a392048 Mon Sep 17 00:00:00 2001 From: Jiayuan Chen Date: Mon, 28 Oct 2024 14:52:26 +0800 Subject: bpf: fix filed access without lock The tcp_bpf_recvmsg_parser() function, running in user context, retrieves seq_copied from tcp_sk without holding the socket lock, and stores it in a local variable seq. However, the softirq context can modify tcp_sk->seq_copied concurrently, for example, n tcp_read_sock(). As a result, the seq value is stale when it is assigned back to tcp_sk->copied_seq at the end of tcp_bpf_recvmsg_parser(), leading to incorrect behavior. Due to concurrency, the copied_seq field in tcp_bpf_recvmsg_parser() might be set to an incorrect value (less than the actual copied_seq) at the end of function: 'WRITE_ONCE(tcp->copied_seq, seq)'. This causes the 'offset' to be negative in tcp_read_sock()->tcp_recv_skb() when processing new incoming packets (sk->copied_seq - skb->seq becomes less than 0), and all subsequent packets will be dropped. Signed-off-by: Jiayuan Chen Link: https://lore.kernel.org/r/20241028065226.35568-1-mrpre@163.com Signed-off-by: Martin KaFai Lau --- net/ipv4/tcp_bpf.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c index e7658c5d6b79..370993c03d31 100644 --- a/net/ipv4/tcp_bpf.c +++ b/net/ipv4/tcp_bpf.c @@ -221,11 +221,11 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk, int flags, int *addr_len) { - struct tcp_sock *tcp = tcp_sk(sk); int peek = flags & MSG_PEEK; - u32 seq = tcp->copied_seq; struct sk_psock *psock; + struct tcp_sock *tcp; int copied = 0; + u32 seq; if (unlikely(flags & MSG_ERRQUEUE)) return inet_recv_error(sk, msg, len, addr_len); @@ -238,7 +238,8 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk, return tcp_recvmsg(sk, msg, len, flags, addr_len); lock_sock(sk); - + tcp = tcp_sk(sk); + seq = tcp->copied_seq; /* We may have received data on the sk_receive_queue pre-accept and * then we can not use read_skb in this context because we haven't * assigned a sk_socket yet so have no link to the ops. The work-around -- cgit From 2e8a1acea8597ff42189ea94f0a63fa58640223d Mon Sep 17 00:00:00 2001 From: Kevin Brodsky Date: Tue, 29 Oct 2024 14:45:35 +0000 Subject: arm64: signal: Improve POR_EL0 handling to avoid uaccess failures Reset POR_EL0 to "allow all" before writing the signal frame, preventing spurious uaccess failures. When POE is supported, the POR_EL0 register constrains memory accesses based on the target page's POIndex (pkey). This raises the question: what constraints should apply to a signal handler? The current answer is that POR_EL0 is reset to POR_EL0_INIT when invoking the handler, giving it full access to POIndex 0. This is in line with x86's MPK support and remains unchanged. This is only part of the story, though. POR_EL0 constrains all unprivileged memory accesses, meaning that uaccess routines such as put_user() are also impacted. As a result POR_EL0 may prevent the signal frame from being written to the signal stack (ultimately causing a SIGSEGV). This is especially concerning when an alternate signal stack is used, because userspace may want to prevent access to it outside of signal handlers. There is currently no provision for that: POR_EL0 is reset after writing to the stack, and POR_EL0_INIT only enables access to POIndex 0. This patch ensures that POR_EL0 is reset to its most permissive state before the signal stack is accessed. Once the signal frame has been fully written, POR_EL0 is still set to POR_EL0_INIT - it is up to the signal handler to enable access to additional pkeys if needed. As to sigreturn(), it expects having access to the stack like any other syscall; we only need to ensure that POR_EL0 is restored from the signal frame after all uaccess calls. This approach is in line with the recent x86/pkeys series [1]. Resetting POR_EL0 early introduces some complications, in that we can no longer read the register directly in preserve_poe_context(). This is addressed by introducing a struct (user_access_state) and helpers to manage any such register impacting user accesses (uaccess and accesses in userspace). Things look like this on signal delivery: 1. Save original POR_EL0 into struct [save_reset_user_access_state()] 2. Set POR_EL0 to "allow all" [save_reset_user_access_state()] 3. Create signal frame 4. Write saved POR_EL0 value to the signal frame [preserve_poe_context()] 5. Finalise signal frame 6. If all operations succeeded: a. Set POR_EL0 to POR_EL0_INIT [set_handler_user_access_state()] b. Else reset POR_EL0 to its original value [restore_user_access_state()] If any step fails when setting up the signal frame, the process will be sent a SIGSEGV, which it may be able to handle. Step 6.b ensures that the original POR_EL0 is saved in the signal frame when delivering that SIGSEGV (so that the original value is restored by sigreturn). The return path (sys_rt_sigreturn) doesn't strictly require any change since restore_poe_context() is already called last. However, to avoid uaccess calls being accidentally added after that point, we use the same approach as in the delivery path, i.e. separating uaccess from writing to the register: 1. Read saved POR_EL0 value from the signal frame [restore_poe_context()] 2. Set POR_EL0 to the saved value [restore_user_access_state()] [1] https://lore.kernel.org/lkml/20240802061318.2140081-1-aruna.ramakrishna@oracle.com/ Fixes: 9160f7e909e1 ("arm64: add POE signal support") Reviewed-by: Catalin Marinas Signed-off-by: Kevin Brodsky Link: https://lore.kernel.org/r/20241029144539.111155-2-kevin.brodsky@arm.com Signed-off-by: Will Deacon --- arch/arm64/kernel/signal.c | 92 +++++++++++++++++++++++++++++++++++++++------- 1 file changed, 78 insertions(+), 14 deletions(-) diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index 561986947530..c7d311d8b92a 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include @@ -66,10 +67,63 @@ struct rt_sigframe_user_layout { unsigned long end_offset; }; +/* + * Holds any EL0-controlled state that influences unprivileged memory accesses. + * This includes both accesses done in userspace and uaccess done in the kernel. + * + * This state needs to be carefully managed to ensure that it doesn't cause + * uaccess to fail when setting up the signal frame, and the signal handler + * itself also expects a well-defined state when entered. + */ +struct user_access_state { + u64 por_el0; +}; + #define BASE_SIGFRAME_SIZE round_up(sizeof(struct rt_sigframe), 16) #define TERMINATOR_SIZE round_up(sizeof(struct _aarch64_ctx), 16) #define EXTRA_CONTEXT_SIZE round_up(sizeof(struct extra_context), 16) +/* + * Save the user access state into ua_state and reset it to disable any + * restrictions. + */ +static void save_reset_user_access_state(struct user_access_state *ua_state) +{ + if (system_supports_poe()) { + u64 por_enable_all = 0; + + for (int pkey = 0; pkey < arch_max_pkey(); pkey++) + por_enable_all |= POE_RXW << (pkey * POR_BITS_PER_PKEY); + + ua_state->por_el0 = read_sysreg_s(SYS_POR_EL0); + write_sysreg_s(por_enable_all, SYS_POR_EL0); + /* Ensure that any subsequent uaccess observes the updated value */ + isb(); + } +} + +/* + * Set the user access state for invoking the signal handler. + * + * No uaccess should be done after that function is called. + */ +static void set_handler_user_access_state(void) +{ + if (system_supports_poe()) + write_sysreg_s(POR_EL0_INIT, SYS_POR_EL0); +} + +/* + * Restore the user access state to the values saved in ua_state. + * + * No uaccess should be done after that function is called. + */ +static void restore_user_access_state(const struct user_access_state *ua_state) +{ + if (system_supports_poe()) + write_sysreg_s(ua_state->por_el0, SYS_POR_EL0); +} + static void init_user_layout(struct rt_sigframe_user_layout *user) { const size_t reserved_size = @@ -261,18 +315,20 @@ static int restore_fpmr_context(struct user_ctxs *user) return err; } -static int preserve_poe_context(struct poe_context __user *ctx) +static int preserve_poe_context(struct poe_context __user *ctx, + const struct user_access_state *ua_state) { int err = 0; __put_user_error(POE_MAGIC, &ctx->head.magic, err); __put_user_error(sizeof(*ctx), &ctx->head.size, err); - __put_user_error(read_sysreg_s(SYS_POR_EL0), &ctx->por_el0, err); + __put_user_error(ua_state->por_el0, &ctx->por_el0, err); return err; } -static int restore_poe_context(struct user_ctxs *user) +static int restore_poe_context(struct user_ctxs *user, + struct user_access_state *ua_state) { u64 por_el0; int err = 0; @@ -282,7 +338,7 @@ static int restore_poe_context(struct user_ctxs *user) __get_user_error(por_el0, &(user->poe->por_el0), err); if (!err) - write_sysreg_s(por_el0, SYS_POR_EL0); + ua_state->por_el0 = por_el0; return err; } @@ -850,7 +906,8 @@ invalid: } static int restore_sigframe(struct pt_regs *regs, - struct rt_sigframe __user *sf) + struct rt_sigframe __user *sf, + struct user_access_state *ua_state) { sigset_t set; int i, err; @@ -899,7 +956,7 @@ static int restore_sigframe(struct pt_regs *regs, err = restore_zt_context(&user); if (err == 0 && system_supports_poe() && user.poe) - err = restore_poe_context(&user); + err = restore_poe_context(&user, ua_state); return err; } @@ -908,6 +965,7 @@ SYSCALL_DEFINE0(rt_sigreturn) { struct pt_regs *regs = current_pt_regs(); struct rt_sigframe __user *frame; + struct user_access_state ua_state; /* Always make any pending restarted system calls return -EINTR */ current->restart_block.fn = do_no_restart_syscall; @@ -924,12 +982,14 @@ SYSCALL_DEFINE0(rt_sigreturn) if (!access_ok(frame, sizeof (*frame))) goto badframe; - if (restore_sigframe(regs, frame)) + if (restore_sigframe(regs, frame, &ua_state)) goto badframe; if (restore_altstack(&frame->uc.uc_stack)) goto badframe; + restore_user_access_state(&ua_state); + return regs->regs[0]; badframe: @@ -1035,7 +1095,8 @@ static int setup_sigframe_layout(struct rt_sigframe_user_layout *user, } static int setup_sigframe(struct rt_sigframe_user_layout *user, - struct pt_regs *regs, sigset_t *set) + struct pt_regs *regs, sigset_t *set, + const struct user_access_state *ua_state) { int i, err = 0; struct rt_sigframe __user *sf = user->sigframe; @@ -1097,10 +1158,9 @@ static int setup_sigframe(struct rt_sigframe_user_layout *user, struct poe_context __user *poe_ctx = apply_user_offset(user, user->poe_offset); - err |= preserve_poe_context(poe_ctx); + err |= preserve_poe_context(poe_ctx, ua_state); } - /* ZA state if present */ if (system_supports_sme() && err == 0 && user->za_offset) { struct za_context __user *za_ctx = @@ -1237,9 +1297,6 @@ static void setup_return(struct pt_regs *regs, struct k_sigaction *ka, sme_smstop(); } - if (system_supports_poe()) - write_sysreg_s(POR_EL0_INIT, SYS_POR_EL0); - if (ka->sa.sa_flags & SA_RESTORER) sigtramp = ka->sa.sa_restorer; else @@ -1253,6 +1310,7 @@ static int setup_rt_frame(int usig, struct ksignal *ksig, sigset_t *set, { struct rt_sigframe_user_layout user; struct rt_sigframe __user *frame; + struct user_access_state ua_state; int err = 0; fpsimd_signal_preserve_current_state(); @@ -1260,13 +1318,14 @@ static int setup_rt_frame(int usig, struct ksignal *ksig, sigset_t *set, if (get_sigframe(&user, ksig, regs)) return 1; + save_reset_user_access_state(&ua_state); frame = user.sigframe; __put_user_error(0, &frame->uc.uc_flags, err); __put_user_error(NULL, &frame->uc.uc_link, err); err |= __save_altstack(&frame->uc.uc_stack, regs->sp); - err |= setup_sigframe(&user, regs, set); + err |= setup_sigframe(&user, regs, set, &ua_state); if (err == 0) { setup_return(regs, &ksig->ka, &user, usig); if (ksig->ka.sa.sa_flags & SA_SIGINFO) { @@ -1276,6 +1335,11 @@ static int setup_rt_frame(int usig, struct ksignal *ksig, sigset_t *set, } } + if (err == 0) + set_handler_user_access_state(); + else + restore_user_access_state(&ua_state); + return err; } -- cgit From ad4a3ca6a8e886f6491910a3ae5d53595e40597d Mon Sep 17 00:00:00 2001 From: Ido Schimmel Date: Tue, 22 Oct 2024 09:38:22 +0300 Subject: ipv4: ip_tunnel: Fix suspicious RCU usage warning in ip_tunnel_init_flow() There are code paths from which the function is called without holding the RCU read lock, resulting in a suspicious RCU usage warning [1]. Fix by using l3mdev_master_upper_ifindex_by_index() which will acquire the RCU read lock before calling l3mdev_master_upper_ifindex_by_index_rcu(). [1] WARNING: suspicious RCU usage 6.12.0-rc3-custom-gac8f72681cf2 #141 Not tainted ----------------------------- net/core/dev.c:876 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by ip/361: #0: ffffffff86fc7cb0 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x377/0xf60 stack backtrace: CPU: 3 UID: 0 PID: 361 Comm: ip Not tainted 6.12.0-rc3-custom-gac8f72681cf2 #141 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: dump_stack_lvl+0xba/0x110 lockdep_rcu_suspicious.cold+0x4f/0xd6 dev_get_by_index_rcu+0x1d3/0x210 l3mdev_master_upper_ifindex_by_index_rcu+0x2b/0xf0 ip_tunnel_bind_dev+0x72f/0xa00 ip_tunnel_newlink+0x368/0x7a0 ipgre_newlink+0x14c/0x170 __rtnl_newlink+0x1173/0x19c0 rtnl_newlink+0x6c/0xa0 rtnetlink_rcv_msg+0x3cc/0xf60 netlink_rcv_skb+0x171/0x450 netlink_unicast+0x539/0x7f0 netlink_sendmsg+0x8c1/0xd80 ____sys_sendmsg+0x8f9/0xc20 ___sys_sendmsg+0x197/0x1e0 __sys_sendmsg+0x122/0x1f0 do_syscall_64+0xbb/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: db53cd3d88dc ("net: Handle l3mdev in ip_tunnel_init_flow") Signed-off-by: Ido Schimmel Reviewed-by: David Ahern Link: https://patch.msgid.link/20241022063822.462057-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski --- include/net/ip_tunnels.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h index 6194fbb564c6..6a070478254d 100644 --- a/include/net/ip_tunnels.h +++ b/include/net/ip_tunnels.h @@ -354,7 +354,7 @@ static inline void ip_tunnel_init_flow(struct flowi4 *fl4, memset(fl4, 0, sizeof(*fl4)); if (oif) { - fl4->flowi4_l3mdev = l3mdev_master_upper_ifindex_by_index_rcu(net, oif); + fl4->flowi4_l3mdev = l3mdev_master_upper_ifindex_by_index(net, oif); /* Legacy VRF/l3mdev use case */ fl4->flowi4_oif = fl4->flowi4_l3mdev ? 0 : oif; } -- cgit From 90e0569dd3d32f4f4d2ca691d3fa5a8a14a13c12 Mon Sep 17 00:00:00 2001 From: Ido Schimmel Date: Wed, 23 Oct 2024 15:30:09 +0300 Subject: ipv4: ip_tunnel: Fix suspicious RCU usage warning in ip_tunnel_find() The per-netns IP tunnel hash table is protected by the RTNL mutex and ip_tunnel_find() is only called from the control path where the mutex is taken. Add a lockdep expression to hlist_for_each_entry_rcu() in ip_tunnel_find() in order to validate that the mutex is held and to silence the suspicious RCU usage warning [1]. [1] WARNING: suspicious RCU usage 6.12.0-rc3-custom-gd95d9a31aceb #139 Not tainted ----------------------------- net/ipv4/ip_tunnel.c:221 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by ip/362: #0: ffffffff86fc7cb0 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x377/0xf60 stack backtrace: CPU: 12 UID: 0 PID: 362 Comm: ip Not tainted 6.12.0-rc3-custom-gd95d9a31aceb #139 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: dump_stack_lvl+0xba/0x110 lockdep_rcu_suspicious.cold+0x4f/0xd6 ip_tunnel_find+0x435/0x4d0 ip_tunnel_newlink+0x517/0x7a0 ipgre_newlink+0x14c/0x170 __rtnl_newlink+0x1173/0x19c0 rtnl_newlink+0x6c/0xa0 rtnetlink_rcv_msg+0x3cc/0xf60 netlink_rcv_skb+0x171/0x450 netlink_unicast+0x539/0x7f0 netlink_sendmsg+0x8c1/0xd80 ____sys_sendmsg+0x8f9/0xc20 ___sys_sendmsg+0x197/0x1e0 __sys_sendmsg+0x122/0x1f0 do_syscall_64+0xbb/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: c54419321455 ("GRE: Refactor GRE tunneling code.") Suggested-by: Eric Dumazet Signed-off-by: Ido Schimmel Reviewed-by: Eric Dumazet Link: https://patch.msgid.link/20241023123009.749764-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski --- net/ipv4/ip_tunnel.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c index d591c73e2c0e..25505f9b724c 100644 --- a/net/ipv4/ip_tunnel.c +++ b/net/ipv4/ip_tunnel.c @@ -218,7 +218,7 @@ static struct ip_tunnel *ip_tunnel_find(struct ip_tunnel_net *itn, ip_tunnel_flags_copy(flags, parms->i_flags); - hlist_for_each_entry_rcu(t, head, hash_node) { + hlist_for_each_entry_rcu(t, head, hash_node, lockdep_rtnl_is_held()) { if (local == t->parms.iph.saddr && remote == t->parms.iph.daddr && link == READ_ONCE(t->parms.link) && -- cgit From 01e215975fd80af81b5b79f009d49ddd35976c13 Mon Sep 17 00:00:00 2001 From: Matt Johnston Date: Tue, 22 Oct 2024 18:25:14 +0800 Subject: mctp i2c: handle NULL header address daddr can be NULL if there is no neighbour table entry present, in that case the tx packet should be dropped. saddr will usually be set by MCTP core, but check for NULL in case a packet is transmitted by a different protocol. Fixes: f5b8abf9fc3d ("mctp i2c: MCTP I2C binding driver") Cc: stable@vger.kernel.org Reported-by: Dung Cao Signed-off-by: Matt Johnston Reviewed-by: Simon Horman Link: https://patch.msgid.link/20241022-mctp-i2c-null-dest-v3-1-e929709956c5@codeconstruct.com.au Signed-off-by: Jakub Kicinski --- drivers/net/mctp/mctp-i2c.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/mctp/mctp-i2c.c b/drivers/net/mctp/mctp-i2c.c index 4dc057c121f5..e70fb6687994 100644 --- a/drivers/net/mctp/mctp-i2c.c +++ b/drivers/net/mctp/mctp-i2c.c @@ -588,6 +588,9 @@ static int mctp_i2c_header_create(struct sk_buff *skb, struct net_device *dev, if (len > MCTP_I2C_MAXMTU) return -EMSGSIZE; + if (!daddr || !saddr) + return -EINVAL; + lldst = *((u8 *)daddr); llsrc = *((u8 *)saddr); -- cgit From 7515e37bce5c428a56a9b04ea7e96b3f53f17150 Mon Sep 17 00:00:00 2001 From: Pablo Neira Ayuso Date: Tue, 22 Oct 2024 16:48:25 +0200 Subject: gtp: allow -1 to be specified as file description from userspace Existing user space applications maintained by the Osmocom project are breaking since a recent fix that addresses incorrect error checking. Restore operation for user space programs that specify -1 as file descriptor to skip GTPv0 or GTPv1 only sockets. Fixes: defd8b3c37b0 ("gtp: fix a potential NULL pointer dereference") Reported-by: Pau Espin Pedrol Signed-off-by: Pablo Neira Ayuso Tested-by: Oliver Smith Reviewed-by: Simon Horman Link: https://patch.msgid.link/20241022144825.66740-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski --- drivers/net/gtp.c | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-) diff --git a/drivers/net/gtp.c b/drivers/net/gtp.c index a60bfb1abb7f..70f981887518 100644 --- a/drivers/net/gtp.c +++ b/drivers/net/gtp.c @@ -1702,20 +1702,24 @@ static int gtp_encap_enable(struct gtp_dev *gtp, struct nlattr *data[]) return -EINVAL; if (data[IFLA_GTP_FD0]) { - u32 fd0 = nla_get_u32(data[IFLA_GTP_FD0]); + int fd0 = nla_get_u32(data[IFLA_GTP_FD0]); - sk0 = gtp_encap_enable_socket(fd0, UDP_ENCAP_GTP0, gtp); - if (IS_ERR(sk0)) - return PTR_ERR(sk0); + if (fd0 >= 0) { + sk0 = gtp_encap_enable_socket(fd0, UDP_ENCAP_GTP0, gtp); + if (IS_ERR(sk0)) + return PTR_ERR(sk0); + } } if (data[IFLA_GTP_FD1]) { - u32 fd1 = nla_get_u32(data[IFLA_GTP_FD1]); + int fd1 = nla_get_u32(data[IFLA_GTP_FD1]); - sk1u = gtp_encap_enable_socket(fd1, UDP_ENCAP_GTP1U, gtp); - if (IS_ERR(sk1u)) { - gtp_encap_disable_sock(sk0); - return PTR_ERR(sk1u); + if (fd1 >= 0) { + sk1u = gtp_encap_enable_socket(fd1, UDP_ENCAP_GTP1U, gtp); + if (IS_ERR(sk1u)) { + gtp_encap_disable_sock(sk0); + return PTR_ERR(sk1u); + } } } -- cgit From c59d72d0a4fbaa5fd7a04b2d13cfc101d01310db Mon Sep 17 00:00:00 2001 From: Florian Westphal Date: Tue, 22 Oct 2024 17:23:18 +0200 Subject: selftests: netfilter: nft_flowtable.sh: make first pass deterministic The CI occasionaly encounters a failing test run. Example: # PASS: ipsec tunnel mode for ns1/ns2 # re-run with random mtus: -o 10966 -l 19499 -r 31322 # PASS: flow offloaded for ns1/ns2 [..] # FAIL: ipsec tunnel ... counter 1157059 exceeds expected value 878489 This script will re-exec itself, on the second run, random MTUs are chosen for the involved links. This is done so we can cover different combinations (large mtu on client, small on server, link has lowest mtu, etc). Furthermore, file size is random, even for the first run. Rework this script and always use the same file size on initial run so that at least the first round can be expected to have reproducible behavior. Second round will use random mtu/filesize. Raise the failure limit to that of the file size, this should avoid all errneous test errors. Currently, first fin will remove the offload, so if one peer is already closing remaining data is handled by classic path, which result in larger-than-expected counter and a test failure. Given packet path also counts tcp/ip headers, in case offload is completely broken this test will still fail (as expected). The test counter limit could be made more strict again in the future once flowtable can keep a connection in offloaded state until FINs in both directions were seen. Signed-off-by: Florian Westphal Reviewed-by: Simon Horman Link: https://patch.msgid.link/20241022152324.13554-1-fw@strlen.de Signed-off-by: Jakub Kicinski --- .../selftests/net/netfilter/nft_flowtable.sh | 39 ++++++++++++---------- 1 file changed, 21 insertions(+), 18 deletions(-) diff --git a/tools/testing/selftests/net/netfilter/nft_flowtable.sh b/tools/testing/selftests/net/netfilter/nft_flowtable.sh index b3995550856a..a4ee5496f2a1 100755 --- a/tools/testing/selftests/net/netfilter/nft_flowtable.sh +++ b/tools/testing/selftests/net/netfilter/nft_flowtable.sh @@ -71,6 +71,8 @@ omtu=9000 lmtu=1500 rmtu=2000 +filesize=$((2 * 1024 * 1024)) + usage(){ echo "nft_flowtable.sh [OPTIONS]" echo @@ -81,12 +83,13 @@ usage(){ exit 1 } -while getopts "o:l:r:" o +while getopts "o:l:r:s:" o do case $o in o) omtu=$OPTARG;; l) lmtu=$OPTARG;; r) rmtu=$OPTARG;; + s) filesize=$OPTARG;; *) usage;; esac done @@ -217,18 +220,10 @@ ns2out=$(mktemp) make_file() { - name=$1 - - SIZE=$((RANDOM % (1024 * 128))) - SIZE=$((SIZE + (1024 * 8))) - TSIZE=$((SIZE * 1024)) - - dd if=/dev/urandom of="$name" bs=1024 count=$SIZE 2> /dev/null + name="$1" + sz="$2" - SIZE=$((RANDOM % 1024)) - SIZE=$((SIZE + 128)) - TSIZE=$((TSIZE + SIZE)) - dd if=/dev/urandom conf=notrunc of="$name" bs=1 count=$SIZE 2> /dev/null + head -c "$sz" < /dev/urandom > "$name" } check_counters() @@ -246,18 +241,18 @@ check_counters() local fs fs=$(du -sb "$nsin") local max_orig=${fs%%/*} - local max_repl=$((max_orig/4)) + local max_repl=$((max_orig)) # flowtable fastpath should bypass normal routing one, i.e. the counters in forward hook # should always be lower than the size of the transmitted file (max_orig). if [ "$orig_cnt" -gt "$max_orig" ];then - echo "FAIL: $what: original counter $orig_cnt exceeds expected value $max_orig" 1>&2 + echo "FAIL: $what: original counter $orig_cnt exceeds expected value $max_orig, reply counter $repl_cnt" 1>&2 ret=1 ok=0 fi if [ "$repl_cnt" -gt $max_repl ];then - echo "FAIL: $what: reply counter $repl_cnt exceeds expected value $max_repl" 1>&2 + echo "FAIL: $what: reply counter $repl_cnt exceeds expected value $max_repl, original counter $orig_cnt" 1>&2 ret=1 ok=0 fi @@ -455,7 +450,7 @@ test_tcp_forwarding_nat() return $lret } -make_file "$nsin" +make_file "$nsin" "$filesize" # First test: # No PMTU discovery, nsr1 is expected to fragment packets from ns1 to ns2 as needed. @@ -664,8 +659,16 @@ if [ "$1" = "" ]; then l=$(((RANDOM%mtu) + low)) r=$(((RANDOM%mtu) + low)) - echo "re-run with random mtus: -o $o -l $l -r $r" - $0 -o "$o" -l "$l" -r "$r" + MINSIZE=$((2 * 1000 * 1000)) + MAXSIZE=$((64 * 1000 * 1000)) + + filesize=$(((RANDOM * RANDOM) % MAXSIZE)) + if [ "$filesize" -lt "$MINSIZE" ]; then + filesize=$((filesize+MINSIZE)) + fi + + echo "re-run with random mtus and file size: -o $o -l $l -r $r -s $filesize" + $0 -o "$o" -l "$l" -r "$r" -s "$filesize" fi exit $ret -- cgit From 2e95c4384438adeaa772caa560244b1a2efef816 Mon Sep 17 00:00:00 2001 From: Pedro Tammela Date: Thu, 24 Oct 2024 12:55:47 -0400 Subject: net/sched: stop qdisc_tree_reduce_backlog on TC_H_ROOT In qdisc_tree_reduce_backlog, Qdiscs with major handle ffff: are assumed to be either root or ingress. This assumption is bogus since it's valid to create egress qdiscs with major handle ffff: Budimir Markovic found that for qdiscs like DRR that maintain an active class list, it will cause a UAF with a dangling class pointer. In 066a3b5b2346, the concern was to avoid iterating over the ingress qdisc since its parent is itself. The proper fix is to stop when parent TC_H_ROOT is reached because the only way to retrieve ingress is when a hierarchy which does not contain a ffff: major handle call into qdisc_lookup with TC_H_MAJ(TC_H_ROOT). In the scenario where major ffff: is an egress qdisc in any of the tree levels, the updates will also propagate to TC_H_ROOT, which then the iteration must stop. Fixes: 066a3b5b2346 ("[NET_SCHED] sch_api: fix qdisc_tree_decrease_qlen() loop") Reported-by: Budimir Markovic Suggested-by: Jamal Hadi Salim Tested-by: Victor Nogueira Signed-off-by: Pedro Tammela Signed-off-by: Jamal Hadi Salim net/sched/sch_api.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Reviewed-by: Simon Horman Link: https://patch.msgid.link/20241024165547.418570-1-jhs@mojatatu.com Signed-off-by: Jakub Kicinski --- net/sched/sch_api.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index 2eefa4783879..a1d27bc039a3 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -791,7 +791,7 @@ void qdisc_tree_reduce_backlog(struct Qdisc *sch, int n, int len) drops = max_t(int, n, 0); rcu_read_lock(); while ((parentid = sch->parent)) { - if (TC_H_MAJ(parentid) == TC_H_MAJ(TC_H_INGRESS)) + if (parentid == TC_H_ROOT) break; if (sch->flags & TCQ_F_NOPARENT) -- cgit From aa30eb3260b2dea3a68d3c42a39f9a09c5e99cee Mon Sep 17 00:00:00 2001 From: Eduard Zingerman Date: Tue, 29 Oct 2024 10:26:40 -0700 Subject: bpf: Force checkpoint when jmp history is too long A specifically crafted program might trick verifier into growing very long jump history within a single bpf_verifier_state instance. Very long jump history makes mark_chain_precision() unreasonably slow, especially in case if verifier processes a loop. Mitigate this by forcing new state in is_state_visited() in case if current state's jump history is too long. Use same constant as in `skip_inf_loop_check`, but multiply it by arbitrarily chosen value 2 to account for jump history containing not only information about jumps, but also information about stack access. For an example of problematic program consider the code below, w/o this patch the example is processed by verifier for ~15 minutes, before failing to allocate big-enough chunk for jmp_history. 0: r7 = *(u16 *)(r1 +0);" 1: r7 += 0x1ab064b9;" 2: if r7 & 0x702000 goto 1b; 3: r7 &= 0x1ee60e;" 4: r7 += r1;" 5: if r7 s> 0x37d2 goto +0;" 6: r0 = 0;" 7: exit;" Perf profiling shows that most of the time is spent in mark_chain_precision() ~95%. The easiest way to explain why this program causes problems is to apply the following patch: diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 0c216e71cec7..4b4823961abe 100644 \--- a/include/linux/bpf.h \+++ b/include/linux/bpf.h \@@ -1926,7 +1926,7 @@ struct bpf_array { }; }; -#define BPF_COMPLEXITY_LIMIT_INSNS 1000000 /* yes. 1M insns */ +#define BPF_COMPLEXITY_LIMIT_INSNS 256 /* yes. 1M insns */ #define MAX_TAIL_CALL_CNT 33 /* Maximum number of loops for bpf_loop and bpf_iter_num. diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index f514247ba8ba..75e88be3bb3e 100644 \--- a/kernel/bpf/verifier.c \+++ b/kernel/bpf/verifier.c \@@ -18024,8 +18024,13 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx) skip_inf_loop_check: if (!force_new_state && env->jmps_processed - env->prev_jmps_processed < 20 && - env->insn_processed - env->prev_insn_processed < 100) + env->insn_processed - env->prev_insn_processed < 100) { + verbose(env, "is_state_visited: suppressing checkpoint at %d, %d jmps processed, cur->jmp_history_cnt is %d\n", + env->insn_idx, + env->jmps_processed - env->prev_jmps_processed, + cur->jmp_history_cnt); add_new_state = false; + } goto miss; } /* If sl->state is a part of a loop and this loop's entry is a part of \@@ -18142,6 +18147,9 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx) if (!add_new_state) return 0; + verbose(env, "is_state_visited: new checkpoint at %d, resetting env->jmps_processed\n", + env->insn_idx); + /* There were no equivalent states, remember the current one. * Technically the current state is not proven to be safe yet, * but it will either reach outer most bpf_exit (which means it's safe) And observe verification log: ... is_state_visited: new checkpoint at 5, resetting env->jmps_processed 5: R1=ctx() R7=ctx(...) 5: (65) if r7 s> 0x37d2 goto pc+0 ; R7=ctx(...) 6: (b7) r0 = 0 ; R0_w=0 7: (95) exit from 5 to 6: R1=ctx() R7=ctx(...) R10=fp0 6: R1=ctx() R7=ctx(...) R10=fp0 6: (b7) r0 = 0 ; R0_w=0 7: (95) exit is_state_visited: suppressing checkpoint at 1, 3 jmps processed, cur->jmp_history_cnt is 74 from 2 to 1: R1=ctx() R7_w=scalar(...) R10=fp0 1: R1=ctx() R7_w=scalar(...) R10=fp0 1: (07) r7 += 447767737 is_state_visited: suppressing checkpoint at 2, 3 jmps processed, cur->jmp_history_cnt is 75 2: R7_w=scalar(...) 2: (45) if r7 & 0x702000 goto pc-2 ... mark_precise 152 steps for r7 ... 2: R7_w=scalar(...) is_state_visited: suppressing checkpoint at 1, 4 jmps processed, cur->jmp_history_cnt is 75 1: (07) r7 += 447767737 is_state_visited: suppressing checkpoint at 2, 4 jmps processed, cur->jmp_history_cnt is 76 2: R7_w=scalar(...) 2: (45) if r7 & 0x702000 goto pc-2 ... BPF program is too large. Processed 257 insn The log output shows that checkpoint at label (1) is never created, because it is suppressed by `skip_inf_loop_check` logic: a. When 'if' at (2) is processed it pushes a state with insn_idx (1) onto stack and proceeds to (3); b. At (5) checkpoint is created, and this resets env->{jmps,insns}_processed. c. Verification proceeds and reaches `exit`; d. State saved at step (a) is popped from stack and is_state_visited() considers if checkpoint needs to be added, but because env->{jmps,insns}_processed had been just reset at step (b) the `skip_inf_loop_check` logic forces `add_new_state` to false. e. Verifier proceeds with current state, which slowly accumulates more and more entries in the jump history. The accumulation of entries in the jump history is a problem because of two factors: - it eventually exhausts memory available for kmalloc() allocation; - mark_chain_precision() traverses the jump history of a state, meaning that if `r7` is marked precise, verifier would iterate ever growing jump history until parent state boundary is reached. (note: the log also shows a REG INVARIANTS VIOLATION warning upon jset processing, but that's another bug to fix). With this patch applied, the example above is rejected by verifier under 1s of time, reaching 1M instructions limit. The program is a simplified reproducer from syzbot report. Previous discussion could be found at [1]. The patch does not cause any changes in verification performance, when tested on selftests from veristat.cfg and cilium programs taken from [2]. [1] https://lore.kernel.org/bpf/20241009021254.2805446-1-eddyz87@gmail.com/ [2] https://github.com/anakryiko/cilium Changelog: - v1 -> v2: - moved patch to bpf tree; - moved force_new_state variable initialization after declaration and shortened the comment. v1: https://lore.kernel.org/bpf/20241018020307.1766906-1-eddyz87@gmail.com/ Fixes: 2589726d12a1 ("bpf: introduce bounded loops") Reported-by: syzbot+7e46cdef14bf496a3ab4@syzkaller.appspotmail.com Signed-off-by: Eduard Zingerman Signed-off-by: Andrii Nakryiko Acked-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20241029172641.1042523-1-eddyz87@gmail.com Closes: https://lore.kernel.org/bpf/670429f6.050a0220.49194.0517.GAE@google.com/ --- kernel/bpf/verifier.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 587a6c76e564..3254c27085b8 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -17886,9 +17886,11 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx) struct bpf_verifier_state_list *sl, **pprev; struct bpf_verifier_state *cur = env->cur_state, *new, *loop_entry; int i, j, n, err, states_cnt = 0; - bool force_new_state = env->test_state_freq || is_force_checkpoint(env, insn_idx); - bool add_new_state = force_new_state; - bool force_exact; + bool force_new_state, add_new_state, force_exact; + + force_new_state = env->test_state_freq || is_force_checkpoint(env, insn_idx) || + /* Avoid accumulating infinitely long jmp history */ + cur->jmp_history_cnt > 40; /* bpf progs typically have pruning point every 4 instructions * http://vger.kernel.org/bpfconf2019.html#session-1 @@ -17898,6 +17900,7 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx) * In tests that amounts to up to 50% reduction into total verifier * memory consumption and 20% verifier time speedup. */ + add_new_state = force_new_state; if (env->jmps_processed - env->prev_jmps_processed >= 2 && env->insn_processed - env->prev_insn_processed >= 8) add_new_state = true; -- cgit From 1fb315892d8395cec2dae04b0cb5558731aefb37 Mon Sep 17 00:00:00 2001 From: Eduard Zingerman Date: Tue, 29 Oct 2024 10:26:41 -0700 Subject: selftests/bpf: Test with a very short loop The test added is a simplified reproducer from syzbot report [1]. If verifier does not insert checkpoint somewhere inside the loop, verification of the program would take a very long time. This would happen because mark_chain_precision() for register r7 would constantly trace jump history of the loop back, processing many iterations for each mark_chain_precision() call. [1] https://lore.kernel.org/bpf/670429f6.050a0220.49194.0517.GAE@google.com/ Signed-off-by: Eduard Zingerman Signed-off-by: Andrii Nakryiko Acked-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20241029172641.1042523-2-eddyz87@gmail.com --- .../selftests/bpf/progs/verifier_search_pruning.c | 23 ++++++++++++++++++++++ tools/testing/selftests/bpf/veristat.cfg | 1 + 2 files changed, 24 insertions(+) diff --git a/tools/testing/selftests/bpf/progs/verifier_search_pruning.c b/tools/testing/selftests/bpf/progs/verifier_search_pruning.c index 5a14498d352f..f40e57251e94 100644 --- a/tools/testing/selftests/bpf/progs/verifier_search_pruning.c +++ b/tools/testing/selftests/bpf/progs/verifier_search_pruning.c @@ -2,6 +2,7 @@ /* Converted from tools/testing/selftests/bpf/verifier/search_pruning.c */ #include +#include <../../../include/linux/filter.h> #include #include "bpf_misc.h" @@ -336,4 +337,26 @@ l0_%=: r1 = 42; \ : __clobber_all); } +/* Without checkpoint forcibly inserted at the back-edge a loop this + * test would take a very long time to verify. + */ +SEC("kprobe") +__failure __log_level(4) +__msg("BPF program is too large.") +__naked void short_loop1(void) +{ + asm volatile ( + " r7 = *(u16 *)(r1 +0);" + "1: r7 += 0x1ab064b9;" + " .8byte %[jset];" /* same as 'if r7 & 0x702000 goto 1b;' */ + " r7 &= 0x1ee60e;" + " r7 += r1;" + " if r7 s> 0x37d2 goto +0;" + " r0 = 0;" + " exit;" + : + : __imm_insn(jset, BPF_JMP_IMM(BPF_JSET, BPF_REG_7, 0x702000, -2)) + : __clobber_all); +} + char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/veristat.cfg b/tools/testing/selftests/bpf/veristat.cfg index 1a385061618d..e661ffdcaadf 100644 --- a/tools/testing/selftests/bpf/veristat.cfg +++ b/tools/testing/selftests/bpf/veristat.cfg @@ -15,3 +15,4 @@ test_usdt* test_verif_scale* test_xdp_noinline* xdp_synproxy* +verifier_search_pruning* -- cgit From 4ce1f56a1eaced2523329bef800d004e30f2f76c Mon Sep 17 00:00:00 2001 From: Zichen Xie Date: Tue, 22 Oct 2024 12:19:08 -0500 Subject: netdevsim: Add trailing zero to terminate the string in nsim_nexthop_bucket_activity_write() This was found by a static analyzer. We should not forget the trailing zero after copy_from_user() if we will further do some string operations, sscanf() in this case. Adding a trailing zero will ensure that the function performs properly. Fixes: c6385c0b67c5 ("netdevsim: Allow reporting activity on nexthop buckets") Signed-off-by: Zichen Xie Reviewed-by: Petr Machata Reviewed-by: Ido Schimmel Link: https://patch.msgid.link/20241022171907.8606-1-zichenxie0106@gmail.com Signed-off-by: Jakub Kicinski --- drivers/net/netdevsim/fib.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/netdevsim/fib.c b/drivers/net/netdevsim/fib.c index 41e80f78b316..16c382c42227 100644 --- a/drivers/net/netdevsim/fib.c +++ b/drivers/net/netdevsim/fib.c @@ -1377,10 +1377,12 @@ static ssize_t nsim_nexthop_bucket_activity_write(struct file *file, if (pos != 0) return -EINVAL; - if (size > sizeof(buf)) + if (size > sizeof(buf) - 1) return -EINVAL; if (copy_from_user(buf, user_buf, size)) return -EFAULT; + buf[size] = 0; + if (sscanf(buf, "%u %hu", &nhid, &bucket_index) != 2) return -EINVAL; -- cgit From a13e690191eafc154b3f60afe9ce35aa9b9128b4 Mon Sep 17 00:00:00 2001 From: Vladimir Oltean Date: Wed, 23 Oct 2024 13:05:41 +0300 Subject: net/sched: sch_api: fix xa_insert() error path in tcf_block_get_ext() This command: $ tc qdisc replace dev eth0 ingress_block 1 egress_block 1 clsact Error: block dev insert failed: -EBUSY. fails because user space requests the same block index to be set for both ingress and egress. [ side note, I don't think it even failed prior to commit 913b47d3424e ("net/sched: Introduce tc block netdev tracking infra"), because this is a command from an old set of notes of mine which used to work, but alas, I did not scientifically bisect this ] The problem is not that it fails, but rather, that the second time around, it fails differently (and irrecoverably): $ tc qdisc replace dev eth0 ingress_block 1 egress_block 1 clsact Error: dsa_core: Flow block cb is busy. [ another note: the extack is added by me for illustration purposes. the context of the problem is that clsact_init() obtains the same &q->ingress_block pointer as &q->egress_block, and since we call tcf_block_get_ext() on both of them, "dev" will be added to the block->ports xarray twice, thus failing the operation: once through the ingress block pointer, and once again through the egress block pointer. the problem itself is that when xa_insert() fails, we have emitted a FLOW_BLOCK_BIND command through ndo_setup_tc(), but the offload never sees a corresponding FLOW_BLOCK_UNBIND. ] Even correcting the bad user input, we still cannot recover: $ tc qdisc replace dev swp3 ingress_block 1 egress_block 2 clsact Error: dsa_core: Flow block cb is busy. Basically the only way to recover is to reboot the system, or unbind and rebind the net device driver. To fix the bug, we need to fill the correct error teardown path which was missed during code movement, and call tcf_block_offload_unbind() when xa_insert() fails. [ last note, fundamentally I blame the label naming convention in tcf_block_get_ext() for the bug. The labels should be named after what they do, not after the error path that jumps to them. This way, it is obviously wrong that two labels pointing to the same code mean something is wrong, and checking the code correctness at the goto site is also easier ] Fixes: 94e2557d086a ("net: sched: move block device tracking into tcf_block_get/put_ext()") Signed-off-by: Vladimir Oltean Reviewed-by: Simon Horman Acked-by: Jamal Hadi Salim Link: https://patch.msgid.link/20241023100541.974362-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski --- net/sched/cls_api.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c index 17d97bbe890f..bbc778c233c8 100644 --- a/net/sched/cls_api.c +++ b/net/sched/cls_api.c @@ -1518,6 +1518,7 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q, return 0; err_dev_insert: + tcf_block_offload_unbind(block, q, ei); err_block_offload_bind: tcf_chain0_head_change_cb_del(block, ei); err_chain0_head_change_cb_add: -- cgit From 6b3f18a76be6bbd237c7594cf0bf2912b68084fe Mon Sep 17 00:00:00 2001 From: Benoît Monin Date: Thu, 24 Oct 2024 17:11:13 +0200 Subject: net: usb: qmi_wwan: add Quectel RG650V MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add support for Quectel RG650V which is based on Qualcomm SDX65 chip. The composition is DIAG / NMEA / AT / AT / QMI. T: Bus=02 Lev=01 Prnt=01 Port=03 Cnt=01 Dev#= 4 Spd=5000 MxCh= 0 D: Ver= 3.20 Cls=00(>ifc ) Sub=00 Prot=00 MxPS= 9 #Cfgs= 1 P: Vendor=2c7c ProdID=0122 Rev=05.15 S: Manufacturer=Quectel S: Product=RG650V-EU S: SerialNumber=xxxxxxx C: #Ifs= 5 Cfg#= 1 Atr=a0 MxPwr=896mA I: If#= 0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=30 Driver=option E: Ad=01(O) Atr=02(Bulk) MxPS=1024 Ivl=0ms E: Ad=81(I) Atr=02(Bulk) MxPS=1024 Ivl=0ms I: If#= 1 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=00 Prot=00 Driver=option E: Ad=02(O) Atr=02(Bulk) MxPS=1024 Ivl=0ms E: Ad=82(I) Atr=02(Bulk) MxPS=1024 Ivl=0ms I: If#= 2 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=00 Prot=00 Driver=option E: Ad=03(O) Atr=02(Bulk) MxPS=1024 Ivl=0ms E: Ad=83(I) Atr=02(Bulk) MxPS=1024 Ivl=0ms E: Ad=84(I) Atr=03(Int.) MxPS= 10 Ivl=9ms I: If#= 3 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=00 Prot=00 Driver=option E: Ad=04(O) Atr=02(Bulk) MxPS=1024 Ivl=0ms E: Ad=85(I) Atr=02(Bulk) MxPS=1024 Ivl=0ms E: Ad=86(I) Atr=03(Int.) MxPS= 10 Ivl=9ms I: If#= 4 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=ff Driver=qmi_wwan E: Ad=05(O) Atr=02(Bulk) MxPS=1024 Ivl=0ms E: Ad=87(I) Atr=02(Bulk) MxPS=1024 Ivl=0ms E: Ad=88(I) Atr=03(Int.) MxPS= 8 Ivl=9ms Signed-off-by: Benoît Monin Reviewed-by: Simon Horman Link: https://patch.msgid.link/20241024151113.53203-1-benoit.monin@gmx.fr Signed-off-by: Jakub Kicinski --- drivers/net/usb/qmi_wwan.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c index f137c82f1c0f..0c011d8f5d4d 100644 --- a/drivers/net/usb/qmi_wwan.c +++ b/drivers/net/usb/qmi_wwan.c @@ -1076,6 +1076,7 @@ static const struct usb_device_id products[] = { USB_DEVICE_AND_INTERFACE_INFO(0x03f0, 0x581d, USB_CLASS_VENDOR_SPEC, 1, 7), .driver_info = (unsigned long)&qmi_wwan_info, }, + {QMI_MATCH_FF_FF_FF(0x2c7c, 0x0122)}, /* Quectel RG650V */ {QMI_MATCH_FF_FF_FF(0x2c7c, 0x0125)}, /* Quectel EC25, EC20 R2.0 Mini PCIe */ {QMI_MATCH_FF_FF_FF(0x2c7c, 0x0306)}, /* Quectel EP06/EG06/EM06 */ {QMI_MATCH_FF_FF_FF(0x2c7c, 0x0512)}, /* Quectel EG12/EM12 */ -- cgit From 63fab04cbd0f96191b6e5beedc3b643b01c15889 Mon Sep 17 00:00:00 2001 From: Chuck Lever Date: Sat, 26 Oct 2024 12:02:38 -0400 Subject: NFSD: Initialize struct nfsd4_copy earlier Ensure the refcount and async_copies fields are initialized early. cleanup_async_copy() will reference these fields if an error occurs in nfsd4_copy(). If they are not correctly initialized, at the very least, a refcount underflow occurs. Reported-by: Olga Kornievskaia Fixes: aadc3bbea163 ("NFSD: Limit the number of concurrent async COPY operations") Reviewed-by: Jeff Layton Tested-by: Olga Kornievskaia Signed-off-by: Chuck Lever --- fs/nfsd/nfs4proc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c index b5a6bf4f459f..5fd1ce3fc8fb 100644 --- a/fs/nfsd/nfs4proc.c +++ b/fs/nfsd/nfs4proc.c @@ -1841,14 +1841,14 @@ nfsd4_copy(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, if (!async_copy) goto out_err; async_copy->cp_nn = nn; + INIT_LIST_HEAD(&async_copy->copies); + refcount_set(&async_copy->refcount, 1); /* Arbitrary cap on number of pending async copy operations */ if (atomic_inc_return(&nn->pending_async_copies) > (int)rqstp->rq_pool->sp_nrthreads) { atomic_dec(&nn->pending_async_copies); goto out_err; } - INIT_LIST_HEAD(&async_copy->copies); - refcount_set(&async_copy->refcount, 1); async_copy->cp_src = kmalloc(sizeof(*async_copy->cp_src), GFP_KERNEL); if (!async_copy->cp_src) goto out_err; -- cgit From 13400ac8fb80c57c2bfb12ebd35ee121ce9b4d21 Mon Sep 17 00:00:00 2001 From: Byeonguk Jeong Date: Sat, 26 Oct 2024 14:02:43 +0900 Subject: bpf: Fix out-of-bounds write in trie_get_next_key() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit trie_get_next_key() allocates a node stack with size trie->max_prefixlen, while it writes (trie->max_prefixlen + 1) nodes to the stack when it has full paths from the root to leaves. For example, consider a trie with max_prefixlen is 8, and the nodes with key 0x00/0, 0x00/1, 0x00/2, ... 0x00/8 inserted. Subsequent calls to trie_get_next_key with _key with .prefixlen = 8 make 9 nodes be written on the node stack with size 8. Fixes: b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map") Signed-off-by: Byeonguk Jeong Reviewed-by: Toke Høiland-Jørgensen Tested-by: Hou Tao Acked-by: Hou Tao Link: https://lore.kernel.org/r/Zxx384ZfdlFYnz6J@localhost.localdomain Signed-off-by: Alexei Starovoitov --- kernel/bpf/lpm_trie.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c index 0218a5132ab5..9b60eda0f727 100644 --- a/kernel/bpf/lpm_trie.c +++ b/kernel/bpf/lpm_trie.c @@ -655,7 +655,7 @@ static int trie_get_next_key(struct bpf_map *map, void *_key, void *_next_key) if (!key || key->prefixlen > trie->max_prefixlen) goto find_leftmost; - node_stack = kmalloc_array(trie->max_prefixlen, + node_stack = kmalloc_array(trie->max_prefixlen + 1, sizeof(struct lpm_trie_node *), GFP_ATOMIC | __GFP_NOWARN); if (!node_stack) -- cgit From d7f214aeacb984b9d42da0146e789f595eb09068 Mon Sep 17 00:00:00 2001 From: Byeonguk Jeong Date: Sat, 26 Oct 2024 14:04:58 +0900 Subject: selftests/bpf: Add test for trie_get_next_key() Add a test for out-of-bounds write in trie_get_next_key() when a full path from root to leaf exists and bpf_map_get_next_key() is called with the leaf node. It may crashes the kernel on failure, so please run in a VM. Signed-off-by: Byeonguk Jeong Acked-by: Hou Tao Link: https://lore.kernel.org/r/Zxx4ep78tsbeWPVM@localhost.localdomain Signed-off-by: Alexei Starovoitov --- .../bpf/map_tests/lpm_trie_map_get_next_key.c | 109 +++++++++++++++++++++ 1 file changed, 109 insertions(+) create mode 100644 tools/testing/selftests/bpf/map_tests/lpm_trie_map_get_next_key.c diff --git a/tools/testing/selftests/bpf/map_tests/lpm_trie_map_get_next_key.c b/tools/testing/selftests/bpf/map_tests/lpm_trie_map_get_next_key.c new file mode 100644 index 000000000000..0ba015686492 --- /dev/null +++ b/tools/testing/selftests/bpf/map_tests/lpm_trie_map_get_next_key.c @@ -0,0 +1,109 @@ +// SPDX-License-Identifier: GPL-2.0 + +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#include + +struct test_lpm_key { + __u32 prefix; + __u32 data; +}; + +struct get_next_key_ctx { + struct test_lpm_key key; + bool start; + bool stop; + int map_fd; + int loop; +}; + +static void *get_next_key_fn(void *arg) +{ + struct get_next_key_ctx *ctx = arg; + struct test_lpm_key next_key; + int i = 0; + + while (!ctx->start) + usleep(1); + + while (!ctx->stop && i++ < ctx->loop) + bpf_map_get_next_key(ctx->map_fd, &ctx->key, &next_key); + + return NULL; +} + +static void abort_get_next_key(struct get_next_key_ctx *ctx, pthread_t *tids, + unsigned int nr) +{ + unsigned int i; + + ctx->stop = true; + ctx->start = true; + for (i = 0; i < nr; i++) + pthread_join(tids[i], NULL); +} + +/* This test aims to prevent regression of future. As long as the kernel does + * not panic, it is considered as success. + */ +void test_lpm_trie_map_get_next_key(void) +{ +#define MAX_NR_THREADS 8 + LIBBPF_OPTS(bpf_map_create_opts, create_opts, + .map_flags = BPF_F_NO_PREALLOC); + struct test_lpm_key key = {}; + __u32 val = 0; + int map_fd; + const __u32 max_prefixlen = 8 * (sizeof(key) - sizeof(key.prefix)); + const __u32 max_entries = max_prefixlen + 1; + unsigned int i, nr = MAX_NR_THREADS, loop = 65536; + pthread_t tids[MAX_NR_THREADS]; + struct get_next_key_ctx ctx; + int err; + + map_fd = bpf_map_create(BPF_MAP_TYPE_LPM_TRIE, "lpm_trie_map", + sizeof(struct test_lpm_key), sizeof(__u32), + max_entries, &create_opts); + CHECK(map_fd == -1, "bpf_map_create()", "error:%s\n", + strerror(errno)); + + for (i = 0; i <= max_prefixlen; i++) { + key.prefix = i; + err = bpf_map_update_elem(map_fd, &key, &val, BPF_ANY); + CHECK(err, "bpf_map_update_elem()", "error:%s\n", + strerror(errno)); + } + + ctx.start = false; + ctx.stop = false; + ctx.map_fd = map_fd; + ctx.loop = loop; + memcpy(&ctx.key, &key, sizeof(key)); + + for (i = 0; i < nr; i++) { + err = pthread_create(&tids[i], NULL, get_next_key_fn, &ctx); + if (err) { + abort_get_next_key(&ctx, tids, i); + CHECK(err, "pthread_create", "error %d\n", err); + } + } + + ctx.start = true; + for (i = 0; i < nr; i++) + pthread_join(tids[i], NULL); + + printf("%s:PASS\n", __func__); + + close(map_fd); +} -- cgit From aec8e6bf839101784f3ef037dcdb9432c3f32343 Mon Sep 17 00:00:00 2001 From: Zhihao Cheng Date: Mon, 21 Oct 2024 22:02:15 +0800 Subject: btrfs: fix use-after-free of block device file in __btrfs_free_extra_devids() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Mounting btrfs from two images (which have the same one fsid and two different dev_uuids) in certain executing order may trigger an UAF for variable 'device->bdev_file' in __btrfs_free_extra_devids(). And following are the details: 1. Attach image_1 to loop0, attach image_2 to loop1, and scan btrfs devices by ioctl(BTRFS_IOC_SCAN_DEV): / btrfs_device_1 → loop0 fs_device \ btrfs_device_2 → loop1 2. mount /dev/loop0 /mnt btrfs_open_devices btrfs_device_1->bdev_file = btrfs_get_bdev_and_sb(loop0) btrfs_device_2->bdev_file = btrfs_get_bdev_and_sb(loop1) btrfs_fill_super open_ctree fail: btrfs_close_devices // -ENOMEM btrfs_close_bdev(btrfs_device_1) fput(btrfs_device_1->bdev_file) // btrfs_device_1->bdev_file is freed btrfs_close_bdev(btrfs_device_2) fput(btrfs_device_2->bdev_file) 3. mount /dev/loop1 /mnt btrfs_open_devices btrfs_get_bdev_and_sb(&bdev_file) // EIO, btrfs_device_1->bdev_file is not assigned, // which points to a freed memory area btrfs_device_2->bdev_file = btrfs_get_bdev_and_sb(loop1) btrfs_fill_super open_ctree btrfs_free_extra_devids if (btrfs_device_1->bdev_file) fput(btrfs_device_1->bdev_file) // UAF ! Fix it by setting 'device->bdev_file' as 'NULL' after closing the btrfs_device in btrfs_close_one_device(). Fixes: 142388194191 ("btrfs: do not background blkdev_put()") CC: stable@vger.kernel.org # 4.19+ Link: https://bugzilla.kernel.org/show_bug.cgi?id=219408 Signed-off-by: Zhihao Cheng Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/volumes.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 8f340ad1d938..eb51b609190f 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1105,6 +1105,7 @@ static void btrfs_close_one_device(struct btrfs_device *device) if (device->bdev) { fs_devices->open_devices--; device->bdev = NULL; + device->bdev_file = NULL; } clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state); btrfs_destroy_dev_zone_info(device); -- cgit From 9ab5cf19fb0e4680f95e506d6c544259bf1111c4 Mon Sep 17 00:00:00 2001 From: Wang Liang Date: Wed, 23 Oct 2024 11:52:13 +0800 Subject: net: fix crash when config small gso_max_size/gso_ipv4_max_size Config a small gso_max_size/gso_ipv4_max_size will lead to an underflow in sk_dst_gso_max_size(), which may trigger a BUG_ON crash, because sk->sk_gso_max_size would be much bigger than device limits. Call Trace: tcp_write_xmit tso_segs = tcp_init_tso_segs(skb, mss_now); tcp_set_skb_tso_segs tcp_skb_pcount_set // skb->len = 524288, mss_now = 8 // u16 tso_segs = 524288/8 = 65535 -> 0 tso_segs = DIV_ROUND_UP(skb->len, mss_now) BUG_ON(!tso_segs) Add check for the minimum value of gso_max_size and gso_ipv4_max_size. Fixes: 46e6b992c250 ("rtnetlink: allow GSO maximums to be set on device creation") Fixes: 9eefedd58ae1 ("net: add gso_ipv4_max_size and gro_ipv4_max_size per device") Signed-off-by: Wang Liang Reviewed-by: Eric Dumazet Link: https://patch.msgid.link/20241023035213.517386-1-wangliang74@huawei.com Signed-off-by: Jakub Kicinski --- net/core/rtnetlink.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index e30e7ea0207d..2ba5cd965d3f 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -2032,7 +2032,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = { [IFLA_NUM_TX_QUEUES] = { .type = NLA_U32 }, [IFLA_NUM_RX_QUEUES] = { .type = NLA_U32 }, [IFLA_GSO_MAX_SEGS] = { .type = NLA_U32 }, - [IFLA_GSO_MAX_SIZE] = { .type = NLA_U32 }, + [IFLA_GSO_MAX_SIZE] = NLA_POLICY_MIN(NLA_U32, MAX_TCP_HEADER + 1), [IFLA_PHYS_PORT_ID] = { .type = NLA_BINARY, .len = MAX_PHYS_ITEM_ID_LEN }, [IFLA_CARRIER_CHANGES] = { .type = NLA_U32 }, /* ignored */ [IFLA_PHYS_SWITCH_ID] = { .type = NLA_BINARY, .len = MAX_PHYS_ITEM_ID_LEN }, @@ -2057,7 +2057,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = { [IFLA_TSO_MAX_SIZE] = { .type = NLA_REJECT }, [IFLA_TSO_MAX_SEGS] = { .type = NLA_REJECT }, [IFLA_ALLMULTI] = { .type = NLA_REJECT }, - [IFLA_GSO_IPV4_MAX_SIZE] = { .type = NLA_U32 }, + [IFLA_GSO_IPV4_MAX_SIZE] = NLA_POLICY_MIN(NLA_U32, MAX_TCP_HEADER + 1), [IFLA_GRO_IPV4_MAX_SIZE] = { .type = NLA_U32 }, }; -- cgit From d0b98f6a17a5cb336121302bce0c97eb5fe32d16 Mon Sep 17 00:00:00 2001 From: Eduard Zingerman Date: Tue, 29 Oct 2024 12:39:11 -0700 Subject: bpf: disallow 40-bytes extra stack for bpf_fastcall patterns Hou Tao reported an issue with bpf_fastcall patterns allowing extra stack space above MAX_BPF_STACK limit. This extra stack allowance is not integrated properly with the following verifier parts: - backtracking logic still assumes that stack can't exceed MAX_BPF_STACK; - bpf_verifier_env->scratched_stack_slots assumes only 64 slots are available. Here is an example of an issue with precision tracking (note stack slot -8 tracked as precise instead of -520): 0: (b7) r1 = 42 ; R1_w=42 1: (b7) r2 = 42 ; R2_w=42 2: (7b) *(u64 *)(r10 -512) = r1 ; R1_w=42 R10=fp0 fp-512_w=42 3: (7b) *(u64 *)(r10 -520) = r2 ; R2_w=42 R10=fp0 fp-520_w=42 4: (85) call bpf_get_smp_processor_id#8 ; R0_w=scalar(...) 5: (79) r2 = *(u64 *)(r10 -520) ; R2_w=42 R10=fp0 fp-520_w=42 6: (79) r1 = *(u64 *)(r10 -512) ; R1_w=42 R10=fp0 fp-512_w=42 7: (bf) r3 = r10 ; R3_w=fp0 R10=fp0 8: (0f) r3 += r2 mark_precise: frame0: last_idx 8 first_idx 0 subseq_idx -1 mark_precise: frame0: regs=r2 stack= before 7: (bf) r3 = r10 mark_precise: frame0: regs=r2 stack= before 6: (79) r1 = *(u64 *)(r10 -512) mark_precise: frame0: regs=r2 stack= before 5: (79) r2 = *(u64 *)(r10 -520) mark_precise: frame0: regs= stack=-8 before 4: (85) call bpf_get_smp_processor_id#8 mark_precise: frame0: regs= stack=-8 before 3: (7b) *(u64 *)(r10 -520) = r2 mark_precise: frame0: regs=r2 stack= before 2: (7b) *(u64 *)(r10 -512) = r1 mark_precise: frame0: regs=r2 stack= before 1: (b7) r2 = 42 9: R2_w=42 R3_w=fp42 9: (95) exit This patch disables the additional allowance for the moment. Also, two test cases are removed: - bpf_fastcall_max_stack_ok: it fails w/o additional stack allowance; - bpf_fastcall_max_stack_fail: this test is no longer necessary, stack size follows regular rules, pattern invalidation is checked by other test cases. Reported-by: Hou Tao Closes: https://lore.kernel.org/bpf/20241023022752.172005-1-houtao@huaweicloud.com/ Fixes: 5b5f51bff1b6 ("bpf: no_caller_saved_registers attribute for helper calls") Signed-off-by: Eduard Zingerman Acked-by: Andrii Nakryiko Tested-by: Hou Tao Link: https://lore.kernel.org/r/20241029193911.1575719-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 14 +----- .../selftests/bpf/progs/verifier_bpf_fastcall.c | 55 ---------------------- 2 files changed, 2 insertions(+), 67 deletions(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 3254c27085b8..bb99bada7e2e 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -6804,20 +6804,10 @@ static int check_stack_slot_within_bounds(struct bpf_verifier_env *env, struct bpf_func_state *state, enum bpf_access_type t) { - struct bpf_insn_aux_data *aux = &env->insn_aux_data[env->insn_idx]; - int min_valid_off, max_bpf_stack; - - /* If accessing instruction is a spill/fill from bpf_fastcall pattern, - * add room for all caller saved registers below MAX_BPF_STACK. - * In case if bpf_fastcall rewrite won't happen maximal stack depth - * would be checked by check_max_stack_depth_subprog(). - */ - max_bpf_stack = MAX_BPF_STACK; - if (aux->fastcall_pattern) - max_bpf_stack += CALLER_SAVED_REGS * BPF_REG_SIZE; + int min_valid_off; if (t == BPF_WRITE || env->allow_uninit_stack) - min_valid_off = -max_bpf_stack; + min_valid_off = -MAX_BPF_STACK; else min_valid_off = -state->allocated_stack; diff --git a/tools/testing/selftests/bpf/progs/verifier_bpf_fastcall.c b/tools/testing/selftests/bpf/progs/verifier_bpf_fastcall.c index 9da97d2efcd9..5094c288cfd7 100644 --- a/tools/testing/selftests/bpf/progs/verifier_bpf_fastcall.c +++ b/tools/testing/selftests/bpf/progs/verifier_bpf_fastcall.c @@ -790,61 +790,6 @@ __naked static void cumulative_stack_depth_subprog(void) :: __imm(bpf_get_smp_processor_id) : __clobber_all); } -SEC("raw_tp") -__arch_x86_64 -__log_level(4) -__msg("stack depth 512") -__xlated("0: r1 = 42") -__xlated("1: *(u64 *)(r10 -512) = r1") -__xlated("2: w0 = ") -__xlated("3: r0 = &(void __percpu *)(r0)") -__xlated("4: r0 = *(u32 *)(r0 +0)") -__xlated("5: exit") -__success -__naked int bpf_fastcall_max_stack_ok(void) -{ - asm volatile( - "r1 = 42;" - "*(u64 *)(r10 - %[max_bpf_stack]) = r1;" - "*(u64 *)(r10 - %[max_bpf_stack_8]) = r1;" - "call %[bpf_get_smp_processor_id];" - "r1 = *(u64 *)(r10 - %[max_bpf_stack_8]);" - "exit;" - : - : __imm_const(max_bpf_stack, MAX_BPF_STACK), - __imm_const(max_bpf_stack_8, MAX_BPF_STACK + 8), - __imm(bpf_get_smp_processor_id) - : __clobber_all - ); -} - -SEC("raw_tp") -__arch_x86_64 -__log_level(4) -__msg("stack depth 520") -__failure -__naked int bpf_fastcall_max_stack_fail(void) -{ - asm volatile( - "r1 = 42;" - "*(u64 *)(r10 - %[max_bpf_stack]) = r1;" - "*(u64 *)(r10 - %[max_bpf_stack_8]) = r1;" - "call %[bpf_get_smp_processor_id];" - "r1 = *(u64 *)(r10 - %[max_bpf_stack_8]);" - /* call to prandom blocks bpf_fastcall rewrite */ - "*(u64 *)(r10 - %[max_bpf_stack_8]) = r1;" - "call %[bpf_get_prandom_u32];" - "r1 = *(u64 *)(r10 - %[max_bpf_stack_8]);" - "exit;" - : - : __imm_const(max_bpf_stack, MAX_BPF_STACK), - __imm_const(max_bpf_stack_8, MAX_BPF_STACK + 8), - __imm(bpf_get_smp_processor_id), - __imm(bpf_get_prandom_u32) - : __clobber_all - ); -} - SEC("cgroup/getsockname_unix") __xlated("0: r2 = 1") /* bpf_cast_to_kern_ctx is replaced by a single assignment */ -- cgit From 72f7e16eccddde99386a10eb2d08833e805917c6 Mon Sep 17 00:00:00 2001 From: Andrzej Kacprowski Date: Thu, 17 Oct 2024 16:49:58 +0200 Subject: accel/ivpu: Fix NOC firewall interrupt handling The NOC firewall interrupt means that the HW prevented unauthorized access to a protected resource, so there is no need to trigger device reset in such case. To facilitate security testing add firewall_irq_counter debugfs file that tracks firewall interrupts. Fixes: 8a27ad81f7d3 ("accel/ivpu: Split IP and buttress code") Cc: stable@vger.kernel.org # v6.11+ Signed-off-by: Andrzej Kacprowski Reviewed-by: Jacek Lawrynowicz Reviewed-by: Jeffrey Hugo Signed-off-by: Jacek Lawrynowicz Link: https://patchwork.freedesktop.org/patch/msgid/20241017144958.79327-1-jacek.lawrynowicz@linux.intel.com --- drivers/accel/ivpu/ivpu_debugfs.c | 9 +++++++++ drivers/accel/ivpu/ivpu_hw.c | 1 + drivers/accel/ivpu/ivpu_hw.h | 1 + drivers/accel/ivpu/ivpu_hw_ip.c | 5 ++++- 4 files changed, 15 insertions(+), 1 deletion(-) diff --git a/drivers/accel/ivpu/ivpu_debugfs.c b/drivers/accel/ivpu/ivpu_debugfs.c index 6f86f8df30db..8d50981594d1 100644 --- a/drivers/accel/ivpu/ivpu_debugfs.c +++ b/drivers/accel/ivpu/ivpu_debugfs.c @@ -108,6 +108,14 @@ static int reset_pending_show(struct seq_file *s, void *v) return 0; } +static int firewall_irq_counter_show(struct seq_file *s, void *v) +{ + struct ivpu_device *vdev = seq_to_ivpu(s); + + seq_printf(s, "%d\n", atomic_read(&vdev->hw->firewall_irq_counter)); + return 0; +} + static const struct drm_debugfs_info vdev_debugfs_list[] = { {"bo_list", bo_list_show, 0}, {"fw_name", fw_name_show, 0}, @@ -116,6 +124,7 @@ static const struct drm_debugfs_info vdev_debugfs_list[] = { {"last_bootmode", last_bootmode_show, 0}, {"reset_counter", reset_counter_show, 0}, {"reset_pending", reset_pending_show, 0}, + {"firewall_irq_counter", firewall_irq_counter_show, 0}, }; static ssize_t diff --git a/drivers/accel/ivpu/ivpu_hw.c b/drivers/accel/ivpu/ivpu_hw.c index 27f0fe4d54e0..e69c0613513f 100644 --- a/drivers/accel/ivpu/ivpu_hw.c +++ b/drivers/accel/ivpu/ivpu_hw.c @@ -249,6 +249,7 @@ int ivpu_hw_init(struct ivpu_device *vdev) platform_init(vdev); wa_init(vdev); timeouts_init(vdev); + atomic_set(&vdev->hw->firewall_irq_counter, 0); return 0; } diff --git a/drivers/accel/ivpu/ivpu_hw.h b/drivers/accel/ivpu/ivpu_hw.h index 1c0c98e3afb8..a96a05b2acda 100644 --- a/drivers/accel/ivpu/ivpu_hw.h +++ b/drivers/accel/ivpu/ivpu_hw.h @@ -52,6 +52,7 @@ struct ivpu_hw_info { int dma_bits; ktime_t d0i3_entry_host_ts; u64 d0i3_entry_vpu_ts; + atomic_t firewall_irq_counter; }; int ivpu_hw_init(struct ivpu_device *vdev); diff --git a/drivers/accel/ivpu/ivpu_hw_ip.c b/drivers/accel/ivpu/ivpu_hw_ip.c index dfd2f4a5b526..60b33fc59d96 100644 --- a/drivers/accel/ivpu/ivpu_hw_ip.c +++ b/drivers/accel/ivpu/ivpu_hw_ip.c @@ -1062,7 +1062,10 @@ static void irq_wdt_mss_handler(struct ivpu_device *vdev) static void irq_noc_firewall_handler(struct ivpu_device *vdev) { - ivpu_pm_trigger_recovery(vdev, "NOC Firewall IRQ"); + atomic_inc(&vdev->hw->firewall_irq_counter); + + ivpu_dbg(vdev, IRQ, "NOC Firewall interrupt detected, counter %d\n", + atomic_read(&vdev->hw->firewall_irq_counter)); } /* Handler for IRQs from NPU core */ -- cgit From 2a492ff66673c38a77d0815d67b9a8cce2ef57f8 Mon Sep 17 00:00:00 2001 From: Ojaswin Mujoo Date: Tue, 15 Oct 2024 15:15:09 +0530 Subject: xfs: Check for delayed allocations before setting extsize Extsize should only be allowed to be set on files with no data in it. For this, we check if the files have extents but miss to check if delayed extents are present. This patch adds that check. While we are at it, also refactor this check into a helper since it's used in some other places as well like xfs_inactive() or xfs_ioctl_setattr_xflags() **Without the patch (SUCCEEDS)** $ xfs_io -c 'open -f testfile' -c 'pwrite 0 1024' -c 'extsize 65536' wrote 1024/1024 bytes at offset 0 1 KiB, 1 ops; 0.0002 sec (4.628 MiB/sec and 4739.3365 ops/sec) **With the patch (FAILS as expected)** $ xfs_io -c 'open -f testfile' -c 'pwrite 0 1024' -c 'extsize 65536' wrote 1024/1024 bytes at offset 0 1 KiB, 1 ops; 0.0002 sec (4.628 MiB/sec and 4739.3365 ops/sec) xfs_io: FS_IOC_FSSETXATTR testfile: Invalid argument Fixes: e94af02a9cd7 ("[XFS] fix old xfs_setattr mis-merge from irix; mostly harmless esp if not using xfs rt") Reviewed-by: Christoph Hellwig Reviewed-by: Darrick J. Wong Reviewed-by: John Garry Signed-off-by: Ojaswin Mujoo Signed-off-by: Carlos Maiolino --- fs/xfs/xfs_inode.c | 2 +- fs/xfs/xfs_inode.h | 5 +++++ fs/xfs/xfs_ioctl.c | 4 ++-- 3 files changed, 8 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index bcc277fc0a83..19dcb569a3e7 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -1409,7 +1409,7 @@ xfs_inactive( if (S_ISREG(VFS_I(ip)->i_mode) && (ip->i_disk_size != 0 || XFS_ISIZE(ip) != 0 || - ip->i_df.if_nextents > 0 || ip->i_delayed_blks > 0)) + xfs_inode_has_filedata(ip))) truncate = 1; if (xfs_iflags_test(ip, XFS_IQUOTAUNCHECKED)) { diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index 97ed912306fd..03944b6c5fba 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -292,6 +292,11 @@ static inline bool xfs_is_cow_inode(struct xfs_inode *ip) return xfs_is_reflink_inode(ip) || xfs_is_always_cow_inode(ip); } +static inline bool xfs_inode_has_filedata(const struct xfs_inode *ip) +{ + return ip->i_df.if_nextents > 0 || ip->i_delayed_blks > 0; +} + /* * Check if an inode has any data in the COW fork. This might be often false * even for inodes with the reflink flag when there is no pending COW operation. diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index a20d426ef021..2567fd2a0994 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -481,7 +481,7 @@ xfs_ioctl_setattr_xflags( if (rtflag != XFS_IS_REALTIME_INODE(ip)) { /* Can't change realtime flag if any extents are allocated. */ - if (ip->i_df.if_nextents || ip->i_delayed_blks) + if (xfs_inode_has_filedata(ip)) return -EINVAL; /* @@ -602,7 +602,7 @@ xfs_ioctl_setattr_check_extsize( if (!fa->fsx_valid) return 0; - if (S_ISREG(VFS_I(ip)->i_mode) && ip->i_df.if_nextents && + if (S_ISREG(VFS_I(ip)->i_mode) && xfs_inode_has_filedata(ip) && XFS_FSB_TO_B(mp, ip->i_extsize) != fa->fsx_extsize) return -EINVAL; -- cgit From 3ef22684038aa577c10972ee9c6a2455f5fac941 Mon Sep 17 00:00:00 2001 From: Chi Zhiling Date: Fri, 25 Oct 2024 10:33:20 +0800 Subject: xfs: Reduce unnecessary searches when searching for the best extents Recently, we found that the CPU spent a lot of time in xfs_alloc_ag_vextent_size when the filesystem has millions of fragmented spaces. The reason is that we conducted much extra searching for extents that could not yield a better result, and these searches would cost a lot of time when there were millions of extents to search through. Even if we get the same result length, we don't switch our choice to the new one, so we can definitely terminate the search early. Since the result length cannot exceed the found length, when the found length equals the best result length we already have, we can conclude the search. We did a test in that filesystem: [root@localhost ~]# xfs_db -c freesp /dev/vdb from to extents blocks pct 1 1 215 215 0.01 2 3 994476 1988952 99.99 Before this patch: 0) | xfs_alloc_ag_vextent_size [xfs]() { 0) * 15597.94 us | } After this patch: 0) | xfs_alloc_ag_vextent_size [xfs]() { 0) 19.176 us | } Signed-off-by: Chi Zhiling Reviewed-by: Dave Chinner Signed-off-by: Carlos Maiolino --- fs/xfs/libxfs/xfs_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index 04f64cf9777e..22bdbb3e9980 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -1923,7 +1923,7 @@ restart: error = -EFSCORRUPTED; goto error0; } - if (flen < bestrlen) + if (flen <= bestrlen) break; busy = xfs_alloc_compute_aligned(args, fbno, flen, &rbno, &rlen, &busy_gen); -- cgit From dc60992ce76fbc2f71c2674f435ff6bde2108028 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Wed, 23 Oct 2024 15:37:22 +0200 Subject: xfs: fix finding a last resort AG in xfs_filestream_pick_ag When the main loop in xfs_filestream_pick_ag fails to find a suitable AG it tries to just pick the online AG. But the loop for that uses args->pag as loop iterator while the later code expects pag to be set. Fix this by reusing the max_pag case for this last resort, and also add a check for impossible case of no AG just to make sure that the uninitialized pag doesn't even escape in theory. Reported-by: syzbot+4125a3c514e3436a02e6@syzkaller.appspotmail.com Signed-off-by: Christoph Hellwig Tested-by: syzbot+4125a3c514e3436a02e6@syzkaller.appspotmail.com Fixes: f8f1ed1ab3baba ("xfs: return a referenced perag from filestreams allocator") Cc: # v6.3 Reviewed-by: Darrick J. Wong Signed-off-by: Carlos Maiolino --- fs/xfs/xfs_filestream.c | 23 ++++++++++++----------- fs/xfs/xfs_trace.h | 15 +++++---------- 2 files changed, 17 insertions(+), 21 deletions(-) diff --git a/fs/xfs/xfs_filestream.c b/fs/xfs/xfs_filestream.c index e3aaa0555597..88bd23ce74cd 100644 --- a/fs/xfs/xfs_filestream.c +++ b/fs/xfs/xfs_filestream.c @@ -64,7 +64,7 @@ xfs_filestream_pick_ag( struct xfs_perag *pag; struct xfs_perag *max_pag = NULL; xfs_extlen_t minlen = *longest; - xfs_extlen_t free = 0, minfree, maxfree = 0; + xfs_extlen_t minfree, maxfree = 0; xfs_agnumber_t agno; bool first_pass = true; int err; @@ -107,7 +107,6 @@ restart: !(flags & XFS_PICK_USERDATA) || (flags & XFS_PICK_LOWSPACE))) { /* Break out, retaining the reference on the AG. */ - free = pag->pagf_freeblks; break; } } @@ -150,23 +149,25 @@ restart: * grab. */ if (!max_pag) { - for_each_perag_wrap(args->mp, 0, start_agno, args->pag) + for_each_perag_wrap(args->mp, 0, start_agno, pag) { + max_pag = pag; break; - atomic_inc(&args->pag->pagf_fstrms); - *longest = 0; - } else { - pag = max_pag; - free = maxfree; - atomic_inc(&pag->pagf_fstrms); + } + + /* Bail if there are no AGs at all to select from. */ + if (!max_pag) + return -ENOSPC; } + + pag = max_pag; + atomic_inc(&pag->pagf_fstrms); } else if (max_pag) { xfs_perag_rele(max_pag); } - trace_xfs_filestream_pick(pag, pino, free); + trace_xfs_filestream_pick(pag, pino); args->pag = pag; return 0; - } static struct xfs_inode * diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index ee9f0b1f548d..fcb2bad4f76e 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -691,8 +691,8 @@ DEFINE_FILESTREAM_EVENT(xfs_filestream_lookup); DEFINE_FILESTREAM_EVENT(xfs_filestream_scan); TRACE_EVENT(xfs_filestream_pick, - TP_PROTO(struct xfs_perag *pag, xfs_ino_t ino, xfs_extlen_t free), - TP_ARGS(pag, ino, free), + TP_PROTO(struct xfs_perag *pag, xfs_ino_t ino), + TP_ARGS(pag, ino), TP_STRUCT__entry( __field(dev_t, dev) __field(xfs_ino_t, ino) @@ -703,14 +703,9 @@ TRACE_EVENT(xfs_filestream_pick, TP_fast_assign( __entry->dev = pag->pag_mount->m_super->s_dev; __entry->ino = ino; - if (pag) { - __entry->agno = pag->pag_agno; - __entry->streams = atomic_read(&pag->pagf_fstrms); - } else { - __entry->agno = NULLAGNUMBER; - __entry->streams = 0; - } - __entry->free = free; + __entry->agno = pag->pag_agno; + __entry->streams = atomic_read(&pag->pagf_fstrms); + __entry->free = pag->pagf_freeblks; ), TP_printk("dev %d:%d ino 0x%llx agno 0x%x streams %d free %d", MAJOR(__entry->dev), MINOR(__entry->dev), -- cgit From 81a1e1c32ef474c20ccb9f730afe1ac25b1c62a4 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Wed, 23 Oct 2024 15:37:23 +0200 Subject: xfs: streamline xfs_filestream_pick_ag Directly return the error from xfs_bmap_longest_free_extent instead of breaking from the loop and handling it there, and use a done label to directly jump to the exist when we found a suitable perag structure to reduce the indentation level and pag/max_pag check complexity in the tail of the function. Signed-off-by: Christoph Hellwig Reviewed-by: Darrick J. Wong Signed-off-by: Carlos Maiolino --- fs/xfs/xfs_filestream.c | 96 ++++++++++++++++++++++++------------------------- 1 file changed, 46 insertions(+), 50 deletions(-) diff --git a/fs/xfs/xfs_filestream.c b/fs/xfs/xfs_filestream.c index 88bd23ce74cd..290ba8887d29 100644 --- a/fs/xfs/xfs_filestream.c +++ b/fs/xfs/xfs_filestream.c @@ -67,22 +67,28 @@ xfs_filestream_pick_ag( xfs_extlen_t minfree, maxfree = 0; xfs_agnumber_t agno; bool first_pass = true; - int err; /* 2% of an AG's blocks must be free for it to be chosen. */ minfree = mp->m_sb.sb_agblocks / 50; restart: for_each_perag_wrap(mp, start_agno, agno, pag) { + int err; + trace_xfs_filestream_scan(pag, pino); + *longest = 0; err = xfs_bmap_longest_free_extent(pag, NULL, longest); if (err) { - if (err != -EAGAIN) - break; - /* Couldn't lock the AGF, skip this AG. */ - err = 0; - continue; + if (err == -EAGAIN) { + /* Couldn't lock the AGF, skip this AG. */ + err = 0; + continue; + } + xfs_perag_rele(pag); + if (max_pag) + xfs_perag_rele(max_pag); + return err; } /* Keep track of the AG with the most free blocks. */ @@ -107,7 +113,9 @@ restart: !(flags & XFS_PICK_USERDATA) || (flags & XFS_PICK_LOWSPACE))) { /* Break out, retaining the reference on the AG. */ - break; + if (max_pag) + xfs_perag_rele(max_pag); + goto done; } } @@ -115,56 +123,44 @@ restart: atomic_dec(&pag->pagf_fstrms); } - if (err) { - xfs_perag_rele(pag); - if (max_pag) - xfs_perag_rele(max_pag); - return err; + /* + * Allow a second pass to give xfs_bmap_longest_free_extent() another + * attempt at locking AGFs that it might have skipped over before we + * fail. + */ + if (first_pass) { + first_pass = false; + goto restart; } - if (!pag) { - /* - * Allow a second pass to give xfs_bmap_longest_free_extent() - * another attempt at locking AGFs that it might have skipped - * over before we fail. - */ - if (first_pass) { - first_pass = false; - goto restart; - } - - /* - * We must be low on data space, so run a final lowspace - * optimised selection pass if we haven't already. - */ - if (!(flags & XFS_PICK_LOWSPACE)) { - flags |= XFS_PICK_LOWSPACE; - goto restart; - } - - /* - * No unassociated AGs are available, so select the AG with the - * most free space, regardless of whether it's already in use by - * another filestream. It none suit, just use whatever AG we can - * grab. - */ - if (!max_pag) { - for_each_perag_wrap(args->mp, 0, start_agno, pag) { - max_pag = pag; - break; - } + /* + * We must be low on data space, so run a final lowspace optimised + * selection pass if we haven't already. + */ + if (!(flags & XFS_PICK_LOWSPACE)) { + flags |= XFS_PICK_LOWSPACE; + goto restart; + } - /* Bail if there are no AGs at all to select from. */ - if (!max_pag) - return -ENOSPC; + /* + * No unassociated AGs are available, so select the AG with the most + * free space, regardless of whether it's already in use by another + * filestream. It none suit, just use whatever AG we can grab. + */ + if (!max_pag) { + for_each_perag_wrap(args->mp, 0, start_agno, pag) { + max_pag = pag; + break; } - pag = max_pag; - atomic_inc(&pag->pagf_fstrms); - } else if (max_pag) { - xfs_perag_rele(max_pag); + /* Bail if there are no AGs at all to select from. */ + if (!max_pag) + return -ENOSPC; } + pag = max_pag; + atomic_inc(&pag->pagf_fstrms); +done: trace_xfs_filestream_pick(pag, pino); args->pag = pag; return 0; -- cgit From 76342e84258771e0ef1da7f7de071069f33f9900 Mon Sep 17 00:00:00 2001 From: Liu Jing Date: Mon, 21 Oct 2024 16:04:47 +0800 Subject: selftests: netfilter: remove unused parameter err is never used, remove it. Signed-off-by: Liu Jing Signed-off-by: Pablo Neira Ayuso --- tools/testing/selftests/net/netfilter/conntrack_dump_flush.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/net/netfilter/conntrack_dump_flush.c b/tools/testing/selftests/net/netfilter/conntrack_dump_flush.c index dc056fec993b..254ff03297f0 100644 --- a/tools/testing/selftests/net/netfilter/conntrack_dump_flush.c +++ b/tools/testing/selftests/net/netfilter/conntrack_dump_flush.c @@ -98,7 +98,7 @@ static int conntrack_data_insert(struct mnl_socket *sock, struct nlmsghdr *nlh, char buf[MNL_SOCKET_BUFFER_SIZE]; struct nlmsghdr *rplnlh; unsigned int portid; - int err, ret; + int ret; portid = mnl_socket_get_portid(sock); @@ -217,7 +217,7 @@ static int conntracK_count_zone(struct mnl_socket *sock, uint16_t zone) struct nfgenmsg *nfh; struct nlattr *nest; unsigned int portid; - int err, ret; + int ret; portid = mnl_socket_get_portid(sock); @@ -264,7 +264,7 @@ static int conntrack_flush_zone(struct mnl_socket *sock, uint16_t zone) struct nfgenmsg *nfh; struct nlattr *nest; unsigned int portid; - int err, ret; + int ret; portid = mnl_socket_get_portid(sock); -- cgit From f48d258f0ac540f00fa617dac496c4c18b5dc2fa Mon Sep 17 00:00:00 2001 From: Dong Chenchen Date: Thu, 24 Oct 2024 09:47:01 +0800 Subject: netfilter: Fix use-after-free in get_info() ip6table_nat module unload has refcnt warning for UAF. call trace is: WARNING: CPU: 1 PID: 379 at kernel/module/main.c:853 module_put+0x6f/0x80 Modules linked in: ip6table_nat(-) CPU: 1 UID: 0 PID: 379 Comm: ip6tables Not tainted 6.12.0-rc4-00047-gc2ee9f594da8-dirty #205 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:module_put+0x6f/0x80 Call Trace: get_info+0x128/0x180 do_ip6t_get_ctl+0x6a/0x430 nf_getsockopt+0x46/0x80 ipv6_getsockopt+0xb9/0x100 rawv6_getsockopt+0x42/0x190 do_sock_getsockopt+0xaa/0x180 __sys_getsockopt+0x70/0xc0 __x64_sys_getsockopt+0x20/0x30 do_syscall_64+0xa2/0x1a0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Concurrent execution of module unload and get_info() trigered the warning. The root cause is as follows: cpu0 cpu1 module_exit //mod->state = MODULE_STATE_GOING ip6table_nat_exit xt_unregister_template kfree(t) //removed from templ_list getinfo() t = xt_find_table_lock list_for_each_entry(tmpl, &xt_templates[af]...) if (strcmp(tmpl->name, name)) continue; //table not found try_module_get list_for_each_entry(t, &xt_net->tables[af]...) return t; //not get refcnt module_put(t->me) //uaf unregister_pernet_subsys //remove table from xt_net list While xt_table module was going away and has been removed from xt_templates list, we couldnt get refcnt of xt_table->me. Check module in xt_net->tables list re-traversal to fix it. Fixes: fdacd57c79b7 ("netfilter: x_tables: never register tables by default") Signed-off-by: Dong Chenchen Reviewed-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso --- net/netfilter/x_tables.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c index da5d929c7c85..709840612f0d 100644 --- a/net/netfilter/x_tables.c +++ b/net/netfilter/x_tables.c @@ -1269,7 +1269,7 @@ struct xt_table *xt_find_table_lock(struct net *net, u_int8_t af, /* and once again: */ list_for_each_entry(t, &xt_net->tables[af], list) - if (strcmp(t->name, name) == 0) + if (strcmp(t->name, name) == 0 && owner == t->me) return t; module_put(owner); -- cgit From 4ed234fe793f27a3b151c43d2106df2ff0d81aac Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Fri, 25 Oct 2024 08:02:29 +0000 Subject: netfilter: nf_reject_ipv6: fix potential crash in nf_send_reset6() I got a syzbot report without a repro [1] crashing in nf_send_reset6() I think the issue is that dev->hard_header_len is zero, and we attempt later to push an Ethernet header. Use LL_MAX_HEADER, as other functions in net/ipv6/netfilter/nf_reject_ipv6.c. [1] skbuff: skb_under_panic: text:ffffffff89b1d008 len:74 put:14 head:ffff88803123aa00 data:ffff88803123a9f2 tail:0x3c end:0x140 dev:syz_tun kernel BUG at net/core/skbuff.c:206 ! Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI CPU: 0 UID: 0 PID: 7373 Comm: syz.1.568 Not tainted 6.12.0-rc2-syzkaller-00631-g6d858708d465 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 RIP: 0010:skb_panic net/core/skbuff.c:206 [inline] RIP: 0010:skb_under_panic+0x14b/0x150 net/core/skbuff.c:216 Code: 0d 8d 48 c7 c6 60 a6 29 8e 48 8b 54 24 08 8b 0c 24 44 8b 44 24 04 4d 89 e9 50 41 54 41 57 41 56 e8 ba 30 38 02 48 83 c4 20 90 <0f> 0b 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 RSP: 0018:ffffc900045269b0 EFLAGS: 00010282 RAX: 0000000000000088 RBX: dffffc0000000000 RCX: cd66dacdc5d8e800 RDX: 0000000000000000 RSI: 0000000000000200 RDI: 0000000000000000 RBP: ffff88802d39a3d0 R08: ffffffff8174afec R09: 1ffff920008a4ccc R10: dffffc0000000000 R11: fffff520008a4ccd R12: 0000000000000140 R13: ffff88803123aa00 R14: ffff88803123a9f2 R15: 000000000000003c FS: 00007fdbee5ff6c0(0000) GS:ffff8880b8600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000005d322000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: skb_push+0xe5/0x100 net/core/skbuff.c:2636 eth_header+0x38/0x1f0 net/ethernet/eth.c:83 dev_hard_header include/linux/netdevice.h:3208 [inline] nf_send_reset6+0xce6/0x1270 net/ipv6/netfilter/nf_reject_ipv6.c:358 nft_reject_inet_eval+0x3b9/0x690 net/netfilter/nft_reject_inet.c:48 expr_call_ops_eval net/netfilter/nf_tables_core.c:240 [inline] nft_do_chain+0x4ad/0x1da0 net/netfilter/nf_tables_core.c:288 nft_do_chain_inet+0x418/0x6b0 net/netfilter/nft_chain_filter.c:161 nf_hook_entry_hookfn include/linux/netfilter.h:154 [inline] nf_hook_slow+0xc3/0x220 net/netfilter/core.c:626 nf_hook include/linux/netfilter.h:269 [inline] NF_HOOK include/linux/netfilter.h:312 [inline] br_nf_pre_routing_ipv6+0x63e/0x770 net/bridge/br_netfilter_ipv6.c:184 nf_hook_entry_hookfn include/linux/netfilter.h:154 [inline] nf_hook_bridge_pre net/bridge/br_input.c:277 [inline] br_handle_frame+0x9fd/0x1530 net/bridge/br_input.c:424 __netif_receive_skb_core+0x13e8/0x4570 net/core/dev.c:5562 __netif_receive_skb_one_core net/core/dev.c:5666 [inline] __netif_receive_skb+0x12f/0x650 net/core/dev.c:5781 netif_receive_skb_internal net/core/dev.c:5867 [inline] netif_receive_skb+0x1e8/0x890 net/core/dev.c:5926 tun_rx_batched+0x1b7/0x8f0 drivers/net/tun.c:1550 tun_get_user+0x3056/0x47e0 drivers/net/tun.c:2007 tun_chr_write_iter+0x10d/0x1f0 drivers/net/tun.c:2053 new_sync_write fs/read_write.c:590 [inline] vfs_write+0xa6d/0xc90 fs/read_write.c:683 ksys_write+0x183/0x2b0 fs/read_write.c:736 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fdbeeb7d1ff Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 c9 8d 02 00 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 44 24 08 e8 1c 8e 02 00 48 RSP: 002b:00007fdbee5ff000 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 00007fdbeed36058 RCX: 00007fdbeeb7d1ff RDX: 000000000000008e RSI: 0000000020000040 RDI: 00000000000000c8 RBP: 00007fdbeebf12be R08: 0000000000000000 R09: 0000000000000000 R10: 000000000000008e R11: 0000000000000293 R12: 0000000000000000 R13: 0000000000000000 R14: 00007fdbeed36058 R15: 00007ffc38de06e8 Fixes: c8d7b98bec43 ("netfilter: move nf_send_resetX() code to nf_reject_ipvX modules") Reported-by: syzbot Signed-off-by: Eric Dumazet Signed-off-by: Pablo Neira Ayuso --- net/ipv6/netfilter/nf_reject_ipv6.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/net/ipv6/netfilter/nf_reject_ipv6.c b/net/ipv6/netfilter/nf_reject_ipv6.c index 7db0437140bf..9ae2b2725bf9 100644 --- a/net/ipv6/netfilter/nf_reject_ipv6.c +++ b/net/ipv6/netfilter/nf_reject_ipv6.c @@ -268,12 +268,12 @@ static int nf_reject6_fill_skb_dst(struct sk_buff *skb_in) void nf_send_reset6(struct net *net, struct sock *sk, struct sk_buff *oldskb, int hook) { - struct sk_buff *nskb; - struct tcphdr _otcph; - const struct tcphdr *otcph; - unsigned int otcplen, hh_len; const struct ipv6hdr *oip6h = ipv6_hdr(oldskb); struct dst_entry *dst = NULL; + const struct tcphdr *otcph; + struct sk_buff *nskb; + struct tcphdr _otcph; + unsigned int otcplen; struct flowi6 fl6; if ((!(ipv6_addr_type(&oip6h->saddr) & IPV6_ADDR_UNICAST)) || @@ -312,9 +312,8 @@ void nf_send_reset6(struct net *net, struct sock *sk, struct sk_buff *oldskb, if (IS_ERR(dst)) return; - hh_len = (dst->dev->hard_header_len + 15)&~15; - nskb = alloc_skb(hh_len + 15 + dst->header_len + sizeof(struct ipv6hdr) - + sizeof(struct tcphdr) + dst->trailer_len, + nskb = alloc_skb(LL_MAX_HEADER + sizeof(struct ipv6hdr) + + sizeof(struct tcphdr) + dst->trailer_len, GFP_ATOMIC); if (!nskb) { @@ -327,7 +326,7 @@ void nf_send_reset6(struct net *net, struct sock *sk, struct sk_buff *oldskb, nskb->mark = fl6.flowi6_mark; - skb_reserve(nskb, hh_len + dst->header_len); + skb_reserve(nskb, LL_MAX_HEADER); nf_reject_ip6hdr_put(nskb, oldskb, IPPROTO_TCP, ip6_dst_hoplimit(dst)); nf_reject_ip6_tcphdr_put(nskb, oldskb, otcph, otcplen); -- cgit From 4413665dd6c528b31284119e3571c25f371e1c36 Mon Sep 17 00:00:00 2001 From: Jan Schär Date: Tue, 29 Oct 2024 23:12:49 +0100 Subject: ALSA: usb-audio: Add quirks for Dell WD19 dock MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The WD19 family of docks has the same audio chipset as the WD15. This change enables jack detection on the WD19. We don't need the dell_dock_mixer_init quirk for the WD19. It is only needed because of the dell_alc4020_map quirk for the WD15 in mixer_maps.c, which disables the volume controls. Even for the WD15, this quirk was apparently only needed when the dock firmware was not updated. Signed-off-by: Jan Schär Cc: Link: https://patch.msgid.link/20241029221249.15661-1-jan@jschaer.ch Signed-off-by: Takashi Iwai --- sound/usb/mixer_quirks.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/sound/usb/mixer_quirks.c b/sound/usb/mixer_quirks.c index 2a9594f34dac..6456e87e2f39 100644 --- a/sound/usb/mixer_quirks.c +++ b/sound/usb/mixer_quirks.c @@ -4042,6 +4042,9 @@ int snd_usb_mixer_apply_create_quirk(struct usb_mixer_interface *mixer) break; err = dell_dock_mixer_init(mixer); break; + case USB_ID(0x0bda, 0x402e): /* Dell WD19 dock */ + err = dell_dock_mixer_create(mixer); + break; case USB_ID(0x2a39, 0x3fd2): /* RME ADI-2 Pro */ case USB_ID(0x2a39, 0x3fd3): /* RME ADI-2 DAC */ -- cgit From 0b04fbe886b4274c8e5855011233aaa69fec6e75 Mon Sep 17 00:00:00 2001 From: Christoffer Sandberg Date: Tue, 29 Oct 2024 16:16:52 +0100 Subject: ALSA: hda/realtek: Fix headset mic on TUXEDO Gemini 17 Gen3 Quirk is needed to enable headset microphone on missing pin 0x19. Signed-off-by: Christoffer Sandberg Signed-off-by: Werner Sembach Cc: Link: https://patch.msgid.link/20241029151653.80726-1-wse@tuxedocomputers.com Signed-off-by: Takashi Iwai --- sound/pci/hda/patch_realtek.c | 1 + 1 file changed, 1 insertion(+) diff --git a/sound/pci/hda/patch_realtek.c b/sound/pci/hda/patch_realtek.c index 7f4926194e50..e06a6fdc0bab 100644 --- a/sound/pci/hda/patch_realtek.c +++ b/sound/pci/hda/patch_realtek.c @@ -10750,6 +10750,7 @@ static const struct snd_pci_quirk alc269_fixup_tbl[] = { SND_PCI_QUIRK(0x1558, 0x1404, "Clevo N150CU", ALC293_FIXUP_SYSTEM76_MIC_NO_PRESENCE), SND_PCI_QUIRK(0x1558, 0x14a1, "Clevo L141MU", ALC293_FIXUP_SYSTEM76_MIC_NO_PRESENCE), SND_PCI_QUIRK(0x1558, 0x2624, "Clevo L240TU", ALC256_FIXUP_SYSTEM76_MIC_NO_PRESENCE), + SND_PCI_QUIRK(0x1558, 0x28c1, "Clevo V370VND", ALC2XX_FIXUP_HEADSET_MIC), SND_PCI_QUIRK(0x1558, 0x4018, "Clevo NV40M[BE]", ALC293_FIXUP_SYSTEM76_MIC_NO_PRESENCE), SND_PCI_QUIRK(0x1558, 0x4019, "Clevo NV40MZ", ALC293_FIXUP_SYSTEM76_MIC_NO_PRESENCE), SND_PCI_QUIRK(0x1558, 0x4020, "Clevo NV40MB", ALC293_FIXUP_SYSTEM76_MIC_NO_PRESENCE), -- cgit From e49370d769e71456db3fbd982e95bab8c69f73e8 Mon Sep 17 00:00:00 2001 From: Christoffer Sandberg Date: Tue, 29 Oct 2024 16:16:53 +0100 Subject: ALSA: hda/realtek: Fix headset mic on TUXEDO Stellaris 16 Gen6 mb1 Quirk is needed to enable headset microphone on missing pin 0x19. Signed-off-by: Christoffer Sandberg Signed-off-by: Werner Sembach Cc: Link: https://patch.msgid.link/20241029151653.80726-2-wse@tuxedocomputers.com Signed-off-by: Takashi Iwai --- sound/pci/hda/patch_realtek.c | 1 + 1 file changed, 1 insertion(+) diff --git a/sound/pci/hda/patch_realtek.c b/sound/pci/hda/patch_realtek.c index e06a6fdc0bab..571fa8a6c9e1 100644 --- a/sound/pci/hda/patch_realtek.c +++ b/sound/pci/hda/patch_realtek.c @@ -11008,6 +11008,7 @@ static const struct snd_pci_quirk alc269_fixup_tbl[] = { SND_PCI_QUIRK(0x1d05, 0x115c, "TongFang GMxTGxx", ALC269_FIXUP_NO_SHUTUP), SND_PCI_QUIRK(0x1d05, 0x121b, "TongFang GMxAGxx", ALC269_FIXUP_NO_SHUTUP), SND_PCI_QUIRK(0x1d05, 0x1387, "TongFang GMxIXxx", ALC2XX_FIXUP_HEADSET_MIC), + SND_PCI_QUIRK(0x1d05, 0x1409, "TongFang GMxIXxx", ALC2XX_FIXUP_HEADSET_MIC), SND_PCI_QUIRK(0x1d17, 0x3288, "Haier Boyue G42", ALC269VC_FIXUP_ACER_VCOPPERBOX_PINS), SND_PCI_QUIRK(0x1d72, 0x1602, "RedmiBook", ALC255_FIXUP_XIAOMI_HEADSET_MIC), SND_PCI_QUIRK(0x1d72, 0x1701, "XiaomiNotebook Pro", ALC298_FIXUP_DELL1_MIC_NO_PRESENCE), -- cgit From 42ab37eaad17aee458489c553a367621ee04e0bc Mon Sep 17 00:00:00 2001 From: Keith Busch Date: Wed, 23 Oct 2024 08:40:26 -0700 Subject: nvme: module parameter to disable pi with offsets A recent commit enables integrity checks for formats the previous kernel versions registered with the "nop" integrity profile. This means namespaces using that format become unreadable when upgrading the kernel past that commit. Introduce a module parameter to restore the "nop" integrity profile so that storage can be readable once again. This could be a boot device, so the setting needs to happen at module load time. Fixes: 921e81db524d17 ("nvme: allow integrity when PI is not in first bytes") Reported-by: David Wei Reviewed-by: Christoph Hellwig Reviewed-by: Kanchan Joshi Reviewed-by: Chaitanya Kulkarni Signed-off-by: Keith Busch --- drivers/nvme/host/core.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 09466e208729..ce20b916301a 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -91,6 +91,17 @@ module_param(apst_secondary_latency_tol_us, ulong, 0644); MODULE_PARM_DESC(apst_secondary_latency_tol_us, "secondary APST latency tolerance in us"); +/* + * Older kernels didn't enable protection information if it was at an offset. + * Newer kernels do, so it breaks reads on the upgrade if such formats were + * used in prior kernels since the metadata written did not contain a valid + * checksum. + */ +static bool disable_pi_offsets = false; +module_param(disable_pi_offsets, bool, 0444); +MODULE_PARM_DESC(disable_pi_offsets, + "disable protection information if it has an offset"); + /* * nvme_wq - hosts nvme related works that are not reset or delete * nvme_reset_wq - hosts nvme reset works @@ -1926,8 +1937,12 @@ static void nvme_configure_metadata(struct nvme_ctrl *ctrl, if (head->pi_size && head->ms >= head->pi_size) head->pi_type = id->dps & NVME_NS_DPS_PI_MASK; - if (!(id->dps & NVME_NS_DPS_PI_FIRST)) - info->pi_offset = head->ms - head->pi_size; + if (!(id->dps & NVME_NS_DPS_PI_FIRST)) { + if (disable_pi_offsets) + head->pi_type = 0; + else + info->pi_offset = head->ms - head->pi_size; + } if (ctrl->ops->flags & NVME_F_FABRICS) { /* -- cgit From d2f551b1f72b4c508ab9298419f6feadc3b5d791 Mon Sep 17 00:00:00 2001 From: Vitaliy Shevtsov Date: Mon, 16 Sep 2024 22:41:37 +0500 Subject: nvmet-auth: assign dh_key to NULL after kfree_sensitive ctrl->dh_key might be used across multiple calls to nvmet_setup_dhgroup() for the same controller. So it's better to nullify it after release on error path in order to avoid double free later in nvmet_destroy_auth(). Found by Linux Verification Center (linuxtesting.org) with Svace. Fixes: 7a277c37d352 ("nvmet-auth: Diffie-Hellman key exchange support") Cc: stable@vger.kernel.org Signed-off-by: Vitaliy Shevtsov Reviewed-by: Christoph Hellwig Reviewed-by: Hannes Reinecke Signed-off-by: Keith Busch --- drivers/nvme/target/auth.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/nvme/target/auth.c b/drivers/nvme/target/auth.c index 7897d02c681d..b0fd211ec57e 100644 --- a/drivers/nvme/target/auth.c +++ b/drivers/nvme/target/auth.c @@ -115,6 +115,7 @@ int nvmet_setup_dhgroup(struct nvmet_ctrl *ctrl, u8 dhgroup_id) pr_debug("%s: ctrl %d failed to generate private key, err %d\n", __func__, ctrl->cntlid, ret); kfree_sensitive(ctrl->dh_key); + ctrl->dh_key = NULL; return ret; } ctrl->dh_keysize = crypto_kpp_maxsize(ctrl->dh_tfm); -- cgit From 5eed4fb274cd6579f2fb4190b11c4c86c553cd06 Mon Sep 17 00:00:00 2001 From: Keith Busch Date: Mon, 28 Oct 2024 13:45:46 -0700 Subject: nvme: re-fix error-handling for io_uring nvme-passthrough This was previously fixed with commit 1147dd0503564fa0e0348 ("nvme: fix error-handling for io_uring nvme-passthrough"), but the change was mistakenly undone in a later commit. Fixes: d6aacee9255e7f ("nvme: use bio_integrity_map_user") Cc: stable@vger.kernel.org Reported-by: Jens Axboe Reviewed-by: Christoph Hellwig Reviewed-by: Anuj Gupta Reviewed-by: Kanchan Joshi Signed-off-by: Keith Busch --- drivers/nvme/host/ioctl.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index b9b79ccfabf8..a96976b22fa7 100644 --- a/drivers/nvme/host/ioctl.c +++ b/drivers/nvme/host/ioctl.c @@ -421,10 +421,13 @@ static enum rq_end_io_ret nvme_uring_cmd_end_io(struct request *req, struct io_uring_cmd *ioucmd = req->end_io_data; struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd); - if (nvme_req(req)->flags & NVME_REQ_CANCELLED) + if (nvme_req(req)->flags & NVME_REQ_CANCELLED) { pdu->status = -EINTR; - else + } else { pdu->status = nvme_req(req)->status; + if (!pdu->status) + pdu->status = blk_status_to_errno(err); + } pdu->result = le64_to_cpu(nvme_req(req)->result.u64); /* -- cgit From 5d01b56f0518d80211812420a8907ca0b6c6e4e3 Mon Sep 17 00:00:00 2001 From: Boris Brezillon Date: Wed, 30 Oct 2024 16:02:31 +0100 Subject: drm/panthor: Fix firmware initialization on systems with a page size > 4k The system and GPU MMU page size might differ, which becomes a problem for FW sections that need to be mapped at explicit addresses since our PAGE_SIZE alignment might cover a VA range that's expected to be used for another section. Make sure we never map more than we need. Changes in v3: - Add R-bs Changes in v2: - Plan for per-VM page sizes so the MCU VM and user VM can have different pages sizes Fixes: 2718d91816ee ("drm/panthor: Add the FW logical block") Signed-off-by: Boris Brezillon Reviewed-by: Steven Price Reviewed-by: Liviu Dudau Link: https://patchwork.freedesktop.org/patch/msgid/20241030150231.768949-1-boris.brezillon@collabora.com --- drivers/gpu/drm/panthor/panthor_fw.c | 4 ++-- drivers/gpu/drm/panthor/panthor_gem.c | 11 ++++++++--- drivers/gpu/drm/panthor/panthor_mmu.c | 16 +++++++++++++--- drivers/gpu/drm/panthor/panthor_mmu.h | 1 + 4 files changed, 24 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/panthor/panthor_fw.c b/drivers/gpu/drm/panthor/panthor_fw.c index ef232c0c2049..4e2d3a02ea06 100644 --- a/drivers/gpu/drm/panthor/panthor_fw.c +++ b/drivers/gpu/drm/panthor/panthor_fw.c @@ -487,6 +487,7 @@ static int panthor_fw_load_section_entry(struct panthor_device *ptdev, struct panthor_fw_binary_iter *iter, u32 ehdr) { + ssize_t vm_pgsz = panthor_vm_page_size(ptdev->fw->vm); struct panthor_fw_binary_section_entry_hdr hdr; struct panthor_fw_section *section; u32 section_size; @@ -515,8 +516,7 @@ static int panthor_fw_load_section_entry(struct panthor_device *ptdev, return -EINVAL; } - if ((hdr.va.start & ~PAGE_MASK) != 0 || - (hdr.va.end & ~PAGE_MASK) != 0) { + if (!IS_ALIGNED(hdr.va.start, vm_pgsz) || !IS_ALIGNED(hdr.va.end, vm_pgsz)) { drm_err(&ptdev->base, "Firmware corrupted, virtual addresses not page aligned: 0x%x-0x%x\n", hdr.va.start, hdr.va.end); return -EINVAL; diff --git a/drivers/gpu/drm/panthor/panthor_gem.c b/drivers/gpu/drm/panthor/panthor_gem.c index 38f560864879..be97d56bc011 100644 --- a/drivers/gpu/drm/panthor/panthor_gem.c +++ b/drivers/gpu/drm/panthor/panthor_gem.c @@ -44,8 +44,7 @@ void panthor_kernel_bo_destroy(struct panthor_kernel_bo *bo) to_panthor_bo(bo->obj)->exclusive_vm_root_gem != panthor_vm_root_gem(vm))) goto out_free_bo; - ret = panthor_vm_unmap_range(vm, bo->va_node.start, - panthor_kernel_bo_size(bo)); + ret = panthor_vm_unmap_range(vm, bo->va_node.start, bo->va_node.size); if (ret) goto out_free_bo; @@ -95,10 +94,16 @@ panthor_kernel_bo_create(struct panthor_device *ptdev, struct panthor_vm *vm, } bo = to_panthor_bo(&obj->base); - size = obj->base.size; kbo->obj = &obj->base; bo->flags = bo_flags; + /* The system and GPU MMU page size might differ, which becomes a + * problem for FW sections that need to be mapped at explicit address + * since our PAGE_SIZE alignment might cover a VA range that's + * expected to be used for another section. + * Make sure we never map more than we need. + */ + size = ALIGN(size, panthor_vm_page_size(vm)); ret = panthor_vm_alloc_va(vm, gpu_va, size, &kbo->va_node); if (ret) goto err_put_obj; diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c index 3cd2bce59edc..5d5e25b1be95 100644 --- a/drivers/gpu/drm/panthor/panthor_mmu.c +++ b/drivers/gpu/drm/panthor/panthor_mmu.c @@ -826,6 +826,14 @@ void panthor_vm_idle(struct panthor_vm *vm) mutex_unlock(&ptdev->mmu->as.slots_lock); } +u32 panthor_vm_page_size(struct panthor_vm *vm) +{ + const struct io_pgtable *pgt = io_pgtable_ops_to_pgtable(vm->pgtbl_ops); + u32 pg_shift = ffs(pgt->cfg.pgsize_bitmap) - 1; + + return 1u << pg_shift; +} + static void panthor_vm_stop(struct panthor_vm *vm) { drm_sched_stop(&vm->sched, NULL); @@ -1025,12 +1033,13 @@ int panthor_vm_alloc_va(struct panthor_vm *vm, u64 va, u64 size, struct drm_mm_node *va_node) { + ssize_t vm_pgsz = panthor_vm_page_size(vm); int ret; - if (!size || (size & ~PAGE_MASK)) + if (!size || !IS_ALIGNED(size, vm_pgsz)) return -EINVAL; - if (va != PANTHOR_VM_KERNEL_AUTO_VA && (va & ~PAGE_MASK)) + if (va != PANTHOR_VM_KERNEL_AUTO_VA && !IS_ALIGNED(va, vm_pgsz)) return -EINVAL; mutex_lock(&vm->mm_lock); @@ -2366,11 +2375,12 @@ panthor_vm_bind_prepare_op_ctx(struct drm_file *file, const struct drm_panthor_vm_bind_op *op, struct panthor_vm_op_ctx *op_ctx) { + ssize_t vm_pgsz = panthor_vm_page_size(vm); struct drm_gem_object *gem; int ret; /* Aligned on page size. */ - if ((op->va | op->size) & ~PAGE_MASK) + if (!IS_ALIGNED(op->va | op->size, vm_pgsz)) return -EINVAL; switch (op->flags & DRM_PANTHOR_VM_BIND_OP_TYPE_MASK) { diff --git a/drivers/gpu/drm/panthor/panthor_mmu.h b/drivers/gpu/drm/panthor/panthor_mmu.h index 6788771071e3..8d21e83d8aba 100644 --- a/drivers/gpu/drm/panthor/panthor_mmu.h +++ b/drivers/gpu/drm/panthor/panthor_mmu.h @@ -30,6 +30,7 @@ panthor_vm_get_bo_for_va(struct panthor_vm *vm, u64 va, u64 *bo_offset); int panthor_vm_active(struct panthor_vm *vm); void panthor_vm_idle(struct panthor_vm *vm); +u32 panthor_vm_page_size(struct panthor_vm *vm); int panthor_vm_as(struct panthor_vm *vm); int panthor_vm_flush_all(struct panthor_vm *vm); -- cgit From 412a2a8fdd4eb89b263623c7a59b77dbfcf8f215 Mon Sep 17 00:00:00 2001 From: Boris Brezillon Date: Tue, 29 Oct 2024 16:29:10 +0100 Subject: drm/panthor: Fail job creation when the group is dead Userspace can use GROUP_SUBMIT errors as a trigger to check the group state and recreate the group if it became unusable. Make sure we report an error when the group became unusable. Changes in v3: - None Changes in v2: - Add R-bs Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block") Signed-off-by: Boris Brezillon Reviewed-by: Steven Price Reviewed-by: Liviu Dudau Link: https://patchwork.freedesktop.org/patch/msgid/20241029152912.270346-2-boris.brezillon@collabora.com --- drivers/gpu/drm/panthor/panthor_sched.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c index aee362abb710..05119d9bb279 100644 --- a/drivers/gpu/drm/panthor/panthor_sched.c +++ b/drivers/gpu/drm/panthor/panthor_sched.c @@ -3409,6 +3409,11 @@ panthor_job_create(struct panthor_file *pfile, goto err_put_job; } + if (!group_can_run(job->group)) { + ret = -EINVAL; + goto err_put_job; + } + if (job->queue_idx >= job->group->queue_count || !job->group->queues[job->queue_idx]) { ret = -EINVAL; -- cgit From 4700fd3e050da8302e60ebd4850d008250fa7204 Mon Sep 17 00:00:00 2001 From: Boris Brezillon Date: Tue, 29 Oct 2024 16:29:11 +0100 Subject: drm/panthor: Report group as timedout when we fail to properly suspend If we don't do that, the group is considered usable by userspace, but all further GROUP_SUBMIT will fail with -EINVAL. Changes in v3: - Add R-bs Changes in v2: - New patch Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block") Signed-off-by: Boris Brezillon Reviewed-by: Steven Price Reviewed-by: Liviu Dudau Link: https://patchwork.freedesktop.org/patch/msgid/20241029152912.270346-3-boris.brezillon@collabora.com --- drivers/gpu/drm/panthor/panthor_sched.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c index 05119d9bb279..9929e22f4d8d 100644 --- a/drivers/gpu/drm/panthor/panthor_sched.c +++ b/drivers/gpu/drm/panthor/panthor_sched.c @@ -589,10 +589,11 @@ struct panthor_group { * @timedout: True when a timeout occurred on any of the queues owned by * this group. * - * Timeouts can be reported by drm_sched or by the FW. In any case, any - * timeout situation is unrecoverable, and the group becomes useless. - * We simply wait for all references to be dropped so we can release the - * group object. + * Timeouts can be reported by drm_sched or by the FW. If a reset is required, + * and the group can't be suspended, this also leads to a timeout. In any case, + * any timeout situation is unrecoverable, and the group becomes useless. We + * simply wait for all references to be dropped so we can release the group + * object. */ bool timedout; @@ -2640,6 +2641,12 @@ void panthor_sched_suspend(struct panthor_device *ptdev) csgs_upd_ctx_init(&upd_ctx); while (slot_mask) { u32 csg_id = ffs(slot_mask) - 1; + struct panthor_csg_slot *csg_slot = &sched->csg_slots[csg_id]; + + /* We consider group suspension failures as fatal and flag the + * group as unusable by setting timedout=true. + */ + csg_slot->group->timedout = true; csgs_upd_ctx_queue_reqs(ptdev, &upd_ctx, csg_id, CSG_STATE_TERMINATE, -- cgit From 8286f8b622990194207df9ab852e0f87c60d35e9 Mon Sep 17 00:00:00 2001 From: Chuck Lever Date: Tue, 29 Oct 2024 15:27:19 -0400 Subject: NFSD: Never decrement pending_async_copies on error The error flow in nfsd4_copy() calls cleanup_async_copy(), which already decrements nn->pending_async_copies. Reported-by: Olga Kornievskaia Fixes: aadc3bbea163 ("NFSD: Limit the number of concurrent async COPY operations") Signed-off-by: Chuck Lever --- fs/nfsd/nfs4proc.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c index 5fd1ce3fc8fb..d32f2dfd148f 100644 --- a/fs/nfsd/nfs4proc.c +++ b/fs/nfsd/nfs4proc.c @@ -1845,10 +1845,8 @@ nfsd4_copy(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, refcount_set(&async_copy->refcount, 1); /* Arbitrary cap on number of pending async copy operations */ if (atomic_inc_return(&nn->pending_async_copies) > - (int)rqstp->rq_pool->sp_nrthreads) { - atomic_dec(&nn->pending_async_copies); + (int)rqstp->rq_pool->sp_nrthreads) goto out_err; - } async_copy->cp_src = kmalloc(sizeof(*async_copy->cp_src), GFP_KERNEL); if (!async_copy->cp_src) goto out_err; -- cgit From 1e67d8641813f1876a42eeb4f532487b8a7fb0a8 Mon Sep 17 00:00:00 2001 From: Sungwoo Kim Date: Tue, 29 Oct 2024 19:44:41 +0000 Subject: Bluetooth: hci: fix null-ptr-deref in hci_read_supported_codecs Fix __hci_cmd_sync_sk() to return not NULL for unknown opcodes. __hci_cmd_sync_sk() returns NULL if a command returns a status event. However, it also returns NULL where an opcode doesn't exist in the hci_cc table because hci_cmd_complete_evt() assumes status = skb->data[0] for unknown opcodes. This leads to null-ptr-deref in cmd_sync for HCI_OP_READ_LOCAL_CODECS as there is no hci_cc for HCI_OP_READ_LOCAL_CODECS, which always assumes status = skb->data[0]. KASAN: null-ptr-deref in range [0x0000000000000070-0x0000000000000077] CPU: 1 PID: 2000 Comm: kworker/u9:5 Not tainted 6.9.0-ga6bcb805883c-dirty #10 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Workqueue: hci7 hci_power_on RIP: 0010:hci_read_supported_codecs+0xb9/0x870 net/bluetooth/hci_codec.c:138 Code: 08 48 89 ef e8 b8 c1 8f fd 48 8b 75 00 e9 96 00 00 00 49 89 c6 48 ba 00 00 00 00 00 fc ff df 4c 8d 60 70 4c 89 e3 48 c1 eb 03 <0f> b6 04 13 84 c0 0f 85 82 06 00 00 41 83 3c 24 02 77 0a e8 bf 78 RSP: 0018:ffff888120bafac8 EFLAGS: 00010212 RAX: 0000000000000000 RBX: 000000000000000e RCX: ffff8881173f0040 RDX: dffffc0000000000 RSI: ffffffffa58496c0 RDI: ffff88810b9ad1e4 RBP: ffff88810b9ac000 R08: ffffffffa77882a7 R09: 1ffffffff4ef1054 R10: dffffc0000000000 R11: fffffbfff4ef1055 R12: 0000000000000070 R13: 0000000000000000 R14: 0000000000000000 R15: ffff88810b9ac000 FS: 0000000000000000(0000) GS:ffff8881f6c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6ddaa3439e CR3: 0000000139764003 CR4: 0000000000770ef0 PKRU: 55555554 Call Trace: hci_read_local_codecs_sync net/bluetooth/hci_sync.c:4546 [inline] hci_init_stage_sync net/bluetooth/hci_sync.c:3441 [inline] hci_init4_sync net/bluetooth/hci_sync.c:4706 [inline] hci_init_sync net/bluetooth/hci_sync.c:4742 [inline] hci_dev_init_sync net/bluetooth/hci_sync.c:4912 [inline] hci_dev_open_sync+0x19a9/0x2d30 net/bluetooth/hci_sync.c:4994 hci_dev_do_open net/bluetooth/hci_core.c:483 [inline] hci_power_on+0x11e/0x560 net/bluetooth/hci_core.c:1015 process_one_work kernel/workqueue.c:3267 [inline] process_scheduled_works+0x8ef/0x14f0 kernel/workqueue.c:3348 worker_thread+0x91f/0xe50 kernel/workqueue.c:3429 kthread+0x2cb/0x360 kernel/kthread.c:388 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 Fixes: abfeea476c68 ("Bluetooth: hci_sync: Convert MGMT_OP_START_DISCOVERY") Signed-off-by: Sungwoo Kim Signed-off-by: Luiz Augusto von Dentz --- net/bluetooth/hci_sync.c | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c index ae7a5817883a..c0203a2b5107 100644 --- a/net/bluetooth/hci_sync.c +++ b/net/bluetooth/hci_sync.c @@ -206,6 +206,12 @@ struct sk_buff *__hci_cmd_sync_sk(struct hci_dev *hdev, u16 opcode, u32 plen, return ERR_PTR(err); } + /* If command return a status event skb will be set to NULL as there are + * no parameters. + */ + if (!skb) + return ERR_PTR(-ENODATA); + return skb; } EXPORT_SYMBOL(__hci_cmd_sync_sk); @@ -255,6 +261,11 @@ int __hci_cmd_sync_status_sk(struct hci_dev *hdev, u16 opcode, u32 plen, u8 status; skb = __hci_cmd_sync_sk(hdev, opcode, plen, param, event, timeout, sk); + + /* If command return a status event, skb will be set to -ENODATA */ + if (skb == ERR_PTR(-ENODATA)) + return 0; + if (IS_ERR(skb)) { if (!event) bt_dev_err(hdev, "Opcode 0x%4.4x failed: %ld", opcode, @@ -262,13 +273,6 @@ int __hci_cmd_sync_status_sk(struct hci_dev *hdev, u16 opcode, u32 plen, return PTR_ERR(skb); } - /* If command return a status event skb will be set to NULL as there are - * no parameters, in case of failure IS_ERR(skb) would have be set to - * the actual error would be found with PTR_ERR(skb). - */ - if (!skb) - return 0; - status = skb->data[0]; kfree_skb(skb); -- cgit From 101ccfbabf4738041273ce64e2b116cf440dea13 Mon Sep 17 00:00:00 2001 From: Hou Tao Date: Wed, 30 Oct 2024 18:05:12 +0800 Subject: bpf: Free dynamically allocated bits in bpf_iter_bits_destroy() bpf_iter_bits_destroy() uses "kit->nr_bits <= 64" to check whether the bits are dynamically allocated. However, the check is incorrect and may cause a kmemleak as shown below: unreferenced object 0xffff88812628c8c0 (size 32): comm "swapper/0", pid 1, jiffies 4294727320 hex dump (first 32 bytes): b0 c1 55 f5 81 88 ff ff f0 f0 f0 f0 f0 f0 f0 f0 ..U........... f0 f0 f0 f0 f0 f0 f0 f0 00 00 00 00 00 00 00 00 .............. backtrace (crc 781e32cc): [<00000000c452b4ab>] kmemleak_alloc+0x4b/0x80 [<0000000004e09f80>] __kmalloc_node_noprof+0x480/0x5c0 [<00000000597124d6>] __alloc.isra.0+0x89/0xb0 [<000000004ebfffcd>] alloc_bulk+0x2af/0x720 [<00000000d9c10145>] prefill_mem_cache+0x7f/0xb0 [<00000000ff9738ff>] bpf_mem_alloc_init+0x3e2/0x610 [<000000008b616eac>] bpf_global_ma_init+0x19/0x30 [<00000000fc473efc>] do_one_initcall+0xd3/0x3c0 [<00000000ec81498c>] kernel_init_freeable+0x66a/0x940 [<00000000b119f72f>] kernel_init+0x20/0x160 [<00000000f11ac9a7>] ret_from_fork+0x3c/0x70 [<0000000004671da4>] ret_from_fork_asm+0x1a/0x30 That is because nr_bits will be set as zero in bpf_iter_bits_next() after all bits have been iterated. Fix the issue by setting kit->bit to kit->nr_bits instead of setting kit->nr_bits to zero when the iteration completes in bpf_iter_bits_next(). In addition, use "!nr_bits || bits >= nr_bits" to check whether the iteration is complete and still use "nr_bits > 64" to indicate whether bits are dynamically allocated. The "!nr_bits" check is necessary because bpf_iter_bits_new() may fail before setting kit->nr_bits, and this condition will stop the iteration early instead of accessing the zeroed or freed kit->bits. Considering the initial value of kit->bits is -1 and the type of kit->nr_bits is unsigned int, change the type of kit->nr_bits to int. The potential overflow problem will be handled in the following patch. Fixes: 4665415975b0 ("bpf: Add bits iterator") Acked-by: Yafang Shao Signed-off-by: Hou Tao Link: https://lore.kernel.org/r/20241030100516.3633640-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov --- kernel/bpf/helpers.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index ca3f0a2e5ed5..d913a8f1fbd9 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -2856,7 +2856,7 @@ struct bpf_iter_bits_kern { unsigned long *bits; unsigned long bits_copy; }; - u32 nr_bits; + int nr_bits; int bit; } __aligned(8); @@ -2930,17 +2930,16 @@ bpf_iter_bits_new(struct bpf_iter_bits *it, const u64 *unsafe_ptr__ign, u32 nr_w __bpf_kfunc int *bpf_iter_bits_next(struct bpf_iter_bits *it) { struct bpf_iter_bits_kern *kit = (void *)it; - u32 nr_bits = kit->nr_bits; + int bit = kit->bit, nr_bits = kit->nr_bits; const unsigned long *bits; - int bit; - if (nr_bits == 0) + if (!nr_bits || bit >= nr_bits) return NULL; bits = nr_bits == 64 ? &kit->bits_copy : kit->bits; - bit = find_next_bit(bits, nr_bits, kit->bit + 1); + bit = find_next_bit(bits, nr_bits, bit + 1); if (bit >= nr_bits) { - kit->nr_bits = 0; + kit->bit = bit; return NULL; } -- cgit From 62a898b07b83f6f407003d8a70f0827a5af08a59 Mon Sep 17 00:00:00 2001 From: Hou Tao Date: Wed, 30 Oct 2024 18:05:13 +0800 Subject: bpf: Add bpf_mem_alloc_check_size() helper Introduce bpf_mem_alloc_check_size() to check whether the allocation size exceeds the limitation for the kmalloc-equivalent allocator. The upper limit for percpu allocation is LLIST_NODE_SZ bytes larger than non-percpu allocation, so a percpu argument is added to the helper. The helper will be used in the following patch to check whether the size parameter passed to bpf_mem_alloc() is too big. Signed-off-by: Hou Tao Link: https://lore.kernel.org/r/20241030100516.3633640-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov --- include/linux/bpf_mem_alloc.h | 3 +++ kernel/bpf/memalloc.c | 14 +++++++++++++- 2 files changed, 16 insertions(+), 1 deletion(-) diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h index aaf004d94322..e45162ef59bb 100644 --- a/include/linux/bpf_mem_alloc.h +++ b/include/linux/bpf_mem_alloc.h @@ -33,6 +33,9 @@ int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma, struct obj_cgroup *objcg int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size); void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma); +/* Check the allocation size for kmalloc equivalent allocator */ +int bpf_mem_alloc_check_size(bool percpu, size_t size); + /* kmalloc/kfree equivalent: */ void *bpf_mem_alloc(struct bpf_mem_alloc *ma, size_t size); void bpf_mem_free(struct bpf_mem_alloc *ma, void *ptr); diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index b3858a76e0b3..146f5b57cfb1 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -35,6 +35,8 @@ */ #define LLIST_NODE_SZ sizeof(struct llist_node) +#define BPF_MEM_ALLOC_SIZE_MAX 4096 + /* similar to kmalloc, but sizeof == 8 bucket is gone */ static u8 size_index[24] __ro_after_init = { 3, /* 8 */ @@ -65,7 +67,7 @@ static u8 size_index[24] __ro_after_init = { static int bpf_mem_cache_idx(size_t size) { - if (!size || size > 4096) + if (!size || size > BPF_MEM_ALLOC_SIZE_MAX) return -1; if (size <= 192) @@ -1005,3 +1007,13 @@ void notrace *bpf_mem_cache_alloc_flags(struct bpf_mem_alloc *ma, gfp_t flags) return !ret ? NULL : ret + LLIST_NODE_SZ; } + +int bpf_mem_alloc_check_size(bool percpu, size_t size) +{ + /* The size of percpu allocation doesn't have LLIST_NODE_SZ overhead */ + if ((percpu && size > BPF_MEM_ALLOC_SIZE_MAX) || + (!percpu && size > BPF_MEM_ALLOC_SIZE_MAX - LLIST_NODE_SZ)) + return -E2BIG; + + return 0; +} -- cgit From 393397fbdcad7396639d7077c33f86169184ba99 Mon Sep 17 00:00:00 2001 From: Hou Tao Date: Wed, 30 Oct 2024 18:05:14 +0800 Subject: bpf: Check the validity of nr_words in bpf_iter_bits_new() Check the validity of nr_words in bpf_iter_bits_new(). Without this check, when multiplication overflow occurs for nr_bits (e.g., when nr_words = 0x0400-0001, nr_bits becomes 64), stack corruption may occur due to bpf_probe_read_kernel_common(..., nr_bytes = 0x2000-0008). Fix it by limiting the maximum value of nr_words to 511. The value is derived from the current implementation of BPF memory allocator. To ensure compatibility if the BPF memory allocator's size limitation changes in the future, use the helper bpf_mem_alloc_check_size() to check whether nr_bytes is too larger. And return -E2BIG instead of -ENOMEM for oversized nr_bytes. Fixes: 4665415975b0 ("bpf: Add bits iterator") Signed-off-by: Hou Tao Link: https://lore.kernel.org/r/20241030100516.3633640-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov --- kernel/bpf/helpers.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index d913a8f1fbd9..018985ebc5ce 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -2851,6 +2851,8 @@ struct bpf_iter_bits { __u64 __opaque[2]; } __aligned(8); +#define BITS_ITER_NR_WORDS_MAX 511 + struct bpf_iter_bits_kern { union { unsigned long *bits; @@ -2865,7 +2867,8 @@ struct bpf_iter_bits_kern { * @it: The new bpf_iter_bits to be created * @unsafe_ptr__ign: A pointer pointing to a memory area to be iterated over * @nr_words: The size of the specified memory area, measured in 8-byte units. - * Due to the limitation of memalloc, it can't be greater than 512. + * The maximum value of @nr_words is @BITS_ITER_NR_WORDS_MAX. This limit may be + * further reduced by the BPF memory allocator implementation. * * This function initializes a new bpf_iter_bits structure for iterating over * a memory area which is specified by the @unsafe_ptr__ign and @nr_words. It @@ -2892,6 +2895,8 @@ bpf_iter_bits_new(struct bpf_iter_bits *it, const u64 *unsafe_ptr__ign, u32 nr_w if (!unsafe_ptr__ign || !nr_words) return -EINVAL; + if (nr_words > BITS_ITER_NR_WORDS_MAX) + return -E2BIG; /* Optimization for u64 mask */ if (nr_bits == 64) { @@ -2903,6 +2908,9 @@ bpf_iter_bits_new(struct bpf_iter_bits *it, const u64 *unsafe_ptr__ign, u32 nr_w return 0; } + if (bpf_mem_alloc_check_size(false, nr_bytes)) + return -E2BIG; + /* Fallback to memalloc */ kit->bits = bpf_mem_alloc(&bpf_global_ma, nr_bytes); if (!kit->bits) -- cgit From e1339383675063ae4760d81ffe13a79981841b8d Mon Sep 17 00:00:00 2001 From: Hou Tao Date: Wed, 30 Oct 2024 18:05:15 +0800 Subject: bpf: Use __u64 to save the bits in bits iterator On 32-bit hosts (e.g., arm32), when a bpf program passes a u64 to bpf_iter_bits_new(), bpf_iter_bits_new() will use bits_copy to store the content of the u64. However, bits_copy is only 4 bytes, leading to stack corruption. The straightforward solution would be to replace u64 with unsigned long in bpf_iter_bits_new(). However, this introduces confusion and problems for 32-bit hosts because the size of ulong in bpf program is 8 bytes, but it is treated as 4-bytes after passed to bpf_iter_bits_new(). Fix it by changing the type of both bits and bit_count from unsigned long to u64. However, the change is not enough. The main reason is that bpf_iter_bits_next() uses find_next_bit() to find the next bit and the pointer passed to find_next_bit() is an unsigned long pointer instead of a u64 pointer. For 32-bit little-endian host, it is fine but it is not the case for 32-bit big-endian host. Because under 32-bit big-endian host, the first iterated unsigned long will be the bits 32-63 of the u64 instead of the expected bits 0-31. Therefore, in addition to changing the type, swap the two unsigned longs within the u64 for 32-bit big-endian host. Signed-off-by: Hou Tao Link: https://lore.kernel.org/r/20241030100516.3633640-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov --- kernel/bpf/helpers.c | 33 ++++++++++++++++++++++++++++++--- 1 file changed, 30 insertions(+), 3 deletions(-) diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 018985ebc5ce..3d45ebe8afb4 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -2855,13 +2855,36 @@ struct bpf_iter_bits { struct bpf_iter_bits_kern { union { - unsigned long *bits; - unsigned long bits_copy; + __u64 *bits; + __u64 bits_copy; }; int nr_bits; int bit; } __aligned(8); +/* On 64-bit hosts, unsigned long and u64 have the same size, so passing + * a u64 pointer and an unsigned long pointer to find_next_bit() will + * return the same result, as both point to the same 8-byte area. + * + * For 32-bit little-endian hosts, using a u64 pointer or unsigned long + * pointer also makes no difference. This is because the first iterated + * unsigned long is composed of bits 0-31 of the u64 and the second unsigned + * long is composed of bits 32-63 of the u64. + * + * However, for 32-bit big-endian hosts, this is not the case. The first + * iterated unsigned long will be bits 32-63 of the u64, so swap these two + * ulong values within the u64. + */ +static void swap_ulong_in_u64(u64 *bits, unsigned int nr) +{ +#if (BITS_PER_LONG == 32) && defined(__BIG_ENDIAN) + unsigned int i; + + for (i = 0; i < nr; i++) + bits[i] = (bits[i] >> 32) | ((u64)(u32)bits[i] << 32); +#endif +} + /** * bpf_iter_bits_new() - Initialize a new bits iterator for a given memory area * @it: The new bpf_iter_bits to be created @@ -2904,6 +2927,8 @@ bpf_iter_bits_new(struct bpf_iter_bits *it, const u64 *unsafe_ptr__ign, u32 nr_w if (err) return -EFAULT; + swap_ulong_in_u64(&kit->bits_copy, nr_words); + kit->nr_bits = nr_bits; return 0; } @@ -2922,6 +2947,8 @@ bpf_iter_bits_new(struct bpf_iter_bits *it, const u64 *unsafe_ptr__ign, u32 nr_w return err; } + swap_ulong_in_u64(kit->bits, nr_words); + kit->nr_bits = nr_bits; return 0; } @@ -2939,7 +2966,7 @@ __bpf_kfunc int *bpf_iter_bits_next(struct bpf_iter_bits *it) { struct bpf_iter_bits_kern *kit = (void *)it; int bit = kit->bit, nr_bits = kit->nr_bits; - const unsigned long *bits; + const void *bits; if (!nr_bits || bit >= nr_bits) return NULL; -- cgit From ebafc1e535db19505aec3b94a4a641fe735a2eac Mon Sep 17 00:00:00 2001 From: Hou Tao Date: Wed, 30 Oct 2024 18:05:16 +0800 Subject: selftests/bpf: Add three test cases for bits_iter Add more test cases for bits iterator: (1) huge word test Verify the multiplication overflow of nr_bits in bits_iter. Without the overflow check, when nr_words is 67108865, nr_bits becomes 64, causing bpf_probe_read_kernel_common() to corrupt the stack. (2) max word test Verify correct handling of maximum nr_words value (511). (3) bad word test Verify early termination of bits iteration when bits iterator initialization fails. Also rename bits_nomem to bits_too_big to better reflect its purpose. Signed-off-by: Hou Tao Link: https://lore.kernel.org/r/20241030100516.3633640-6-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov --- .../selftests/bpf/progs/verifier_bits_iter.c | 61 ++++++++++++++++++++-- 1 file changed, 58 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/bpf/progs/verifier_bits_iter.c b/tools/testing/selftests/bpf/progs/verifier_bits_iter.c index f4da4d508ddb..156cc278e2fc 100644 --- a/tools/testing/selftests/bpf/progs/verifier_bits_iter.c +++ b/tools/testing/selftests/bpf/progs/verifier_bits_iter.c @@ -15,6 +15,8 @@ int bpf_iter_bits_new(struct bpf_iter_bits *it, const u64 *unsafe_ptr__ign, int *bpf_iter_bits_next(struct bpf_iter_bits *it) __ksym __weak; void bpf_iter_bits_destroy(struct bpf_iter_bits *it) __ksym __weak; +u64 bits_array[511] = {}; + SEC("iter.s/cgroup") __description("bits iter without destroy") __failure __msg("Unreleased reference") @@ -110,16 +112,16 @@ int bit_index(void) } SEC("syscall") -__description("bits nomem") +__description("bits too big") __success __retval(0) -int bits_nomem(void) +int bits_too_big(void) { u64 data[4]; int nr = 0; int *bit; __builtin_memset(&data, 0xff, sizeof(data)); - bpf_for_each(bits, bit, &data[0], 513) /* Be greater than 512 */ + bpf_for_each(bits, bit, &data[0], 512) /* Be greater than 511 */ nr++; return nr; } @@ -151,3 +153,56 @@ int zero_words(void) nr++; return nr; } + +SEC("syscall") +__description("huge words") +__success __retval(0) +int huge_words(void) +{ + u64 data[8] = {0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1}; + int nr = 0; + int *bit; + + bpf_for_each(bits, bit, &data[0], 67108865) + nr++; + return nr; +} + +SEC("syscall") +__description("max words") +__success __retval(4) +int max_words(void) +{ + volatile int nr = 0; + int *bit; + + bits_array[0] = (1ULL << 63) | 1U; + bits_array[510] = (1ULL << 33) | (1ULL << 32); + + bpf_for_each(bits, bit, bits_array, 511) { + if (nr == 0 && *bit != 0) + break; + if (nr == 2 && *bit != 32672) + break; + nr++; + } + return nr; +} + +SEC("syscall") +__description("bad words") +__success __retval(0) +int bad_words(void) +{ + void *bad_addr = (void *)(3UL << 30); + int nr = 0; + int *bit; + + bpf_for_each(bits, bit, bad_addr, 1) + nr++; + + bpf_for_each(bits, bit, bad_addr, 4) + nr++; + + return nr; +} -- cgit From 63a81588cd2025e75fbaf30b65930b76825c456f Mon Sep 17 00:00:00 2001 From: Chuck Lever Date: Wed, 30 Oct 2024 16:11:30 -0400 Subject: rpcrdma: Always release the rpcrdma_device's xa_array Dai pointed out that the xa_init_flags() in rpcrdma_add_one() needs to have a matching xa_destroy() in rpcrdma_remove_one() to release underlying memory that the xarray might have accrued during operation. Reported-by: Dai Ngo Fixes: 7e86845a0346 ("rpcrdma: Implement generic device removal") Signed-off-by: Chuck Lever --- net/sunrpc/xprtrdma/ib_client.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/sunrpc/xprtrdma/ib_client.c b/net/sunrpc/xprtrdma/ib_client.c index 8507cd4d8921..28c68b5f6823 100644 --- a/net/sunrpc/xprtrdma/ib_client.c +++ b/net/sunrpc/xprtrdma/ib_client.c @@ -153,6 +153,7 @@ static void rpcrdma_remove_one(struct ib_device *device, } trace_rpcrdma_client_remove_one_done(device); + xa_destroy(&rd->rd_xa); kfree(rd); } -- cgit From 0fc810ae3ae110f9e2fcccce80fc8c8d62f97907 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Tue, 29 Oct 2024 16:03:31 -1000 Subject: x86/uaccess: Avoid barrier_nospec() in 64-bit copy_from_user() The barrier_nospec() in 64-bit copy_from_user() is slow. Instead use pointer masking to force the user pointer to all 1's for an invalid address. The kernel test robot reports a 2.6% improvement in the per_thread_ops benchmark [1]. This is a variation on a patch originally by Josh Poimboeuf [2]. Link: https://lore.kernel.org/202410281344.d02c72a2-oliver.sang@intel.com [1] Link: https://lore.kernel.org/5b887fe4c580214900e21f6c61095adf9a142735.1730166635.git.jpoimboe@kernel.org [2] Tested-and-reviewed-by: Josh Poimboeuf Cc: Kirill A. Shutemov Signed-off-by: Linus Torvalds --- include/linux/uaccess.h | 21 +++++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-) diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index 39c7cf82b0c2..43844510d5d0 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -38,6 +38,7 @@ #else #define can_do_masked_user_access() 0 #define masked_user_access_begin(src) NULL + #define mask_user_address(src) (src) #endif /* @@ -159,19 +160,27 @@ _inline_copy_from_user(void *to, const void __user *from, unsigned long n) { unsigned long res = n; might_fault(); - if (!should_fail_usercopy() && likely(access_ok(from, n))) { + if (should_fail_usercopy()) + goto fail; + if (can_do_masked_user_access()) + from = mask_user_address(from); + else { + if (!access_ok(from, n)) + goto fail; /* * Ensure that bad access_ok() speculation will not * lead to nasty side effects *after* the copy is * finished: */ barrier_nospec(); - instrument_copy_from_user_before(to, from, n); - res = raw_copy_from_user(to, from, n); - instrument_copy_from_user_after(to, from, n, res); } - if (unlikely(res)) - memset(to + (n - res), 0, res); + instrument_copy_from_user_before(to, from, n); + res = raw_copy_from_user(to, from, n); + instrument_copy_from_user_after(to, from, n, res); + if (likely(!res)) + return 0; +fail: + memset(to + (n - res), 0, res); return res; } extern __must_check unsigned long -- cgit From 69d5e722be949a1e2409c3f2865ba6020c279db6 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Wed, 30 Oct 2024 11:49:34 +0100 Subject: sched/ext: Fix scx vs sched_delayed Commit 98442f0ccd82 ("sched: Fix delayed_dequeue vs switched_from_fair()") forgot about scx :/ Fixes: 98442f0ccd82 ("sched: Fix delayed_dequeue vs switched_from_fair()") Reported-by: Tejun Heo Signed-off-by: Peter Zijlstra (Intel) Acked-by: Tejun Heo Link: https://lkml.kernel.org/r/20241030104934.GK14555@noisy.programming.kicks-ass.net --- kernel/sched/ext.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 40bdfe84e4f0..721a75489ff5 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -4489,11 +4489,16 @@ static void scx_ops_disable_workfn(struct kthread_work *work) scx_task_iter_start(&sti); while ((p = scx_task_iter_next_locked(&sti))) { const struct sched_class *old_class = p->sched_class; + const struct sched_class *new_class = + __setscheduler_class(p->policy, p->prio); struct sched_enq_and_set_ctx ctx; + if (old_class != new_class && p->se.sched_delayed) + dequeue_task(task_rq(p), p, DEQUEUE_SLEEP | DEQUEUE_DELAYED); + sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx); - p->sched_class = __setscheduler_class(p->policy, p->prio); + p->sched_class = new_class; check_class_changing(task_rq(p), p, old_class); sched_enq_and_set_task(&ctx); @@ -5199,12 +5204,17 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link) scx_task_iter_start(&sti); while ((p = scx_task_iter_next_locked(&sti))) { const struct sched_class *old_class = p->sched_class; + const struct sched_class *new_class = + __setscheduler_class(p->policy, p->prio); struct sched_enq_and_set_ctx ctx; + if (old_class != new_class && p->se.sched_delayed) + dequeue_task(task_rq(p), p, DEQUEUE_SLEEP | DEQUEUE_DELAYED); + sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx); p->scx.slice = SCX_SLICE_DFL; - p->sched_class = __setscheduler_class(p->policy, p->prio); + p->sched_class = new_class; check_class_changing(task_rq(p), p, old_class); sched_enq_and_set_task(&ctx); -- cgit From 2313ab74c3004089ecac5f0f91f7274829f3825b Mon Sep 17 00:00:00 2001 From: Alice Ryhl Date: Wed, 30 Oct 2024 10:31:34 +0000 Subject: cfi: tweak llvm version for HAVE_CFI_ICALL_NORMALIZE_INTEGERS The llvm fix [1] did not make it for 19.0.0, but ended up getting backported to llvm 19.1.3 [2]. Thus, fix the version requirement to correctly specify which versions have the bug. Link: https://github.com/llvm/llvm-project/pull/104826 [1] Link: https://github.com/llvm/llvm-project/pull/113938 [2] Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202410281414.c351044e-oliver.sang@intel.com Fixes: 8b8ca9c25fe6 ("cfi: fix conditions for HAVE_CFI_ICALL_NORMALIZE_INTEGERS") Signed-off-by: Alice Ryhl Reviewed-by: Sami Tolvanen Link: https://lore.kernel.org/r/20241030-cfi-icall-1913-v1-1-ab8a26e13733@google.com Signed-off-by: Miguel Ojeda --- arch/Kconfig | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 00163e4a237c..bd9f095d69fa 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -855,14 +855,14 @@ config HAVE_CFI_ICALL_NORMALIZE_INTEGERS_CLANG def_bool y depends on $(cc-option,-fsanitize=kcfi -fsanitize-cfi-icall-experimental-normalize-integers) # With GCOV/KASAN we need this fix: https://github.com/llvm/llvm-project/pull/104826 - depends on CLANG_VERSION >= 190000 || (!GCOV_KERNEL && !KASAN_GENERIC && !KASAN_SW_TAGS) + depends on CLANG_VERSION >= 190103 || (!GCOV_KERNEL && !KASAN_GENERIC && !KASAN_SW_TAGS) config HAVE_CFI_ICALL_NORMALIZE_INTEGERS_RUSTC def_bool y depends on HAVE_CFI_ICALL_NORMALIZE_INTEGERS_CLANG depends on RUSTC_VERSION >= 107900 # With GCOV/KASAN we need this fix: https://github.com/rust-lang/rust/pull/129373 - depends on (RUSTC_LLVM_VERSION >= 190000 && RUSTC_VERSION >= 108200) || \ + depends on (RUSTC_LLVM_VERSION >= 190103 && RUSTC_VERSION >= 108200) || \ (!GCOV_KERNEL && !KASAN_GENERIC && !KASAN_SW_TAGS) config CFI_PERMISSIVE -- cgit From 04c20a9356f283da623903e81e7c6d5df7e4dc3c Mon Sep 17 00:00:00 2001 From: Benoît Monin Date: Thu, 24 Oct 2024 16:01:54 +0200 Subject: net: skip offload for NETIF_F_IPV6_CSUM if ipv6 header contains extension MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit As documented in skbuff.h, devices with NETIF_F_IPV6_CSUM capability can only checksum TCP and UDP over IPv6 if the IP header does not contains extension. This is enforced for UDP packets emitted from user-space to an IPv6 address as they go through ip6_make_skb(), which calls __ip6_append_data() where a check is done on the header size before setting CHECKSUM_PARTIAL. But the introduction of UDP encapsulation with fou6 added a code-path where it is possible to get an skb with a partial UDP checksum and an IPv6 header with extension: * fou6 adds a UDP header with a partial checksum if the inner packet does not contains a valid checksum. * ip6_tunnel adds an IPv6 header with a destination option extension header if encap_limit is non-zero (the default value is 4). The thread linked below describes in more details how to reproduce the problem with GRE-in-UDP tunnel. Add a check on the network header size in skb_csum_hwoffload_help() to make sure no IPv6 packet with extension header is handed to a network device with NETIF_F_IPV6_CSUM capability. Link: https://lore.kernel.org/netdev/26548921.1r3eYUQgxm@benoit.monin/T/#u Fixes: aa3463d65e7b ("fou: Add encap ops for IPv6 tunnels") Signed-off-by: Benoît Monin Reviewed-by: Willem de Bruijn Link: https://patch.msgid.link/5fbeecfc311ea182aa1d1c771725ab8b4cac515e.1729778144.git.benoit.monin@gmx.fr Signed-off-by: Jakub Kicinski --- net/core/dev.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/net/core/dev.c b/net/core/dev.c index ea5fbcd133ae..8453e14d301b 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3639,6 +3639,9 @@ int skb_csum_hwoffload_help(struct sk_buff *skb, return 0; if (features & (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM)) { + if (vlan_get_protocol(skb) == htons(ETH_P_IPV6) && + skb_network_header_len(skb) != sizeof(struct ipv6hdr)) + goto sw_checksum; switch (skb->csum_offset) { case offsetof(struct tcphdr, check): case offsetof(struct udphdr, check): @@ -3646,6 +3649,7 @@ int skb_csum_hwoffload_help(struct sk_buff *skb, } } +sw_checksum: return skb_checksum_help(skb); } EXPORT_SYMBOL(skb_csum_hwoffload_help); -- cgit From 0a66e5582b5102c4d7b866b977ff7c850c1174ce Mon Sep 17 00:00:00 2001 From: Amit Cohen Date: Fri, 25 Oct 2024 16:26:25 +0200 Subject: mlxsw: spectrum_ptp: Add missing verification before pushing Tx header Tx header should be pushed for each packet which is transmitted via Spectrum ASICs. The cited commit moved the call to skb_cow_head() from mlxsw_sp_port_xmit() to functions which handle Tx header. In case that mlxsw_sp->ptp_ops->txhdr_construct() is used to handle Tx header, and txhdr_construct() is mlxsw_sp_ptp_txhdr_construct(), there is no call for skb_cow_head() before pushing Tx header size to SKB. This flow is relevant for Spectrum-1 and Spectrum-4, for PTP packets. Add the missing call to skb_cow_head() to make sure that there is both enough room to push the Tx header and that the SKB header is not cloned and can be modified. An additional set will be sent to net-next to centralize the handling of the Tx header by pushing it to every packet just before transmission. Cc: Richard Cochran Fixes: 24157bc69f45 ("mlxsw: Send PTP packets as data packets to overcome a limitation") Signed-off-by: Amit Cohen Signed-off-by: Petr Machata Link: https://patch.msgid.link/5145780b07ebbb5d3b3570f311254a3a2d554a44.1729866134.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c index 5b174cb95eb8..d94081c7658e 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c @@ -16,6 +16,7 @@ #include "spectrum.h" #include "spectrum_ptp.h" #include "core.h" +#include "txheader.h" #define MLXSW_SP1_PTP_CLOCK_CYCLES_SHIFT 29 #define MLXSW_SP1_PTP_CLOCK_FREQ_KHZ 156257 /* 6.4nSec */ @@ -1684,6 +1685,12 @@ int mlxsw_sp_ptp_txhdr_construct(struct mlxsw_core *mlxsw_core, struct sk_buff *skb, const struct mlxsw_tx_info *tx_info) { + if (skb_cow_head(skb, MLXSW_TXHDR_LEN)) { + this_cpu_inc(mlxsw_sp_port->pcpu_stats->tx_dropped); + dev_kfree_skb_any(skb); + return -ENOMEM; + } + mlxsw_sp_txhdr_construct(skb, tx_info); return 0; } -- cgit From 15f73e601a9c67aa83bde92b2d940a6532d8614d Mon Sep 17 00:00:00 2001 From: Amit Cohen Date: Fri, 25 Oct 2024 16:26:26 +0200 Subject: mlxsw: pci: Sync Rx buffers for CPU When Rx packet is received, drivers should sync the pages for CPU, to ensure the CPU reads the data written by the device and not stale data from its cache. Add the missing sync call in Rx path, sync the actual length of data for each fragment. Cc: Jiri Pirko Fixes: b5b60bb491b2 ("mlxsw: pci: Use page pool for Rx buffers allocation") Signed-off-by: Amit Cohen Reviewed-by: Ido Schimmel Signed-off-by: Petr Machata Link: https://patch.msgid.link/461486fac91755ca4e04c2068c102250026dcd0b.1729866134.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/mellanox/mlxsw/pci.c | 22 +++++++++++++++------- 1 file changed, 15 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci.c b/drivers/net/ethernet/mellanox/mlxsw/pci.c index 060e5b939211..2320a5f323b4 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/pci.c +++ b/drivers/net/ethernet/mellanox/mlxsw/pci.c @@ -389,15 +389,27 @@ static void mlxsw_pci_wqe_frag_unmap(struct mlxsw_pci *mlxsw_pci, char *wqe, dma_unmap_single(&pdev->dev, mapaddr, frag_len, direction); } -static struct sk_buff *mlxsw_pci_rdq_build_skb(struct page *pages[], +static struct sk_buff *mlxsw_pci_rdq_build_skb(struct mlxsw_pci_queue *q, + struct page *pages[], u16 byte_count) { + struct mlxsw_pci_queue *cq = q->u.rdq.cq; unsigned int linear_data_size; + struct page_pool *page_pool; struct sk_buff *skb; int page_index = 0; bool linear_only; void *data; + linear_only = byte_count + MLXSW_PCI_RX_BUF_SW_OVERHEAD <= PAGE_SIZE; + linear_data_size = linear_only ? byte_count : + PAGE_SIZE - + MLXSW_PCI_RX_BUF_SW_OVERHEAD; + + page_pool = cq->u.cq.page_pool; + page_pool_dma_sync_for_cpu(page_pool, pages[page_index], + MLXSW_PCI_SKB_HEADROOM, linear_data_size); + data = page_address(pages[page_index]); net_prefetch(data); @@ -405,11 +417,6 @@ static struct sk_buff *mlxsw_pci_rdq_build_skb(struct page *pages[], if (unlikely(!skb)) return ERR_PTR(-ENOMEM); - linear_only = byte_count + MLXSW_PCI_RX_BUF_SW_OVERHEAD <= PAGE_SIZE; - linear_data_size = linear_only ? byte_count : - PAGE_SIZE - - MLXSW_PCI_RX_BUF_SW_OVERHEAD; - skb_reserve(skb, MLXSW_PCI_SKB_HEADROOM); skb_put(skb, linear_data_size); @@ -425,6 +432,7 @@ static struct sk_buff *mlxsw_pci_rdq_build_skb(struct page *pages[], page = pages[page_index]; frag_size = min(byte_count, PAGE_SIZE); + page_pool_dma_sync_for_cpu(page_pool, page, 0, frag_size); skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, page, 0, frag_size, PAGE_SIZE); byte_count -= frag_size; @@ -760,7 +768,7 @@ static void mlxsw_pci_cqe_rdq_handle(struct mlxsw_pci *mlxsw_pci, if (err) goto out; - skb = mlxsw_pci_rdq_build_skb(pages, byte_count); + skb = mlxsw_pci_rdq_build_skb(q, pages, byte_count); if (IS_ERR(skb)) { dev_err_ratelimited(&pdev->dev, "Failed to build skb for RDQ\n"); mlxsw_pci_rdq_pages_recycle(q, pages, num_sg_entries); -- cgit From d0fbdc3ae9ecc614ddffde55dccbcacef353da0b Mon Sep 17 00:00:00 2001 From: Amit Cohen Date: Fri, 25 Oct 2024 16:26:27 +0200 Subject: mlxsw: pci: Sync Rx buffers for device Non-coherent architectures, like ARM, may require invalidating caches before the device can use the DMA mapped memory, which means that before posting pages to device, drivers should sync the memory for device. Sync for device can be configured as page pool responsibility. Set the relevant flag and define max_len for sync. Cc: Jiri Pirko Fixes: b5b60bb491b2 ("mlxsw: pci: Use page pool for Rx buffers allocation") Signed-off-by: Amit Cohen Reviewed-by: Ido Schimmel Signed-off-by: Petr Machata Link: https://patch.msgid.link/92e01f05c4f506a4f0a9b39c10175dcc01994910.1729866134.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/mellanox/mlxsw/pci.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci.c b/drivers/net/ethernet/mellanox/mlxsw/pci.c index 2320a5f323b4..d6f37456fb31 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/pci.c +++ b/drivers/net/ethernet/mellanox/mlxsw/pci.c @@ -996,12 +996,13 @@ static int mlxsw_pci_cq_page_pool_init(struct mlxsw_pci_queue *q, if (cq_type != MLXSW_PCI_CQ_RDQ) return 0; - pp_params.flags = PP_FLAG_DMA_MAP; + pp_params.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV; pp_params.pool_size = MLXSW_PCI_WQE_COUNT * mlxsw_pci->num_sg_entries; pp_params.nid = dev_to_node(&mlxsw_pci->pdev->dev); pp_params.dev = &mlxsw_pci->pdev->dev; pp_params.napi = &q->u.cq.napi; pp_params.dma_dir = DMA_FROM_DEVICE; + pp_params.max_len = PAGE_SIZE; page_pool = page_pool_create(&pp_params); if (IS_ERR(page_pool)) -- cgit From 12ae97c531fcd3bfd774d4dfeaeac23eafe24280 Mon Sep 17 00:00:00 2001 From: Ido Schimmel Date: Fri, 25 Oct 2024 16:26:28 +0200 Subject: mlxsw: spectrum_ipip: Fix memory leak when changing remote IPv6 address The device stores IPv6 addresses that are used for encapsulation in linear memory that is managed by the driver. Changing the remote address of an ip6gre net device never worked properly, but since cited commit the following reproducer [1] would result in a warning [2] and a memory leak [3]. The problem is that the new remote address is never added by the driver to its hash table (and therefore the device) and the old address is never removed from it. Fix by programming the new address when the configuration of the ip6gre net device changes and removing the old one. If the address did not change, then the above would result in increasing the reference count of the address and then decreasing it. [1] # ip link add name bla up type ip6gre local 2001:db8:1::1 remote 2001:db8:2::1 tos inherit ttl inherit # ip link set dev bla type ip6gre remote 2001:db8:3::1 # ip link del dev bla # devlink dev reload pci/0000:01:00.0 [2] WARNING: CPU: 0 PID: 1682 at drivers/net/ethernet/mellanox/mlxsw/spectrum.c:3002 mlxsw_sp_ipv6_addr_put+0x140/0x1d0 Modules linked in: CPU: 0 UID: 0 PID: 1682 Comm: ip Not tainted 6.12.0-rc3-custom-g86b5b55bc835 #151 Hardware name: Nvidia SN5600/VMOD0013, BIOS 5.13 05/31/2023 RIP: 0010:mlxsw_sp_ipv6_addr_put+0x140/0x1d0 [...] Call Trace: mlxsw_sp_router_netdevice_event+0x55f/0x1240 notifier_call_chain+0x5a/0xd0 call_netdevice_notifiers_info+0x39/0x90 unregister_netdevice_many_notify+0x63e/0x9d0 rtnl_dellink+0x16b/0x3a0 rtnetlink_rcv_msg+0x142/0x3f0 netlink_rcv_skb+0x50/0x100 netlink_unicast+0x242/0x390 netlink_sendmsg+0x1de/0x420 ____sys_sendmsg+0x2bd/0x320 ___sys_sendmsg+0x9a/0xe0 __sys_sendmsg+0x7a/0xd0 do_syscall_64+0x9e/0x1a0 entry_SYSCALL_64_after_hwframe+0x77/0x7f [3] unreferenced object 0xffff898081f597a0 (size 32): comm "ip", pid 1626, jiffies 4294719324 hex dump (first 32 bytes): 20 01 0d b8 00 02 00 00 00 00 00 00 00 00 00 01 ............... 21 49 61 83 80 89 ff ff 00 00 00 00 01 00 00 00 !Ia............. backtrace (crc fd9be911): [<00000000df89c55d>] __kmalloc_cache_noprof+0x1da/0x260 [<00000000ff2a1ddb>] mlxsw_sp_ipv6_addr_kvdl_index_get+0x281/0x340 [<000000009ddd445d>] mlxsw_sp_router_netdevice_event+0x47b/0x1240 [<00000000743e7757>] notifier_call_chain+0x5a/0xd0 [<000000007c7b9e13>] call_netdevice_notifiers_info+0x39/0x90 [<000000002509645d>] register_netdevice+0x5f7/0x7a0 [<00000000c2e7d2a9>] ip6gre_newlink_common.isra.0+0x65/0x130 [<0000000087cd6d8d>] ip6gre_newlink+0x72/0x120 [<000000004df7c7cc>] rtnl_newlink+0x471/0xa20 [<0000000057ed632a>] rtnetlink_rcv_msg+0x142/0x3f0 [<0000000032e0d5b5>] netlink_rcv_skb+0x50/0x100 [<00000000908bca63>] netlink_unicast+0x242/0x390 [<00000000cdbe1c87>] netlink_sendmsg+0x1de/0x420 [<0000000011db153e>] ____sys_sendmsg+0x2bd/0x320 [<000000003b6d53eb>] ___sys_sendmsg+0x9a/0xe0 [<00000000cae27c62>] __sys_sendmsg+0x7a/0xd0 Fixes: cf42911523e0 ("mlxsw: spectrum_ipip: Use common hash table for IPv6 address mapping") Reported-by: Maksym Yaremchuk Signed-off-by: Ido Schimmel Reviewed-by: Petr Machata Signed-off-by: Petr Machata Link: https://patch.msgid.link/e91012edc5a6cb9df37b78fd377f669381facfcb.1729866134.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski --- .../net/ethernet/mellanox/mlxsw/spectrum_ipip.c | 26 ++++++++++++++++++++-- 1 file changed, 24 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.c index d761a1235994..7ea798a4949e 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.c @@ -481,11 +481,33 @@ mlxsw_sp_ipip_ol_netdev_change_gre6(struct mlxsw_sp *mlxsw_sp, struct mlxsw_sp_ipip_entry *ipip_entry, struct netlink_ext_ack *extack) { + u32 new_kvdl_index, old_kvdl_index = ipip_entry->dip_kvdl_index; + struct in6_addr old_addr6 = ipip_entry->parms.daddr.addr6; struct mlxsw_sp_ipip_parms new_parms; + int err; new_parms = mlxsw_sp_ipip_netdev_parms_init_gre6(ipip_entry->ol_dev); - return mlxsw_sp_ipip_ol_netdev_change_gre(mlxsw_sp, ipip_entry, - &new_parms, extack); + + err = mlxsw_sp_ipv6_addr_kvdl_index_get(mlxsw_sp, + &new_parms.daddr.addr6, + &new_kvdl_index); + if (err) + return err; + ipip_entry->dip_kvdl_index = new_kvdl_index; + + err = mlxsw_sp_ipip_ol_netdev_change_gre(mlxsw_sp, ipip_entry, + &new_parms, extack); + if (err) + goto err_change_gre; + + mlxsw_sp_ipv6_addr_put(mlxsw_sp, &old_addr6); + + return 0; + +err_change_gre: + ipip_entry->dip_kvdl_index = old_kvdl_index; + mlxsw_sp_ipv6_addr_put(mlxsw_sp, &new_parms.daddr.addr6); + return err; } static int -- cgit From d7bd61fa0222db1cdc01d66bec2477c9fdfa6d4f Mon Sep 17 00:00:00 2001 From: Ido Schimmel Date: Fri, 25 Oct 2024 16:26:29 +0200 Subject: selftests: forwarding: Add IPv6 GRE remote change tests Test that after changing the remote address of an ip6gre net device traffic is forwarded as expected. Test with both flat and hierarchical topologies and with and without an input / output keys. Signed-off-by: Ido Schimmel Reviewed-by: Petr Machata Signed-off-by: Petr Machata Link: https://patch.msgid.link/02b05246d2cdada0cf2fccffc0faa8a424d0f51b.1729866134.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski --- .../selftests/net/forwarding/ip6gre_flat.sh | 14 ++++ .../selftests/net/forwarding/ip6gre_flat_key.sh | 14 ++++ .../selftests/net/forwarding/ip6gre_flat_keys.sh | 14 ++++ .../selftests/net/forwarding/ip6gre_hier.sh | 14 ++++ .../selftests/net/forwarding/ip6gre_hier_key.sh | 14 ++++ .../selftests/net/forwarding/ip6gre_hier_keys.sh | 14 ++++ .../testing/selftests/net/forwarding/ip6gre_lib.sh | 80 ++++++++++++++++++++++ 7 files changed, 164 insertions(+) diff --git a/tools/testing/selftests/net/forwarding/ip6gre_flat.sh b/tools/testing/selftests/net/forwarding/ip6gre_flat.sh index 96c97064f2d3..becc7c3fc809 100755 --- a/tools/testing/selftests/net/forwarding/ip6gre_flat.sh +++ b/tools/testing/selftests/net/forwarding/ip6gre_flat.sh @@ -8,6 +8,7 @@ ALL_TESTS=" gre_flat gre_mtu_change + gre_flat_remote_change " NUM_NETIFS=6 @@ -44,6 +45,19 @@ gre_mtu_change() test_mtu_change } +gre_flat_remote_change() +{ + flat_remote_change + + test_traffic_ip4ip6 "GRE flat IPv4-in-IPv6 (new remote)" + test_traffic_ip6ip6 "GRE flat IPv6-in-IPv6 (new remote)" + + flat_remote_restore + + test_traffic_ip4ip6 "GRE flat IPv4-in-IPv6 (old remote)" + test_traffic_ip6ip6 "GRE flat IPv6-in-IPv6 (old remote)" +} + cleanup() { pre_cleanup diff --git a/tools/testing/selftests/net/forwarding/ip6gre_flat_key.sh b/tools/testing/selftests/net/forwarding/ip6gre_flat_key.sh index ff9fb0db9bd1..e5335116a2fd 100755 --- a/tools/testing/selftests/net/forwarding/ip6gre_flat_key.sh +++ b/tools/testing/selftests/net/forwarding/ip6gre_flat_key.sh @@ -8,6 +8,7 @@ ALL_TESTS=" gre_flat gre_mtu_change + gre_flat_remote_change " NUM_NETIFS=6 @@ -44,6 +45,19 @@ gre_mtu_change() test_mtu_change } +gre_flat_remote_change() +{ + flat_remote_change + + test_traffic_ip4ip6 "GRE flat IPv4-in-IPv6 with key (new remote)" + test_traffic_ip6ip6 "GRE flat IPv6-in-IPv6 with key (new remote)" + + flat_remote_restore + + test_traffic_ip4ip6 "GRE flat IPv4-in-IPv6 with key (old remote)" + test_traffic_ip6ip6 "GRE flat IPv6-in-IPv6 with key (old remote)" +} + cleanup() { pre_cleanup diff --git a/tools/testing/selftests/net/forwarding/ip6gre_flat_keys.sh b/tools/testing/selftests/net/forwarding/ip6gre_flat_keys.sh index 12c138785242..7e0cbfdefab0 100755 --- a/tools/testing/selftests/net/forwarding/ip6gre_flat_keys.sh +++ b/tools/testing/selftests/net/forwarding/ip6gre_flat_keys.sh @@ -8,6 +8,7 @@ ALL_TESTS=" gre_flat gre_mtu_change + gre_flat_remote_change " NUM_NETIFS=6 @@ -44,6 +45,19 @@ gre_mtu_change() test_mtu_change gre } +gre_flat_remote_change() +{ + flat_remote_change + + test_traffic_ip4ip6 "GRE flat IPv4-in-IPv6 with ikey/okey (new remote)" + test_traffic_ip6ip6 "GRE flat IPv6-in-IPv6 with ikey/okey (new remote)" + + flat_remote_restore + + test_traffic_ip4ip6 "GRE flat IPv4-in-IPv6 with ikey/okey (old remote)" + test_traffic_ip6ip6 "GRE flat IPv6-in-IPv6 with ikey/okey (old remote)" +} + cleanup() { pre_cleanup diff --git a/tools/testing/selftests/net/forwarding/ip6gre_hier.sh b/tools/testing/selftests/net/forwarding/ip6gre_hier.sh index 83b55c30a5c3..e0844495f3d1 100755 --- a/tools/testing/selftests/net/forwarding/ip6gre_hier.sh +++ b/tools/testing/selftests/net/forwarding/ip6gre_hier.sh @@ -8,6 +8,7 @@ ALL_TESTS=" gre_hier gre_mtu_change + gre_hier_remote_change " NUM_NETIFS=6 @@ -44,6 +45,19 @@ gre_mtu_change() test_mtu_change gre } +gre_hier_remote_change() +{ + hier_remote_change + + test_traffic_ip4ip6 "GRE hierarchical IPv4-in-IPv6 (new remote)" + test_traffic_ip6ip6 "GRE hierarchical IPv6-in-IPv6 (new remote)" + + hier_remote_restore + + test_traffic_ip4ip6 "GRE hierarchical IPv4-in-IPv6 (old remote)" + test_traffic_ip6ip6 "GRE hierarchical IPv6-in-IPv6 (old remote)" +} + cleanup() { pre_cleanup diff --git a/tools/testing/selftests/net/forwarding/ip6gre_hier_key.sh b/tools/testing/selftests/net/forwarding/ip6gre_hier_key.sh index 256607916d92..741bc9c928eb 100755 --- a/tools/testing/selftests/net/forwarding/ip6gre_hier_key.sh +++ b/tools/testing/selftests/net/forwarding/ip6gre_hier_key.sh @@ -8,6 +8,7 @@ ALL_TESTS=" gre_hier gre_mtu_change + gre_hier_remote_change " NUM_NETIFS=6 @@ -44,6 +45,19 @@ gre_mtu_change() test_mtu_change gre } +gre_hier_remote_change() +{ + hier_remote_change + + test_traffic_ip4ip6 "GRE hierarchical IPv4-in-IPv6 with key (new remote)" + test_traffic_ip6ip6 "GRE hierarchical IPv6-in-IPv6 with key (new remote)" + + hier_remote_restore + + test_traffic_ip4ip6 "GRE hierarchical IPv4-in-IPv6 with key (old remote)" + test_traffic_ip6ip6 "GRE hierarchical IPv6-in-IPv6 with key (old remote)" +} + cleanup() { pre_cleanup diff --git a/tools/testing/selftests/net/forwarding/ip6gre_hier_keys.sh b/tools/testing/selftests/net/forwarding/ip6gre_hier_keys.sh index ad1bcd6334a8..ad9eab4b1367 100755 --- a/tools/testing/selftests/net/forwarding/ip6gre_hier_keys.sh +++ b/tools/testing/selftests/net/forwarding/ip6gre_hier_keys.sh @@ -8,6 +8,7 @@ ALL_TESTS=" gre_hier gre_mtu_change + gre_hier_remote_change " NUM_NETIFS=6 @@ -44,6 +45,19 @@ gre_mtu_change() test_mtu_change gre } +gre_hier_remote_change() +{ + hier_remote_change + + test_traffic_ip4ip6 "GRE hierarchical IPv4-in-IPv6 with ikey/okey (new remote)" + test_traffic_ip6ip6 "GRE hierarchical IPv6-in-IPv6 with ikey/okey (new remote)" + + hier_remote_restore + + test_traffic_ip4ip6 "GRE hierarchical IPv4-in-IPv6 with ikey/okey (old remote)" + test_traffic_ip6ip6 "GRE hierarchical IPv6-in-IPv6 with ikey/okey (old remote)" +} + cleanup() { pre_cleanup diff --git a/tools/testing/selftests/net/forwarding/ip6gre_lib.sh b/tools/testing/selftests/net/forwarding/ip6gre_lib.sh index 24f4ab328bd2..2d91281dc5b7 100644 --- a/tools/testing/selftests/net/forwarding/ip6gre_lib.sh +++ b/tools/testing/selftests/net/forwarding/ip6gre_lib.sh @@ -436,3 +436,83 @@ test_mtu_change() check_err $? log_test "ping GRE IPv6, packet size 1800 after MTU change" } + +topo_flat_remote_change() +{ + local old1=$1; shift + local new1=$1; shift + local old2=$1; shift + local new2=$1; shift + + ip link set dev g1a type ip6gre local $new1 remote $new2 + __addr_add_del g1a add "$new1/128" + __addr_add_del g1a del "$old1/128" + ip -6 route add $new2/128 via 2001:db8:10::2 + ip -6 route del $old2/128 + + ip link set dev g2a type ip6gre local $new2 remote $new1 + __addr_add_del g2a add "$new2/128" + __addr_add_del g2a del "$old2/128" + ip -6 route add vrf v$ol2 $new1/128 via 2001:db8:10::1 + ip -6 route del vrf v$ol2 $old1/128 +} + +flat_remote_change() +{ + local old1=2001:db8:3::1 + local new1=2001:db8:3::10 + local old2=2001:db8:3::2 + local new2=2001:db8:3::20 + + topo_flat_remote_change $old1 $new1 $old2 $new2 +} + +flat_remote_restore() +{ + local old1=2001:db8:3::10 + local new1=2001:db8:3::1 + local old2=2001:db8:3::20 + local new2=2001:db8:3::2 + + topo_flat_remote_change $old1 $new1 $old2 $new2 +} + +topo_hier_remote_change() +{ + local old1=$1; shift + local new1=$1; shift + local old2=$1; shift + local new2=$1; shift + + __addr_add_del dummy1 del "$old1/64" + __addr_add_del dummy1 add "$new1/64" + ip link set dev g1a type ip6gre local $new1 remote $new2 + ip -6 route add vrf v$ul1 $new2/128 via 2001:db8:10::2 + ip -6 route del vrf v$ul1 $old2/128 + + __addr_add_del dummy2 del "$old2/64" + __addr_add_del dummy2 add "$new2/64" + ip link set dev g2a type ip6gre local $new2 remote $new1 + ip -6 route add vrf v$ul2 $new1/128 via 2001:db8:10::1 + ip -6 route del vrf v$ul2 $old1/128 +} + +hier_remote_change() +{ + local old1=2001:db8:3::1 + local new1=2001:db8:3::10 + local old2=2001:db8:3::2 + local new2=2001:db8:3::20 + + topo_hier_remote_change $old1 $new1 $old2 $new2 +} + +hier_remote_restore() +{ + local old1=2001:db8:3::10 + local new1=2001:db8:3::1 + local old2=2001:db8:3::20 + local new2=2001:db8:3::2 + + topo_hier_remote_change $old1 $new1 $old2 $new2 +} -- cgit From 637f41476384c76d3cd7dcf5947caf2c8b8d7a9b Mon Sep 17 00:00:00 2001 From: Daniel Golle Date: Sat, 26 Oct 2024 14:52:25 +0100 Subject: net: ethernet: mtk_wed: fix path of MT7988 WO firmware linux-firmware commit 808cba84 ("mtk_wed: add firmware for mt7988 Wireless Ethernet Dispatcher") added mt7988_wo_{0,1}.bin in the 'mediatek/mt7988' directory while driver current expects the files in the 'mediatek' directory. Change path in the driver header now that the firmware has been added. Fixes: e2f64db13aa1 ("net: ethernet: mtk_wed: introduce WED support for MT7988") Signed-off-by: Daniel Golle Reviewed-by: Andrew Lunn Reviewed-by: AngeloGioacchino Del Regno Link: https://patch.msgid.link/Zxz0GWTR5X5LdWPe@pidgin.makrotopia.org Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/mediatek/mtk_wed_wo.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/mediatek/mtk_wed_wo.h b/drivers/net/ethernet/mediatek/mtk_wed_wo.h index 87a67fa3868d..c01b1e8428f6 100644 --- a/drivers/net/ethernet/mediatek/mtk_wed_wo.h +++ b/drivers/net/ethernet/mediatek/mtk_wed_wo.h @@ -91,8 +91,8 @@ enum mtk_wed_dummy_cr_idx { #define MT7981_FIRMWARE_WO "mediatek/mt7981_wo.bin" #define MT7986_FIRMWARE_WO0 "mediatek/mt7986_wo_0.bin" #define MT7986_FIRMWARE_WO1 "mediatek/mt7986_wo_1.bin" -#define MT7988_FIRMWARE_WO0 "mediatek/mt7988_wo_0.bin" -#define MT7988_FIRMWARE_WO1 "mediatek/mt7988_wo_1.bin" +#define MT7988_FIRMWARE_WO0 "mediatek/mt7988/mt7988_wo_0.bin" +#define MT7988_FIRMWARE_WO1 "mediatek/mt7988/mt7988_wo_1.bin" #define MTK_WO_MCU_CFG_LS_BASE 0 #define MTK_WO_MCU_CFG_LS_HW_VER_ADDR (MTK_WO_MCU_CFG_LS_BASE + 0x000) -- cgit From aa6f8b2593b56a02043684182a89853f919dff3e Mon Sep 17 00:00:00 2001 From: John Hubbard Date: Fri, 18 Oct 2024 15:34:11 -0700 Subject: mm/gup: stop leaking pinned pages in low memory conditions If a driver tries to call any of the pin_user_pages*(FOLL_LONGTERM) family of functions, and requests "too many" pages, then the call will erroneously leave pages pinned. This is visible in user space as an actual memory leak. Repro is trivial: just make enough pin_user_pages(FOLL_LONGTERM) calls to exhaust memory. The root cause of the problem is this sequence, within __gup_longterm_locked(): __get_user_pages_locked() rc = check_and_migrate_movable_pages() ...which gets retried in a loop. The loop error handling is incomplete, clearly due to a somewhat unusual and complicated tri-state error API. But anyway, if -ENOMEM, or in fact, any unexpected error is returned from check_and_migrate_movable_pages(), then __gup_longterm_locked() happily returns the error, while leaving the pages pinned. In the failed case, which is an app that requests (via a device driver) 30720000000 bytes to be pinned, and then exits, I see this: $ grep foll /proc/vmstat nr_foll_pin_acquired 7502048 nr_foll_pin_released 2048 And after applying this patch, it returns to balanced pins: $ grep foll /proc/vmstat nr_foll_pin_acquired 7502048 nr_foll_pin_released 7502048 Note that the child routine, check_and_migrate_movable_folios(), avoids this problem, by unpinning any folios in the **folios argument, before returning an error. Fix this by making check_and_migrate_movable_pages() behave in exactly the same way as check_and_migrate_movable_folios(): unpin all pages in **pages, before returning an error. Also, documentation was an aggravating factor, so: 1) Consolidate the documentation for these two routines, now that they have identical external behavior. 2) Rewrite the consolidated documentation: a) Clearly list the three return code cases, and what happens in each case. b) Mention that one of the cases unpins the pages or folios, before returning an error code. Link: https://lkml.kernel.org/r/20241018223411.310331-1-jhubbard@nvidia.com Fixes: 24a95998e9ba ("mm/gup.c: simplify and fix check_and_migrate_movable_pages() return codes") Signed-off-by: John Hubbard Reviewed-by: Alistair Popple Suggested-by: David Hildenbrand Cc: Shigeru Yoshida Cc: Jason Gunthorpe Cc: Minchan Kim Cc: Pasha Tatashin Signed-off-by: Andrew Morton --- mm/gup.c | 33 +++++++++++++++++++-------------- 1 file changed, 19 insertions(+), 14 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index a82890b46a36..4637dab7b54f 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2394,20 +2394,25 @@ err: } /* - * Check whether all folios are *allowed* to be pinned indefinitely (longterm). + * Check whether all folios are *allowed* to be pinned indefinitely (long term). * Rather confusingly, all folios in the range are required to be pinned via * FOLL_PIN, before calling this routine. * - * If any folios in the range are not allowed to be pinned, then this routine - * will migrate those folios away, unpin all the folios in the range and return - * -EAGAIN. The caller should re-pin the entire range with FOLL_PIN and then - * call this routine again. + * Return values: * - * If an error other than -EAGAIN occurs, this indicates a migration failure. - * The caller should give up, and propagate the error back up the call stack. - * - * If everything is OK and all folios in the range are allowed to be pinned, + * 0: if everything is OK and all folios in the range are allowed to be pinned, * then this routine leaves all folios pinned and returns zero for success. + * + * -EAGAIN: if any folios in the range are not allowed to be pinned, then this + * routine will migrate those folios away, unpin all the folios in the range. If + * migration of the entire set of folios succeeds, then -EAGAIN is returned. The + * caller should re-pin the entire range with FOLL_PIN and then call this + * routine again. + * + * -ENOMEM, or any other -errno: if an error *other* than -EAGAIN occurs, this + * indicates a migration failure. The caller should give up, and propagate the + * error back up the call stack. The caller does not need to unpin any folios in + * that case, because this routine will do the unpinning. */ static long check_and_migrate_movable_folios(unsigned long nr_folios, struct folio **folios) @@ -2425,10 +2430,8 @@ static long check_and_migrate_movable_folios(unsigned long nr_folios, } /* - * This routine just converts all the pages in the @pages array to folios and - * calls check_and_migrate_movable_folios() to do the heavy lifting. - * - * Please see the check_and_migrate_movable_folios() documentation for details. + * Return values and behavior are the same as those for + * check_and_migrate_movable_folios(). */ static long check_and_migrate_movable_pages(unsigned long nr_pages, struct page **pages) @@ -2437,8 +2440,10 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, long i, ret; folios = kmalloc_array(nr_pages, sizeof(*folios), GFP_KERNEL); - if (!folios) + if (!folios) { + unpin_user_pages(pages, nr_pages); return -ENOMEM; + } for (i = 0; i < nr_pages; i++) folios[i] = page_folio(pages[i]); -- cgit From f4657e16e767105194f97586fe3c03d3f64c4d37 Mon Sep 17 00:00:00 2001 From: Hao Ge Date: Sun, 20 Oct 2024 15:08:19 +0800 Subject: mm/codetag: fix null pointer check logic for ref and tag When we compile and load lib/slub_kunit.c,it will cause a panic. The root cause is that __kmalloc_cache_noprof was directly called instead of kmem_cache_alloc,which resulted in no alloc_tag being allocated.This caused current->alloc_tag to be null,leading to a null pointer dereference in alloc_tag_ref_set. Despite the fact that my colleague Pei Xiao will later fix the code in slub_kunit.c,we still need fix null pointer check logic for ref and tag to avoid panic caused by a null pointer dereference. Here is the log for the panic: [ 74.779373][ T2158] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020 [ 74.780130][ T2158] Mem abort info: [ 74.780406][ T2158] ESR = 0x0000000096000004 [ 74.780756][ T2158] EC = 0x25: DABT (current EL), IL = 32 bits [ 74.781225][ T2158] SET = 0, FnV = 0 [ 74.781529][ T2158] EA = 0, S1PTW = 0 [ 74.781836][ T2158] FSC = 0x04: level 0 translation fault [ 74.782288][ T2158] Data abort info: [ 74.782577][ T2158] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 [ 74.783068][ T2158] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 74.783533][ T2158] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 74.784010][ T2158] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000105f34000 [ 74.784586][ T2158] [0000000000000020] pgd=0000000000000000, p4d=0000000000000000 [ 74.785293][ T2158] Internal error: Oops: 0000000096000004 [#1] SMP [ 74.785805][ T2158] Modules linked in: slub_kunit kunit ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ip6table_nat ip6table_mangle 4 [ 74.790661][ T2158] CPU: 0 UID: 0 PID: 2158 Comm: kunit_try_catch Kdump: loaded Tainted: G W N 6.12.0-rc3+ #2 [ 74.791535][ T2158] Tainted: [W]=WARN, [N]=TEST [ 74.791889][ T2158] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 74.792479][ T2158] pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 74.793101][ T2158] pc : alloc_tagging_slab_alloc_hook+0x120/0x270 [ 74.793607][ T2158] lr : alloc_tagging_slab_alloc_hook+0x120/0x270 [ 74.794095][ T2158] sp : ffff800084d33cd0 [ 74.794418][ T2158] x29: ffff800084d33cd0 x28: 0000000000000000 x27: 0000000000000000 [ 74.795095][ T2158] x26: 0000000000000000 x25: 0000000000000012 x24: ffff80007b30e314 [ 74.795822][ T2158] x23: ffff000390ff6f10 x22: 0000000000000000 x21: 0000000000000088 [ 74.796555][ T2158] x20: ffff000390285840 x19: fffffd7fc3ef7830 x18: ffffffffffffffff [ 74.797283][ T2158] x17: ffff8000800e63b4 x16: ffff80007b33afc4 x15: ffff800081654c00 [ 74.798011][ T2158] x14: 0000000000000000 x13: 205d383531325420 x12: 5b5d383734363537 [ 74.798744][ T2158] x11: ffff800084d337e0 x10: 000000000000005d x9 : 00000000ffffffd0 [ 74.799476][ T2158] x8 : 7f7f7f7f7f7f7f7f x7 : ffff80008219d188 x6 : c0000000ffff7fff [ 74.800206][ T2158] x5 : ffff0003fdbc9208 x4 : ffff800081edd188 x3 : 0000000000000001 [ 74.800932][ T2158] x2 : 0beaa6dee1ac5a00 x1 : 0beaa6dee1ac5a00 x0 : ffff80037c2cb000 [ 74.801656][ T2158] Call trace: [ 74.801954][ T2158] alloc_tagging_slab_alloc_hook+0x120/0x270 [ 74.802494][ T2158] __kmalloc_cache_noprof+0x148/0x33c [ 74.802976][ T2158] test_kmalloc_redzone_access+0x4c/0x104 [slub_kunit] [ 74.803607][ T2158] kunit_try_run_case+0x70/0x17c [kunit] [ 74.804124][ T2158] kunit_generic_run_threadfn_adapter+0x2c/0x4c [kunit] [ 74.804768][ T2158] kthread+0x10c/0x118 [ 74.805141][ T2158] ret_from_fork+0x10/0x20 [ 74.805540][ T2158] Code: b9400a80 11000400 b9000a80 97ffd858 (f94012d3) [ 74.806176][ T2158] SMP: stopping secondary CPUs [ 74.808130][ T2158] Starting crashdump kernel... Link: https://lkml.kernel.org/r/20241020070819.307944-1-hao.ge@linux.dev Fixes: e0a955bf7f61 ("mm/codetag: add pgalloc_tag_copy()") Signed-off-by: Hao Ge Acked-by: Suren Baghdasaryan Suggested-by: Suren Baghdasaryan Acked-by: Yu Zhao Cc: Kent Overstreet Signed-off-by: Andrew Morton --- include/linux/alloc_tag.h | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h index 1f0a9ff23a2c..941deffc590d 100644 --- a/include/linux/alloc_tag.h +++ b/include/linux/alloc_tag.h @@ -135,18 +135,21 @@ static inline void alloc_tag_sub_check(union codetag_ref *ref) {} #endif /* Caller should verify both ref and tag to be valid */ -static inline void __alloc_tag_ref_set(union codetag_ref *ref, struct alloc_tag *tag) +static inline bool __alloc_tag_ref_set(union codetag_ref *ref, struct alloc_tag *tag) { alloc_tag_add_check(ref, tag); if (!ref || !tag) - return; + return false; ref->ct = &tag->ct; + return true; } -static inline void alloc_tag_ref_set(union codetag_ref *ref, struct alloc_tag *tag) +static inline bool alloc_tag_ref_set(union codetag_ref *ref, struct alloc_tag *tag) { - __alloc_tag_ref_set(ref, tag); + if (unlikely(!__alloc_tag_ref_set(ref, tag))) + return false; + /* * We need in increment the call counter every time we have a new * allocation or when we split a large allocation into smaller ones. @@ -154,12 +157,13 @@ static inline void alloc_tag_ref_set(union codetag_ref *ref, struct alloc_tag *t * counter because when we free each part the counter will be decremented. */ this_cpu_inc(tag->counters->calls); + return true; } static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag, size_t bytes) { - alloc_tag_ref_set(ref, tag); - this_cpu_add(tag->counters->bytes, bytes); + if (likely(alloc_tag_ref_set(ref, tag))) + this_cpu_add(tag->counters->bytes, bytes); } static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes) -- cgit From e0fc203748377835bbb4fb4c45174592214a3211 Mon Sep 17 00:00:00 2001 From: Zi Yan Date: Wed, 23 Oct 2024 13:12:36 -0400 Subject: mm: avoid VM_BUG_ON when try to map an anon large folio to zero page. An anonymous large folio can be split into non order-0 folios, try_to_map_unused_to_zeropage() should not VM_BUG_ON compound pages but just return false. This fixes the crash when splitting anonymous large folios to non order-0 folios. Link: https://lkml.kernel.org/r/20241023171236.1122535-1-ziy@nvidia.com Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp") Signed-off-by: Zi Yan Acked-by: David Hildenbrand Acked-by: Usama Arif Cc: Barry Song Cc: Domenico Cerasuolo Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Matthew Wilcox Cc: Mike Rapoport (Microsoft) Cc: Nico Pache Cc: Rik van Riel Cc: Roman Gushchin Cc: Ryan Roberts Cc: Shakeel Butt Cc: Yu Zhao Signed-off-by: Andrew Morton --- mm/migrate.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/migrate.c b/mm/migrate.c index df91248755e4..7e520562d421 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -206,7 +206,8 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw, pte_t newpte; void *addr; - VM_BUG_ON_PAGE(PageCompound(page), page); + if (PageCompound(page)) + return false; VM_BUG_ON_PAGE(!PageAnon(page), page); VM_BUG_ON_PAGE(!PageLocked(page), page); VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page); -- cgit From b54e1bfecc4b2775c184d2edb319232b853a686d Mon Sep 17 00:00:00 2001 From: Barry Song Date: Thu, 24 Oct 2024 10:02:01 +1300 Subject: mm: fix PSWPIN counter for large folios swap-in Similar to PSWPOUT, we should count the number of base pages instead of large folios. Link: https://lkml.kernel.org/r/20241023210201.2798-1-21cnbao@gmail.com Fixes: 242d12c98174 ("mm: support large folios swap-in for sync io devices") Signed-off-by: Barry Song Acked-by: David Hildenbrand Reviewed-by: Baolin Wang Cc: Chris Li Cc: Yosry Ahmed Cc: "Huang, Ying" Cc: Kairui Song Cc: Ryan Roberts Cc: Kanchana P Sridhar Cc: Usama Arif Signed-off-by: Andrew Morton --- mm/page_io.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/page_io.c b/mm/page_io.c index 78bc88acee79..69536a2b3c13 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -570,7 +570,7 @@ static void swap_read_folio_bdev_sync(struct folio *folio, * attempt to access it in the page fault retry time check. */ get_task_struct(current); - count_vm_event(PSWPIN); + count_vm_events(PSWPIN, folio_nr_pages(folio)); submit_bio_wait(&bio); __end_swap_bio_read(&bio); put_task_struct(current); @@ -585,7 +585,7 @@ static void swap_read_folio_bdev_async(struct folio *folio, bio->bi_iter.bi_sector = swap_folio_sector(folio); bio->bi_end_io = end_swap_bio_read; bio_add_folio_nofail(bio, folio, folio_size(folio), 0); - count_vm_event(PSWPIN); + count_vm_events(PSWPIN, folio_nr_pages(folio)); submit_bio(bio); } -- cgit From 5168a68eb78fa1c67a8b2d31d0642c7fd866cc12 Mon Sep 17 00:00:00 2001 From: Kairui Song Date: Wed, 23 Oct 2024 01:55:12 +0800 Subject: mm, swap: avoid over reclaim of full clusters When running low on usable slots, cluster allocator will try to reclaim the full clusters aggressively to reclaim HAS_CACHE slots. This guarantees that as long as there are any usable slots, HAS_CACHE or not, the swap device will be usable and workload won't go OOM early. Before the cluster allocator, swap allocator fails easily if device is filled up with reclaimable HAS_CACHE slots. Which can be easily reproduced with following simple program: #include #include #include #include #define SIZE 8192UL * 1024UL * 1024UL int main(int argc, char **argv) { long tmp; char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); memset(p, 0, SIZE); madvise(p, SIZE, MADV_PAGEOUT); for (unsigned long i = 0; i < SIZE; ++i) tmp += p[i]; getchar(); /* Pause */ return 0; } Setup an 8G non ramdisk swap, the first run of the program will swapout 8G ram successfully. But run same program again after the first run paused, the second run can't swapout all 8G memory as now half of the swap device is pinned by HAS_CACHE. There was a random scan in the old allocator that may reclaim part of the HAS_CACHE by luck, but it's unreliable. The new allocator's added reclaim of full clusters when device is low on usable slots. But when multiple CPUs are seeing the device is low on usable slots at the same time, they ran into a thundering herd problem. This is an observable problem on large machine with mass parallel workload, as full cluster reclaim is slower on large swap device and higher number of CPUs will also make things worse. Testing using a 128G ZRAM on a 48c96t system. When the swap device is very close to full (eg. 124G / 128G), running build linux kernel with make -j96 in a 1G memory cgroup will hung (not a softlockup though) spinning in full cluster reclaim for about ~5min before go OOM. To solve this, split the full reclaim into two parts: - Instead of do a synchronous aggressively reclaim when device is low, do only one aggressively reclaim when device is strictly full with a kworker. This still ensures in worst case the device won't be unusable because of HAS_CACHE slots. - To avoid allocation (especially higher order) suffer from HAS_CACHE filling up clusters and kworker not responsive enough, do one synchronous scan every time the free list is drained, and only scan one cluster. This is kind of similar to the random reclaim before, keeps the full clusters rotated and has a minimal latency. This should provide a fair reclaim strategy suitable for most workloads. Link: https://lkml.kernel.org/r/20241022175512.10398-1-ryncsn@gmail.com Fixes: 2cacbdfdee65 ("mm: swap: add a adaptive full cluster cache reclaim") Signed-off-by: Kairui Song Cc: Barry Song Cc: Chris Li Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Kalesh Singh Cc: Ryan Roberts Cc: Yosry Ahmed Signed-off-by: Andrew Morton --- include/linux/swap.h | 1 + mm/swapfile.c | 49 ++++++++++++++++++++++++++++++------------------- 2 files changed, 31 insertions(+), 19 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index ca533b478c21..f3e0ac20c2e8 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -335,6 +335,7 @@ struct swap_info_struct { * list. */ struct work_struct discard_work; /* discard worker */ + struct work_struct reclaim_work; /* reclaim worker */ struct list_head discard_clusters; /* discard clusters list */ struct plist_node avail_lists[]; /* * entries in swap_avail_heads, one diff --git a/mm/swapfile.c b/mm/swapfile.c index b0915f3fab31..46bd4b1a3c07 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -731,15 +731,16 @@ done: return offset; } -static void swap_reclaim_full_clusters(struct swap_info_struct *si) +/* Return true if reclaimed a whole cluster */ +static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force) { long to_scan = 1; unsigned long offset, end; struct swap_cluster_info *ci; unsigned char *map = si->swap_map; - int nr_reclaim, total_reclaimed = 0; + int nr_reclaim; - if (atomic_long_read(&nr_swap_pages) <= SWAPFILE_CLUSTER) + if (force) to_scan = si->inuse_pages / SWAPFILE_CLUSTER; while (!list_empty(&si->full_clusters)) { @@ -749,28 +750,36 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si) end = min(si->max, offset + SWAPFILE_CLUSTER); to_scan--; + spin_unlock(&si->lock); while (offset < end) { if (READ_ONCE(map[offset]) == SWAP_HAS_CACHE) { - spin_unlock(&si->lock); nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT); - spin_lock(&si->lock); - if (nr_reclaim > 0) { - offset += nr_reclaim; - total_reclaimed += nr_reclaim; - continue; - } else if (nr_reclaim < 0) { - offset += -nr_reclaim; + if (nr_reclaim) { + offset += abs(nr_reclaim); continue; } } offset++; } - if (to_scan <= 0 || total_reclaimed) + spin_lock(&si->lock); + + if (to_scan <= 0) break; } } +static void swap_reclaim_work(struct work_struct *work) +{ + struct swap_info_struct *si; + + si = container_of(work, struct swap_info_struct, reclaim_work); + + spin_lock(&si->lock); + swap_reclaim_full_clusters(si, true); + spin_unlock(&si->lock); +} + /* * Try to get swap entries with specified order from current cpu's swap entry * pool (a cluster). This might involve allocating a new cluster for current CPU @@ -800,6 +809,10 @@ new_cluster: goto done; } + /* Try reclaim from full clusters if free clusters list is drained */ + if (vm_swap_full()) + swap_reclaim_full_clusters(si, false); + if (order < PMD_ORDER) { unsigned int frags = 0; @@ -881,13 +894,6 @@ new_cluster: } done: - /* Try reclaim from full clusters if device is nearfull */ - if (vm_swap_full() && (!found || (si->pages - si->inuse_pages) < SWAPFILE_CLUSTER)) { - swap_reclaim_full_clusters(si); - if (!found && !order && si->pages != si->inuse_pages) - goto new_cluster; - } - cluster->next[order] = offset; return found; } @@ -922,6 +928,9 @@ static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset, si->lowest_bit = si->max; si->highest_bit = 0; del_from_avail_list(si); + + if (vm_swap_full()) + schedule_work(&si->reclaim_work); } } @@ -2816,6 +2825,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) wait_for_completion(&p->comp); flush_work(&p->discard_work); + flush_work(&p->reclaim_work); destroy_swap_extents(p); if (p->flags & SWP_CONTINUED) @@ -3376,6 +3386,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) return PTR_ERR(si); INIT_WORK(&si->discard_work, swap_discard_work); + INIT_WORK(&si->reclaim_work, swap_reclaim_work); name = getname(specialfile); if (IS_ERR(name)) { -- cgit From ece5897e5a10fcd56a317e32f2dc7219f366a5a8 Mon Sep 17 00:00:00 2001 From: Wladislav Wiebe Date: Tue, 22 Oct 2024 19:21:13 +0200 Subject: tools/mm: -Werror fixes in page-types/slabinfo MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Commit e6d2c436ff693 ("tools/mm: allow users to provide additional cflags/ldflags") passes now CFLAGS to Makefile. With this, build systems with default -Werror enabled found: slabinfo.c:1300:25: error: ignoring return value of 'chdir' declared with attribute 'warn_unused_result' [-Werror=unused-result]                          chdir("..");                          ^~~~~~~~~~~ page-types.c:397:35: error: format '%lu' expects argument of type 'long unsigned int', but argument 2 has type 'uint64_t' {aka 'long long unsigned int'} [-Werror=format=]                          printf("%lu\t", mapcnt0);                                  ~~^     ~~~~~~~ .. Fix page-types by using PRIu64 for uint64_t prints and check in slabinfo for return code on chdir(".."). Link: https://lkml.kernel.org/r/c1ceb507-94bc-461c-934d-c19b77edd825@gmail.com Fixes: e6d2c436ff69 ("tools/mm: allow users to provide additional cflags/ldflags") Signed-off-by: Wladislav Wiebe Cc: Vlastimil Babka Cc: Herton R. Krzesinski Cc: Signed-off-by: Andrew Morton --- tools/mm/page-types.c | 9 +++++---- tools/mm/slabinfo.c | 4 +++- 2 files changed, 8 insertions(+), 5 deletions(-) diff --git a/tools/mm/page-types.c b/tools/mm/page-types.c index fa050d5a48cd..6eb17cc1a06c 100644 --- a/tools/mm/page-types.c +++ b/tools/mm/page-types.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include #include @@ -391,9 +392,9 @@ static void show_page_range(unsigned long voffset, unsigned long offset, if (opt_file) printf("%lx\t", voff); if (opt_list_cgroup) - printf("@%llu\t", (unsigned long long)cgroup0); + printf("@%" PRIu64 "\t", cgroup0); if (opt_list_mapcnt) - printf("%lu\t", mapcnt0); + printf("%" PRIu64 "\t", mapcnt0); printf("%lx\t%lx\t%s\n", index, count, page_flag_name(flags0)); } @@ -419,9 +420,9 @@ static void show_page(unsigned long voffset, unsigned long offset, if (opt_file) printf("%lx\t", voffset); if (opt_list_cgroup) - printf("@%llu\t", (unsigned long long)cgroup); + printf("@%" PRIu64 "\t", cgroup) if (opt_list_mapcnt) - printf("%lu\t", mapcnt); + printf("%" PRIu64 "\t", mapcnt); printf("%lx\t%s\n", offset, page_flag_name(flags)); } diff --git a/tools/mm/slabinfo.c b/tools/mm/slabinfo.c index cfaeaea71042..04e9e6ba86ea 100644 --- a/tools/mm/slabinfo.c +++ b/tools/mm/slabinfo.c @@ -1297,7 +1297,9 @@ static void read_slab_dir(void) slab->cpu_partial_free = get_obj("cpu_partial_free"); slab->alloc_node_mismatch = get_obj("alloc_node_mismatch"); slab->deactivate_bypass = get_obj("deactivate_bypass"); - chdir(".."); + if (chdir("..")) + fatal("Unable to chdir from slab ../%s\n", + slab->name); if (slab->name[0] == ':') alias_targets++; slab++; -- cgit From 330d8df81f3673d6fb74550bbc9bb159d81b35f7 Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Tue, 22 Oct 2024 18:07:06 +0200 Subject: kasan: remove vmalloc_percpu test Commit 1a2473f0cbc0 ("kasan: improve vmalloc tests") added the vmalloc_percpu KASAN test with the assumption that __alloc_percpu always uses vmalloc internally, which is tagged by KASAN. However, __alloc_percpu might allocate memory from the first per-CPU chunk, which is not allocated via vmalloc(). As a result, the test might fail. Remove the test until proper KASAN annotation for the per-CPU allocated are added; tracked in https://bugzilla.kernel.org/show_bug.cgi?id=215019. Link: https://lkml.kernel.org/r/20241022160706.38943-1-andrey.konovalov@linux.dev Fixes: 1a2473f0cbc0 ("kasan: improve vmalloc tests") Signed-off-by: Andrey Konovalov Reported-by: Samuel Holland Link: https://lore.kernel.org/all/4a245fff-cc46-44d1-a5f9-fd2f1c3764ae@sifive.com/ Reported-by: Sabyrzhan Tasbolatov Link: https://lore.kernel.org/all/CACzwLxiWzNqPBp4C1VkaXZ2wDwvY3yZeetCi1TLGFipKW77drA@mail.gmail.com/ Cc: Alexander Potapenko Cc: Andrey Ryabinin Cc: Dmitry Vyukov Cc: Marco Elver Cc: Sabyrzhan Tasbolatov Cc: Signed-off-by: Andrew Morton --- mm/kasan/kasan_test_c.c | 27 --------------------------- 1 file changed, 27 deletions(-) diff --git a/mm/kasan/kasan_test_c.c b/mm/kasan/kasan_test_c.c index a181e4780d9d..d8fb281e439d 100644 --- a/mm/kasan/kasan_test_c.c +++ b/mm/kasan/kasan_test_c.c @@ -1810,32 +1810,6 @@ static void vm_map_ram_tags(struct kunit *test) free_pages((unsigned long)p_ptr, 1); } -static void vmalloc_percpu(struct kunit *test) -{ - char __percpu *ptr; - int cpu; - - /* - * This test is specifically crafted for the software tag-based mode, - * the only tag-based mode that poisons percpu mappings. - */ - KASAN_TEST_NEEDS_CONFIG_ON(test, CONFIG_KASAN_SW_TAGS); - - ptr = __alloc_percpu(PAGE_SIZE, PAGE_SIZE); - - for_each_possible_cpu(cpu) { - char *c_ptr = per_cpu_ptr(ptr, cpu); - - KUNIT_EXPECT_GE(test, (u8)get_tag(c_ptr), (u8)KASAN_TAG_MIN); - KUNIT_EXPECT_LT(test, (u8)get_tag(c_ptr), (u8)KASAN_TAG_KERNEL); - - /* Make sure that in-bounds accesses don't crash the kernel. */ - *c_ptr = 0; - } - - free_percpu(ptr); -} - /* * Check that the assigned pointer tag falls within the [KASAN_TAG_MIN, * KASAN_TAG_KERNEL) range (note: excluding the match-all tag) for tag-based @@ -2023,7 +1997,6 @@ static struct kunit_case kasan_kunit_test_cases[] = { KUNIT_CASE(vmalloc_oob), KUNIT_CASE(vmap_tags), KUNIT_CASE(vm_map_ram_tags), - KUNIT_CASE(vmalloc_percpu), KUNIT_CASE(match_all_not_assigned), KUNIT_CASE(match_all_ptr_tag), KUNIT_CASE(match_all_mem_tag), -- cgit From d31638ff6c5437ca2968d6c22fb16524fd485013 Mon Sep 17 00:00:00 2001 From: Phillip Lougher Date: Mon, 21 Oct 2024 00:22:00 +0100 Subject: Squashfs: fix variable overflow in squashfs_readpage_block Syzbot reports a slab out of bounds access in squashfs_readpage_block(). This is caused by an attempt to read page index 0x2000000000. This value (start_index) is stored in an integer loop variable which overflows producing a value of 0. This causes a loop which iterates over pages start_index -> end_index to iterate over 0 -> end_index, which ultimately causes an out of bounds page array access. Fix by changing variable to a loff_t, and rename to index to make it clearer it is a page index, and not a loop count. Link: https://lkml.kernel.org/r/20241020232200.837231-1-phillip@squashfs.org.uk Signed-off-by: Phillip Lougher Reported-by: "Lai, Yi" Closes: https://lore.kernel.org/all/ZwzcnCAosIPqQ9Ie@ly-workstation/ Signed-off-by: Andrew Morton --- fs/squashfs/file_direct.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/fs/squashfs/file_direct.c b/fs/squashfs/file_direct.c index 22251743fadf..d19d4db74af8 100644 --- a/fs/squashfs/file_direct.c +++ b/fs/squashfs/file_direct.c @@ -30,7 +30,8 @@ int squashfs_readpage_block(struct page *target_page, u64 block, int bsize, int mask = (1 << (msblk->block_log - PAGE_SHIFT)) - 1; loff_t start_index = folio->index & ~mask; loff_t end_index = start_index | mask; - int i, n, pages, bytes, res = -ENOMEM; + loff_t index; + int i, pages, bytes, res = -ENOMEM; struct page **page, *last_page; struct squashfs_page_actor *actor; void *pageaddr; @@ -45,9 +46,9 @@ int squashfs_readpage_block(struct page *target_page, u64 block, int bsize, return res; /* Try to grab all the pages covered by the Squashfs block */ - for (i = 0, n = start_index; n <= end_index; n++) { - page[i] = (n == folio->index) ? target_page : - grab_cache_page_nowait(target_page->mapping, n); + for (i = 0, index = start_index; index <= end_index; index++) { + page[i] = (index == folio->index) ? target_page : + grab_cache_page_nowait(target_page->mapping, index); if (page[i] == NULL) continue; -- cgit From b3a033e3ecd3471248d474ef263aadc0059e516a Mon Sep 17 00:00:00 2001 From: Ryusuke Konishi Date: Sun, 20 Oct 2024 13:51:28 +0900 Subject: nilfs2: fix potential deadlock with newly created symlinks Syzbot reported that page_symlink(), called by nilfs_symlink(), triggers memory reclamation involving the filesystem layer, which can result in circular lock dependencies among the reader/writer semaphore nilfs->ns_segctor_sem, s_writers percpu_rwsem (intwrite) and the fs_reclaim pseudo lock. This is because after commit 21fc61c73c39 ("don't put symlink bodies in pagecache into highmem"), the gfp flags of the page cache for symbolic links are overwritten to GFP_KERNEL via inode_nohighmem(). This is not a problem for symlinks read from the backing device, because the __GFP_FS flag is dropped after inode_nohighmem() is called. However, when a new symlink is created with nilfs_symlink(), the gfp flags remain overwritten to GFP_KERNEL. Then, memory allocation called from page_symlink() etc. triggers memory reclamation including the FS layer, which may call nilfs_evict_inode() or nilfs_dirty_inode(). And these can cause a deadlock if they are called while nilfs->ns_segctor_sem is held: Fix this issue by dropping the __GFP_FS flag from the page cache GFP flags of newly created symlinks in the same way that nilfs_new_inode() and __nilfs_read_inode() do, as a workaround until we adopt nofs allocation scope consistently or improve the locking constraints. Link: https://lkml.kernel.org/r/20241020050003.4308-1-konishi.ryusuke@gmail.com Fixes: 21fc61c73c39 ("don't put symlink bodies in pagecache into highmem") Signed-off-by: Ryusuke Konishi Reported-by: syzbot+9ef37ac20608f4836256@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=9ef37ac20608f4836256 Tested-by: syzbot+9ef37ac20608f4836256@syzkaller.appspotmail.com Cc: Signed-off-by: Andrew Morton --- fs/nilfs2/namei.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c index 4905063790c5..9b108052d9f7 100644 --- a/fs/nilfs2/namei.c +++ b/fs/nilfs2/namei.c @@ -157,6 +157,9 @@ static int nilfs_symlink(struct mnt_idmap *idmap, struct inode *dir, /* slow symlink */ inode->i_op = &nilfs_symlink_inode_operations; inode_nohighmem(inode); + mapping_set_gfp_mask(inode->i_mapping, + mapping_gfp_constraint(inode->i_mapping, + ~__GFP_FS)); inode->i_mapping->a_ops = &nilfs_aops; err = page_symlink(inode, symname, l); if (err) -- cgit From 9d08ec41a0645283d79a2e642205d488feaceacf Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Sat, 19 Oct 2024 22:22:12 -0600 Subject: mm: allow set/clear page_type again Some page flags (page->flags) were converted to page types (page->page_types). A recent example is PG_hugetlb. From the exclusive writer's perspective, e.g., a thread doing __folio_set_hugetlb(), there is a difference between the page flag and type APIs: the former allows the same non-atomic operation to be repeated whereas the latter does not. For example, calling __folio_set_hugetlb() twice triggers VM_BUG_ON_FOLIO(), since the second call expects the type (PG_hugetlb) not to be set previously. Using add_hugetlb_folio() as an example, it calls __folio_set_hugetlb() in the following error-handling path. And when that happens, it triggers the aforementioned VM_BUG_ON_FOLIO(). if (folio_test_hugetlb(folio)) { rc = hugetlb_vmemmap_restore_folio(h, folio); if (rc) { spin_lock_irq(&hugetlb_lock); add_hugetlb_folio(h, folio, false); ... It is possible to make hugeTLB comply with the new requirements from the page type API. However, a straightforward fix would be to just allow the same page type to be set or cleared again inside the API, to avoid any changes to its callers. Link: https://lkml.kernel.org/r/20241020042212.296781-1-yuzhao@google.com Fixes: d99e3140a4d3 ("mm: turn folio_test_hugetlb into a PageType") Signed-off-by: Yu Zhao Cc: Matthew Wilcox (Oracle) Cc: Muchun Song Cc: Signed-off-by: Andrew Morton --- include/linux/page-flags.h | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 1b3a76710487..cc839e4365c1 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -975,12 +975,16 @@ static __always_inline bool folio_test_##fname(const struct folio *folio) \ } \ static __always_inline void __folio_set_##fname(struct folio *folio) \ { \ + if (folio_test_##fname(folio)) \ + return; \ VM_BUG_ON_FOLIO(data_race(folio->page.page_type) != UINT_MAX, \ folio); \ folio->page.page_type = (unsigned int)PGTY_##lname << 24; \ } \ static __always_inline void __folio_clear_##fname(struct folio *folio) \ { \ + if (folio->page.page_type == UINT_MAX) \ + return; \ VM_BUG_ON_FOLIO(!folio_test_##fname(folio), folio); \ folio->page.page_type = UINT_MAX; \ } @@ -993,11 +997,15 @@ static __always_inline int Page##uname(const struct page *page) \ } \ static __always_inline void __SetPage##uname(struct page *page) \ { \ + if (Page##uname(page)) \ + return; \ VM_BUG_ON_PAGE(data_race(page->page_type) != UINT_MAX, page); \ page->page_type = (unsigned int)PGTY_##lname << 24; \ } \ static __always_inline void __ClearPage##uname(struct page *page) \ { \ + if (page->page_type == UINT_MAX) \ + return; \ VM_BUG_ON_PAGE(!Page##uname(page), page); \ page->page_type = UINT_MAX; \ } -- cgit From caa714f86699bcfb01aa2d698db12d91af7d0d81 Mon Sep 17 00:00:00 2001 From: Jinjie Ruan Date: Wed, 30 Oct 2024 10:35:02 +0800 Subject: drm/tests: helpers: Add helper for drm_display_mode_from_cea_vic() As Maxime suggested, add a new helper drm_kunit_display_mode_from_cea_vic(), it can replace the direct call of drm_display_mode_from_cea_vic(), and it will help solving the `mode` memory leaks. Acked-by: Maxime Ripard Suggested-by: Maxime Ripard Signed-off-by: Jinjie Ruan Link: https://patchwork.freedesktop.org/patch/msgid/20241030023504.530425-2-ruanjinjie@huawei.com Signed-off-by: Maxime Ripard --- drivers/gpu/drm/tests/drm_kunit_helpers.c | 42 +++++++++++++++++++++++++++++++ include/drm/drm_kunit_helpers.h | 4 +++ 2 files changed, 46 insertions(+) diff --git a/drivers/gpu/drm/tests/drm_kunit_helpers.c b/drivers/gpu/drm/tests/drm_kunit_helpers.c index aa62719dab0e..04a6b8cc62ac 100644 --- a/drivers/gpu/drm/tests/drm_kunit_helpers.c +++ b/drivers/gpu/drm/tests/drm_kunit_helpers.c @@ -3,6 +3,7 @@ #include #include #include +#include #include #include #include @@ -311,6 +312,47 @@ drm_kunit_helper_create_crtc(struct kunit *test, } EXPORT_SYMBOL_GPL(drm_kunit_helper_create_crtc); +static void kunit_action_drm_mode_destroy(void *ptr) +{ + struct drm_display_mode *mode = ptr; + + drm_mode_destroy(NULL, mode); +} + +/** + * drm_kunit_display_mode_from_cea_vic() - return a mode for CEA VIC + for a KUnit test + * @test: The test context object + * @dev: DRM device + * @video_code: CEA VIC of the mode + * + * Creates a new mode matching the specified CEA VIC for a KUnit test. + * + * Resources will be cleaned up automatically. + * + * Returns: A new drm_display_mode on success or NULL on failure + */ +struct drm_display_mode * +drm_kunit_display_mode_from_cea_vic(struct kunit *test, struct drm_device *dev, + u8 video_code) +{ + struct drm_display_mode *mode; + int ret; + + mode = drm_display_mode_from_cea_vic(dev, video_code); + if (!mode) + return NULL; + + ret = kunit_add_action_or_reset(test, + kunit_action_drm_mode_destroy, + mode); + if (ret) + return NULL; + + return mode; +} +EXPORT_SYMBOL_GPL(drm_kunit_display_mode_from_cea_vic); + MODULE_AUTHOR("Maxime Ripard "); MODULE_DESCRIPTION("KUnit test suite helper functions"); MODULE_LICENSE("GPL"); diff --git a/include/drm/drm_kunit_helpers.h b/include/drm/drm_kunit_helpers.h index e7cc17ee4934..afdd46ef04f7 100644 --- a/include/drm/drm_kunit_helpers.h +++ b/include/drm/drm_kunit_helpers.h @@ -120,4 +120,8 @@ drm_kunit_helper_create_crtc(struct kunit *test, const struct drm_crtc_funcs *funcs, const struct drm_crtc_helper_funcs *helper_funcs); +struct drm_display_mode * +drm_kunit_display_mode_from_cea_vic(struct kunit *test, struct drm_device *dev, + u8 video_code); + #endif // DRM_KUNIT_HELPERS_H_ -- cgit From 926163342a2e7595d950e84c17c693b1272bd491 Mon Sep 17 00:00:00 2001 From: Jinjie Ruan Date: Wed, 30 Oct 2024 10:35:03 +0800 Subject: drm/connector: hdmi: Fix memory leak in drm_display_mode_from_cea_vic() modprobe drm_connector_test and then rmmod drm_connector_test, the following memory leak occurs. The `mode` allocated in drm_mode_duplicate() called by drm_display_mode_from_cea_vic() is not freed, which cause the memory leak: unreferenced object 0xffffff80cb0ee400 (size 128): comm "kunit_try_catch", pid 1948, jiffies 4294950339 hex dump (first 32 bytes): 14 44 02 00 80 07 d8 07 04 08 98 08 00 00 38 04 .D............8. 3c 04 41 04 65 04 00 00 05 00 00 00 00 00 00 00 <.A.e........... backtrace (crc 90e9585c): [<00000000ec42e3d7>] kmemleak_alloc+0x34/0x40 [<00000000d0ef055a>] __kmalloc_cache_noprof+0x26c/0x2f4 [<00000000c2062161>] drm_mode_duplicate+0x44/0x19c [<00000000f96c74aa>] drm_display_mode_from_cea_vic+0x88/0x98 [<00000000d8f2c8b4>] 0xffffffdc982a4868 [<000000005d164dbc>] kunit_try_run_case+0x13c/0x3ac [<000000006fb23398>] kunit_generic_run_threadfn_adapter+0x80/0xec [<000000006ea56ca0>] kthread+0x2e8/0x374 [<000000000676063f>] ret_from_fork+0x10/0x20 ...... Free `mode` by using drm_kunit_display_mode_from_cea_vic() to fix it. Cc: stable@vger.kernel.org Fixes: abb6f74973e2 ("drm/tests: Add HDMI TDMS character rate tests") Acked-by: Maxime Ripard Signed-off-by: Jinjie Ruan Link: https://patchwork.freedesktop.org/patch/msgid/20241030023504.530425-3-ruanjinjie@huawei.com Signed-off-by: Maxime Ripard --- drivers/gpu/drm/tests/drm_connector_test.c | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/drivers/gpu/drm/tests/drm_connector_test.c b/drivers/gpu/drm/tests/drm_connector_test.c index 15e36a8db685..6bba97d0be88 100644 --- a/drivers/gpu/drm/tests/drm_connector_test.c +++ b/drivers/gpu/drm/tests/drm_connector_test.c @@ -996,7 +996,7 @@ static void drm_test_drm_hdmi_compute_mode_clock_rgb(struct kunit *test) unsigned long long rate; struct drm_device *drm = &priv->drm; - mode = drm_display_mode_from_cea_vic(drm, 16); + mode = drm_kunit_display_mode_from_cea_vic(test, drm, 16); KUNIT_ASSERT_NOT_NULL(test, mode); KUNIT_ASSERT_FALSE(test, mode->flags & DRM_MODE_FLAG_DBLCLK); @@ -1017,7 +1017,7 @@ static void drm_test_drm_hdmi_compute_mode_clock_rgb_10bpc(struct kunit *test) unsigned long long rate; struct drm_device *drm = &priv->drm; - mode = drm_display_mode_from_cea_vic(drm, 16); + mode = drm_kunit_display_mode_from_cea_vic(test, drm, 16); KUNIT_ASSERT_NOT_NULL(test, mode); KUNIT_ASSERT_FALSE(test, mode->flags & DRM_MODE_FLAG_DBLCLK); @@ -1038,7 +1038,7 @@ static void drm_test_drm_hdmi_compute_mode_clock_rgb_10bpc_vic_1(struct kunit *t unsigned long long rate; struct drm_device *drm = &priv->drm; - mode = drm_display_mode_from_cea_vic(drm, 1); + mode = drm_kunit_display_mode_from_cea_vic(test, drm, 1); KUNIT_ASSERT_NOT_NULL(test, mode); rate = drm_hdmi_compute_mode_clock(mode, 10, HDMI_COLORSPACE_RGB); @@ -1056,7 +1056,7 @@ static void drm_test_drm_hdmi_compute_mode_clock_rgb_12bpc(struct kunit *test) unsigned long long rate; struct drm_device *drm = &priv->drm; - mode = drm_display_mode_from_cea_vic(drm, 16); + mode = drm_kunit_display_mode_from_cea_vic(test, drm, 16); KUNIT_ASSERT_NOT_NULL(test, mode); KUNIT_ASSERT_FALSE(test, mode->flags & DRM_MODE_FLAG_DBLCLK); @@ -1077,7 +1077,7 @@ static void drm_test_drm_hdmi_compute_mode_clock_rgb_12bpc_vic_1(struct kunit *t unsigned long long rate; struct drm_device *drm = &priv->drm; - mode = drm_display_mode_from_cea_vic(drm, 1); + mode = drm_kunit_display_mode_from_cea_vic(test, drm, 1); KUNIT_ASSERT_NOT_NULL(test, mode); rate = drm_hdmi_compute_mode_clock(mode, 12, HDMI_COLORSPACE_RGB); @@ -1095,7 +1095,7 @@ static void drm_test_drm_hdmi_compute_mode_clock_rgb_double(struct kunit *test) unsigned long long rate; struct drm_device *drm = &priv->drm; - mode = drm_display_mode_from_cea_vic(drm, 6); + mode = drm_kunit_display_mode_from_cea_vic(test, drm, 6); KUNIT_ASSERT_NOT_NULL(test, mode); KUNIT_ASSERT_TRUE(test, mode->flags & DRM_MODE_FLAG_DBLCLK); @@ -1118,7 +1118,7 @@ static void drm_test_connector_hdmi_compute_mode_clock_yuv420_valid(struct kunit unsigned long long rate; unsigned int vic = *(unsigned int *)test->param_value; - mode = drm_display_mode_from_cea_vic(drm, vic); + mode = drm_kunit_display_mode_from_cea_vic(test, drm, vic); KUNIT_ASSERT_NOT_NULL(test, mode); KUNIT_ASSERT_FALSE(test, mode->flags & DRM_MODE_FLAG_DBLCLK); @@ -1155,7 +1155,7 @@ static void drm_test_connector_hdmi_compute_mode_clock_yuv420_10_bpc(struct kuni drm_hdmi_compute_mode_clock_yuv420_vic_valid_tests[0]; unsigned long long rate; - mode = drm_display_mode_from_cea_vic(drm, vic); + mode = drm_kunit_display_mode_from_cea_vic(test, drm, vic); KUNIT_ASSERT_NOT_NULL(test, mode); KUNIT_ASSERT_FALSE(test, mode->flags & DRM_MODE_FLAG_DBLCLK); @@ -1180,7 +1180,7 @@ static void drm_test_connector_hdmi_compute_mode_clock_yuv420_12_bpc(struct kuni drm_hdmi_compute_mode_clock_yuv420_vic_valid_tests[0]; unsigned long long rate; - mode = drm_display_mode_from_cea_vic(drm, vic); + mode = drm_kunit_display_mode_from_cea_vic(test, drm, vic); KUNIT_ASSERT_NOT_NULL(test, mode); KUNIT_ASSERT_FALSE(test, mode->flags & DRM_MODE_FLAG_DBLCLK); @@ -1203,7 +1203,7 @@ static void drm_test_connector_hdmi_compute_mode_clock_yuv422_8_bpc(struct kunit struct drm_device *drm = &priv->drm; unsigned long long rate; - mode = drm_display_mode_from_cea_vic(drm, 16); + mode = drm_kunit_display_mode_from_cea_vic(test, drm, 16); KUNIT_ASSERT_NOT_NULL(test, mode); KUNIT_ASSERT_FALSE(test, mode->flags & DRM_MODE_FLAG_DBLCLK); @@ -1225,7 +1225,7 @@ static void drm_test_connector_hdmi_compute_mode_clock_yuv422_10_bpc(struct kuni struct drm_device *drm = &priv->drm; unsigned long long rate; - mode = drm_display_mode_from_cea_vic(drm, 16); + mode = drm_kunit_display_mode_from_cea_vic(test, drm, 16); KUNIT_ASSERT_NOT_NULL(test, mode); KUNIT_ASSERT_FALSE(test, mode->flags & DRM_MODE_FLAG_DBLCLK); @@ -1247,7 +1247,7 @@ static void drm_test_connector_hdmi_compute_mode_clock_yuv422_12_bpc(struct kuni struct drm_device *drm = &priv->drm; unsigned long long rate; - mode = drm_display_mode_from_cea_vic(drm, 16); + mode = drm_kunit_display_mode_from_cea_vic(test, drm, 16); KUNIT_ASSERT_NOT_NULL(test, mode); KUNIT_ASSERT_FALSE(test, mode->flags & DRM_MODE_FLAG_DBLCLK); -- cgit From add4163aca0d4a86e9fe4aa513865e4237db8aef Mon Sep 17 00:00:00 2001 From: Jinjie Ruan Date: Wed, 30 Oct 2024 10:35:04 +0800 Subject: drm/tests: hdmi: Fix memory leaks in drm_display_mode_from_cea_vic() modprobe drm_hdmi_state_helper_test and then rmmod it, the following memory leak occurs. The `mode` allocated in drm_mode_duplicate() called by drm_display_mode_from_cea_vic() is not freed, which cause the memory leak: unreferenced object 0xffffff80ccd18100 (size 128): comm "kunit_try_catch", pid 1851, jiffies 4295059695 hex dump (first 32 bytes): 57 62 00 00 80 02 90 02 f0 02 20 03 00 00 e0 01 Wb........ ..... ea 01 ec 01 0d 02 00 00 0a 00 00 00 00 00 00 00 ................ backtrace (crc c2f1aa95): [<000000000f10b11b>] kmemleak_alloc+0x34/0x40 [<000000001cd4cf73>] __kmalloc_cache_noprof+0x26c/0x2f4 [<00000000f1f3cffa>] drm_mode_duplicate+0x44/0x19c [<000000008cbeef13>] drm_display_mode_from_cea_vic+0x88/0x98 [<0000000019daaacf>] 0xffffffedc11ae69c [<000000000aad0f85>] kunit_try_run_case+0x13c/0x3ac [<00000000a9210bac>] kunit_generic_run_threadfn_adapter+0x80/0xec [<000000000a0b2e9e>] kthread+0x2e8/0x374 [<00000000bd668858>] ret_from_fork+0x10/0x20 ...... Free `mode` by using drm_kunit_display_mode_from_cea_vic() to fix it. Cc: stable@vger.kernel.org Fixes: 4af70f19e559 ("drm/tests: Add RGB Quantization tests") Acked-by: Maxime Ripard Signed-off-by: Jinjie Ruan Link: https://patchwork.freedesktop.org/patch/msgid/20241030023504.530425-4-ruanjinjie@huawei.com Signed-off-by: Maxime Ripard --- drivers/gpu/drm/tests/drm_hdmi_state_helper_test.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/tests/drm_hdmi_state_helper_test.c b/drivers/gpu/drm/tests/drm_hdmi_state_helper_test.c index 34ee95d41f29..294773342e71 100644 --- a/drivers/gpu/drm/tests/drm_hdmi_state_helper_test.c +++ b/drivers/gpu/drm/tests/drm_hdmi_state_helper_test.c @@ -441,7 +441,7 @@ static void drm_test_check_broadcast_rgb_auto_cea_mode_vic_1(struct kunit *test) ctx = drm_kunit_helper_acquire_ctx_alloc(test); KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ctx); - mode = drm_display_mode_from_cea_vic(drm, 1); + mode = drm_kunit_display_mode_from_cea_vic(test, drm, 1); KUNIT_ASSERT_NOT_NULL(test, mode); drm = &priv->drm; @@ -555,7 +555,7 @@ static void drm_test_check_broadcast_rgb_full_cea_mode_vic_1(struct kunit *test) ctx = drm_kunit_helper_acquire_ctx_alloc(test); KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ctx); - mode = drm_display_mode_from_cea_vic(drm, 1); + mode = drm_kunit_display_mode_from_cea_vic(test, drm, 1); KUNIT_ASSERT_NOT_NULL(test, mode); drm = &priv->drm; @@ -671,7 +671,7 @@ static void drm_test_check_broadcast_rgb_limited_cea_mode_vic_1(struct kunit *te ctx = drm_kunit_helper_acquire_ctx_alloc(test); KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ctx); - mode = drm_display_mode_from_cea_vic(drm, 1); + mode = drm_kunit_display_mode_from_cea_vic(test, drm, 1); KUNIT_ASSERT_NOT_NULL(test, mode); drm = &priv->drm; @@ -1263,7 +1263,7 @@ static void drm_test_check_output_bpc_format_vic_1(struct kunit *test) ctx = drm_kunit_helper_acquire_ctx_alloc(test); KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ctx); - mode = drm_display_mode_from_cea_vic(drm, 1); + mode = drm_kunit_display_mode_from_cea_vic(test, drm, 1); KUNIT_ASSERT_NOT_NULL(test, mode); /* -- cgit From d5953d680f7e96208c29ce4139a0e38de87a57fe Mon Sep 17 00:00:00 2001 From: Pablo Neira Ayuso Date: Wed, 30 Oct 2024 23:13:48 +0100 Subject: netfilter: nft_payload: sanitize offset and length before calling skb_checksum() If access to offset + length is larger than the skbuff length, then skb_checksum() triggers BUG_ON(). skb_checksum() internally subtracts the length parameter while iterating over skbuff, BUG_ON(len) at the end of it checks that the expected length to be included in the checksum calculation is fully consumed. Fixes: 7ec3f7b47b8d ("netfilter: nft_payload: add packet mangling support") Reported-by: Slavin Liu Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nft_payload.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/netfilter/nft_payload.c b/net/netfilter/nft_payload.c index 330609a76fb2..7dfc5343dae4 100644 --- a/net/netfilter/nft_payload.c +++ b/net/netfilter/nft_payload.c @@ -904,6 +904,9 @@ static void nft_payload_set_eval(const struct nft_expr *expr, ((priv->base != NFT_PAYLOAD_TRANSPORT_HEADER && priv->base != NFT_PAYLOAD_INNER_HEADER) || skb->ip_summed != CHECKSUM_PARTIAL)) { + if (offset + priv->len > skb->len) + goto err; + fsum = skb_checksum(skb, offset, priv->len, 0); tsum = csum_partial(src, priv->len, 0); -- cgit From e6ab19443b36a45ebfb392775cb17d6a78dd07ea Mon Sep 17 00:00:00 2001 From: Peiyang Wang Date: Fri, 25 Oct 2024 17:29:30 +0800 Subject: net: hns3: default enable tx bounce buffer when smmu enabled The SMMU engine on HIP09 chip has a hardware issue. SMMU pagetable prefetch features may prefetch and use a invalid PTE even the PTE is valid at that time. This will cause the device trigger fake pagefaults. The solution is to avoid prefetching by adding a SYNC command when smmu mapping a iova. But the performance of nic has a sharp drop. Then we do this workaround, always enable tx bounce buffer, avoid mapping/unmapping on TX path. This issue only affects HNS3, so we always enable tx bounce buffer when smmu enabled to improve performance. Fixes: 295ba232a8c3 ("net: hns3: add device version to replace pci revision") Signed-off-by: Peiyang Wang Signed-off-by: Jian Shen Signed-off-by: Jijie Shao Signed-off-by: Paolo Abeni --- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 31 ++++++++++++++++++++ drivers/net/ethernet/hisilicon/hns3/hns3_enet.h | 2 ++ drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 33 ++++++++++++++++++++++ 3 files changed, 66 insertions(+) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index 4cbc4d069a1f..ac88e301f221 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #include #include @@ -1032,6 +1033,8 @@ static bool hns3_can_use_tx_sgl(struct hns3_enet_ring *ring, static void hns3_init_tx_spare_buffer(struct hns3_enet_ring *ring) { u32 alloc_size = ring->tqp->handle->kinfo.tx_spare_buf_size; + struct net_device *netdev = ring_to_netdev(ring); + struct hns3_nic_priv *priv = netdev_priv(netdev); struct hns3_tx_spare *tx_spare; struct page *page; dma_addr_t dma; @@ -1073,6 +1076,7 @@ static void hns3_init_tx_spare_buffer(struct hns3_enet_ring *ring) tx_spare->buf = page_address(page); tx_spare->len = PAGE_SIZE << order; ring->tx_spare = tx_spare; + ring->tx_copybreak = priv->tx_copybreak; return; dma_mapping_error: @@ -4868,6 +4872,30 @@ static void hns3_nic_dealloc_vector_data(struct hns3_nic_priv *priv) devm_kfree(&pdev->dev, priv->tqp_vector); } +static void hns3_update_tx_spare_buf_config(struct hns3_nic_priv *priv) +{ +#define HNS3_MIN_SPARE_BUF_SIZE (2 * 1024 * 1024) +#define HNS3_MAX_PACKET_SIZE (64 * 1024) + + struct iommu_domain *domain = iommu_get_domain_for_dev(priv->dev); + struct hnae3_ae_dev *ae_dev = hns3_get_ae_dev(priv->ae_handle); + struct hnae3_handle *handle = priv->ae_handle; + + if (ae_dev->dev_version < HNAE3_DEVICE_VERSION_V3) + return; + + if (!(domain && iommu_is_dma_domain(domain))) + return; + + priv->min_tx_copybreak = HNS3_MAX_PACKET_SIZE; + priv->min_tx_spare_buf_size = HNS3_MIN_SPARE_BUF_SIZE; + + if (priv->tx_copybreak < priv->min_tx_copybreak) + priv->tx_copybreak = priv->min_tx_copybreak; + if (handle->kinfo.tx_spare_buf_size < priv->min_tx_spare_buf_size) + handle->kinfo.tx_spare_buf_size = priv->min_tx_spare_buf_size; +} + static void hns3_ring_get_cfg(struct hnae3_queue *q, struct hns3_nic_priv *priv, unsigned int ring_type) { @@ -5101,6 +5129,7 @@ int hns3_init_all_ring(struct hns3_nic_priv *priv) int i, j; int ret; + hns3_update_tx_spare_buf_config(priv); for (i = 0; i < ring_num; i++) { ret = hns3_alloc_ring_memory(&priv->ring[i]); if (ret) { @@ -5305,6 +5334,8 @@ static int hns3_client_init(struct hnae3_handle *handle) priv->ae_handle = handle; priv->tx_timeout_count = 0; priv->max_non_tso_bd_num = ae_dev->dev_specs.max_non_tso_bd_num; + priv->min_tx_copybreak = 0; + priv->min_tx_spare_buf_size = 0; set_bit(HNS3_NIC_STATE_DOWN, &priv->state); handle->msg_enable = netif_msg_init(debug, DEFAULT_MSG_LEVEL); diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h index d36c4ed16d8d..caf7a4df8585 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h @@ -596,6 +596,8 @@ struct hns3_nic_priv { struct hns3_enet_coalesce rx_coal; u32 tx_copybreak; u32 rx_copybreak; + u32 min_tx_copybreak; + u32 min_tx_spare_buf_size; }; union l3_hdr_info { diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c index b1e988347347..97eaeec1952b 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c @@ -1933,6 +1933,31 @@ static int hns3_set_tx_spare_buf_size(struct net_device *netdev, return ret; } +static int hns3_check_tx_copybreak(struct net_device *netdev, u32 copybreak) +{ + struct hns3_nic_priv *priv = netdev_priv(netdev); + + if (copybreak < priv->min_tx_copybreak) { + netdev_err(netdev, "tx copybreak %u should be no less than %u!\n", + copybreak, priv->min_tx_copybreak); + return -EINVAL; + } + return 0; +} + +static int hns3_check_tx_spare_buf_size(struct net_device *netdev, u32 buf_size) +{ + struct hns3_nic_priv *priv = netdev_priv(netdev); + + if (buf_size < priv->min_tx_spare_buf_size) { + netdev_err(netdev, + "tx spare buf size %u should be no less than %u!\n", + buf_size, priv->min_tx_spare_buf_size); + return -EINVAL; + } + return 0; +} + static int hns3_set_tunable(struct net_device *netdev, const struct ethtool_tunable *tuna, const void *data) @@ -1949,6 +1974,10 @@ static int hns3_set_tunable(struct net_device *netdev, switch (tuna->id) { case ETHTOOL_TX_COPYBREAK: + ret = hns3_check_tx_copybreak(netdev, *(u32 *)data); + if (ret) + return ret; + priv->tx_copybreak = *(u32 *)data; for (i = 0; i < h->kinfo.num_tqps; i++) @@ -1963,6 +1992,10 @@ static int hns3_set_tunable(struct net_device *netdev, break; case ETHTOOL_TX_COPYBREAK_BUF_SIZE: + ret = hns3_check_tx_spare_buf_size(netdev, *(u32 *)data); + if (ret) + return ret; + old_tx_spare_buf_size = h->kinfo.tx_spare_buf_size; new_tx_spare_buf_size = *(u32 *)data; netdev_info(netdev, "request to set tx spare buf size from %u to %u\n", -- cgit From f2c14899caba76da93ff3fff46b4d5a8f43ce07e Mon Sep 17 00:00:00 2001 From: Jian Shen Date: Fri, 25 Oct 2024 17:29:31 +0800 Subject: net: hns3: add sync command to sync io-pgtable To avoid errors in pgtable prefectch, add a sync command to sync io-pagtable. This is a supplement for the previous patch. We want all the tx packet can be handled with tx bounce buffer path. But it depends on the remain space of the spare buffer, checked by the hns3_can_use_tx_bounce(). In most cases, maybe 99.99%, it returns true. But once it return false by no available space, the packet will be handled with the former path, which will map/unmap the skb buffer. Then the driver will face the smmu prefetch risk again. So add a sync command in this case to avoid smmu prefectch, just protects corner scenes. Fixes: 295ba232a8c3 ("net: hns3: add device version to replace pci revision") Signed-off-by: Jian Shen Signed-off-by: Peiyang Wang Signed-off-by: Jijie Shao Signed-off-by: Paolo Abeni --- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 27 +++++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index ac88e301f221..8760b4e9ade6 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -381,6 +381,24 @@ static const struct hns3_rx_ptype hns3_rx_ptype_tbl[] = { #define HNS3_INVALID_PTYPE \ ARRAY_SIZE(hns3_rx_ptype_tbl) +static void hns3_dma_map_sync(struct device *dev, unsigned long iova) +{ + struct iommu_domain *domain = iommu_get_domain_for_dev(dev); + struct iommu_iotlb_gather iotlb_gather; + size_t granule; + + if (!domain || !iommu_is_dma_domain(domain)) + return; + + granule = 1 << __ffs(domain->pgsize_bitmap); + iova = ALIGN_DOWN(iova, granule); + iotlb_gather.start = iova; + iotlb_gather.end = iova + granule - 1; + iotlb_gather.pgsize = granule; + + iommu_iotlb_sync(domain, &iotlb_gather); +} + static irqreturn_t hns3_irq_handle(int irq, void *vector) { struct hns3_enet_tqp_vector *tqp_vector = vector; @@ -1728,7 +1746,9 @@ static int hns3_map_and_fill_desc(struct hns3_enet_ring *ring, void *priv, unsigned int type) { struct hns3_desc_cb *desc_cb = &ring->desc_cb[ring->next_to_use]; + struct hnae3_handle *handle = ring->tqp->handle; struct device *dev = ring_to_dev(ring); + struct hnae3_ae_dev *ae_dev; unsigned int size; dma_addr_t dma; @@ -1760,6 +1780,13 @@ static int hns3_map_and_fill_desc(struct hns3_enet_ring *ring, void *priv, return -ENOMEM; } + /* Add a SYNC command to sync io-pgtale to avoid errors in pgtable + * prefetch + */ + ae_dev = hns3_get_ae_dev(handle); + if (ae_dev->dev_version >= HNAE3_DEVICE_VERSION_V3) + hns3_dma_map_sync(dev, dma); + desc_cb->priv = priv; desc_cb->length = size; desc_cb->dma = dma; -- cgit From 3e0f7cc887b77603182dceca4d3a6e84f6a40d0a Mon Sep 17 00:00:00 2001 From: Hao Lan Date: Fri, 25 Oct 2024 17:29:32 +0800 Subject: net: hns3: fixed reset failure issues caused by the incorrect reset type When a reset type that is not supported by the driver is input, a reset pending flag bit of the HNAE3_NONE_RESET type is generated in reset_pending. The driver does not have a mechanism to clear this type of error. As a result, the driver considers that the reset is not complete. This patch provides a mechanism to clear the HNAE3_NONE_RESET flag and the parameter of hnae3_ae_ops.set_default_reset_request is verified. The error message: hns3 0000:39:01.0: cmd failed -16 hns3 0000:39:01.0: hclge device re-init failed, VF is disabled! hns3 0000:39:01.0: failed to reset VF stack hns3 0000:39:01.0: failed to reset VF(4) hns3 0000:39:01.0: prepare reset(2) wait done hns3 0000:39:01.0 eth4: already uninitialized Use the crash tool to view struct hclgevf_dev: struct hclgevf_dev { ... default_reset_request = 0x20, reset_level = HNAE3_NONE_RESET, reset_pending = 0x100, reset_type = HNAE3_NONE_RESET, ... }; Fixes: 720bd5837e37 ("net: hns3: add set_default_reset_request in the hnae3_ae_ops") Signed-off-by: Hao Lan Signed-off-by: Jijie Shao Signed-off-by: Paolo Abeni --- .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 33 ++++++++++++++++--- .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c | 38 ++++++++++++++++++---- 2 files changed, 61 insertions(+), 10 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index bd86efd92a5a..35c618c794be 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -3584,6 +3584,17 @@ static int hclge_set_vf_link_state(struct hnae3_handle *handle, int vf, return ret; } +static void hclge_set_reset_pending(struct hclge_dev *hdev, + enum hnae3_reset_type reset_type) +{ + /* When an incorrect reset type is executed, the get_reset_level + * function generates the HNAE3_NONE_RESET flag. As a result, this + * type do not need to pending. + */ + if (reset_type != HNAE3_NONE_RESET) + set_bit(reset_type, &hdev->reset_pending); +} + static u32 hclge_check_event_cause(struct hclge_dev *hdev, u32 *clearval) { u32 cmdq_src_reg, msix_src_reg, hw_err_src_reg; @@ -3604,7 +3615,7 @@ static u32 hclge_check_event_cause(struct hclge_dev *hdev, u32 *clearval) */ if (BIT(HCLGE_VECTOR0_IMPRESET_INT_B) & msix_src_reg) { dev_info(&hdev->pdev->dev, "IMP reset interrupt\n"); - set_bit(HNAE3_IMP_RESET, &hdev->reset_pending); + hclge_set_reset_pending(hdev, HNAE3_IMP_RESET); set_bit(HCLGE_COMM_STATE_CMD_DISABLE, &hdev->hw.hw.comm_state); *clearval = BIT(HCLGE_VECTOR0_IMPRESET_INT_B); hdev->rst_stats.imp_rst_cnt++; @@ -3614,7 +3625,7 @@ static u32 hclge_check_event_cause(struct hclge_dev *hdev, u32 *clearval) if (BIT(HCLGE_VECTOR0_GLOBALRESET_INT_B) & msix_src_reg) { dev_info(&hdev->pdev->dev, "global reset interrupt\n"); set_bit(HCLGE_COMM_STATE_CMD_DISABLE, &hdev->hw.hw.comm_state); - set_bit(HNAE3_GLOBAL_RESET, &hdev->reset_pending); + hclge_set_reset_pending(hdev, HNAE3_GLOBAL_RESET); *clearval = BIT(HCLGE_VECTOR0_GLOBALRESET_INT_B); hdev->rst_stats.global_rst_cnt++; return HCLGE_VECTOR0_EVENT_RST; @@ -4062,7 +4073,7 @@ static void hclge_do_reset(struct hclge_dev *hdev) case HNAE3_FUNC_RESET: dev_info(&pdev->dev, "PF reset requested\n"); /* schedule again to check later */ - set_bit(HNAE3_FUNC_RESET, &hdev->reset_pending); + hclge_set_reset_pending(hdev, HNAE3_FUNC_RESET); hclge_reset_task_schedule(hdev); break; default: @@ -4096,6 +4107,8 @@ static enum hnae3_reset_type hclge_get_reset_level(struct hnae3_ae_dev *ae_dev, clear_bit(HNAE3_FLR_RESET, addr); } + clear_bit(HNAE3_NONE_RESET, addr); + if (hdev->reset_type != HNAE3_NONE_RESET && rst_level < hdev->reset_type) return HNAE3_NONE_RESET; @@ -4237,7 +4250,7 @@ static bool hclge_reset_err_handle(struct hclge_dev *hdev) return false; } else if (hdev->rst_stats.reset_fail_cnt < MAX_RESET_FAIL_CNT) { hdev->rst_stats.reset_fail_cnt++; - set_bit(hdev->reset_type, &hdev->reset_pending); + hclge_set_reset_pending(hdev, hdev->reset_type); dev_info(&hdev->pdev->dev, "re-schedule reset task(%u)\n", hdev->rst_stats.reset_fail_cnt); @@ -4480,8 +4493,20 @@ static void hclge_reset_event(struct pci_dev *pdev, struct hnae3_handle *handle) static void hclge_set_def_reset_request(struct hnae3_ae_dev *ae_dev, enum hnae3_reset_type rst_type) { +#define HCLGE_SUPPORT_RESET_TYPE \ + (BIT(HNAE3_FLR_RESET) | BIT(HNAE3_FUNC_RESET) | \ + BIT(HNAE3_GLOBAL_RESET) | BIT(HNAE3_IMP_RESET)) + struct hclge_dev *hdev = ae_dev->priv; + if (!(BIT(rst_type) & HCLGE_SUPPORT_RESET_TYPE)) { + /* To prevent reset triggered by hclge_reset_event */ + set_bit(HNAE3_NONE_RESET, &hdev->default_reset_request); + dev_warn(&hdev->pdev->dev, "unsupported reset type %d\n", + rst_type); + return; + } + set_bit(rst_type, &hdev->default_reset_request); } diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c index 094a7c7b5592..ab54e6155e93 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c @@ -1395,6 +1395,17 @@ static int hclgevf_notify_roce_client(struct hclgevf_dev *hdev, return ret; } +static void hclgevf_set_reset_pending(struct hclgevf_dev *hdev, + enum hnae3_reset_type reset_type) +{ + /* When an incorrect reset type is executed, the get_reset_level + * function generates the HNAE3_NONE_RESET flag. As a result, this + * type do not need to pending. + */ + if (reset_type != HNAE3_NONE_RESET) + set_bit(reset_type, &hdev->reset_pending); +} + static int hclgevf_reset_wait(struct hclgevf_dev *hdev) { #define HCLGEVF_RESET_WAIT_US 20000 @@ -1544,7 +1555,7 @@ static void hclgevf_reset_err_handle(struct hclgevf_dev *hdev) hdev->rst_stats.rst_fail_cnt); if (hdev->rst_stats.rst_fail_cnt < HCLGEVF_RESET_MAX_FAIL_CNT) - set_bit(hdev->reset_type, &hdev->reset_pending); + hclgevf_set_reset_pending(hdev, hdev->reset_type); if (hclgevf_is_reset_pending(hdev)) { set_bit(HCLGEVF_RESET_PENDING, &hdev->reset_state); @@ -1664,6 +1675,8 @@ static enum hnae3_reset_type hclgevf_get_reset_level(unsigned long *addr) clear_bit(HNAE3_FLR_RESET, addr); } + clear_bit(HNAE3_NONE_RESET, addr); + return rst_level; } @@ -1673,14 +1686,15 @@ static void hclgevf_reset_event(struct pci_dev *pdev, struct hnae3_ae_dev *ae_dev = pci_get_drvdata(pdev); struct hclgevf_dev *hdev = ae_dev->priv; - dev_info(&hdev->pdev->dev, "received reset request from VF enet\n"); - if (hdev->default_reset_request) hdev->reset_level = hclgevf_get_reset_level(&hdev->default_reset_request); else hdev->reset_level = HNAE3_VF_FUNC_RESET; + dev_info(&hdev->pdev->dev, "received reset request from VF enet, reset level is %d\n", + hdev->reset_level); + /* reset of this VF requested */ set_bit(HCLGEVF_RESET_REQUESTED, &hdev->reset_state); hclgevf_reset_task_schedule(hdev); @@ -1691,8 +1705,20 @@ static void hclgevf_reset_event(struct pci_dev *pdev, static void hclgevf_set_def_reset_request(struct hnae3_ae_dev *ae_dev, enum hnae3_reset_type rst_type) { +#define HCLGEVF_SUPPORT_RESET_TYPE \ + (BIT(HNAE3_VF_RESET) | BIT(HNAE3_VF_FUNC_RESET) | \ + BIT(HNAE3_VF_PF_FUNC_RESET) | BIT(HNAE3_VF_FULL_RESET) | \ + BIT(HNAE3_FLR_RESET) | BIT(HNAE3_VF_EXP_RESET)) + struct hclgevf_dev *hdev = ae_dev->priv; + if (!(BIT(rst_type) & HCLGEVF_SUPPORT_RESET_TYPE)) { + /* To prevent reset triggered by hclge_reset_event */ + set_bit(HNAE3_NONE_RESET, &hdev->default_reset_request); + dev_info(&hdev->pdev->dev, "unsupported reset type %d\n", + rst_type); + return; + } set_bit(rst_type, &hdev->default_reset_request); } @@ -1849,14 +1875,14 @@ static void hclgevf_reset_service_task(struct hclgevf_dev *hdev) */ if (hdev->reset_attempts > HCLGEVF_MAX_RESET_ATTEMPTS_CNT) { /* prepare for full reset of stack + pcie interface */ - set_bit(HNAE3_VF_FULL_RESET, &hdev->reset_pending); + hclgevf_set_reset_pending(hdev, HNAE3_VF_FULL_RESET); /* "defer" schedule the reset task again */ set_bit(HCLGEVF_RESET_PENDING, &hdev->reset_state); } else { hdev->reset_attempts++; - set_bit(hdev->reset_level, &hdev->reset_pending); + hclgevf_set_reset_pending(hdev, hdev->reset_level); set_bit(HCLGEVF_RESET_PENDING, &hdev->reset_state); } hclgevf_reset_task_schedule(hdev); @@ -1979,7 +2005,7 @@ static enum hclgevf_evt_cause hclgevf_check_evt_cause(struct hclgevf_dev *hdev, rst_ing_reg = hclgevf_read_dev(&hdev->hw, HCLGEVF_RST_ING); dev_info(&hdev->pdev->dev, "receive reset interrupt 0x%x!\n", rst_ing_reg); - set_bit(HNAE3_VF_RESET, &hdev->reset_pending); + hclgevf_set_reset_pending(hdev, HNAE3_VF_RESET); set_bit(HCLGEVF_RESET_PENDING, &hdev->reset_state); set_bit(HCLGE_COMM_STATE_CMD_DISABLE, &hdev->hw.hw.comm_state); *clearval = ~(1U << HCLGEVF_VECTOR0_RST_INT_B); -- cgit From 662ecfc46690e92cf630f51b5d4bbbcffe102980 Mon Sep 17 00:00:00 2001 From: Hao Lan Date: Fri, 25 Oct 2024 17:29:33 +0800 Subject: net: hns3: fix missing features due to dev->features configuration too early Currently, the netdev->features is configured in hns3_nic_set_features. As a result, __netdev_update_features considers that there is no feature difference, and the procedures of the real features are missing. Fixes: 2a7556bb2b73 ("net: hns3: implement ndo_features_check ops for hns3 driver") Signed-off-by: Hao Lan Signed-off-by: Jian Shen Signed-off-by: Jijie Shao Signed-off-by: Paolo Abeni --- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index 8760b4e9ade6..b09f0cca34dc 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -2483,7 +2483,6 @@ static int hns3_nic_set_features(struct net_device *netdev, return ret; } - netdev->features = features; return 0; } -- cgit From 2758f18a83ef283d50c0566d3f672621cc658a1a Mon Sep 17 00:00:00 2001 From: Hao Lan Date: Fri, 25 Oct 2024 17:29:34 +0800 Subject: net: hns3: Resolved the issue that the debugfs query result is inconsistent. This patch modifies the implementation of debugfs: When the user process stops unexpectedly, not all data of the file system is read. In this case, the save_buf pointer is not released. When the user process is called next time, save_buf is used to copy the cached data to the user space. As a result, the queried data is inconsistent. To solve this problem, determine whether the function is invoked for the first time based on the value of *ppos. If *ppos is 0, obtain the actual data. Fixes: 5e69ea7ee2a6 ("net: hns3: refactor the debugfs process") Signed-off-by: Hao Lan Signed-off-by: Guangwei Zhang Signed-off-by: Jijie Shao Signed-off-by: Paolo Abeni --- drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c index 807eb3bbb11c..841e5af7b2be 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c @@ -1293,8 +1293,10 @@ static ssize_t hns3_dbg_read(struct file *filp, char __user *buffer, /* save the buffer addr until the last read operation */ *save_buf = read_buf; + } - /* get data ready for the first time to read */ + /* get data ready for the first time to read */ + if (!*ppos) { ret = hns3_dbg_read_cmd(dbg_data, hns3_dbg_cmd[index].cmd, read_buf, hns3_dbg_cmd[index].buf_len); if (ret) -- cgit From 5f62009ff10826fefa215da68831f42b0c36b6fb Mon Sep 17 00:00:00 2001 From: Jian Shen Date: Fri, 25 Oct 2024 17:29:35 +0800 Subject: net: hns3: don't auto enable misc vector Currently, there is a time window between misc irq enabled and service task inited. If an interrupte is reported at this time, it will cause warning like below: [ 16.324639] Call trace: [ 16.324641] __queue_delayed_work+0xb8/0xe0 [ 16.324643] mod_delayed_work_on+0x78/0xd0 [ 16.324655] hclge_errhand_task_schedule+0x58/0x90 [hclge] [ 16.324662] hclge_misc_irq_handle+0x168/0x240 [hclge] [ 16.324666] __handle_irq_event_percpu+0x64/0x1e0 [ 16.324667] handle_irq_event+0x80/0x170 [ 16.324670] handle_fasteoi_edge_irq+0x110/0x2bc [ 16.324671] __handle_domain_irq+0x84/0xfc [ 16.324673] gic_handle_irq+0x88/0x2c0 [ 16.324674] el1_irq+0xb8/0x140 [ 16.324677] arch_cpu_idle+0x18/0x40 [ 16.324679] default_idle_call+0x5c/0x1bc [ 16.324682] cpuidle_idle_call+0x18c/0x1c4 [ 16.324684] do_idle+0x174/0x17c [ 16.324685] cpu_startup_entry+0x30/0x6c [ 16.324687] secondary_start_kernel+0x1a4/0x280 [ 16.324688] ---[ end trace 6aa0bff672a964aa ]--- So don't auto enable misc vector when request irq.. Fixes: 7be1b9f3e99f ("net: hns3: make hclge_service use delayed workqueue") Signed-off-by: Jian Shen Signed-off-by: Jijie Shao Signed-off-by: Paolo Abeni --- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 35c618c794be..728f4777e51f 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -6,6 +6,7 @@ #include #include #include +#include #include #include #include @@ -3780,7 +3781,7 @@ static int hclge_misc_irq_init(struct hclge_dev *hdev) snprintf(hdev->misc_vector.name, HNAE3_INT_NAME_LEN, "%s-misc-%s", HCLGE_NAME, pci_name(hdev->pdev)); ret = request_irq(hdev->misc_vector.vector_irq, hclge_misc_irq_handle, - 0, hdev->misc_vector.name, hdev); + IRQ_NOAUTOEN, hdev->misc_vector.name, hdev); if (ret) { hclge_free_vector(hdev, 0); dev_err(&hdev->pdev->dev, "request misc irq(%d) fail\n", @@ -11916,9 +11917,6 @@ static int hclge_init_ae_dev(struct hnae3_ae_dev *ae_dev) hclge_init_rxd_adv_layout(hdev); - /* Enable MISC vector(vector0) */ - hclge_enable_vector(&hdev->misc_vector, true); - ret = hclge_init_wol(hdev); if (ret) dev_warn(&pdev->dev, @@ -11931,6 +11929,10 @@ static int hclge_init_ae_dev(struct hnae3_ae_dev *ae_dev) hclge_state_init(hdev); hdev->last_reset_time = jiffies; + /* Enable MISC vector(vector0) */ + enable_irq(hdev->misc_vector.vector_irq); + hclge_enable_vector(&hdev->misc_vector, true); + dev_info(&hdev->pdev->dev, "%s driver initialization finished.\n", HCLGE_DRIVER_NAME); @@ -12336,7 +12338,7 @@ static void hclge_uninit_ae_dev(struct hnae3_ae_dev *ae_dev) /* Disable MISC vector(vector0) */ hclge_enable_vector(&hdev->misc_vector, false); - synchronize_irq(hdev->misc_vector.vector_irq); + disable_irq(hdev->misc_vector.vector_irq); /* Disable all hw interrupts */ hclge_config_mac_tnl_int(hdev, false); -- cgit From d1c2e2961ab460ac2433ff8ad46000582abc573c Mon Sep 17 00:00:00 2001 From: Jian Shen Date: Fri, 25 Oct 2024 17:29:36 +0800 Subject: net: hns3: initialize reset_timer before hclgevf_misc_irq_init() Currently the misc irq is initialized before reset_timer setup. But it will access the reset_timer in the irq handler. So initialize the reset_timer earlier. Fixes: ff200099d271 ("net: hns3: remove unnecessary work in hclgevf_main") Signed-off-by: Jian Shen Signed-off-by: Jijie Shao Signed-off-by: Paolo Abeni --- drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c index ab54e6155e93..896f1eb172d3 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c @@ -2315,6 +2315,7 @@ static void hclgevf_state_init(struct hclgevf_dev *hdev) clear_bit(HCLGEVF_STATE_RST_FAIL, &hdev->state); INIT_DELAYED_WORK(&hdev->service_task, hclgevf_service_task); + timer_setup(&hdev->reset_timer, hclgevf_reset_timer, 0); mutex_init(&hdev->mbx_resp.mbx_mutex); sema_init(&hdev->reset_sem, 1); @@ -3014,7 +3015,6 @@ static int hclgevf_init_hdev(struct hclgevf_dev *hdev) HCLGEVF_DRIVER_NAME); hclgevf_task_schedule(hdev, round_jiffies_relative(HZ)); - timer_setup(&hdev->reset_timer, hclgevf_reset_timer, 0); return 0; -- cgit From 3e22b7de34cbdb991a2c9c5413eeb8a6fb7da2a5 Mon Sep 17 00:00:00 2001 From: Hao Lan Date: Fri, 25 Oct 2024 17:29:37 +0800 Subject: net: hns3: fixed hclge_fetch_pf_reg accesses bar space out of bounds issue The TQP BAR space is divided into two segments. TQPs 0-1023 and TQPs 1024-1279 are in different BAR space addresses. However, hclge_fetch_pf_reg does not distinguish the tqp space information when reading the tqp space information. When the number of TQPs is greater than 1024, access bar space overwriting occurs. The problem of different segments has been considered during the initialization of tqp.io_base. Therefore, tqp.io_base is directly used when the queue is read in hclge_fetch_pf_reg. The error message: Unable to handle kernel paging request at virtual address ffff800037200000 pc : hclge_fetch_pf_reg+0x138/0x250 [hclge] lr : hclge_get_regs+0x84/0x1d0 [hclge] Call trace: hclge_fetch_pf_reg+0x138/0x250 [hclge] hclge_get_regs+0x84/0x1d0 [hclge] hns3_get_regs+0x2c/0x50 [hns3] ethtool_get_regs+0xf4/0x270 dev_ethtool+0x674/0x8a0 dev_ioctl+0x270/0x36c sock_do_ioctl+0x110/0x2a0 sock_ioctl+0x2ac/0x530 __arm64_sys_ioctl+0xa8/0x100 invoke_syscall+0x4c/0x124 el0_svc_common.constprop.0+0x140/0x15c do_el0_svc+0x30/0xd0 el0_svc+0x1c/0x2c el0_sync_handler+0xb0/0xb4 el0_sync+0x168/0x180 Fixes: 939ccd107ffc ("net: hns3: move dump regs function to a separate file") Signed-off-by: Hao Lan Signed-off-by: Jijie Shao Signed-off-by: Paolo Abeni --- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_regs.c | 9 +++++---- drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_regs.c | 9 +++++---- 2 files changed, 10 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_regs.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_regs.c index 43c1c18fa81f..8c057192aae6 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_regs.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_regs.c @@ -510,9 +510,9 @@ out: static int hclge_fetch_pf_reg(struct hclge_dev *hdev, void *data, struct hnae3_knic_private_info *kinfo) { -#define HCLGE_RING_REG_OFFSET 0x200 #define HCLGE_RING_INT_REG_OFFSET 0x4 + struct hnae3_queue *tqp; int i, j, reg_num; int data_num_sum; u32 *reg = data; @@ -533,10 +533,11 @@ static int hclge_fetch_pf_reg(struct hclge_dev *hdev, void *data, reg_num = ARRAY_SIZE(ring_reg_addr_list); for (j = 0; j < kinfo->num_tqps; j++) { reg += hclge_reg_get_tlv(HCLGE_REG_TAG_RING, reg_num, reg); + tqp = kinfo->tqp[j]; for (i = 0; i < reg_num; i++) - *reg++ = hclge_read_dev(&hdev->hw, - ring_reg_addr_list[i] + - HCLGE_RING_REG_OFFSET * j); + *reg++ = readl_relaxed(tqp->io_base - + HCLGE_TQP_REG_OFFSET + + ring_reg_addr_list[i]); } data_num_sum += (reg_num + HCLGE_REG_TLV_SPACE) * kinfo->num_tqps; diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_regs.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_regs.c index 6db415d8b917..7d9d9dbc7560 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_regs.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_regs.c @@ -123,10 +123,10 @@ int hclgevf_get_regs_len(struct hnae3_handle *handle) void hclgevf_get_regs(struct hnae3_handle *handle, u32 *version, void *data) { -#define HCLGEVF_RING_REG_OFFSET 0x200 #define HCLGEVF_RING_INT_REG_OFFSET 0x4 struct hclgevf_dev *hdev = hclgevf_ae_get_hdev(handle); + struct hnae3_queue *tqp; int i, j, reg_um; u32 *reg = data; @@ -147,10 +147,11 @@ void hclgevf_get_regs(struct hnae3_handle *handle, u32 *version, reg_um = ARRAY_SIZE(ring_reg_addr_list); for (j = 0; j < hdev->num_tqps; j++) { reg += hclgevf_reg_get_tlv(HCLGEVF_REG_TAG_RING, reg_um, reg); + tqp = &hdev->htqp[j].q; for (i = 0; i < reg_um; i++) - *reg++ = hclgevf_read_dev(&hdev->hw, - ring_reg_addr_list[i] + - HCLGEVF_RING_REG_OFFSET * j); + *reg++ = readl_relaxed(tqp->io_base - + HCLGEVF_TQP_REG_OFFSET + + ring_reg_addr_list[i]); } reg_um = ARRAY_SIZE(tqp_intr_reg_addr_list); -- cgit From 2cf246143519ecc11dab754385ec42d78b6b6a05 Mon Sep 17 00:00:00 2001 From: Jie Wang Date: Fri, 25 Oct 2024 17:29:38 +0800 Subject: net: hns3: fix kernel crash when 1588 is sent on HIP08 devices Currently, HIP08 devices does not register the ptp devices, so the hdev->ptp is NULL. But the tx process would still try to set hardware time stamp info with SKBTX_HW_TSTAMP flag and cause a kernel crash. [ 128.087798] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000018 ... [ 128.280251] pc : hclge_ptp_set_tx_info+0x2c/0x140 [hclge] [ 128.286600] lr : hclge_ptp_set_tx_info+0x20/0x140 [hclge] [ 128.292938] sp : ffff800059b93140 [ 128.297200] x29: ffff800059b93140 x28: 0000000000003280 [ 128.303455] x27: ffff800020d48280 x26: ffff0cb9dc814080 [ 128.309715] x25: ffff0cb9cde93fa0 x24: 0000000000000001 [ 128.315969] x23: 0000000000000000 x22: 0000000000000194 [ 128.322219] x21: ffff0cd94f986000 x20: 0000000000000000 [ 128.328462] x19: ffff0cb9d2a166c0 x18: 0000000000000000 [ 128.334698] x17: 0000000000000000 x16: ffffcf1fc523ed24 [ 128.340934] x15: 0000ffffd530a518 x14: 0000000000000000 [ 128.347162] x13: ffff0cd6bdb31310 x12: 0000000000000368 [ 128.353388] x11: ffff0cb9cfbc7070 x10: ffff2cf55dd11e02 [ 128.359606] x9 : ffffcf1f85a212b4 x8 : ffff0cd7cf27dab0 [ 128.365831] x7 : 0000000000000a20 x6 : ffff0cd7cf27d000 [ 128.372040] x5 : 0000000000000000 x4 : 000000000000ffff [ 128.378243] x3 : 0000000000000400 x2 : ffffcf1f85a21294 [ 128.384437] x1 : ffff0cb9db520080 x0 : ffff0cb9db500080 [ 128.390626] Call trace: [ 128.393964] hclge_ptp_set_tx_info+0x2c/0x140 [hclge] [ 128.399893] hns3_nic_net_xmit+0x39c/0x4c4 [hns3] [ 128.405468] xmit_one.constprop.0+0xc4/0x200 [ 128.410600] dev_hard_start_xmit+0x54/0xf0 [ 128.415556] sch_direct_xmit+0xe8/0x634 [ 128.420246] __dev_queue_xmit+0x224/0xc70 [ 128.425101] dev_queue_xmit+0x1c/0x40 [ 128.429608] ovs_vport_send+0xac/0x1a0 [openvswitch] [ 128.435409] do_output+0x60/0x17c [openvswitch] [ 128.440770] do_execute_actions+0x898/0x8c4 [openvswitch] [ 128.446993] ovs_execute_actions+0x64/0xf0 [openvswitch] [ 128.453129] ovs_dp_process_packet+0xa0/0x224 [openvswitch] [ 128.459530] ovs_vport_receive+0x7c/0xfc [openvswitch] [ 128.465497] internal_dev_xmit+0x34/0xb0 [openvswitch] [ 128.471460] xmit_one.constprop.0+0xc4/0x200 [ 128.476561] dev_hard_start_xmit+0x54/0xf0 [ 128.481489] __dev_queue_xmit+0x968/0xc70 [ 128.486330] dev_queue_xmit+0x1c/0x40 [ 128.490856] ip_finish_output2+0x250/0x570 [ 128.495810] __ip_finish_output+0x170/0x1e0 [ 128.500832] ip_finish_output+0x3c/0xf0 [ 128.505504] ip_output+0xbc/0x160 [ 128.509654] ip_send_skb+0x58/0xd4 [ 128.513892] udp_send_skb+0x12c/0x354 [ 128.518387] udp_sendmsg+0x7a8/0x9c0 [ 128.522793] inet_sendmsg+0x4c/0x8c [ 128.527116] __sock_sendmsg+0x48/0x80 [ 128.531609] __sys_sendto+0x124/0x164 [ 128.536099] __arm64_sys_sendto+0x30/0x5c [ 128.540935] invoke_syscall+0x50/0x130 [ 128.545508] el0_svc_common.constprop.0+0x10c/0x124 [ 128.551205] do_el0_svc+0x34/0xdc [ 128.555347] el0_svc+0x20/0x30 [ 128.559227] el0_sync_handler+0xb8/0xc0 [ 128.563883] el0_sync+0x160/0x180 Fixes: 0bf5eb788512 ("net: hns3: add support for PTP") Signed-off-by: Jie Wang Signed-off-by: Jijie Shao Signed-off-by: Paolo Abeni --- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_ptp.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_ptp.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_ptp.c index 5505caea88e9..bab16c2191b2 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_ptp.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_ptp.c @@ -58,6 +58,9 @@ bool hclge_ptp_set_tx_info(struct hnae3_handle *handle, struct sk_buff *skb) struct hclge_dev *hdev = vport->back; struct hclge_ptp *ptp = hdev->ptp; + if (!ptp) + return false; + if (!test_bit(HCLGE_PTP_FLAG_TX_EN, &ptp->flags) || test_and_set_bit(HCLGE_STATE_PTP_TX_HANDLING, &hdev->state)) { ptp->tx_skipped++; -- cgit From a14968aea637bbe38a99e6089944e4ad8e6c49e5 Mon Sep 17 00:00:00 2001 From: Suraj Sonawane Date: Sat, 26 Oct 2024 14:36:42 +0530 Subject: gpio: fix uninit-value in swnode_find_gpio Fix an issue detected by the Smatch tool: drivers/gpio/gpiolib-swnode.c:78 swnode_find_gpio() error: uninitialized symbol 'ret'. The issue occurs because the 'ret' variable may be used without initialization if the for_each_gpio_property_name loop does not run. This could lead to returning an undefined value, causing unpredictable behavior. Initialize 'ret' to 0 before the loop to ensure the function returns an error code if no properties are parsed, maintaining proper error handling. Fixes: 9e4c6c1ad ("Merge tag 'io_uring-6.12-20241011' of git://git.kernel.dk/linux") Signed-off-by: Suraj Sonawane Link: https://lore.kernel.org/r/20241026090642.28633-1-surajsonawane0215@gmail.com Signed-off-by: Bartosz Golaszewski --- drivers/gpio/gpiolib-swnode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpio/gpiolib-swnode.c b/drivers/gpio/gpiolib-swnode.c index 2b2dd7e92211..51d2475c05c5 100644 --- a/drivers/gpio/gpiolib-swnode.c +++ b/drivers/gpio/gpiolib-swnode.c @@ -64,7 +64,7 @@ struct gpio_desc *swnode_find_gpio(struct fwnode_handle *fwnode, struct fwnode_reference_args args; struct gpio_desc *desc; char propname[32]; /* 32 is max size of property name */ - int ret; + int ret = 0; swnode = to_software_node(fwnode); if (!swnode) -- cgit From cb08a0265917bc2943bf68c1760058660882e394 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Wed, 23 Oct 2024 03:16:57 +0900 Subject: kbuild: rpm-pkg: disable kernel-devel package when cross-compiling Since commit f1d87664b82a ("kbuild: cross-compile linux-headers package when possible"), 'make binrpm-pkg' may attempt to cross-compile the kernel-devel package, but it fails under certain circumstances. For example, when CONFIG_MODULE_SIG_FORMAT is enabled on openSUSE Tumbleweed, the following command fails: $ make ARCH=arm64 CROSS_COMPILE=aarch64-suse-linux- binrpm-pkg [ snip ] Rebuilding host programs with aarch64-suse-linux-gcc... HOSTCC /home/masahiro/ref/linux/rpmbuild/BUILDROOT/kernel-6.12.0_rc4-1.aarch64/usr/src/kernels/6.12.0-rc4/scripts/kallsyms HOSTCC /home/masahiro/ref/linux/rpmbuild/BUILDROOT/kernel-6.12.0_rc4-1.aarch64/usr/src/kernels/6.12.0-rc4/scripts/sorttable HOSTCC /home/masahiro/ref/linux/rpmbuild/BUILDROOT/kernel-6.12.0_rc4-1.aarch64/usr/src/kernels/6.12.0-rc4/scripts/asn1_compiler HOSTCC /home/masahiro/ref/linux/rpmbuild/BUILDROOT/kernel-6.12.0_rc4-1.aarch64/usr/src/kernels/6.12.0-rc4/scripts/sign-file /home/masahiro/ref/linux/rpmbuild/BUILDROOT/kernel-6.12.0_rc4-1.aarch64/usr/src/kernels/6.12.0-rc4/scripts/sign-file.c:25:10: fatal error: openssl/opensslv.h: No such file or directory 25 | #include | ^~~~~~~~~~~~~~~~~~~~ compilation terminated. I believe this issue is less common on Fedora because the disto's cross- compilier cannot link user-space programs. Hence, CONFIG_CC_CAN_LINK is unset. On Fedora 40, the package information explains this limitation clearly: $ dnf info gcc-aarch64-linux-gnu [ snip ] Description : Cross-build GNU C compiler. : : Only building kernels is currently supported. Support for cross-building : user space programs is not currently provided as that would massively multiply : the number of packages. Anyway, cross-compiling RPM packages is somewhat challenging. This commit disables the kernel-devel package when cross-compiling because I did not come up with a better solution. Fixes: f1d87664b82a ("kbuild: cross-compile linux-headers package when possible") Signed-off-by: Masahiro Yamada Reviewed-by: Nathan Chancellor --- scripts/Makefile.package | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/scripts/Makefile.package b/scripts/Makefile.package index 11d53f240a2b..74bcb9e7f7a4 100644 --- a/scripts/Makefile.package +++ b/scripts/Makefile.package @@ -62,6 +62,10 @@ rpm-sources: linux.tar.gz PHONY += rpm-pkg srcrpm-pkg binrpm-pkg +ifneq ($(CC),$(HOSTCC)) +rpm-no-devel = --without=devel +endif + rpm-pkg: private build-type := a srcrpm-pkg: private build-type := s binrpm-pkg: private build-type := b @@ -72,7 +76,8 @@ rpm-pkg srcrpm-pkg binrpm-pkg: rpmbuild/SPECS/kernel.spec --define='_topdir $(abspath rpmbuild)' \ $(if $(filter a b, $(build-type)), \ --target $(UTS_MACHINE)-linux --build-in-place --noprep --define='_smp_mflags %{nil}' \ - $$(rpm -q rpm >/dev/null 2>&1 || echo --nodeps)) \ + $$(rpm -q rpm >/dev/null 2>&1 || echo --nodeps) \ + $(rpm-no-devel)) \ $(RPMOPTS)) # deb-pkg srcdeb-pkg bindeb-pkg -- cgit From e2c318225ac13083cdcb4780cdf5b90edaa8644d Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Wed, 23 Oct 2024 03:16:58 +0900 Subject: kbuild: deb-pkg: add pkg.linux-upstream.nokernelheaders build profile Since commit f1d87664b82a ("kbuild: cross-compile linux-headers package when possible"), 'make bindeb-pkg' may attempt to cross-compile the linux-headers package, but it fails under certain circumstances. For example, when CONFIG_MODULE_SIG_FORMAT is enabled on Debian, the following command fails: $ make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- bindeb-pkg [ snip ] Rebuilding host programs with aarch64-linux-gnu-gcc... HOSTCC debian/linux-headers-6.12.0-rc4/usr/src/linux-headers-6.12.0-rc4/scripts/kallsyms HOSTCC debian/linux-headers-6.12.0-rc4/usr/src/linux-headers-6.12.0-rc4/scripts/sorttable HOSTCC debian/linux-headers-6.12.0-rc4/usr/src/linux-headers-6.12.0-rc4/scripts/asn1_compiler HOSTCC debian/linux-headers-6.12.0-rc4/usr/src/linux-headers-6.12.0-rc4/scripts/sign-file In file included from /usr/include/openssl/opensslv.h:109, from debian/linux-headers-6.12.0-rc4/usr/src/linux-headers-6.12.0-rc4/scripts/sign-file.c:25: /usr/include/openssl/macros.h:14:10: fatal error: openssl/opensslconf.h: No such file or directory 14 | #include | ^~~~~~~~~~~~~~~~~~~~~~~ compilation terminated. This commit adds a new profile, pkg.linux-upstream.nokernelheaders, to guard the linux-headers package. There are two options to fix the above issue. Option 1: Set the pkg.linux-upstream.nokernelheaders build profile $ DEB_BUILD_PROFILES=pkg.linux-upstream.nokernelheaders \ make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- bindeb-pkg This skips the building of the linux-headers package. Option 2: Install the necessary build dependencies If you want to cross-compile the linux-headers package, you need to install additional packages. For example, on Debian, the packages necessary for cross-compiling it to arm64 can be installed with the following commands: # dpkg --add-architecture arm64 # apt update # apt install gcc-aarch64-linux-gnu libssl-dev:arm64 Fixes: f1d87664b82a ("kbuild: cross-compile linux-headers package when possible") Reported-by: Ron Economos Closes: https://lore.kernel.org/all/b3d4f49e-7ddb-29ba-0967-689232329b53@w6rz.net/ Signed-off-by: Masahiro Yamada Tested-by: Ron Economos Reviewed-by: Nicolas Schier --- scripts/package/builddeb | 2 +- scripts/package/install-extmod-build | 6 ++---- scripts/package/mkdebian | 9 ++++++++- 3 files changed, 11 insertions(+), 6 deletions(-) diff --git a/scripts/package/builddeb b/scripts/package/builddeb index 404587fc71fe..441b0bb66e0d 100755 --- a/scripts/package/builddeb +++ b/scripts/package/builddeb @@ -123,7 +123,7 @@ install_kernel_headers () { pdir=debian/$1 version=${1#linux-headers-} - "${srctree}/scripts/package/install-extmod-build" "${pdir}/usr/src/linux-headers-${version}" + CC="${DEB_HOST_GNU_TYPE}-gcc" "${srctree}/scripts/package/install-extmod-build" "${pdir}/usr/src/linux-headers-${version}" mkdir -p $pdir/lib/modules/$version/ ln -s /usr/src/linux-headers-$version $pdir/lib/modules/$version/build diff --git a/scripts/package/install-extmod-build b/scripts/package/install-extmod-build index d2c9cacecc0c..7ec1f061a519 100755 --- a/scripts/package/install-extmod-build +++ b/scripts/package/install-extmod-build @@ -44,13 +44,11 @@ mkdir -p "${destdir}" fi } | tar -c -f - -T - | tar -xf - -C "${destdir}" -# When ${CC} and ${HOSTCC} differ, we are likely cross-compiling. Rebuild host -# programs using ${CC}. This assumes CC=${CROSS_COMPILE}gcc, which is usually -# the case for package building. It does not cross-compile when CC=clang. +# When ${CC} and ${HOSTCC} differ, rebuild host programs using ${CC}. # # This caters to host programs that participate in Kbuild. objtool and # resolve_btfids are out of scope. -if [ "${CC}" != "${HOSTCC}" ] && is_enabled CONFIG_CC_CAN_LINK; then +if [ "${CC}" != "${HOSTCC}" ]; then echo "Rebuilding host programs with ${CC}..." cat <<-'EOF' > "${destdir}/Kbuild" diff --git a/scripts/package/mkdebian b/scripts/package/mkdebian index 10637d403777..93eb50356ddb 100755 --- a/scripts/package/mkdebian +++ b/scripts/package/mkdebian @@ -179,6 +179,8 @@ fi echo $debarch > debian/arch +host_gnu=$(dpkg-architecture -a "${debarch}" -q DEB_HOST_GNU_TYPE | sed 's/_/-/g') + # Generate a simple changelog template cat < debian/changelog $sourcename ($packageversion) $distribution; urgency=low @@ -196,7 +198,11 @@ Priority: optional Maintainer: $maintainer Rules-Requires-Root: no Build-Depends: debhelper-compat (= 12) -Build-Depends-Arch: bc, bison, cpio, flex, kmod, libelf-dev:native, libssl-dev:native, rsync +Build-Depends-Arch: bc, bison, cpio, flex, + gcc-${host_gnu} , + kmod, libelf-dev:native, + libssl-dev:native, libssl-dev , + rsync Homepage: https://www.kernel.org/ Package: $packagename-$version @@ -224,6 +230,7 @@ cat <> debian/control Package: linux-headers-$version Architecture: $debarch +Build-Profiles: Description: Linux kernel headers for $version on $debarch This package provides kernel header files for $version on $debarch . -- cgit From 2ad7126c5190864e928154ef74e0ae6cbdcea783 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Wed, 23 Oct 2024 03:16:59 +0900 Subject: kbuild: deb-pkg: add pkg.linux-upstream.nokerneldbg build profile The Debian kernel supports the pkg.linux.nokerneldbg build profile. The debug package tends to become huge, and you may not want to build it even when CONFIG_DEBUG_INFO is enabled. This commit introduces a similar profile for the upstream kernel. Signed-off-by: Masahiro Yamada Reviewed-by: Nicolas Schier --- scripts/package/mkdebian | 1 + 1 file changed, 1 insertion(+) diff --git a/scripts/package/mkdebian b/scripts/package/mkdebian index 93eb50356ddb..fc3b7fa709fc 100755 --- a/scripts/package/mkdebian +++ b/scripts/package/mkdebian @@ -245,6 +245,7 @@ cat <> debian/control Package: linux-image-$version-dbg Section: debug Architecture: $debarch +Build-Profiles: Description: Linux kernel debugging symbols for $version This package will come in handy if you need to debug the kernel. It provides all the necessary debug symbols for the kernel and its modules. -- cgit From d01661e1f422f071279417c6a21d9d7989844d25 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Sun, 27 Oct 2024 02:55:50 +0900 Subject: kconfig: show sub-menu entries even if the prompt is hidden Since commit f79dc03fe68c ("kconfig: refactor choice value calculation"), when EXPERT is disabled, nothing within the "if INPUT" ... "endif" block in drivers/input/Kconfig is displayed. This issue affects all command-line interfaces and GUI frontends. The prompt for INPUT is hidden when EXPERT is disabled. Previously, menu_is_visible() returned true in this case; however, it now returns false, resulting in all sub-menu entries being skipped. Here is a simplified test case illustrating the issue: config A bool "A" if X default y config B bool "B" depends on A When X is disabled, A becomes unconfigurable and is forced to y. B should be displayed, as its dependency is met. This commit restores the necessary code, so menu_is_visible() functions as it did previously. Fixes: f79dc03fe68c ("kconfig: refactor choice value calculation") Reported-by: Edmund Raile Closes: https://lore.kernel.org/all/5fd0dfc7ff171aa74352e638c276069a5f2e888d.camel@proton.me/ Signed-off-by: Masahiro Yamada --- scripts/kconfig/menu.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/scripts/kconfig/menu.c b/scripts/kconfig/menu.c index 4addd33749bb..6587ac86d0d5 100644 --- a/scripts/kconfig/menu.c +++ b/scripts/kconfig/menu.c @@ -533,6 +533,7 @@ bool menu_is_empty(struct menu *menu) bool menu_is_visible(struct menu *menu) { + struct menu *child; struct symbol *sym; tristate visible; @@ -551,7 +552,17 @@ bool menu_is_visible(struct menu *menu) } else visible = menu->prompt->visible.tri = expr_calc_value(menu->prompt->visible.expr); - return visible != no; + if (visible != no) + return true; + + if (!sym || sym_get_tristate_value(menu->sym) == no) + return false; + + for (child = menu->list; child; child = child->next) + if (menu_is_visible(child)) + return true; + + return false; } const char *menu_get_prompt(const struct menu *menu) -- cgit From 90bad749858cf88d80af7c2b23f86db4f7ad61c2 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Wed, 30 Oct 2024 19:36:52 +0200 Subject: gpio: sloppy-logic-analyzer: Check for error code from devm_mutex_init() call Even if it's not critical, the avoidance of checking the error code from devm_mutex_init() call today diminishes the point of using devm variant of it. Tomorrow it may even leak something. Add the missed check. Fixes: 7828b7bbbf20 ("gpio: add sloppy logic analyzer using polling") Reviewed-by: Wolfram Sang Signed-off-by: Andy Shevchenko Link: https://lore.kernel.org/r/20241030174132.2113286-3-andriy.shevchenko@linux.intel.com Signed-off-by: Bartosz Golaszewski --- drivers/gpio/gpio-sloppy-logic-analyzer.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/gpio/gpio-sloppy-logic-analyzer.c b/drivers/gpio/gpio-sloppy-logic-analyzer.c index 07e0d7180579..59a8f3a5c4e4 100644 --- a/drivers/gpio/gpio-sloppy-logic-analyzer.c +++ b/drivers/gpio/gpio-sloppy-logic-analyzer.c @@ -234,7 +234,9 @@ static int gpio_la_poll_probe(struct platform_device *pdev) if (!priv) return -ENOMEM; - devm_mutex_init(dev, &priv->blob_lock); + ret = devm_mutex_init(dev, &priv->blob_lock); + if (ret) + return ret; fops_buf_size_set(priv, GPIO_LA_DEFAULT_BUF_SIZE); -- cgit From 993ca0eccec65a2cacc3cefb15d35ffadc6f00fb Mon Sep 17 00:00:00 2001 From: Matthew Brost Date: Wed, 23 Oct 2024 15:12:00 -0700 Subject: drm/xe: Add mmio read before GGTT invalidate MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On LNL without a mmio read before a GGTT invalidate the GuC can incorrectly read the GGTT scratch page upon next access leading to jobs not getting scheduled. A mmio read before a GGTT invalidate seems to fix this. Since a GGTT invalidate is not a hot code path, blindly do a mmio read before each GGTT invalidate. Cc: John Harrison Cc: Daniele Ceraolo Spurio Cc: Thomas Hellström Cc: Lucas De Marchi Cc: stable@vger.kernel.org Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Reported-by: Paulo Zanoni Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3164 Signed-off-by: Matthew Brost Reviewed-by: Lucas De Marchi Link: https://patchwork.freedesktop.org/patch/msgid/20241023221200.1797832-1-matthew.brost@intel.com Signed-off-by: Lucas De Marchi (cherry picked from commit 5a710196883e0ac019ac6df2a6d79c16ad3c32fa) [ Fix conflict with mmio vs gt argument ] Signed-off-by: Lucas De Marchi --- drivers/gpu/drm/xe/xe_ggtt.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c index 2895f154654c..ff19eca5d358 100644 --- a/drivers/gpu/drm/xe/xe_ggtt.c +++ b/drivers/gpu/drm/xe/xe_ggtt.c @@ -397,6 +397,16 @@ static void ggtt_invalidate_gt_tlb(struct xe_gt *gt) static void xe_ggtt_invalidate(struct xe_ggtt *ggtt) { + struct xe_device *xe = tile_to_xe(ggtt->tile); + + /* + * XXX: Barrier for GGTT pages. Unsure exactly why this required but + * without this LNL is having issues with the GuC reading scratch page + * vs. correct GGTT page. Not particularly a hot code path so blindly + * do a mmio read here which results in GuC reading correct GGTT page. + */ + xe_mmio_read32(xe_root_mmio_gt(xe), VF_CAP_REG); + /* Each GT in a tile has its own TLB to cache GGTT lookups */ ggtt_invalidate_gt_tlb(ggtt->tile->primary_gt); ggtt_invalidate_gt_tlb(ggtt->tile->media_gt); -- cgit From fe05cee4d9533892210e1ee90147175d87e7c053 Mon Sep 17 00:00:00 2001 From: Matthew Brost Date: Fri, 25 Oct 2024 14:43:29 -0700 Subject: drm/xe: Don't short circuit TDR on jobs not started Short circuiting TDR on jobs not started is an optimization which is not required. On LNL we are facing an issue where jobs do not get scheduled by the GuC if it misses a GGTT page update. When this occurs let the TDR fire, toggle the scheduling which may get the job unstuck, and print a warning message. If the TDR fires twice on job that hasn't started, timeout the job. v2: - Add warning message (Paulo) - Add fixes tag (Paulo) - Timeout job which hasn't started after TDR firing twice v3: - Include local change v4: - Short circuit check_timeout on job not started - use warn level rather than notice (Paulo) Fixes: 7ddb9403dd74 ("drm/xe: Sample ctx timestamp to determine if jobs have timed out") Cc: stable@vger.kernel.org Cc: Paulo Zanoni Signed-off-by: Matthew Brost Reviewed-by: Lucas De Marchi Link: https://patchwork.freedesktop.org/patch/msgid/20241025214330.2010521-2-matthew.brost@intel.com Signed-off-by: Lucas De Marchi (cherry picked from commit 35d25a4a0012e690ef0cc4c5440231176db595cc) Signed-off-by: Lucas De Marchi --- drivers/gpu/drm/xe/xe_guc_submit.c | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c index d333be9c4227..f903b0772722 100644 --- a/drivers/gpu/drm/xe/xe_guc_submit.c +++ b/drivers/gpu/drm/xe/xe_guc_submit.c @@ -916,12 +916,22 @@ static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w) static bool check_timeout(struct xe_exec_queue *q, struct xe_sched_job *job) { struct xe_gt *gt = guc_to_gt(exec_queue_to_guc(q)); - u32 ctx_timestamp = xe_lrc_ctx_timestamp(q->lrc[0]); - u32 ctx_job_timestamp = xe_lrc_ctx_job_timestamp(q->lrc[0]); + u32 ctx_timestamp, ctx_job_timestamp; u32 timeout_ms = q->sched_props.job_timeout_ms; u32 diff; u64 running_time_ms; + if (!xe_sched_job_started(job)) { + xe_gt_warn(gt, "Check job timeout: seqno=%u, lrc_seqno=%u, guc_id=%d, not started", + xe_sched_job_seqno(job), xe_sched_job_lrc_seqno(job), + q->guc->id); + + return xe_sched_invalidate_job(job, 2); + } + + ctx_timestamp = xe_lrc_ctx_timestamp(q->lrc[0]); + ctx_job_timestamp = xe_lrc_ctx_job_timestamp(q->lrc[0]); + /* * Counter wraps at ~223s at the usual 19.2MHz, be paranoid catch * possible overflows with a high timeout. @@ -1049,10 +1059,6 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job) exec_queue_killed_or_banned_or_wedged(q) || exec_queue_destroyed(q); - /* Job hasn't started, can't be timed out */ - if (!skip_timeout_check && !xe_sched_job_started(job)) - goto rearm; - /* * XXX: Sampling timeout doesn't work in wedged mode as we have to * modify scheduling state to read timestamp. We could read the -- cgit From 1d60d74e852647255bd8e76f5a22dc42531e4389 Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Thu, 31 Oct 2024 08:05:44 -0600 Subject: io_uring/rw: fix missing NOWAIT check for O_DIRECT start write When io_uring starts a write, it'll call kiocb_start_write() to bump the super block rwsem, preventing any freezes from happening while that write is in-flight. The freeze side will grab that rwsem for writing, excluding any new writers from happening and waiting for existing writes to finish. But io_uring unconditionally uses kiocb_start_write(), which will block if someone is currently attempting to freeze the mount point. This causes a deadlock where freeze is waiting for previous writes to complete, but the previous writes cannot complete, as the task that is supposed to complete them is blocked waiting on starting a new write. This results in the following stuck trace showing that dependency with the write blocked starting a new write: task:fio state:D stack:0 pid:886 tgid:886 ppid:876 Call trace: __switch_to+0x1d8/0x348 __schedule+0x8e8/0x2248 schedule+0x110/0x3f0 percpu_rwsem_wait+0x1e8/0x3f8 __percpu_down_read+0xe8/0x500 io_write+0xbb8/0xff8 io_issue_sqe+0x10c/0x1020 io_submit_sqes+0x614/0x2110 __arm64_sys_io_uring_enter+0x524/0x1038 invoke_syscall+0x74/0x268 el0_svc_common.constprop.0+0x160/0x238 do_el0_svc+0x44/0x60 el0_svc+0x44/0xb0 el0t_64_sync_handler+0x118/0x128 el0t_64_sync+0x168/0x170 INFO: task fsfreeze:7364 blocked for more than 15 seconds. Not tainted 6.12.0-rc5-00063-g76aaf945701c #7963 with the attempting freezer stuck trying to grab the rwsem: task:fsfreeze state:D stack:0 pid:7364 tgid:7364 ppid:995 Call trace: __switch_to+0x1d8/0x348 __schedule+0x8e8/0x2248 schedule+0x110/0x3f0 percpu_down_write+0x2b0/0x680 freeze_super+0x248/0x8a8 do_vfs_ioctl+0x149c/0x1b18 __arm64_sys_ioctl+0xd0/0x1a0 invoke_syscall+0x74/0x268 el0_svc_common.constprop.0+0x160/0x238 do_el0_svc+0x44/0x60 el0_svc+0x44/0xb0 el0t_64_sync_handler+0x118/0x128 el0t_64_sync+0x168/0x170 Fix this by having the io_uring side honor IOCB_NOWAIT, and only attempt a blocking grab of the super block rwsem if it isn't set. For normal issue where IOCB_NOWAIT would always be set, this returns -EAGAIN which will have io_uring core issue a blocking attempt of the write. That will in turn also get completions run, ensuring forward progress. Since freezing requires CAP_SYS_ADMIN in the first place, this isn't something that can be triggered by a regular user. Cc: stable@vger.kernel.org # 5.10+ Reported-by: Peter Mann Link: https://lore.kernel.org/io-uring/38c94aec-81c9-4f62-b44e-1d87f5597644@sh.cz Signed-off-by: Jens Axboe --- io_uring/rw.c | 23 +++++++++++++++++++++-- 1 file changed, 21 insertions(+), 2 deletions(-) diff --git a/io_uring/rw.c b/io_uring/rw.c index 354c4e175654..155938f10093 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -1014,6 +1014,25 @@ int io_read_mshot(struct io_kiocb *req, unsigned int issue_flags) return IOU_OK; } +static bool io_kiocb_start_write(struct io_kiocb *req, struct kiocb *kiocb) +{ + struct inode *inode; + bool ret; + + if (!(req->flags & REQ_F_ISREG)) + return true; + if (!(kiocb->ki_flags & IOCB_NOWAIT)) { + kiocb_start_write(kiocb); + return true; + } + + inode = file_inode(kiocb->ki_filp); + ret = sb_start_write_trylock(inode->i_sb); + if (ret) + __sb_writers_release(inode->i_sb, SB_FREEZE_WRITE); + return ret; +} + int io_write(struct io_kiocb *req, unsigned int issue_flags) { bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; @@ -1051,8 +1070,8 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags) if (unlikely(ret)) return ret; - if (req->flags & REQ_F_ISREG) - kiocb_start_write(kiocb); + if (unlikely(!io_kiocb_start_write(req, kiocb))) + return -EAGAIN; kiocb->ki_flags |= IOCB_WRITE; if (likely(req->file->f_op->write_iter)) -- cgit From c40dd8c4732551605712985bc5b7045094c6458d Mon Sep 17 00:00:00 2001 From: Toke Høiland-Jørgensen Date: Wed, 30 Oct 2024 11:48:26 +0100 Subject: bpf, test_run: Fix LIVE_FRAME frame update after a page has been recycled MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The test_run code detects whether a page has been modified and re-initialises the xdp_frame structure if it has, using xdp_update_frame_from_buff(). However, xdp_update_frame_from_buff() doesn't touch frame->mem, so that wasn't correctly re-initialised, which led to the pages from page_pool not being returned correctly. Syzbot noticed this as a memory leak. Fix this by also copying the frame->mem structure when re-initialising the frame, like we do on initialisation of a new page from page_pool. Fixes: e5995bc7e2ba ("bpf, test_run: fix crashes due to XDP frame overwriting/corruption") Fixes: b530e9e1063e ("bpf: Add "live packet" mode for XDP in BPF_PROG_RUN") Reported-by: syzbot+d121e098da06af416d23@syzkaller.appspotmail.com Signed-off-by: Toke Høiland-Jørgensen Signed-off-by: Daniel Borkmann Tested-by: syzbot+d121e098da06af416d23@syzkaller.appspotmail.com Reviewed-by: Alexander Lobakin Acked-by: Stanislav Fomichev Link: https://lore.kernel.org/bpf/20241030-test-run-mem-fix-v1-1-41e88e8cae43@redhat.com --- net/bpf/test_run.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c index 6d7a442ceb89..501ec4249fed 100644 --- a/net/bpf/test_run.c +++ b/net/bpf/test_run.c @@ -246,6 +246,7 @@ static void reset_ctx(struct xdp_page_head *head) head->ctx.data_meta = head->orig_ctx.data_meta; head->ctx.data_end = head->orig_ctx.data_end; xdp_update_frame_from_buff(&head->ctx, head->frame); + head->frame->mem = head->orig_ctx.rxq->mem; } static int xdp_recv_frames(struct xdp_frame **frames, int nframes, -- cgit From a0f0625390858321525c2a8d04e174a546bd19b3 Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Mon, 28 Oct 2024 16:23:00 +0000 Subject: btrfs: fix extent map merging not happening for adjacent extents If we have 3 or more adjacent extents in a file, that is, consecutive file extent items pointing to adjacent extents, within a contiguous file range and compatible flags, we end up not merging all the extents into a single extent map. For example: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt/sdc $ xfs_io -f -d -c "pwrite -b 64K 0 64K" \ -c "pwrite -b 64K 64K 64K" \ -c "pwrite -b 64K 128K 64K" \ -c "pwrite -b 64K 192K 64K" \ /mnt/sdc/foo After all the ordered extents complete we unpin the extent maps and try to merge them, but instead of getting a single extent map we get two because: 1) When the first ordered extent completes (file range [0, 64K)) we unpin its extent map and attempt to merge it with the extent map for the range [64K, 128K), but we can't because that extent map is still pinned; 2) When the second ordered extent completes (file range [64K, 128K)), we unpin its extent map and merge it with the previous extent map, for file range [0, 64K), but we can't merge with the next extent map, for the file range [128K, 192K), because this one is still pinned. The merged extent map for the file range [0, 128K) gets the flag EXTENT_MAP_MERGED set; 3) When the third ordered extent completes (file range [128K, 192K)), we unpin its extent map and attempt to merge it with the previous extent map, for file range [0, 128K), but we can't because that extent map has the flag EXTENT_MAP_MERGED set (mergeable_maps() returns false due to different flags) while the extent map for the range [128K, 192K) doesn't have that flag set. We also can't merge it with the next extent map, for file range [192K, 256K), because that one is still pinned. At this moment we have 3 extent maps: One for file range [0, 128K), with the flag EXTENT_MAP_MERGED set. One for file range [128K, 192K). One for file range [192K, 256K) which is still pinned; 4) When the fourth and final extent completes (file range [192K, 256K)), we unpin its extent map and attempt to merge it with the previous extent map, for file range [128K, 192K), which succeeds since none of these extent maps have the EXTENT_MAP_MERGED flag set. So we end up with 2 extent maps: One for file range [0, 128K), with the flag EXTENT_MAP_MERGED set. One for file range [128K, 256K), with the flag EXTENT_MAP_MERGED set. Since after merging extent maps we don't attempt to merge again, that is, merge the resulting extent map with the one that is now preceding it (and the one following it), we end up with those two extent maps, when we could have had a single extent map to represent the whole file. Fix this by making mergeable_maps() ignore the EXTENT_MAP_MERGED flag. While this doesn't present any functional issue, it prevents the merging of extent maps which allows to save memory, and can make defrag not merging extents too (that will be addressed in the next patch). Fixes: 199257a78bb0 ("btrfs: defrag: don't use merged extent map for their generation check") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo Signed-off-by: Filipe Manana Signed-off-by: David Sterba --- fs/btrfs/extent_map.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c index 668c617444a5..1d93e1202c33 100644 --- a/fs/btrfs/extent_map.c +++ b/fs/btrfs/extent_map.c @@ -230,7 +230,12 @@ static bool mergeable_maps(const struct extent_map *prev, const struct extent_ma if (extent_map_end(prev) != next->start) return false; - if (prev->flags != next->flags) + /* + * The merged flag is not an on-disk flag, it just indicates we had the + * extent maps of 2 (or more) adjacent extents merged, so factor it out. + */ + if ((prev->flags & ~EXTENT_FLAG_MERGED) != + (next->flags & ~EXTENT_FLAG_MERGED)) return false; if (next->disk_bytenr < EXTENT_MAP_LAST_BYTE - 1) -- cgit From 77b0d113eec49a7390ff1a08ca1923e89f5f86c6 Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Tue, 29 Oct 2024 15:18:45 +0000 Subject: btrfs: fix defrag not merging contiguous extents due to merged extent maps When running defrag (manual defrag) against a file that has extents that are contiguous and we already have the respective extent maps loaded and merged, we end up not defragging the range covered by those contiguous extents. This happens when we have an extent map that was the result of merging multiple extent maps for contiguous extents and the length of the merged extent map is greater than or equals to the defrag threshold length. The script below reproduces this scenario: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount $DEV $MNT # Create a 256K file with 4 extents of 64K each. xfs_io -f -c "falloc 0 64K" \ -c "pwrite 0 64K" \ -c "falloc 64K 64K" \ -c "pwrite 64K 64K" \ -c "falloc 128K 64K" \ -c "pwrite 128K 64K" \ -c "falloc 192K 64K" \ -c "pwrite 192K 64K" \ $MNT/foo umount $MNT echo -n "Initial number of file extent items: " btrfs inspect-internal dump-tree -t 5 $DEV | grep EXTENT_DATA | wc -l mount $DEV $MNT # Read the whole file in order to load and merge extent maps. cat $MNT/foo > /dev/null btrfs filesystem defragment -t 128K $MNT/foo umount $MNT echo -n "Number of file extent items after defrag with 128K threshold: " btrfs inspect-internal dump-tree -t 5 $DEV | grep EXTENT_DATA | wc -l mount $DEV $MNT # Read the whole file in order to load and merge extent maps. cat $MNT/foo > /dev/null btrfs filesystem defragment -t 256K $MNT/foo umount $MNT echo -n "Number of file extent items after defrag with 256K threshold: " btrfs inspect-internal dump-tree -t 5 $DEV | grep EXTENT_DATA | wc -l Running it: $ ./test.sh Initial number of file extent items: 4 Number of file extent items after defrag with 128K threshold: 4 Number of file extent items after defrag with 256K threshold: 4 The 4 extents don't get merged because we have an extent map with a size of 256K that is the result of merging the individual extent maps for each of the four 64K extents and at defrag_lookup_extent() we have a value of zero for the generation threshold ('newer_than' argument) since this is a manual defrag. As a consequence we don't call defrag_get_extent() to get an extent map representing a single file extent item in the inode's subvolume tree, so we end up using the merged extent map at defrag_collect_targets() and decide not to defrag. Fix this by updating defrag_lookup_extent() to always discard extent maps that were merged and call defrag_get_extent() regardless of the minimum generation threshold ('newer_than' argument). A test case for fstests will be sent along soon. CC: stable@vger.kernel.org # 6.1+ Fixes: 199257a78bb0 ("btrfs: defrag: don't use merged extent map for their generation check") Reviewed-by: Qu Wenruo Signed-off-by: Filipe Manana Signed-off-by: David Sterba --- fs/btrfs/defrag.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c index b95ef44c326b..968dae953948 100644 --- a/fs/btrfs/defrag.c +++ b/fs/btrfs/defrag.c @@ -763,12 +763,12 @@ static struct extent_map *defrag_lookup_extent(struct inode *inode, u64 start, * We can get a merged extent, in that case, we need to re-search * tree to get the original em for defrag. * - * If @newer_than is 0 or em::generation < newer_than, we can trust - * this em, as either we don't care about the generation, or the - * merged extent map will be rejected anyway. + * This is because even if we have adjacent extents that are contiguous + * and compatible (same type and flags), we still want to defrag them + * so that we use less metadata (extent items in the extent tree and + * file extent items in the inode's subvolume tree). */ - if (em && (em->flags & EXTENT_FLAG_MERGED) && - newer_than && em->generation >= newer_than) { + if (em && (em->flags & EXTENT_FLAG_MERGED)) { free_extent_map(em); em = NULL; } -- cgit From 3e8b7238b427e05498034c240451af5f5495afda Mon Sep 17 00:00:00 2001 From: Johan Hovold Date: Mon, 28 Oct 2024 13:49:58 +0100 Subject: gpiolib: fix debugfs newline separators The gpiolib debugfs interface exports a list of all gpio chips in a system and the state of their pins. The gpio chip sections are supposed to be separated by a newline character, but a long-standing bug prevents the separator from being included when output is generated in multiple sessions, making the output inconsistent and hard to read. Make sure to only suppress the newline separator at the beginning of the file as intended. Fixes: f9c4a31f6150 ("gpiolib: Use seq_file's iterator interface") Cc: stable@vger.kernel.org # 3.7 Cc: Thierry Reding Signed-off-by: Johan Hovold Link: https://lore.kernel.org/r/20241028125000.24051-2-johan+linaro@kernel.org Signed-off-by: Bartosz Golaszewski --- drivers/gpio/gpiolib.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/gpio/gpiolib.c b/drivers/gpio/gpiolib.c index d5952ab7752c..e27488a90bc9 100644 --- a/drivers/gpio/gpiolib.c +++ b/drivers/gpio/gpiolib.c @@ -4926,6 +4926,8 @@ static void *gpiolib_seq_start(struct seq_file *s, loff_t *pos) return NULL; s->private = priv; + if (*pos > 0) + priv->newline = true; priv->idx = srcu_read_lock(&gpio_devices_srcu); list_for_each_entry_srcu(gdev, &gpio_devices, list, -- cgit From 604888f8c3d01fddd9366161efc65cb3182831f1 Mon Sep 17 00:00:00 2001 From: Johan Hovold Date: Mon, 28 Oct 2024 13:49:59 +0100 Subject: gpiolib: fix debugfs dangling chip separator Add the missing newline after entries for recently removed gpio chips so that the chip sections are separated by a newline as intended. Fixes: e348544f7994 ("gpio: protect the list of GPIO devices with SRCU") Cc: stable@vger.kernel.org # 6.9 Cc: Bartosz Golaszewski Signed-off-by: Johan Hovold Link: https://lore.kernel.org/r/20241028125000.24051-3-johan+linaro@kernel.org Signed-off-by: Bartosz Golaszewski --- drivers/gpio/gpiolib.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpio/gpiolib.c b/drivers/gpio/gpiolib.c index e27488a90bc9..2b02655abb56 100644 --- a/drivers/gpio/gpiolib.c +++ b/drivers/gpio/gpiolib.c @@ -4971,7 +4971,7 @@ static int gpiolib_seq_show(struct seq_file *s, void *v) gc = srcu_dereference(gdev->chip, &gdev->srcu); if (!gc) { - seq_printf(s, "%s%s: (dangling chip)", + seq_printf(s, "%s%s: (dangling chip)\n", priv->newline ? "\n" : "", dev_name(&gdev->dev)); return 0; -- cgit From 85d16bceaf5d8112c9ffcfedd2f1bb9d0a1c1578 Mon Sep 17 00:00:00 2001 From: Jarkko Sakkinen Date: Fri, 25 Oct 2024 21:15:28 +0300 Subject: mailmap: update Jarkko's email addresses Remove my previous work email, and the new one. The previous was never used in the commit log, so there's no good reason to spare it. Link: https://lkml.kernel.org/r/20241025181530.6151-1-jarkko@kernel.org Signed-off-by: Jarkko Sakkinen Cc: Alex Elder Cc: David S. Miller Cc: Geliang Tang Cc: Jiri Kosina Cc: Kees Cook Cc: Matthieu Baerts (NGI0) Cc: Matt Ranostay Cc: Neeraj Upadhyay Cc: Quentin Monnet Signed-off-by: Andrew Morton --- .mailmap | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.mailmap b/.mailmap index 9a94c514e32c..5cf2fab5fe83 100644 --- a/.mailmap +++ b/.mailmap @@ -282,7 +282,7 @@ Jan Glauber Jan Kuliga Jarkko Sakkinen Jarkko Sakkinen -Jarkko Sakkinen +Jarkko Sakkinen Jason Gunthorpe Jason Gunthorpe Jason Gunthorpe -- cgit From 35e41024c4c2b02ef8207f61b9004f6956cf037b Mon Sep 17 00:00:00 2001 From: Gregory Price Date: Fri, 25 Oct 2024 10:17:24 -0400 Subject: vmscan,migrate: fix page count imbalance on node stats when demoting pages When numa balancing is enabled with demotion, vmscan will call migrate_pages when shrinking LRUs. migrate_pages will decrement the the node's isolated page count, leading to an imbalanced count when invoked from (MG)LRU code. The result is dmesg output like such: $ cat /proc/sys/vm/stat_refresh [77383.088417] vmstat_refresh: nr_isolated_anon -103212 [77383.088417] vmstat_refresh: nr_isolated_file -899642 This negative value may impact compaction and reclaim throttling. The following path produces the decrement: shrink_folio_list demote_folio_list migrate_pages migrate_pages_batch migrate_folio_move migrate_folio_done mod_node_page_state(-ve) <- decrement This path happens for SUCCESSFUL migrations, not failures. Typically callers to migrate_pages are required to handle putback/accounting for failures, but this is already handled in the shrink code. When accounting for migrations, instead do not decrement the count when the migration reason is MR_DEMOTION. As of v6.11, this demotion logic is the only source of MR_DEMOTION. Link: https://lkml.kernel.org/r/20241025141724.17927-1-gourry@gourry.net Fixes: 26aa2d199d6f ("mm/migrate: demote pages during reclaim") Signed-off-by: Gregory Price Reviewed-by: Yang Shi Reviewed-by: Davidlohr Bueso Reviewed-by: Shakeel Butt Reviewed-by: "Huang, Ying" Reviewed-by: Oscar Salvador Cc: Dave Hansen Cc: Wei Xu Cc: Signed-off-by: Andrew Morton --- mm/migrate.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/migrate.c b/mm/migrate.c index 7e520562d421..fab84a776088 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1178,7 +1178,7 @@ static void migrate_folio_done(struct folio *src, * not accounted to NR_ISOLATED_*. They can be recognized * as __folio_test_movable */ - if (likely(!__folio_test_movable(src))) + if (likely(!__folio_test_movable(src)) && reason != MR_DEMOTION) mod_node_page_state(folio_pgdat(src), NR_ISOLATED_ANON + folio_is_file_lru(src), -folio_nr_pages(src)); -- cgit From 0173471d21ec964921f97ba4eca71af74beb29f7 Mon Sep 17 00:00:00 2001 From: Eugen Hristev Date: Fri, 25 Oct 2024 11:58:48 +0300 Subject: .mailmap: update e-mail address for Eugen Hristev Update e-mail address. Link: https://lkml.kernel.org/r/20241025085848.483149-1-eugen.hristev@linaro.org Signed-off-by: Eugen Hristev Signed-off-by: Andrew Morton --- .mailmap | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/.mailmap b/.mailmap index 5cf2fab5fe83..5378f04b2566 100644 --- a/.mailmap +++ b/.mailmap @@ -199,7 +199,8 @@ Elliot Berman Enric Balletbo i Serra Enric Balletbo i Serra Erik Kaneda -Eugen Hristev +Eugen Hristev +Eugen Hristev Evgeniy Polyakov Ezequiel Garcia Faith Ekstrand -- cgit From 15e8156713cc38031642fafc8baf7d53f19f2e83 Mon Sep 17 00:00:00 2001 From: Chen Ridong Date: Fri, 25 Oct 2024 06:09:42 +0000 Subject: mm: shrinker: avoid memleak in alloc_shrinker_info A memleak was found as below: unreferenced object 0xffff8881010d2a80 (size 32): comm "mkdir", pid 1559, jiffies 4294932666 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 @............... backtrace (crc 2e7ef6fa): [] __kmalloc_node_noprof+0x394/0x470 [] alloc_shrinker_info+0x7b/0x1a0 [] mem_cgroup_css_online+0x11a/0x3b0 [] online_css+0x29/0xa0 [] cgroup_apply_control_enable+0x20d/0x360 [] cgroup_mkdir+0x168/0x5f0 [] kernfs_iop_mkdir+0x5e/0x90 [] vfs_mkdir+0x144/0x220 [] do_mkdirat+0x87/0x130 [] __x64_sys_mkdir+0x49/0x70 [] do_syscall_64+0x68/0x140 [] entry_SYSCALL_64_after_hwframe+0x76/0x7e alloc_shrinker_info(), when shrinker_unit_alloc() returns an errer, the info won't be freed. Just fix it. Link: https://lkml.kernel.org/r/20241025060942.1049263-1-chenridong@huaweicloud.com Fixes: 307bececcd12 ("mm: shrinker: add a secondary array for shrinker_info::{map, nr_deferred}") Signed-off-by: Chen Ridong Acked-by: Qi Zheng Acked-by: Roman Gushchin Acked-by: Vlastimil Babka Acked-by: Kirill A. Shutemov Reviewed-by: Dave Chinner Cc: Anshuman Khandual Cc: Muchun Song Cc: Wang Weiyang Cc: Signed-off-by: Andrew Morton --- mm/shrinker.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/mm/shrinker.c b/mm/shrinker.c index dc5d2a6fcfc4..4a93fd433689 100644 --- a/mm/shrinker.c +++ b/mm/shrinker.c @@ -76,19 +76,21 @@ void free_shrinker_info(struct mem_cgroup *memcg) int alloc_shrinker_info(struct mem_cgroup *memcg) { - struct shrinker_info *info; int nid, ret = 0; int array_size = 0; mutex_lock(&shrinker_mutex); array_size = shrinker_unit_size(shrinker_nr_max); for_each_node(nid) { - info = kvzalloc_node(sizeof(*info) + array_size, GFP_KERNEL, nid); + struct shrinker_info *info = kvzalloc_node(sizeof(*info) + array_size, + GFP_KERNEL, nid); if (!info) goto err; info->map_nr_max = shrinker_nr_max; - if (shrinker_unit_alloc(info, NULL, nid)) + if (shrinker_unit_alloc(info, NULL, nid)) { + kvfree(info); goto err; + } rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info); } mutex_unlock(&shrinker_mutex); -- cgit From d4148aeab412432bf928f311eca8a2ba52bb05df Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Thu, 24 Oct 2024 17:12:29 +0200 Subject: mm, mmap: limit THP alignment of anonymous mappings to PMD-aligned sizes Since commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") a mmap() of anonymous memory without a specific address hint and of at least PMD_SIZE will be aligned to PMD so that it can benefit from a THP backing page. However this change has been shown to regress some workloads significantly. [1] reports regressions in various spec benchmarks, with up to 600% slowdown of the cactusBSSN benchmark on some platforms. The benchmark seems to create many mappings of 4632kB, which would have merged to a large THP-backed area before commit efa7df3e3bb5 and now they are fragmented to multiple areas each aligned to PMD boundary with gaps between. The regression then seems to be caused mainly due to the benchmark's memory access pattern suffering from TLB or cache aliasing due to the aligned boundaries of the individual areas. Another known regression bisected to commit efa7df3e3bb5 is darktable [2] [3] and early testing suggests this patch fixes the regression there as well. To fix the regression but still try to benefit from THP-friendly anonymous mapping alignment, add a condition that the size of the mapping must be a multiple of PMD size instead of at least PMD size. In case of many odd-sized mapping like the cactusBSSN creates, those will stop being aligned and with gaps between, and instead naturally merge again. Link: https://lkml.kernel.org/r/20241024151228.101841-2-vbabka@suse.cz Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") Signed-off-by: Vlastimil Babka Reported-by: Michael Matz Debugged-by: Gabriel Krisman Bertazi Closes: https://bugzilla.suse.com/show_bug.cgi?id=1229012 [1] Reported-by: Matthias Bodenbinder Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219366 [2] Closes: https://lore.kernel.org/all/2050f0d4-57b0-481d-bab8-05e8d48fed0c@leemhuis.info/ [3] Reviewed-by: Lorenzo Stoakes Reviewed-by: Yang Shi Cc: Rik van Riel Cc: Jann Horn Cc: Liam R. Howlett Cc: Petr Tesarik Cc: Thorsten Leemhuis Cc: Signed-off-by: Andrew Morton --- mm/mmap.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/mmap.c b/mm/mmap.c index 1e0e34cb993f..9841b41e3c76 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -900,7 +900,8 @@ __get_unmapped_area(struct file *file, unsigned long addr, unsigned long len, if (get_area) { addr = get_area(file, addr, len, pgoff, flags); - } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { + } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) + && IS_ALIGNED(len, PMD_SIZE)) { /* Ensures that larger anonymous mappings are THP aligned. */ addr = thp_get_unmapped_area_vmflags(file, addr, len, pgoff, flags, vm_flags); -- cgit From 071b24b54d2d05fbf39ddbb27dee08abd1d713f3 Mon Sep 17 00:00:00 2001 From: Dmitry Torokhov Date: Sun, 27 Oct 2024 22:31:15 -0700 Subject: Input: fix regression when re-registering input handlers Commit d469647bafd9 ("Input: simplify event handling logic") introduced code that would set handler->events() method to either input_handler_events_filter() or input_handler_events_default() or input_handler_events_null(), depending on the kind of input handler (a filter or a regular one) we are dealing with. Unfortunately this breaks cases when we try to re-register the same filter (as is the case with sysrq handler): after initial registration the handler will have 2 event handling methods defined, and will run afoul of the check in input_handler_check_methods(): input: input_handler_check_methods: only one event processing method can be defined (sysrq) sysrq: Failed to register input handler, error -22 Fix this by adding handle_events() method to input_handle structure and setting it up when registering a new input handle according to event handling methods defined in associated input_handler structure, thus avoiding modifying the input_handler structure. Reported-by: "Ned T. Crigler" Reported-by: Christian Heusel Tested-by: "Ned T. Crigler" Tested-by: Peter Seiderer Fixes: d469647bafd9 ("Input: simplify event handling logic") Link: https://lore.kernel.org/r/Zx2iQp6csn42PJA7@xavtug Cc: stable@vger.kernel.org Signed-off-by: Dmitry Torokhov --- drivers/input/input.c | 134 +++++++++++++++++++++++++++----------------------- include/linux/input.h | 10 +++- 2 files changed, 82 insertions(+), 62 deletions(-) diff --git a/drivers/input/input.c b/drivers/input/input.c index 3c321671793f..3d2cc13e1f32 100644 --- a/drivers/input/input.c +++ b/drivers/input/input.c @@ -119,12 +119,12 @@ static void input_pass_values(struct input_dev *dev, handle = rcu_dereference(dev->grab); if (handle) { - count = handle->handler->events(handle, vals, count); + count = handle->handle_events(handle, vals, count); } else { list_for_each_entry_rcu(handle, &dev->h_list, d_node) if (handle->open) { - count = handle->handler->events(handle, vals, - count); + count = handle->handle_events(handle, vals, + count); if (!count) break; } @@ -2537,57 +2537,6 @@ static int input_handler_check_methods(const struct input_handler *handler) return 0; } -/* - * An implementation of input_handler's events() method that simply - * invokes handler->event() method for each event one by one. - */ -static unsigned int input_handler_events_default(struct input_handle *handle, - struct input_value *vals, - unsigned int count) -{ - struct input_handler *handler = handle->handler; - struct input_value *v; - - for (v = vals; v != vals + count; v++) - handler->event(handle, v->type, v->code, v->value); - - return count; -} - -/* - * An implementation of input_handler's events() method that invokes - * handler->filter() method for each event one by one and removes events - * that were filtered out from the "vals" array. - */ -static unsigned int input_handler_events_filter(struct input_handle *handle, - struct input_value *vals, - unsigned int count) -{ - struct input_handler *handler = handle->handler; - struct input_value *end = vals; - struct input_value *v; - - for (v = vals; v != vals + count; v++) { - if (handler->filter(handle, v->type, v->code, v->value)) - continue; - if (end != v) - *end = *v; - end++; - } - - return end - vals; -} - -/* - * An implementation of input_handler's events() method that does nothing. - */ -static unsigned int input_handler_events_null(struct input_handle *handle, - struct input_value *vals, - unsigned int count) -{ - return count; -} - /** * input_register_handler - register a new input handler * @handler: handler to be registered @@ -2607,13 +2556,6 @@ int input_register_handler(struct input_handler *handler) INIT_LIST_HEAD(&handler->h_list); - if (handler->filter) - handler->events = input_handler_events_filter; - else if (handler->event) - handler->events = input_handler_events_default; - else if (!handler->events) - handler->events = input_handler_events_null; - error = mutex_lock_interruptible(&input_mutex); if (error) return error; @@ -2687,6 +2629,75 @@ int input_handler_for_each_handle(struct input_handler *handler, void *data, } EXPORT_SYMBOL(input_handler_for_each_handle); +/* + * An implementation of input_handle's handle_events() method that simply + * invokes handler->event() method for each event one by one. + */ +static unsigned int input_handle_events_default(struct input_handle *handle, + struct input_value *vals, + unsigned int count) +{ + struct input_handler *handler = handle->handler; + struct input_value *v; + + for (v = vals; v != vals + count; v++) + handler->event(handle, v->type, v->code, v->value); + + return count; +} + +/* + * An implementation of input_handle's handle_events() method that invokes + * handler->filter() method for each event one by one and removes events + * that were filtered out from the "vals" array. + */ +static unsigned int input_handle_events_filter(struct input_handle *handle, + struct input_value *vals, + unsigned int count) +{ + struct input_handler *handler = handle->handler; + struct input_value *end = vals; + struct input_value *v; + + for (v = vals; v != vals + count; v++) { + if (handler->filter(handle, v->type, v->code, v->value)) + continue; + if (end != v) + *end = *v; + end++; + } + + return end - vals; +} + +/* + * An implementation of input_handle's handle_events() method that does nothing. + */ +static unsigned int input_handle_events_null(struct input_handle *handle, + struct input_value *vals, + unsigned int count) +{ + return count; +} + +/* + * Sets up appropriate handle->event_handler based on the input_handler + * associated with the handle. + */ +static void input_handle_setup_event_handler(struct input_handle *handle) +{ + struct input_handler *handler = handle->handler; + + if (handler->filter) + handle->handle_events = input_handle_events_filter; + else if (handler->event) + handle->handle_events = input_handle_events_default; + else if (handler->events) + handle->handle_events = handler->events; + else + handle->handle_events = input_handle_events_null; +} + /** * input_register_handle - register a new input handle * @handle: handle to register @@ -2704,6 +2715,7 @@ int input_register_handle(struct input_handle *handle) struct input_dev *dev = handle->dev; int error; + input_handle_setup_event_handler(handle); /* * We take dev->mutex here to prevent race with * input_release_device(). diff --git a/include/linux/input.h b/include/linux/input.h index 89a0be6ee0e2..cd866b020a01 100644 --- a/include/linux/input.h +++ b/include/linux/input.h @@ -339,12 +339,16 @@ struct input_handler { * @name: name given to the handle by handler that created it * @dev: input device the handle is attached to * @handler: handler that works with the device through this handle + * @handle_events: event sequence handler. It is set up by the input core + * according to event handling method specified in the @handler. See + * input_handle_setup_event_handler(). + * This method is being called by the input core with interrupts disabled + * and dev->event_lock spinlock held and so it may not sleep. * @d_node: used to put the handle on device's list of attached handles * @h_node: used to put the handle on handler's list of handles from which * it gets events */ struct input_handle { - void *private; int open; @@ -353,6 +357,10 @@ struct input_handle { struct input_dev *dev; struct input_handler *handler; + unsigned int (*handle_events)(struct input_handle *handle, + struct input_value *vals, + unsigned int count); + struct list_head d_node; struct list_head h_node; }; -- cgit From 2e766a1f5f94a142d9a906c9411d0f6101c4c721 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Sun, 3 Nov 2024 21:46:50 +0900 Subject: modpost: fix acpi MODULE_DEVICE_TABLE built with mismatched endianness When CONFIG_SATA_AHCI_PLATFORM=m, modpost outputs incorect acpi MODULE_ALIAS() if the endianness of the target and the build machine do not match. When the endianness of the target kernel and the build machine match, the output is correct: $ grep 'MODULE_ALIAS("acpi' drivers/ata/ahci_platform.mod.c MODULE_ALIAS("acpi*:APMC0D33:*"); MODULE_ALIAS("acpi*:010601:*"); However, when building a little-endian kernel on a big-endian machine (or vice versa), the output is incorrect: $ grep 'MODULE_ALIAS("acpi' drivers/ata/ahci_platform.mod.c MODULE_ALIAS("acpi*:APMC0D33:*"); MODULE_ALIAS("acpi*:0601??:*"); The 'cls' and 'cls_msk' fields are 32-bit. DEF_FIELD() must be used instead of DEF_FIELD_ADDR() to correctly handle endianness of these 32-bit fields. The check 'if (cls)' was unnecessary; it never became NULL, as it was the pointer to 'symval' plus the offset to the 'cls' field. Fixes: 26095a01d359 ("ACPI / scan: Add support for ACPI _CLS device matching") Signed-off-by: Masahiro Yamada --- scripts/mod/file2alias.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/scripts/mod/file2alias.c b/scripts/mod/file2alias.c index 99dce93a4188..16154449dde1 100644 --- a/scripts/mod/file2alias.c +++ b/scripts/mod/file2alias.c @@ -567,12 +567,12 @@ static int do_acpi_entry(const char *filename, void *symval, char *alias) { DEF_FIELD_ADDR(symval, acpi_device_id, id); - DEF_FIELD_ADDR(symval, acpi_device_id, cls); - DEF_FIELD_ADDR(symval, acpi_device_id, cls_msk); + DEF_FIELD(symval, acpi_device_id, cls); + DEF_FIELD(symval, acpi_device_id, cls_msk); if (id && strlen((const char *)*id)) sprintf(alias, "acpi*:%s:*", *id); - else if (cls) { + else { int i, byte_shift, cnt = 0; unsigned int msk; @@ -580,10 +580,10 @@ static int do_acpi_entry(const char *filename, cnt = 6; for (i = 1; i <= 3; i++) { byte_shift = 8 * (3-i); - msk = (*cls_msk >> byte_shift) & 0xFF; + msk = (cls_msk >> byte_shift) & 0xFF; if (msk) sprintf(&alias[cnt], "%02x", - (*cls >> byte_shift) & 0xFF); + (cls >> byte_shift) & 0xFF); else sprintf(&alias[cnt], "??"); cnt += 2; -- cgit From 77dc55a978e69625f9718460012e5ef0172dc4de Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Sun, 3 Nov 2024 21:52:57 +0900 Subject: modpost: fix input MODULE_DEVICE_TABLE() built for 64-bit on 32-bit host When building a 64-bit kernel on a 32-bit build host, incorrect input MODULE_ALIAS() entries may be generated. For example, when compiling a 64-bit kernel with CONFIG_INPUT_MOUSEDEV=m on a 64-bit build machine, you will get the correct output: $ grep MODULE_ALIAS drivers/input/mousedev.mod.c MODULE_ALIAS("input:b*v*p*e*-e*1,*2,*k*110,*r*0,*1,*a*m*l*s*f*w*"); MODULE_ALIAS("input:b*v*p*e*-e*1,*2,*k*r*8,*a*m*l*s*f*w*"); MODULE_ALIAS("input:b*v*p*e*-e*1,*3,*k*14A,*r*a*0,*1,*m*l*s*f*w*"); MODULE_ALIAS("input:b*v*p*e*-e*1,*3,*k*145,*r*a*0,*1,*18,*1C,*m*l*s*f*w*"); MODULE_ALIAS("input:b*v*p*e*-e*1,*3,*k*110,*r*a*0,*1,*m*l*s*f*w*"); However, building the same kernel on a 32-bit machine results in incorrect output: $ grep MODULE_ALIAS drivers/input/mousedev.mod.c MODULE_ALIAS("input:b*v*p*e*-e*1,*2,*k*110,*130,*r*0,*1,*a*m*l*s*f*w*"); MODULE_ALIAS("input:b*v*p*e*-e*1,*2,*k*r*8,*a*m*l*s*f*w*"); MODULE_ALIAS("input:b*v*p*e*-e*1,*3,*k*14A,*16A,*r*a*0,*1,*20,*21,*m*l*s*f*w*"); MODULE_ALIAS("input:b*v*p*e*-e*1,*3,*k*145,*165,*r*a*0,*1,*18,*1C,*20,*21,*38,*3C,*m*l*s*f*w*"); MODULE_ALIAS("input:b*v*p*e*-e*1,*3,*k*110,*130,*r*a*0,*1,*20,*21,*m*l*s*f*w*"); A similar issue occurs with CONFIG_INPUT_JOYDEV=m. On a 64-bit build machine, the output is: $ grep MODULE_ALIAS drivers/input/joydev.mod.c MODULE_ALIAS("input:b*v*p*e*-e*3,*k*r*a*0,*m*l*s*f*w*"); MODULE_ALIAS("input:b*v*p*e*-e*3,*k*r*a*2,*m*l*s*f*w*"); MODULE_ALIAS("input:b*v*p*e*-e*3,*k*r*a*8,*m*l*s*f*w*"); MODULE_ALIAS("input:b*v*p*e*-e*3,*k*r*a*6,*m*l*s*f*w*"); MODULE_ALIAS("input:b*v*p*e*-e*1,*k*120,*r*a*m*l*s*f*w*"); MODULE_ALIAS("input:b*v*p*e*-e*1,*k*130,*r*a*m*l*s*f*w*"); MODULE_ALIAS("input:b*v*p*e*-e*1,*k*2C0,*r*a*m*l*s*f*w*"); However, on a 32-bit machine, the output is incorrect: $ grep MODULE_ALIAS drivers/input/joydev.mod.c MODULE_ALIAS("input:b*v*p*e*-e*3,*k*r*a*0,*20,*m*l*s*f*w*"); MODULE_ALIAS("input:b*v*p*e*-e*3,*k*r*a*2,*22,*m*l*s*f*w*"); MODULE_ALIAS("input:b*v*p*e*-e*3,*k*r*a*8,*28,*m*l*s*f*w*"); MODULE_ALIAS("input:b*v*p*e*-e*3,*k*r*a*6,*26,*m*l*s*f*w*"); MODULE_ALIAS("input:b*v*p*e*-e*1,*k*11F,*13F,*r*a*m*l*s*f*w*"); MODULE_ALIAS("input:b*v*p*e*-e*1,*k*11F,*13F,*r*a*m*l*s*f*w*"); MODULE_ALIAS("input:b*v*p*e*-e*1,*k*2C0,*2E0,*r*a*m*l*s*f*w*"); When building a 64-bit kernel, BITS_PER_LONG is defined as 64. However, on a 32-bit build machine, the constant 1L is a signed 32-bit value. Left-shifting it beyond 32 bits causes wraparound, and shifting by 31 or 63 bits makes it a negative value. The fix in commit e0e92632715f ("[PATCH] PATCH: 1 line 2.6.18 bugfix: modpost-64bit-fix.patch") is incorrect; it only addresses cases where a 64-bit kernel is built on a 64-bit build machine, overlooking cases on a 32-bit build machine. Using 1ULL ensures a 64-bit width on both 32-bit and 64-bit machines, avoiding the wraparound issue. Fixes: e0e92632715f ("[PATCH] PATCH: 1 line 2.6.18 bugfix: modpost-64bit-fix.patch") Signed-off-by: Masahiro Yamada --- scripts/mod/file2alias.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/mod/file2alias.c b/scripts/mod/file2alias.c index 16154449dde1..c4cc11aa558f 100644 --- a/scripts/mod/file2alias.c +++ b/scripts/mod/file2alias.c @@ -743,7 +743,7 @@ static void do_input(char *alias, for (i = min / BITS_PER_LONG; i < max / BITS_PER_LONG + 1; i++) arr[i] = TO_NATIVE(arr[i]); for (i = min; i < max; i++) - if (arr[i / BITS_PER_LONG] & (1L << (i%BITS_PER_LONG))) + if (arr[i / BITS_PER_LONG] & (1ULL << (i%BITS_PER_LONG))) sprintf(alias + strlen(alias), "%X,*", i); } -- cgit From ddd6d8e975b171ea3f63a011a75820883ff0d479 Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Sat, 19 Oct 2024 01:29:38 +0000 Subject: mm: multi-gen LRU: remove MM_LEAF_OLD and MM_NONLEAF_TOTAL stats Patch series "mm: multi-gen LRU: Have secondary MMUs participate in MM_WALK". Today, the MM_WALK capability causes MGLRU to clear the young bit from PMDs and PTEs during the page table walk before eviction, but MGLRU does not call the clear_young() MMU notifier in this case. By not calling this notifier, the MM walk takes less time/CPU, but it causes pages that are accessed mostly through KVM / secondary MMUs to appear younger than they should be. We do call the clear_young() notifier today, but only when attempting to evict the page, so we end up clearing young/accessed information less frequently for secondary MMUs than for mm PTEs, and therefore they appear younger and are less likely to be evicted. Therefore, memory that is *not* being accessed mostly by KVM will be evicted *more* frequently, worsening performance. ChromeOS observed a tab-open latency regression when enabling MGLRU with a setup that involved running a VM: Tab-open latency histogram (ms) Version p50 mean p95 p99 max base 1315 1198 2347 3454 10319 mglru 2559 1311 7399 12060 43758 fix 1119 926 2470 4211 6947 This series replaces the final non-selftest patchs from this series[1], which introduced a similar change (and a new MMU notifier) with KVM optimizations. I'll send a separate series (to Sean and Paolo) for the KVM optimizations. This series also makes proactive reclaim with MGLRU possible for KVM memory. I have verified that this functions correctly with the selftest from [1], but given that that test is a KVM selftest, I'll send it with the rest of the KVM optimizations later. Andrew, let me know if you'd like to take the test now anyway. [1]: https://lore.kernel.org/linux-mm/20240926013506.860253-18-jthoughton@google.com/ This patch (of 2): The removed stats, MM_LEAF_OLD and MM_NONLEAF_TOTAL, are not very helpful and become more complicated to properly compute when adding test/clear_young() notifiers in MGLRU's mm walk. Link: https://lkml.kernel.org/r/20241019012940.3656292-1-jthoughton@google.com Link: https://lkml.kernel.org/r/20241019012940.3656292-2-jthoughton@google.com Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks") Signed-off-by: Yu Zhao Signed-off-by: James Houghton Cc: Axel Rasmussen Cc: David Matlack Cc: David Rientjes Cc: David Stevens Cc: Oliver Upton Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Wei Xu Cc: Signed-off-by: Andrew Morton --- include/linux/mmzone.h | 2 -- mm/vmscan.c | 14 +++++--------- 2 files changed, 5 insertions(+), 11 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 17506e4a2835..9342e5692dab 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -458,9 +458,7 @@ struct lru_gen_folio { enum { MM_LEAF_TOTAL, /* total leaf entries */ - MM_LEAF_OLD, /* old leaf entries */ MM_LEAF_YOUNG, /* young leaf entries */ - MM_NONLEAF_TOTAL, /* total non-leaf entries */ MM_NONLEAF_FOUND, /* non-leaf entries found in Bloom filters */ MM_NONLEAF_ADDED, /* non-leaf entries added to Bloom filters */ NR_MM_STATS diff --git a/mm/vmscan.c b/mm/vmscan.c index eb4e8440c507..4f1d33e4b360 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3399,7 +3399,6 @@ restart: continue; if (!pte_young(ptent)) { - walk->mm_stats[MM_LEAF_OLD]++; continue; } @@ -3552,7 +3551,6 @@ restart: walk->mm_stats[MM_LEAF_TOTAL]++; if (!pmd_young(val)) { - walk->mm_stats[MM_LEAF_OLD]++; continue; } @@ -3564,8 +3562,6 @@ restart: continue; } - walk->mm_stats[MM_NONLEAF_TOTAL]++; - if (!walk->force_scan && should_clear_pmd_young()) { if (!pmd_young(val)) continue; @@ -5254,11 +5250,11 @@ static void lru_gen_seq_show_full(struct seq_file *m, struct lruvec *lruvec, for (tier = 0; tier < MAX_NR_TIERS; tier++) { seq_printf(m, " %10d", tier); for (type = 0; type < ANON_AND_FILE; type++) { - const char *s = " "; + const char *s = "xxx"; unsigned long n[3] = {}; if (seq == max_seq) { - s = "RT "; + s = "RTx"; n[0] = READ_ONCE(lrugen->avg_refaulted[type][tier]); n[1] = READ_ONCE(lrugen->avg_total[type][tier]); } else if (seq == min_seq[type] || NR_HIST_GENS > 1) { @@ -5280,14 +5276,14 @@ static void lru_gen_seq_show_full(struct seq_file *m, struct lruvec *lruvec, seq_puts(m, " "); for (i = 0; i < NR_MM_STATS; i++) { - const char *s = " "; + const char *s = "xxxx"; unsigned long n = 0; if (seq == max_seq && NR_HIST_GENS == 1) { - s = "LOYNFA"; + s = "TYFA"; n = READ_ONCE(mm_state->stats[hist][i]); } else if (seq != max_seq && NR_HIST_GENS > 1) { - s = "loynfa"; + s = "tyfa"; n = READ_ONCE(mm_state->stats[hist][i]); } -- cgit From 1d4832becdc2cdb2cffe2a6050c9d9fd8ff1c58c Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Sat, 19 Oct 2024 01:29:39 +0000 Subject: mm: multi-gen LRU: use {ptep,pmdp}_clear_young_notify() When the MM_WALK capability is enabled, memory that is mostly accessed by a VM appears younger than it really is, therefore this memory will be less likely to be evicted. Therefore, the presence of a running VM can significantly increase swap-outs for non-VM memory, regressing the performance for the rest of the system. Fix this regression by always calling {ptep,pmdp}_clear_young_notify() whenever we clear the young bits on PMDs/PTEs. [jthoughton@google.com: fix link-time error] Link: https://lkml.kernel.org/r/20241019012940.3656292-3-jthoughton@google.com Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks") Signed-off-by: Yu Zhao Signed-off-by: James Houghton Reported-by: David Stevens Cc: Axel Rasmussen Cc: David Matlack Cc: David Rientjes Cc: Oliver Upton Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Wei Xu Cc: Cc: kernel test robot Signed-off-by: Andrew Morton --- include/linux/mmzone.h | 5 +-- mm/rmap.c | 9 ++---- mm/vmscan.c | 88 ++++++++++++++++++++++++++++---------------------- 3 files changed, 55 insertions(+), 47 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 9342e5692dab..5b1c984daf45 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -555,7 +555,7 @@ struct lru_gen_memcg { void lru_gen_init_pgdat(struct pglist_data *pgdat); void lru_gen_init_lruvec(struct lruvec *lruvec); -void lru_gen_look_around(struct page_vma_mapped_walk *pvmw); +bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw); void lru_gen_init_memcg(struct mem_cgroup *memcg); void lru_gen_exit_memcg(struct mem_cgroup *memcg); @@ -574,8 +574,9 @@ static inline void lru_gen_init_lruvec(struct lruvec *lruvec) { } -static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) +static inline bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw) { + return false; } static inline void lru_gen_init_memcg(struct mem_cgroup *memcg) diff --git a/mm/rmap.c b/mm/rmap.c index a8797d1b3d49..73d5998677d4 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -885,13 +885,10 @@ static bool folio_referenced_one(struct folio *folio, return false; } - if (pvmw.pte) { - if (lru_gen_enabled() && - pte_young(ptep_get(pvmw.pte))) { - lru_gen_look_around(&pvmw); + if (lru_gen_enabled() && pvmw.pte) { + if (lru_gen_look_around(&pvmw)) referenced++; - } - + } else if (pvmw.pte) { if (ptep_clear_flush_young_notify(vma, address, pvmw.pte)) referenced++; diff --git a/mm/vmscan.c b/mm/vmscan.c index 4f1d33e4b360..ddaaff67642e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -56,6 +56,7 @@ #include #include #include +#include #include #include @@ -3294,7 +3295,8 @@ static bool get_next_vma(unsigned long mask, unsigned long size, struct mm_walk return false; } -static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned long addr) +static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned long addr, + struct pglist_data *pgdat) { unsigned long pfn = pte_pfn(pte); @@ -3306,13 +3308,20 @@ static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned if (WARN_ON_ONCE(pte_devmap(pte) || pte_special(pte))) return -1; + if (!pte_young(pte) && !mm_has_notifiers(vma->vm_mm)) + return -1; + if (WARN_ON_ONCE(!pfn_valid(pfn))) return -1; + if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat)) + return -1; + return pfn; } -static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned long addr) +static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned long addr, + struct pglist_data *pgdat) { unsigned long pfn = pmd_pfn(pmd); @@ -3324,9 +3333,15 @@ static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned if (WARN_ON_ONCE(pmd_devmap(pmd))) return -1; + if (!pmd_young(pmd) && !mm_has_notifiers(vma->vm_mm)) + return -1; + if (WARN_ON_ONCE(!pfn_valid(pfn))) return -1; + if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat)) + return -1; + return pfn; } @@ -3335,10 +3350,6 @@ static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg, { struct folio *folio; - /* try to avoid unnecessary memory loads */ - if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat)) - return NULL; - folio = pfn_folio(pfn); if (folio_nid(folio) != pgdat->node_id) return NULL; @@ -3394,20 +3405,16 @@ restart: total++; walk->mm_stats[MM_LEAF_TOTAL]++; - pfn = get_pte_pfn(ptent, args->vma, addr); + pfn = get_pte_pfn(ptent, args->vma, addr, pgdat); if (pfn == -1) continue; - if (!pte_young(ptent)) { - continue; - } - folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap); if (!folio) continue; - if (!ptep_test_and_clear_young(args->vma, addr, pte + i)) - VM_WARN_ON_ONCE(true); + if (!ptep_clear_young_notify(args->vma, addr, pte + i)) + continue; young++; walk->mm_stats[MM_LEAF_YOUNG]++; @@ -3473,21 +3480,25 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area /* don't round down the first address */ addr = i ? (*first & PMD_MASK) + i * PMD_SIZE : *first; - pfn = get_pmd_pfn(pmd[i], vma, addr); - if (pfn == -1) + if (!pmd_present(pmd[i])) goto next; if (!pmd_trans_huge(pmd[i])) { - if (!walk->force_scan && should_clear_pmd_young()) + if (!walk->force_scan && should_clear_pmd_young() && + !mm_has_notifiers(args->mm)) pmdp_test_and_clear_young(vma, addr, pmd + i); goto next; } + pfn = get_pmd_pfn(pmd[i], vma, addr, pgdat); + if (pfn == -1) + goto next; + folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap); if (!folio) goto next; - if (!pmdp_test_and_clear_young(vma, addr, pmd + i)) + if (!pmdp_clear_young_notify(vma, addr, pmd + i)) goto next; walk->mm_stats[MM_LEAF_YOUNG]++; @@ -3545,24 +3556,18 @@ restart: } if (pmd_trans_huge(val)) { - unsigned long pfn = pmd_pfn(val); struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec); + unsigned long pfn = get_pmd_pfn(val, vma, addr, pgdat); walk->mm_stats[MM_LEAF_TOTAL]++; - if (!pmd_young(val)) { - continue; - } - - /* try to avoid unnecessary memory loads */ - if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat)) - continue; - - walk_pmd_range_locked(pud, addr, vma, args, bitmap, &first); + if (pfn != -1) + walk_pmd_range_locked(pud, addr, vma, args, bitmap, &first); continue; } - if (!walk->force_scan && should_clear_pmd_young()) { + if (!walk->force_scan && should_clear_pmd_young() && + !mm_has_notifiers(args->mm)) { if (!pmd_young(val)) continue; @@ -4036,13 +4041,13 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) * the PTE table to the Bloom filter. This forms a feedback loop between the * eviction and the aging. */ -void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) +bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw) { int i; unsigned long start; unsigned long end; struct lru_gen_mm_walk *walk; - int young = 0; + int young = 1; pte_t *pte = pvmw->pte; unsigned long addr = pvmw->address; struct vm_area_struct *vma = pvmw->vma; @@ -4058,12 +4063,15 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) lockdep_assert_held(pvmw->ptl); VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio); + if (!ptep_clear_young_notify(vma, addr, pte)) + return false; + if (spin_is_contended(pvmw->ptl)) - return; + return true; /* exclude special VMAs containing anon pages from COW */ if (vma->vm_flags & VM_SPECIAL) - return; + return true; /* avoid taking the LRU lock under the PTL when possible */ walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL; @@ -4071,6 +4079,9 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) start = max(addr & PMD_MASK, vma->vm_start); end = min(addr | ~PMD_MASK, vma->vm_end - 1) + 1; + if (end - start == PAGE_SIZE) + return true; + if (end - start > MIN_LRU_BATCH * PAGE_SIZE) { if (addr - start < MIN_LRU_BATCH * PAGE_SIZE / 2) end = start + MIN_LRU_BATCH * PAGE_SIZE; @@ -4084,7 +4095,7 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) /* folio_update_gen() requires stable folio_memcg() */ if (!mem_cgroup_trylock_pages(memcg)) - return; + return true; arch_enter_lazy_mmu_mode(); @@ -4094,19 +4105,16 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) unsigned long pfn; pte_t ptent = ptep_get(pte + i); - pfn = get_pte_pfn(ptent, vma, addr); + pfn = get_pte_pfn(ptent, vma, addr, pgdat); if (pfn == -1) continue; - if (!pte_young(ptent)) - continue; - folio = get_pfn_folio(pfn, memcg, pgdat, can_swap); if (!folio) continue; - if (!ptep_test_and_clear_young(vma, addr, pte + i)) - VM_WARN_ON_ONCE(true); + if (!ptep_clear_young_notify(vma, addr, pte + i)) + continue; young++; @@ -4136,6 +4144,8 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) /* feedback from rmap walkers to page table walkers */ if (mm_state && suitable_to_scan(i, young)) update_bloom_filter(mm_state, max_seq, pvmw->pmd); + + return true; } /****************************************************************************** -- cgit From 59b723cd2adbac2a34fc8e12c74ae26ae45bf230 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Sun, 3 Nov 2024 14:05:52 -1000 Subject: Linux 6.12-rc6 --- Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Makefile b/Makefile index 5e04e4abffd8..b8efbfe9da94 100644 --- a/Makefile +++ b/Makefile @@ -2,7 +2,7 @@ VERSION = 6 PATCHLEVEL = 12 SUBLEVEL = 0 -EXTRAVERSION = -rc5 +EXTRAVERSION = -rc6 NAME = Baby Opossum Posse # *DOCUMENTATION* -- cgit From 2d0f2a648147d6bbf0655e03500586a6712a7281 Mon Sep 17 00:00:00 2001 From: Maxim Levitsky Date: Fri, 4 Oct 2024 18:01:53 -0400 Subject: KVM: selftests: memslot_perf_test: increase guest sync timeout When memslot_perf_test is run nested, first iteration of test_memslot_rw_loop testcase, sometimes takes more than 2 seconds due to build of shadow page tables. Following iterations are fast. To be on the safe side, bump the timeout to 10 seconds. Signed-off-by: Maxim Levitsky Tested-by: Liam Merwick Reviewed-by: Liam Merwick Link: https://lore.kernel.org/r/20241004220153.287459-1-mlevitsk@redhat.com Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/memslot_perf_test.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/kvm/memslot_perf_test.c b/tools/testing/selftests/kvm/memslot_perf_test.c index 989ffe0d047f..e3711beff7f3 100644 --- a/tools/testing/selftests/kvm/memslot_perf_test.c +++ b/tools/testing/selftests/kvm/memslot_perf_test.c @@ -417,7 +417,7 @@ static bool _guest_should_exit(void) */ static noinline void host_perform_sync(struct sync_area *sync) { - alarm(2); + alarm(10); atomic_store_explicit(&sync->sync_flag, true, memory_order_release); while (atomic_load_explicit(&sync->sync_flag, memory_order_acquire)) -- cgit From 945bdae20be5a13f1fcdcb14ec356dcbeee35839 Mon Sep 17 00:00:00 2001 From: Patrick Roy Date: Thu, 24 Oct 2024 10:59:53 +0100 Subject: KVM: selftests: fix unintentional noop test in guest_memfd_test.c The loop in test_create_guest_memfd_invalid() that is supposed to test that nothing is accepted as a valid flag to KVM_CREATE_GUEST_MEMFD was initializing `flag` as 0 instead of BIT(0). This caused the loop to immediately exit instead of iterating over BIT(0), BIT(1), ... . Fixes: 8a89efd43423 ("KVM: selftests: Add basic selftest for guest_memfd()") Signed-off-by: Patrick Roy Reviewed-by: James Gowans Reviewed-by: Muhammad Usama Anjum Link: https://lore.kernel.org/r/20241024095956.3668818-1-roypat@amazon.co.uk Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/guest_memfd_test.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c index ba0c8e996035..ce687f8d248f 100644 --- a/tools/testing/selftests/kvm/guest_memfd_test.c +++ b/tools/testing/selftests/kvm/guest_memfd_test.c @@ -134,7 +134,7 @@ static void test_create_guest_memfd_invalid(struct kvm_vm *vm) size); } - for (flag = 0; flag; flag <<= 1) { + for (flag = BIT(0); flag; flag <<= 1) { fd = __vm_create_guest_memfd(vm, page_size, flag); TEST_ASSERT(fd == -1 && errno == EINVAL, "guest_memfd() with flag '0x%lx' should fail with EINVAL", -- cgit From 5b188cc4866aaf712e896f92ac42c7802135e507 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Wed, 9 Oct 2024 08:49:41 -0700 Subject: KVM: selftests: Disable strict aliasing Disable strict aliasing, as has been done in the kernel proper for decades (literally since before git history) to fix issues where gcc will optimize away loads in code that looks 100% correct, but is _technically_ undefined behavior, and thus can be thrown away by the compiler. E.g. arm64's vPMU counter access test casts a uint64_t (unsigned long) pointer to a u64 (unsigned long long) pointer when setting PMCR.N via u64p_replace_bits(), which gcc-13 detects and optimizes away, i.e. ignores the result and uses the original PMCR. The issue is most easily observed by making set_pmcr_n() noinline and wrapping the call with printf(), e.g. sans comments, for this code: printf("orig = %lx, next = %lx, want = %lu\n", pmcr_orig, pmcr, pmcr_n); set_pmcr_n(&pmcr, pmcr_n); printf("orig = %lx, next = %lx, want = %lu\n", pmcr_orig, pmcr, pmcr_n); gcc-13 generates: 0000000000401c90 : 401c90: f9400002 ldr x2, [x0] 401c94: b3751022 bfi x2, x1, #11, #5 401c98: f9000002 str x2, [x0] 401c9c: d65f03c0 ret 0000000000402660 : 402724: aa1403e3 mov x3, x20 402728: aa1503e2 mov x2, x21 40272c: aa1603e0 mov x0, x22 402730: aa1503e1 mov x1, x21 402734: 940060ff bl 41ab30 <_IO_printf> 402738: aa1403e1 mov x1, x20 40273c: 910183e0 add x0, sp, #0x60 402740: 97fffd54 bl 401c90 402744: aa1403e3 mov x3, x20 402748: aa1503e2 mov x2, x21 40274c: aa1503e1 mov x1, x21 402750: aa1603e0 mov x0, x22 402754: 940060f7 bl 41ab30 <_IO_printf> with the value stored in [sp + 0x60] ignored by both printf() above and in the test proper, resulting in a false failure due to vcpu_set_reg() simply storing the original value, not the intended value. $ ./vpmu_counter_access Random seed: 0x6b8b4567 orig = 3040, next = 3040, want = 0 orig = 3040, next = 3040, want = 0 ==== Test Assertion Failure ==== aarch64/vpmu_counter_access.c:505: pmcr_n == get_pmcr_n(pmcr) pid=71578 tid=71578 errno=9 - Bad file descriptor 1 0x400673: run_access_test at vpmu_counter_access.c:522 2 (inlined by) main at vpmu_counter_access.c:643 3 0x4132d7: __libc_start_call_main at libc-start.o:0 4 0x413653: __libc_start_main at ??:0 5 0x40106f: _start at ??:0 Failed to update PMCR.N to 0 (received: 6) Somewhat bizarrely, gcc-11 also exhibits the same behavior, but only if set_pmcr_n() is marked noinline, whereas gcc-13 fails even if set_pmcr_n() is inlined in its sole caller. Cc: stable@vger.kernel.org Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116912 Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/Makefile | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index 156fbfae940f..1896ef383cae 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -241,10 +241,10 @@ CFLAGS += -Wall -Wstrict-prototypes -Wuninitialized -O2 -g -std=gnu99 \ -Wno-gnu-variable-sized-type-not-at-end -MD -MP -DCONFIG_64BIT \ -fno-builtin-memcmp -fno-builtin-memcpy \ -fno-builtin-memset -fno-builtin-strnlen \ - -fno-stack-protector -fno-PIE -I$(LINUX_TOOL_INCLUDE) \ - -I$(LINUX_TOOL_ARCH_INCLUDE) -I$(LINUX_HDR_PATH) -Iinclude \ - -I$( Date: Wed, 30 Oct 2024 21:53:33 -0700 Subject: KVM: selftests: Don't force -march=x86-64-v2 if it's unsupported Force -march=x86-64-v2 to avoid SSE/AVX instructions if and only if the uarch definition is supported by the compiler, e.g. gcc 7.5 only supports x86-64. Fixes: 9a400068a158 ("KVM: selftests: x86: Avoid using SSE/AVX instructions") Cc: Vitaly Kuznetsov Reviewed-and-tested-by: Vitaly Kuznetsov Link: https://lore.kernel.org/r/20241031045333.1209195-1-seanjc@google.com Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/Makefile | 2 ++ 1 file changed, 2 insertions(+) diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index 1896ef383cae..48645a2e29da 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -249,8 +249,10 @@ ifeq ($(ARCH),s390) CFLAGS += -march=z10 endif ifeq ($(ARCH),x86) +ifeq ($(shell echo "void foo(void) { }" | $(CC) -march=x86-64-v2 -x c - -c -o /dev/null 2>/dev/null; echo "$$?"),0) CFLAGS += -march=x86-64-v2 endif +endif ifeq ($(ARCH),arm64) tools_dir := $(top_srcdir)/tools arm64_tools_dir := $(tools_dir)/arch/arm64/tools/ -- cgit From 2657b82a78f18528bef56dc1b017158490970873 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 31 Oct 2024 13:20:11 -0700 Subject: KVM: nVMX: Treat vpid01 as current if L2 is active, but with VPID disabled When getting the current VPID, e.g. to emulate a guest TLB flush, return vpid01 if L2 is running but with VPID disabled, i.e. if VPID is disabled in vmcs12. Architecturally, if VPID is disabled, then the guest and host effectively share VPID=0. KVM emulates this behavior by using vpid01 when running an L2 with VPID disabled (see prepare_vmcs02_early_rare()), and so KVM must also treat vpid01 as the current VPID while L2 is active. Unconditionally treating vpid02 as the current VPID when L2 is active causes KVM to flush TLB entries for vpid02 instead of vpid01, which results in TLB entries from L1 being incorrectly preserved across nested VM-Enter to L2 (L2=>L1 isn't problematic, because the TLB flush after nested VM-Exit flushes vpid01). The bug manifests as failures in the vmx_apicv_test KVM-Unit-Test, as KVM incorrectly retains TLB entries for the APIC-access page across a nested VM-Enter. Opportunisticaly add comments at various touchpoints to explain the architectural requirements, and also why KVM uses vpid01 instead of vpid02. All credit goes to Chao, who root caused the issue and identified the fix. Link: https://lore.kernel.org/all/ZwzczkIlYGX+QXJz@intel.com Fixes: 2b4a5a5d5688 ("KVM: nVMX: Flush current VPID (L1 vs. L2) for KVM_REQ_TLB_FLUSH_GUEST") Cc: stable@vger.kernel.org Cc: Like Xu Debugged-by: Chao Gao Reviewed-by: Chao Gao Tested-by: Chao Gao Link: https://lore.kernel.org/r/20241031202011.1580522-1-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/nested.c | 30 +++++++++++++++++++++++++----- arch/x86/kvm/vmx/vmx.c | 2 +- 2 files changed, 26 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index a8e7bc04d9bf..931a7361c30f 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -1197,11 +1197,14 @@ static void nested_vmx_transition_tlb_flush(struct kvm_vcpu *vcpu, kvm_hv_nested_transtion_tlb_flush(vcpu, enable_ept); /* - * If vmcs12 doesn't use VPID, L1 expects linear and combined mappings - * for *all* contexts to be flushed on VM-Enter/VM-Exit, i.e. it's a - * full TLB flush from the guest's perspective. This is required even - * if VPID is disabled in the host as KVM may need to synchronize the - * MMU in response to the guest TLB flush. + * If VPID is disabled, then guest TLB accesses use VPID=0, i.e. the + * same VPID as the host, and so architecturally, linear and combined + * mappings for VPID=0 must be flushed at VM-Enter and VM-Exit. KVM + * emulates L2 sharing L1's VPID=0 by using vpid01 while running L2, + * and so KVM must also emulate TLB flush of VPID=0, i.e. vpid01. This + * is required if VPID is disabled in KVM, as a TLB flush (there are no + * VPIDs) still occurs from L1's perspective, and KVM may need to + * synchronize the MMU in response to the guest TLB flush. * * Note, using TLB_FLUSH_GUEST is correct even if nested EPT is in use. * EPT is a special snowflake, as guest-physical mappings aren't @@ -2315,6 +2318,17 @@ static void prepare_vmcs02_early_rare(struct vcpu_vmx *vmx, vmcs_write64(VMCS_LINK_POINTER, INVALID_GPA); + /* + * If VPID is disabled, then guest TLB accesses use VPID=0, i.e. the + * same VPID as the host. Emulate this behavior by using vpid01 for L2 + * if VPID is disabled in vmcs12. Note, if VPID is disabled, VM-Enter + * and VM-Exit are architecturally required to flush VPID=0, but *only* + * VPID=0. I.e. using vpid02 would be ok (so long as KVM emulates the + * required flushes), but doing so would cause KVM to over-flush. E.g. + * if L1 runs L2 X with VPID12=1, then runs L2 Y with VPID12 disabled, + * and then runs L2 X again, then KVM can and should retain TLB entries + * for VPID12=1. + */ if (enable_vpid) { if (nested_cpu_has_vpid(vmcs12) && vmx->nested.vpid02) vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->nested.vpid02); @@ -5950,6 +5964,12 @@ static int handle_invvpid(struct kvm_vcpu *vcpu) return nested_vmx_fail(vcpu, VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID); + /* + * Always flush the effective vpid02, i.e. never flush the current VPID + * and never explicitly flush vpid01. INVVPID targets a VPID, not a + * VMCS, and so whether or not the current vmcs12 has VPID enabled is + * irrelevant (and there may not be a loaded vmcs12). + */ vpid02 = nested_get_vpid02(vcpu); switch (type) { case VMX_VPID_EXTENT_INDIVIDUAL_ADDR: diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 81ed596e4454..9886d67d9512 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -3216,7 +3216,7 @@ void vmx_flush_tlb_all(struct kvm_vcpu *vcpu) static inline int vmx_get_current_vpid(struct kvm_vcpu *vcpu) { - if (is_guest_mode(vcpu)) + if (is_guest_mode(vcpu) && nested_cpu_has_vpid(get_vmcs12(vcpu))) return nested_get_vpid02(vcpu); return to_vmx(vcpu)->vpid; } -- cgit From e5d253c60e9627a22940e00a05a6115d722f07ed Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 31 Oct 2024 13:32:14 -0700 Subject: KVM: SVM: Propagate error from snp_guest_req_init() to userspace If snp_guest_req_init() fails, return the provided error code up the stack to userspace, e.g. so that userspace can log that KVM_SEV_INIT2 failed, as opposed to some random operation later in VM setup failing because SNP wasn't actually enabled for the VM. Note, KVM itself doesn't consult the return value from __sev_guest_init(), i.e. the fallout is purely that userspace may be confused. Fixes: 88caf544c930 ("KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event") Reported-by: kernel test robot Reported-by: Dan Carpenter Closes: https://lore.kernel.org/r/202410192220.MeTyHPxI-lkp@intel.com Link: https://lore.kernel.org/r/20241031203214.1585751-1-seanjc@google.com Signed-off-by: Sean Christopherson --- arch/x86/kvm/svm/sev.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 0b851ef937f2..03c7db3b4cc3 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -450,8 +450,11 @@ static int __sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp, goto e_free; /* This needs to happen after SEV/SNP firmware initialization. */ - if (vm_type == KVM_X86_SNP_VM && snp_guest_req_init(kvm)) - goto e_free; + if (vm_type == KVM_X86_SNP_VM) { + ret = snp_guest_req_init(kvm); + if (ret) + goto e_free; + } INIT_LIST_HEAD(&sev->regions_list); INIT_LIST_HEAD(&sev->mirror_vms); -- cgit From 10299cdde869abab7a42fb5ab905a47a4e2cd24e Mon Sep 17 00:00:00 2001 From: John Sperbeck Date: Tue, 5 Nov 2024 19:40:31 -0800 Subject: KVM: selftests: use X86_MEMTYPE_WB instead of VMX_BASIC_MEM_TYPE_WB In 08a7d2525511 ("tools arch x86: Sync the msr-index.h copy with the kernel sources"), VMX_BASIC_MEM_TYPE_WB was removed. Use X86_MEMTYPE_WB instead. Fixes: 08a7d2525511 ("tools arch x86: Sync the msr-index.h copy with the kernel sources") Signed-off-by: John Sperbeck Message-ID: <20241106034031.503291-1-jsperbeck@google.com> Signed-off-by: Paolo Bonzini --- tools/testing/selftests/kvm/lib/x86_64/vmx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/kvm/lib/x86_64/vmx.c b/tools/testing/selftests/kvm/lib/x86_64/vmx.c index 089b8925b6b2..d7ac122820bf 100644 --- a/tools/testing/selftests/kvm/lib/x86_64/vmx.c +++ b/tools/testing/selftests/kvm/lib/x86_64/vmx.c @@ -200,7 +200,7 @@ static inline void init_vmcs_control_fields(struct vmx_pages *vmx) if (vmx->eptp_gpa) { uint64_t ept_paddr; struct eptPageTablePointer eptp = { - .memory_type = VMX_BASIC_MEM_TYPE_WB, + .memory_type = X86_MEMTYPE_WB, .page_walk_length = 3, /* + 1 */ .ad_enabled = ept_vpid_cap_supported(VMX_EPT_VPID_CAP_AD_BITS), .address = vmx->eptp_gpa >> PAGE_SHIFT_4K, -- cgit From e3a7792d96765ff435f3000e94619fcef2f6bfec Mon Sep 17 00:00:00 2001 From: Dionna Glaze Date: Tue, 5 Nov 2024 01:05:48 +0000 Subject: kvm: svm: Fix gctx page leak on invalid inputs Ensure that snp gctx page allocation is adequately deallocated on failure during snp_launch_start. Fixes: 136d8bc931c8 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command") CC: Sean Christopherson CC: Paolo Bonzini CC: Thomas Gleixner CC: Ingo Molnar CC: Borislav Petkov CC: Dave Hansen CC: Ashish Kalra CC: Tom Lendacky CC: John Allen CC: Herbert Xu CC: "David S. Miller" CC: Michael Roth CC: Luis Chamberlain CC: Russ Weight CC: Danilo Krummrich CC: Greg Kroah-Hartman CC: "Rafael J. Wysocki" CC: Tianfei zhang CC: Alexey Kardashevskiy Signed-off-by: Dionna Glaze Message-ID: <20241105010558.1266699-2-dionnaglaze@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm/sev.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 03c7db3b4cc3..fb854cf20ac3 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -2215,10 +2215,6 @@ static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp) if (sev->snp_context) return -EINVAL; - sev->snp_context = snp_context_create(kvm, argp); - if (!sev->snp_context) - return -ENOTTY; - if (params.flags) return -EINVAL; @@ -2233,6 +2229,10 @@ static int snp_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp) if (params.policy & SNP_POLICY_MASK_SINGLE_SOCKET) return -EINVAL; + sev->snp_context = snp_context_create(kvm, argp); + if (!sev->snp_context) + return -ENOTTY; + start.gctx_paddr = __psp_pa(sev->snp_context); start.policy = params.policy; memcpy(start.gosvw, params.gosvw, sizeof(params.gosvw)); -- cgit From d3ddef46f22e8c3124e0df1f325bc6a18dadff39 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 5 Nov 2024 17:51:35 -0800 Subject: KVM: x86: Unconditionally set irr_pending when updating APICv state Always set irr_pending (to true) when updating APICv status to fix a bug where KVM fails to set irr_pending when userspace sets APIC state and APICv is disabled, which ultimate results in KVM failing to inject the pending interrupt(s) that userspace stuffed into the vIRR, until another interrupt happens to be emulated by KVM. Only the APICv-disabled case is flawed, as KVM forces apic->irr_pending to be true if APICv is enabled, because not all vIRR updates will be visible to KVM. Hit the bug with a big hammer, even though strictly speaking KVM can scan the vIRR and set/clear irr_pending as appropriate for this specific case. The bug was introduced by commit 755c2bf87860 ("KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when inhibiting it"), which as the shortlog suggests, deleted code that updated irr_pending. Before that commit, kvm_apic_update_apicv() did indeed scan the vIRR, with with the crucial difference that kvm_apic_update_apicv() did the scan even when APICv was being *disabled*, e.g. due to an AVIC inhibition. struct kvm_lapic *apic = vcpu->arch.apic; if (vcpu->arch.apicv_active) { /* irr_pending is always true when apicv is activated. */ apic->irr_pending = true; apic->isr_count = 1; } else { apic->irr_pending = (apic_search_irr(apic) != -1); apic->isr_count = count_vectors(apic->regs + APIC_ISR); } And _that_ bug (clearing irr_pending) was introduced by commit b26a695a1d78 ("kvm: lapic: Introduce APICv update helper function"), prior to which KVM unconditionally set irr_pending to true in kvm_apic_set_state(), i.e. assumed that the new virtual APIC state could have a pending IRQ. Furthermore, in addition to introducing this issue, commit 755c2bf87860 also papered over the underlying bug: KVM doesn't ensure CPUs and devices see APICv as disabled prior to searching the IRR. Waiting until KVM emulates an EOI to update irr_pending "works", but only because KVM won't emulate EOI until after refresh_apicv_exec_ctrl(), and there are plenty of memory barriers in between. I.e. leaving irr_pending set is basically hacking around bad ordering. So, effectively revert to the pre-b26a695a1d78 behavior for state restore, even though it's sub-optimal if no IRQs are pending, in order to provide a minimal fix, but leave behind a FIXME to document the ugliness. With luck, the ordering issue will be fixed and the mess will be cleaned up in the not-too-distant future. Fixes: 755c2bf87860 ("KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when inhibiting it") Cc: stable@vger.kernel.org Cc: Maxim Levitsky Reported-by: Yong He Closes: https://lkml.kernel.org/r/20241023124527.1092810-1-alexyonghe%40tencent.com Signed-off-by: Sean Christopherson Message-ID: <20241106015135.2462147-1-seanjc@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/lapic.c | 29 ++++++++++++++++++----------- 1 file changed, 18 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 2098dc689088..95c6beb8ce27 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -2629,19 +2629,26 @@ void kvm_apic_update_apicv(struct kvm_vcpu *vcpu) { struct kvm_lapic *apic = vcpu->arch.apic; - if (apic->apicv_active) { - /* irr_pending is always true when apicv is activated. */ - apic->irr_pending = true; + /* + * When APICv is enabled, KVM must always search the IRR for a pending + * IRQ, as other vCPUs and devices can set IRR bits even if the vCPU + * isn't running. If APICv is disabled, KVM _should_ search the IRR + * for a pending IRQ. But KVM currently doesn't ensure *all* hardware, + * e.g. CPUs and IOMMUs, has seen the change in state, i.e. searching + * the IRR at this time could race with IRQ delivery from hardware that + * still sees APICv as being enabled. + * + * FIXME: Ensure other vCPUs and devices observe the change in APICv + * state prior to updating KVM's metadata caches, so that KVM + * can safely search the IRR and set irr_pending accordingly. + */ + apic->irr_pending = true; + + if (apic->apicv_active) apic->isr_count = 1; - } else { - /* - * Don't clear irr_pending, searching the IRR can race with - * updates from the CPU as APICv is still active from hardware's - * perspective. The flag will be cleared as appropriate when - * KVM injects the interrupt. - */ + else apic->isr_count = count_vectors(apic->regs + APIC_ISR); - } + apic->highest_isr_cache = -1; } -- cgit From aa0d42cacf093a6fcca872edc954f6f812926a17 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 1 Nov 2024 11:50:30 -0700 Subject: KVM: VMX: Bury Intel PT virtualization (guest/host mode) behind CONFIG_BROKEN Hide KVM's pt_mode module param behind CONFIG_BROKEN, i.e. disable support for virtualizing Intel PT via guest/host mode unless BROKEN=y. There are myriad bugs in the implementation, some of which are fatal to the guest, and others which put the stability and health of the host at risk. For guest fatalities, the most glaring issue is that KVM fails to ensure tracing is disabled, and *stays* disabled prior to VM-Enter, which is necessary as hardware disallows loading (the guest's) RTIT_CTL if tracing is enabled (enforced via a VMX consistency check). Per the SDM: If the logical processor is operating with Intel PT enabled (if IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load IA32_RTIT_CTL" VM-entry control must be 0. On the host side, KVM doesn't validate the guest CPUID configuration provided by userspace, and even worse, uses the guest configuration to decide what MSRs to save/load at VM-Enter and VM-Exit. E.g. configuring guest CPUID to enumerate more address ranges than are supported in hardware will result in KVM trying to passthrough, save, and load non-existent MSRs, which generates a variety of WARNs, ToPA ERRORs in the host, a potential deadlock, etc. Fixes: f99e3daf94ff ("KVM: x86: Add Intel PT virtualization work mode") Cc: stable@vger.kernel.org Cc: Adrian Hunter Signed-off-by: Sean Christopherson Reviewed-by: Xiaoyao Li Tested-by: Adrian Hunter Message-ID: <20241101185031.1799556-2-seanjc@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/vmx.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 9886d67d9512..d28618e9277e 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -217,9 +217,11 @@ module_param(ple_window_shrink, uint, 0444); static unsigned int ple_window_max = KVM_VMX_DEFAULT_PLE_WINDOW_MAX; module_param(ple_window_max, uint, 0444); -/* Default is SYSTEM mode, 1 for host-guest mode */ +/* Default is SYSTEM mode, 1 for host-guest mode (which is BROKEN) */ int __read_mostly pt_mode = PT_MODE_SYSTEM; +#ifdef CONFIG_BROKEN module_param(pt_mode, int, S_IRUGO); +#endif struct x86_pmu_lbr __ro_after_init vmx_lbr_caps; -- cgit From e16e018e8283cceda36b5e61ab454fdccbd516ca Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Wed, 23 Oct 2024 14:45:03 +0200 Subject: KVM: powerpc: remove remaining traces of KVM_CAP_PPC_RMA This was only needed for PPC970 support, which is long gone: the implementation was removed in 2014. Signed-off-by: Paolo Bonzini Message-ID: <20241023124507.280382-2-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini --- Documentation/virt/kvm/api.rst | 36 ------------------------------------ arch/powerpc/kvm/powerpc.c | 3 --- 2 files changed, 39 deletions(-) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index edc070c6e19b..16fe2fc6348c 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -2170,42 +2170,6 @@ userspace update the TCE table directly which is useful in some circumstances. -4.63 KVM_ALLOCATE_RMA ---------------------- - -:Capability: KVM_CAP_PPC_RMA -:Architectures: powerpc -:Type: vm ioctl -:Parameters: struct kvm_allocate_rma (out) -:Returns: file descriptor for mapping the allocated RMA - -This allocates a Real Mode Area (RMA) from the pool allocated at boot -time by the kernel. An RMA is a physically-contiguous, aligned region -of memory used on older POWER processors to provide the memory which -will be accessed by real-mode (MMU off) accesses in a KVM guest. -POWER processors support a set of sizes for the RMA that usually -includes 64MB, 128MB, 256MB and some larger powers of two. - -:: - - /* for KVM_ALLOCATE_RMA */ - struct kvm_allocate_rma { - __u64 rma_size; - }; - -The return value is a file descriptor which can be passed to mmap(2) -to map the allocated RMA into userspace. The mapped area can then be -passed to the KVM_SET_USER_MEMORY_REGION ioctl to establish it as the -RMA for a virtual machine. The size of the RMA in bytes (which is -fixed at host kernel boot time) is returned in the rma_size field of -the argument structure. - -The KVM_CAP_PPC_RMA capability is 1 or 2 if the KVM_ALLOCATE_RMA ioctl -is supported; 2 if the processor requires all virtual machines to have -an RMA, or 1 if the processor can use an RMA but doesn't require it, -because it supports the Virtual RMA (VRMA) facility. - - 4.64 KVM_NMI ------------ diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index f14329989e9a..76446604332c 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -612,9 +612,6 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) r = 8 | 4 | 2 | 1; } break; - case KVM_CAP_PPC_RMA: - r = 0; - break; case KVM_CAP_PPC_HWRNG: r = kvmppc_hwrng_present(); break; -- cgit From aae7527ea91a83db47880c967cdceb6d1fcfdc05 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Wed, 23 Oct 2024 14:45:04 +0200 Subject: Documentation: kvm: fix a few mistakes The only occurrence "Capability: none" actually meant the same as "basic". Fix that and a few more aesthetic or content issues in the document. Signed-off-by: Paolo Bonzini Message-ID: <20241023124507.280382-3-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini --- Documentation/virt/kvm/api.rst | 48 +++++++++++++++++++++++++++++------------- 1 file changed, 33 insertions(+), 15 deletions(-) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 16fe2fc6348c..0a2052c5acf7 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -96,12 +96,9 @@ description: Capability: which KVM extension provides this ioctl. Can be 'basic', which means that is will be provided by any kernel that supports - API version 12 (see section 4.1), a KVM_CAP_xyz constant, which + API version 12 (see section 4.1), or a KVM_CAP_xyz constant, which means availability needs to be checked with KVM_CHECK_EXTENSION - (see section 4.4), or 'none' which means that while not all kernels - support this ioctl, there's no capability bit to check its - availability: for kernels that don't support the ioctl, - the ioctl returns -ENOTTY. + (see section 4.4). Architectures: which instruction set architectures provide this ioctl. @@ -338,8 +335,8 @@ KVM_S390_SIE_PAGE_OFFSET in order to obtain a memory map of the virtual cpu's hardware control block. -4.8 KVM_GET_DIRTY_LOG (vm ioctl) --------------------------------- +4.8 KVM_GET_DIRTY_LOG +--------------------- :Capability: basic :Architectures: all @@ -1298,7 +1295,7 @@ See KVM_GET_VCPU_EVENTS for the data structure. :Capability: KVM_CAP_DEBUGREGS :Architectures: x86 -:Type: vm ioctl +:Type: vcpu ioctl :Parameters: struct kvm_debugregs (out) :Returns: 0 on success, -1 on error @@ -1320,7 +1317,7 @@ Reads debug registers from the vcpu. :Capability: KVM_CAP_DEBUGREGS :Architectures: x86 -:Type: vm ioctl +:Type: vcpu ioctl :Parameters: struct kvm_debugregs (in) :Returns: 0 on success, -1 on error @@ -2116,8 +2113,8 @@ TLB, prior to calling KVM_RUN on the associated vcpu. The "bitmap" field is the userspace address of an array. This array consists of a number of bits, equal to the total number of TLB entries as -determined by the last successful call to KVM_CONFIG_TLB, rounded up to the -nearest multiple of 64. +determined by the last successful call to ``KVM_ENABLE_CAP(KVM_CAP_SW_TLB)``, +rounded up to the nearest multiple of 64. Each bit corresponds to one TLB entry, ordered the same as in the shared TLB array. @@ -3557,6 +3554,27 @@ Errors: This ioctl returns the guest registers that are supported for the KVM_GET_ONE_REG/KVM_SET_ONE_REG calls. +Note that s390 does not support KVM_GET_REG_LIST for historical reasons +(read: nobody cared). The set of registers in kernels 4.x and newer is: + +- KVM_REG_S390_TODPR + +- KVM_REG_S390_EPOCHDIFF + +- KVM_REG_S390_CPU_TIMER + +- KVM_REG_S390_CLOCK_COMP + +- KVM_REG_S390_PFTOKEN + +- KVM_REG_S390_PFCOMPARE + +- KVM_REG_S390_PFSELECT + +- KVM_REG_S390_PP + +- KVM_REG_S390_GBEA + 4.85 KVM_ARM_SET_DEVICE_ADDR (deprecated) ----------------------------------------- @@ -4920,8 +4938,8 @@ Coalesced pio is based on coalesced mmio. There is little difference between coalesced mmio and pio except that coalesced pio records accesses to I/O ports. -4.117 KVM_CLEAR_DIRTY_LOG (vm ioctl) ------------------------------------- +4.117 KVM_CLEAR_DIRTY_LOG +------------------------- :Capability: KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 :Architectures: x86, arm64, mips @@ -5230,7 +5248,7 @@ the cpu reset definition in the POP (Principles Of Operation). 4.123 KVM_S390_INITIAL_RESET ---------------------------- -:Capability: none +:Capability: basic :Architectures: s390 :Type: vcpu ioctl :Parameters: none @@ -6169,7 +6187,7 @@ applied. .. _KVM_ARM_GET_REG_WRITABLE_MASKS: 4.139 KVM_ARM_GET_REG_WRITABLE_MASKS -------------------------------------------- +------------------------------------ :Capability: KVM_CAP_ARM_SUPPORTED_REG_MASK_RANGES :Architectures: arm64 -- cgit From badd5372ecccc5735009857357f8ec0069f25a8b Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Wed, 23 Oct 2024 14:45:05 +0200 Subject: Documentation: kvm: replace section numbers with links In order to simplify further introduction of hyperlinks, replace explicit section numbers with rST hyperlinks. The section numbers could actually be removed now, but I'm not going to do a huge change throughout the file for an RFC... Signed-off-by: Paolo Bonzini Message-ID: <20241023124507.280382-4-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini --- Documentation/virt/kvm/api.rst | 40 ++++++++++++++++++++++++++++------------ 1 file changed, 28 insertions(+), 12 deletions(-) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 0a2052c5acf7..7b42bb4d420d 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -96,9 +96,9 @@ description: Capability: which KVM extension provides this ioctl. Can be 'basic', which means that is will be provided by any kernel that supports - API version 12 (see section 4.1), or a KVM_CAP_xyz constant, which - means availability needs to be checked with KVM_CHECK_EXTENSION - (see section 4.4). + API version 12 (see :ref:`KVM_GET_API_VERSION `), + or a KVM_CAP_xyz constant that can be checked with + :ref:`KVM_CHECK_EXTENSION `. Architectures: which instruction set architectures provide this ioctl. @@ -115,6 +115,8 @@ description: are not detailed, but errors with specific meanings are. +.. _KVM_GET_API_VERSION: + 4.1 KVM_GET_API_VERSION ----------------------- @@ -243,6 +245,8 @@ This list also varies by kvm version and host processor, but does not change otherwise. +.. _KVM_CHECK_EXTENSION: + 4.4 KVM_CHECK_EXTENSION ----------------------- @@ -285,7 +289,7 @@ the VCPU file descriptor can be mmap-ed, including: - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. For more information on - KVM_CAP_DIRTY_LOG_RING, see section 8.3. + KVM_CAP_DIRTY_LOG_RING, see :ref:`KVM_CAP_DIRTY_LOG_RING`. 4.7 KVM_CREATE_VCPU @@ -1426,6 +1430,8 @@ because of a quirk in the virtualization implementation (see the internals documentation when it pops into existence). +.. _KVM_ENABLE_CAP: + 4.37 KVM_ENABLE_CAP ------------------- @@ -2563,7 +2569,7 @@ Specifically: ======================= ========= ===== ======================================= .. [1] These encodings are not accepted for SVE-enabled vcpus. See - KVM_ARM_VCPU_INIT. + :ref:`KVM_ARM_VCPU_INIT`. The equivalent register content can be accessed via bits [127:0] of the corresponding SVE Zn registers instead for vcpus that have SVE @@ -5075,8 +5081,8 @@ Recognised values for feature: Finalizes the configuration of the specified vcpu feature. The vcpu must already have been initialised, enabling the affected feature, by -means of a successful KVM_ARM_VCPU_INIT call with the appropriate flag set in -features[]. +means of a successful :ref:`KVM_ARM_VCPU_INIT ` call with the +appropriate flag set in features[]. For affected vcpu features, this is a mandatory step that must be performed before the vcpu is fully usable. @@ -6425,6 +6431,8 @@ the capability to be present. `flags` must currently be zero. +.. _kvm_run: + 5. The kvm_run structure ======================== @@ -7144,11 +7152,15 @@ primary storage for certain register types. Therefore, the kernel may use the values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set. +.. _cap_enable: + 6. Capabilities that can be enabled on vCPUs ============================================ There are certain capabilities that change the behavior of the virtual CPU or -the virtual machine when enabled. To enable them, please see section 4.37. +the virtual machine when enabled. To enable them, please see +:ref:`KVM_ENABLE_CAP`. + Below you can find a list of capabilities and what their effect on the vCPU or the virtual machine is when enabling them. @@ -7357,7 +7369,7 @@ KVM API and also from the guest. sets are supported (bitfields defined in arch/x86/include/uapi/asm/kvm.h). -As described above in the kvm_sync_regs struct info in section 5 (kvm_run): +As described above in the kvm_sync_regs struct info in section :ref:`kvm_run`, KVM_CAP_SYNC_REGS "allow[s] userspace to access certain guest registers without having to call SET/GET_*REGS". This reduces overhead by eliminating repeated ioctl calls for setting and/or getting register values. This is @@ -7403,13 +7415,15 @@ Unused bitfields in the bitarrays must be set to zero. This capability connects the vcpu to an in-kernel XIVE device. +.. _cap_enable_vm: + 7. Capabilities that can be enabled on VMs ========================================== There are certain capabilities that change the behavior of the virtual -machine when enabled. To enable them, please see section 4.37. Below -you can find a list of capabilities and what their effect on the VM -is when enabling them. +machine when enabled. To enable them, please see section +:ref:`KVM_ENABLE_CAP`. Below you can find a list of capabilities and +what their effect on the VM is when enabling them. The following information is provided along with the description: @@ -8570,6 +8584,8 @@ guest according to the bits in the KVM_CPUID_FEATURES CPUID leaf (0x40000001). Otherwise, a guest may use the paravirtual features regardless of what has actually been exposed through the CPUID leaf. +.. _KVM_CAP_DIRTY_LOG_RING: + 8.29 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL ---------------------------------------------------------- -- cgit From 5b47f5a72574237ba171e795dcaa173abc9d6d9d Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Wed, 23 Oct 2024 14:45:06 +0200 Subject: Documentation: kvm: reorganize introduction Reorganize the text to mention file descriptors as early as possible. Also mention capabilities early as they are a central part of KVM's API. Signed-off-by: Paolo Bonzini Message-ID: <20241023124507.280382-5-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini --- Documentation/virt/kvm/api.rst | 38 +++++++++++++++++++++++++------------- 1 file changed, 25 insertions(+), 13 deletions(-) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 7b42bb4d420d..497528fe91db 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -7,8 +7,19 @@ The Definitive KVM (Kernel-based Virtual Machine) API Documentation 1. General description ====================== -The kvm API is a set of ioctls that are issued to control various aspects -of a virtual machine. The ioctls belong to the following classes: +The kvm API is centered around different kinds of file descriptors +and ioctls that can be issued to these file descriptors. An initial +open("/dev/kvm") obtains a handle to the kvm subsystem; this handle +can be used to issue system ioctls. A KVM_CREATE_VM ioctl on this +handle will create a VM file descriptor which can be used to issue VM +ioctls. A KVM_CREATE_VCPU or KVM_CREATE_DEVICE ioctl on a VM fd will +create a virtual cpu or device and return a file descriptor pointing to +the new resource. + +In other words, the kvm API is a set of ioctls that are issued to +different kinds of file descriptor in order to control various aspects of +a virtual machine. Depending on the file descriptor that accepts them, +ioctls belong to the following classes: - System ioctls: These query and set global attributes which affect the whole kvm subsystem. In addition a system ioctl is used to create @@ -35,18 +46,19 @@ of a virtual machine. The ioctls belong to the following classes: device ioctls must be issued from the same process (address space) that was used to create the VM. -2. File descriptors -=================== +While most ioctls are specific to one kind of file descriptor, in some +cases the same ioctl can belong to more than one class. -The kvm API is centered around file descriptors. An initial -open("/dev/kvm") obtains a handle to the kvm subsystem; this handle -can be used to issue system ioctls. A KVM_CREATE_VM ioctl on this -handle will create a VM file descriptor which can be used to issue VM -ioctls. A KVM_CREATE_VCPU or KVM_CREATE_DEVICE ioctl on a VM fd will -create a virtual cpu or device and return a file descriptor pointing to -the new resource. Finally, ioctls on a vcpu or device fd can be used -to control the vcpu or device. For vcpus, this includes the important -task of actually running guest code. +The KVM API grew over time. For this reason, KVM defines many constants +of the form ``KVM_CAP_*``, each corresponding to a set of functionality +provided by one or more ioctls. Availability of these "capabilities" can +be checked with :ref:`KVM_CHECK_EXTENSION `. Some +capabilities also need to be enabled for VMs or VCPUs where their +functionality is desired (see :ref:`cap_enable` and :ref:`cap_enable_vm`). + + +2. Restrictions +=============== In general file descriptors can be migrated among processes by means of fork() and the SCM_RIGHTS facility of unix domain socket. These -- cgit