summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-09-09perf trace augmented_syscalls.bpf: Move the renameat aumenter to renameat2, ↵Arnaldo Carvalho de Melo
temporarily While trying to shape Howard Chu's generic BPF augmenter transition into the codebase I got stuck with the renameat2 syscall. Until I noticed that the attempt at reusing augmenters were making it use the 'openat' syscall augmenter, that collect just one string syscall arg, for the 'renameat2' syscall, that takes two strings. So, for the moment, just to help in this transition period, since 'renameat2' is what is used these days in the 'mv' utility, just make the BPF collector be associated with the more widely used syscall, hopefully the transition to Howard's generic BPF augmenter will cure this, so get this out of the way for now! So now we still have that odd "reuse", but for something we're not testing so won't get in the way anymore: root@number:~# rm -f 987654 ; touch 123456 ; perf trace -vv -e rename* mv 123456 987654 |& grep renameat Reusing "openat" BPF sys_enter augmenter for "renameat" 0.000 ( 0.079 ms): mv/1158612 renameat2(olddfd: CWD, oldname: "123456", newdfd: CWD, newname: "987654", flags: NOREPLACE) = 0 root@number:~# Reviewed-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Howard Chu <howardchu95@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/lkml/CAP-5=fXjGYs=tpBgETK-P9U-CuXssytk9pSnTXpfphrmmOydWA@mail.gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-09-09mm/damon/vaddr: protect vma traversal in __damon_va_thre_regions() with rcu ↵Liam R. Howlett
read lock Traversing VMAs of a given maple tree should be protected by rcu read lock. However, __damon_va_three_regions() is not doing the protection. Hold the lock. Link: https://lkml.kernel.org/r/20240905001204.1481-1-sj@kernel.org Fixes: d0cf3dd47f0d ("damon: convert __damon_va_three_regions to use the VMA iterator") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: SeongJae Park <sj@kernel.org> Reported-by: Guenter Roeck <linux@roeck-us.net> Closes: https://lore.kernel.org/b83651a0-5b24-4206-b860-cb54ffdf209b@roeck-us.net Tested-by: Guenter Roeck <linux@roeck-us.net> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09mm: vmscan.c: fix OOM on swap stress testChris Li
I found a regression on mm-unstable during my swap stress test, using tmpfs to compile linux. The test OOM very soon after the make spawns many cc processes. It bisects down to this change: 33dfe9204f29b415bbc0abb1a50642d1ba94f5e9 (mm/gup: clear the LRU flag of a page before adding to LRU batch) Yu Zhao propose the fix: "I think this is one of the potential side effects -- Huge mentioned earlier about isolate_lru_folios():" I test that with it the swap stress test no longer OOM. Link: https://lore.kernel.org/r/CAOUHufYi9h0kz5uW3LHHS3ZrVwEq-kKp8S6N-MZUmErNAXoXmw@mail.gmail.com/ Link: https://lkml.kernel.org/r/20240905-lru-flag-v2-1-8a2d9046c594@kernel.org Fixes: 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before adding to LRU batch") Signed-off-by: Chris Li <chrisl@kernel.org> Suggested-by: Yu Zhao <yuzhao@google.com> Suggested-by: Hugh Dickins <hughd@google.com> Closes: https://lore.kernel.org/all/CAF8kJuNP5iTj2p07QgHSGOJsiUfYpJ2f4R1Q5-3BN9JiD9W_KA@mail.gmail.com/ Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09ocfs2: cancel dqi_sync_work before freeing oinfoJoseph Qi
ocfs2_global_read_info() will initialize and schedule dqi_sync_work at the end, if error occurs after successfully reading global quota, it will trigger the following warning with CONFIG_DEBUG_OBJECTS_* enabled: ODEBUG: free active (active state 0) object: 00000000d8b0ce28 object type: timer_list hint: qsync_work_fn+0x0/0x16c This reports that there is an active delayed work when freeing oinfo in error handling, so cancel dqi_sync_work first. BTW, return status instead of -1 when .read_file_info fails. Link: https://syzkaller.appspot.com/bug?extid=f7af59df5d6b25f0febd Link: https://lkml.kernel.org/r/20240904071004.2067695-1-joseph.qi@linux.alibaba.com Fixes: 171bf93ce11f ("ocfs2: Periodic quota syncing") Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Reported-by: syzbot+f7af59df5d6b25f0febd@syzkaller.appspotmail.com Tested-by: syzbot+f7af59df5d6b25f0febd@syzkaller.appspotmail.com Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09ocfs2: fix possible null-ptr-deref in ocfs2_set_buffer_uptodateLizhi Xu
When doing cleanup, if flags without OCFS2_BH_READAHEAD, it may trigger NULL pointer dereference in the following ocfs2_set_buffer_uptodate() if bh is NULL. Link: https://lkml.kernel.org/r/20240902023636.1843422-3-joseph.qi@linux.alibaba.com Fixes: cf76c78595ca ("ocfs2: don't put and assigning null to bh allocated outside") Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com> Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reported-by: Heming Zhao <heming.zhao@suse.com> Suggested-by: Heming Zhao <heming.zhao@suse.com> Cc: <stable@vger.kernel.org> [4.20+] Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09ocfs2: remove unreasonable unlock in ocfs2_read_blocksLizhi Xu
Patch series "Misc fixes for ocfs2_read_blocks", v5. This series contains 2 fixes for ocfs2_read_blocks(). The first patch fix the issue reported by syzbot, which detects bad unlock balance in ocfs2_read_blocks(). The second patch fixes an issue reported by Heming Zhao when reviewing above fix. This patch (of 2): There was a lock release before exiting, so remove the unreasonable unlock. Link: https://lkml.kernel.org/r/20240902023636.1843422-1-joseph.qi@linux.alibaba.com Link: https://lkml.kernel.org/r/20240902023636.1843422-2-joseph.qi@linux.alibaba.com Fixes: cf76c78595ca ("ocfs2: don't put and assigning null to bh allocated outside") Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com> Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reported-by: syzbot+ab134185af9ef88dfed5@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=ab134185af9ef88dfed5 Tested-by: syzbot+ab134185af9ef88dfed5@syzkaller.appspotmail.com Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> [4.20+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09ocfs2: fix null-ptr-deref when journal load failed.Julian Sun
During the mounting process, if journal_reset() fails because of too short journal, then lead to jbd2_journal_load() fails with NULL j_sb_buffer. Subsequently, ocfs2_journal_shutdown() calls jbd2_journal_flush()->jbd2_cleanup_journal_tail()-> __jbd2_update_log_tail()->jbd2_journal_update_sb_log_tail() ->lock_buffer(journal->j_sb_buffer), resulting in a null-pointer dereference error. To resolve this issue, we should check the JBD2_LOADED flag to ensure the journal was properly loaded. Additionally, use journal instead of osb->journal directly to simplify the code. Link: https://syzkaller.appspot.com/bug?extid=05b9b39d8bdfe1a0861f Link: https://lkml.kernel.org/r/20240902030844.422725-1-sunjunchao2870@gmail.com Fixes: f6f50e28f0cb ("jbd2: Fail to load a journal if it is too short") Signed-off-by: Julian Sun <sunjunchao2870@gmail.com> Reported-by: syzbot+05b9b39d8bdfe1a0861f@syzkaller.appspotmail.com Suggested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09clk: rockchip: remove unused mclk_pdm0_p/pdm0_p definitionsArnd Bergmann
When -Wunused-const-variable is enabled (not the default), there is a warning about two definitions in this file: In file included from drivers/clk/rockchip/clk-rk3576.c:14: drivers/clk/rockchip/clk-rk3576.c:334:7: error: 'mclk_pdm0_p' defined but not used [-Werror=unused-const-variable=] 334 | PNAME(mclk_pdm0_p) = { "mclk_pdm0_src_top", "xin24m" }; | ^~~~~~~~~~~ drivers/clk/rockchip/clk.h:564:43: note: in definition of macro 'PNAME' 564 | #define PNAME(x) static const char *const x[] __initconst | ^ drivers/clk/rockchip/clk-rk3576.c:333:7: error: 'pdm0_p' defined but not used [-Werror=unused-const-variable=] 333 | PNAME(pdm0_p) = { "clk_pdm0_src_top", "xin24m" }; | ^~~~~~ drivers/clk/rockchip/clk.h:564:43: note: in definition of macro 'PNAME' 564 | #define PNAME(x) static const char *const x[] __initconst | ^ Remove them for the moment. If they are needed later, they can be added back at that point. Fixes: cc40f5baa91b ("clk: rockchip: Add clock controller for the RK3576") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20240909121116.254036-1-arnd@kernel.org Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2024-09-09clk: qcom: clk-alpha-pll: Simplify the zonda_pll_adjust_l_val()Satya Priya Kakitapalli
In zonda_pll_adjust_l_val() replace the divide operator with comparison operator to fix below build error and smatch warning. drivers/clk/qcom/clk-alpha-pll.o: In function `clk_zonda_pll_set_rate': clk-alpha-pll.c:(.text+0x45dc): undefined reference to `__aeabi_uldivmod' smatch warnings: drivers/clk/qcom/clk-alpha-pll.c:2129 zonda_pll_adjust_l_val() warn: replace divide condition '(remainder * 2) / prate' with '(remainder * 2) >= prate' Fixes: f4973130d255 ("clk: qcom: clk-alpha-pll: Update set_rate for Zonda PLL") Reported-by: Jon Hunter <jonathanh@nvidia.com> Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202408110724.8pqbpDiD-lkp@intel.com/ Signed-off-by: Satya Priya Kakitapalli <quic_skakitap@quicinc.com> Link: https://lore.kernel.org/r/20240906113905.641336-1-quic_skakitap@quicinc.com Reviewed-by: Vladimir Zapolskiy <vladimir.zapolskiy@linaro.org> Tested-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2024-09-09ASoC: mt8365: Fix -Werror buildsMark Brown
Merge series from Mark Brown <broonie@kernel.org>: Nathan reported that the newly added mt8365 drivers were causing a number of warnings which break -Werror builds, these were only visible on arm64 since the drivers did not have COMPILE_TEST enabled. Fix this and some other minor stuff I noticed while doing so.
2024-09-09ASoC: loongson: Simplify code formattingMark Brown
Merge series from Binbin Zhou <zhoubinbin@loongson.cn>: This patchset attempts to improve code readability by simplifying code formatting. No functional changes.
2024-09-09Merge tag 'v6.12-rockchip-clk1' of ↵Stephen Boyd
git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip into clk-rockchip Pull Rockchip clk driver updates from Heiko Stübner: - Get rid of CLK_NR_CLKS defines in Rockchip DT binding headers - New clock controller driver for the rk3576 - Some fixes for rk3228 and rk3588 * tag 'v6.12-rockchip-clk1' of git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip: clk: rockchip: fix error for unknown clocks clk: rockchip: rk3588: drop unused code clk: rockchip: Add clock controller for the RK3576 clk: rockchip: Add new pll type pll_rk3588_ddr dt-bindings: clock, reset: Add support for rk3576 dt-bindings: clock: rockchip,rk3588-cru: drop unneeded assigned-clocks clk: rockchip: rk3588: Fix 32k clock name for pmu_24m_32k_100m_src_p dt-bindings: clock: rockchip: remove CLK_NR_CLKS and CLKPMU_NR_CLKS clk: rockchip: rk3399: Drop CLK_NR_CLKS CLKPMU_NR_CLKS usage clk: rockchip: rk3368: Drop CLK_NR_CLKS usage clk: rockchip: rk3328: Drop CLK_NR_CLKS usage clk: rockchip: rk3308: Drop CLK_NR_CLKS usage clk: rockchip: rk3288: Drop CLK_NR_CLKS usage clk: rockchip: rk3228: Drop CLK_NR_CLKS usage clk: rockchip: rk3036: Drop CLK_NR_CLKS usage clk: rockchip: px30: Drop CLK_NR_CLKS CLKPMU_NR_CLKS usage clk: rockchip: Set parent rate for DCLK_VOP clock on RK3228
2024-09-09idpf: enable WB_ON_ITRJoshua Hay
Tell hardware to write back completed descriptors even when interrupts are disabled. Otherwise, descriptors might not be written back until the hardware can flush a full cacheline of descriptors. This can cause unnecessary delays when traffic is light (or even trigger Tx queue timeout). The example scenario to reproduce the Tx timeout if the fix is not applied: - configure at least 2 Tx queues to be assigned to the same q_vector, - generate a huge Tx traffic on the first Tx queue - try to send a few packets using the second Tx queue. In such a case Tx timeout will appear on the second Tx queue because no completion descriptors are written back for that queue while interrupts are disabled due to NAPI polling. Fixes: c2d548cad150 ("idpf: add TX splitq napi poll support") Fixes: a5ab9ee0df0b ("idpf: add singleq start_xmit and napi poll") Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Co-developed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-09-09idpf: fix netdev Tx queue stop/wakeMichal Kubiak
netif_txq_maybe_stop() returns -1, 0, or 1, while idpf_tx_maybe_stop_common() says it returns 0 or -EBUSY. As a result, there sometimes are Tx queue timeout warnings despite that the queue is empty or there is at least enough space to restart it. Make idpf_tx_maybe_stop_common() inline and returning true or false, handling the return of netif_txq_maybe_stop() properly. Use a correct goto in idpf_tx_maybe_stop_splitq() to avoid stopping the queue or incrementing the stops counter twice. Fixes: 6818c4d5b3c2 ("idpf: add splitq start_xmit") Fixes: a5ab9ee0df0b ("idpf: add singleq start_xmit and napi poll") Cc: stable@vger.kernel.org # 6.7+ Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-09-09idpf: refactor Tx completion routinesJoshua Hay
Add a mechanism to guard against stashing partial packets into the hash table to make the driver more robust, with more efficient decision making when cleaning. Don't stash partial packets. This can happen when an RE (Report Event) completion is received in flow scheduling mode, or when an out of order RS (Report Status) completion is received. The first buffer with the skb is stashed, but some or all of its frags are not because the stack is out of reserve buffers. This leaves the ring in a weird state since the frags are still on the ring. Use the field libeth_sqe::nr_frags to track the number of fragments/tx_bufs representing the packet. The clean routines check to make sure there are enough reserve buffers on the stack before stashing any part of the packet. If there are not, next_to_clean is left pointing to the first buffer of the packet that failed to be stashed. This leaves the whole packet on the ring, and the next time around, cleaning will start from this packet. An RS completion is still expected for this packet in either case. So instead of being cleaned from the hash table, it will be cleaned from the ring directly. This should all still be fine since the DESC_UNUSED and BUFS_UNUSED will reflect the state of the ring. If we ever fall below the thresholds, the TxQ will still be stopped, giving the completion queue time to catch up. This may lead to stopping the queue more frequently, but it guarantees the Tx ring will always be in a good state. Also, always use the idpf_tx_splitq_clean function to clean descriptors, i.e. use it from clean_buf_ring as well. This way we avoid duplicating the logic and make sure we're using the same reserve buffers guard rail. This does require a switch from the s16 next_to_clean overflow descriptor ring wrap calculation to u16 and the normal ring size check. Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-09-09netdevice: add netdev_tx_reset_subqueue() shorthandAlexander Lobakin
Add a shorthand similar to other net*_subqueue() helpers for resetting the queue by its index w/o obtaining &netdev_tx_queue beforehand manually. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-09-09idpf: convert to libeth Tx buffer completionAlexander Lobakin
&idpf_tx_buffer is almost identical to the previous generations, as well as the way it's handled. Moreover, relying on dma_unmap_addr() and !!buf->skb instead of explicit defining of buffer's type was never good. Use the newly added libeth helpers to do it properly and reduce the copy-paste around the Tx code. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-09-09libeth: add Tx buffer completion helpersAlexander Lobakin
Software-side Tx buffers for storing DMA, frame size, skb pointers etc. are pretty much generic and every driver defines them the same way. The same can be said for software Tx completions -- same napi_consume_skb()s and all that... Add a couple simple wrappers for doing that to stop repeating the old tale at least within the Intel code. Drivers are free to use 'priv' member at the end of the structure. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-09-09regulator: tps6287x: Constify struct regulator_descChristophe JAILLET
'struct regulator_desc' is not modified in this driver. Constifying this structure moves some data to a read-only section, so increases overall security, especially when the structure holds some function pointers. On a x86_64, with allmodconfig: Before: ====== text data bss dec hex filename 4974 736 16 5726 165e drivers/regulator/tps6287x-regulator.o After: ===== text data bss dec hex filename 5294 416 16 5726 165e drivers/regulator/tps6287x-regulator.o -- Compile tested only Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://patch.msgid.link/7727e493490d37775a653905dfe0cc1d8478f8e0.1725908163.git.christophe.jaillet@wanadoo.fr Signed-off-by: Mark Brown <broonie@kernel.org>
2024-09-09regulator: wm8400: Constify struct regulator_descChristophe JAILLET
'struct regulator_desc' is not modified in this driver. Constifying this structure moves some data to a read-only section, so increases overall security, especially when the structure holds some function pointers. On a x86_64, with allmodconfig: Before: ====== text data bss dec hex filename 4419 2512 0 6931 1b13 drivers/regulator/wm8400-regulator.o After: ===== text data bss dec hex filename 6307 624 0 6931 1b13 drivers/regulator/wm8400-regulator.o -- Compile tested only Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://patch.msgid.link/fde33ecfd9bbdbdc1da1620c9f3b1b7a72f9d805.1725906876.git.christophe.jaillet@wanadoo.fr Signed-off-by: Mark Brown <broonie@kernel.org>
2024-09-09tracing: Drop unused helper function to fix the buildAndy Shevchenko
A helper function defined but not used. This, in particular, prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y: kernel/trace/trace.c:2229:19: error: unused function 'run_tracer_selftest' [-Werror,-Wunused-function] 2229 | static inline int run_tracer_selftest(struct tracer *type) | ^~~~~~~~~~~~~~~~~~~ Fix this by dropping unused functions. See also commit 6863f5643dd7 ("kbuild: allow Clang to find unused static inline functions for W=1 build"). Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Bill Wendling <morbo@google.com> Cc: Justin Stitt <justinstitt@google.com> Link: https://lore.kernel.org/20240909105314.928302-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-09-09tracing/osnoise: Fix build when timerlat is not enabledSteven Rostedt
To fix some critical section races, the interface_lock was added to a few locations. One of those locations was above where the interface_lock was declared, so the declaration was moved up before that usage. Unfortunately, where it was placed was inside a CONFIG_TIMERLAT_TRACER ifdef block. As the interface_lock is used outside that config, this broke the build when CONFIG_OSNOISE_TRACER was enabled but CONFIG_TIMERLAT_TRACER was not. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: "Helena Anna" <helena.anna.dubel@intel.com> Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com> Cc: Tomas Glozar <tglozar@redhat.com> Link: https://lore.kernel.org/20240909103231.23a289e2@gandalf.local.home Fixes: e6a53481da29 ("tracing/timerlat: Only clear timer if a kthread exists") Reported-by: "Bityutskiy, Artem" <artem.bityutskiy@intel.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-09-09net/mlx5: Fix bridge mode operations when there are no VFsBenjamin Poirier
Currently, trying to set the bridge mode attribute when numvfs=0 leads to a crash: bridge link set dev eth2 hwmode vepa [ 168.967392] BUG: kernel NULL pointer dereference, address: 0000000000000030 [...] [ 168.969989] RIP: 0010:mlx5_add_flow_rules+0x1f/0x300 [mlx5_core] [...] [ 168.976037] Call Trace: [ 168.976188] <TASK> [ 168.978620] _mlx5_eswitch_set_vepa_locked+0x113/0x230 [mlx5_core] [ 168.979074] mlx5_eswitch_set_vepa+0x7f/0xa0 [mlx5_core] [ 168.979471] rtnl_bridge_setlink+0xe9/0x1f0 [ 168.979714] rtnetlink_rcv_msg+0x159/0x400 [ 168.980451] netlink_rcv_skb+0x54/0x100 [ 168.980675] netlink_unicast+0x241/0x360 [ 168.980918] netlink_sendmsg+0x1f6/0x430 [ 168.981162] ____sys_sendmsg+0x3bb/0x3f0 [ 168.982155] ___sys_sendmsg+0x88/0xd0 [ 168.985036] __sys_sendmsg+0x59/0xa0 [ 168.985477] do_syscall_64+0x79/0x150 [ 168.987273] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 168.987773] RIP: 0033:0x7f8f7950f917 (esw->fdb_table.legacy.vepa_fdb is null) The bridge mode is only relevant when there are multiple functions per port. Therefore, prevent setting and getting this setting when there are no VFs. Note that after this change, there are no settings to change on the PF interface using `bridge link` when there are no VFs, so the interface no longer appears in the `bridge link` output. Fixes: 4b89251de024 ("net/mlx5: Support ndo bridge_setlink and getlink") Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: Verify support for scheduling element and TSAR typeCarolina Jubran
Before creating a scheduling element in a NIC or E-Switch scheduler, ensure that the requested element type is supported. If the element is of type Transmit Scheduling Arbiter (TSAR), also verify that the specific TSAR type is supported. Fixes: 214baf22870c ("net/mlx5e: Support HTB offload") Fixes: 85c5f7c9200e ("net/mlx5: E-switch, Create QoS on demand") Fixes: 0fe132eac38c ("net/mlx5: E-switch, Allow to add vports to rate groups") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: Add missing masks and QoS bit masks for scheduling elementsCarolina Jubran
Add the missing masks for supported element types and Transmit Scheduling Arbiter (TSAR) types in scheduling elements. Also, add the corresponding bit masks for these types in the QoS capabilities of a NIC scheduler. Fixes: 214baf22870c ("net/mlx5e: Support HTB offload") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: Explicitly set scheduling element and TSAR typeCarolina Jubran
Ensure the scheduling element type and TSAR type are explicitly initialized in the QoS rate group creation. This prevents potential issues due to default values. Fixes: 1ae258f8b343 ("net/mlx5: E-switch, Introduce rate limiting groups API") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5e: Add missing link mode to ptys2ext_ethtool_mapShahar Shitrit
Add MLX5E_400GAUI_8_400GBASE_CR8 to the extended modes in ptys2ext_ethtool_table, since it was missing. Fixes: 6a897372417e ("net/mlx5: ethtool, Add ethtool support for 50Gbps per lane link modes") Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5e: Add missing link modes to ptys2ethtool_mapShahar Shitrit
Add MLX5E_1000BASE_T and MLX5E_100BASE_TX to the legacy modes in ptys2legacy_ethtool_table, since they were missing. Fixes: 665bc53969d7 ("net/mlx5e: Use new ethtool get/set link ksettings API") Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: Update the list of the PCI supported devicesMaher Sanalla
Add the upcoming ConnectX-9 device ID to the table of supported PCI device IDs. Fixes: f908a35b2218 ("net/mlx5: Update the list of the PCI supported devices") Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09RDMA/bnxt_re: Fix the max WQE size for static WQE supportSelvin Xavier
When variable size WQE is supported, max_qp_sges reported is more than 6. For devices that supports variable size WQE, the Send WQE size calculation is wrong when an an older library that doesn't support variable size WQE is used. Set the WQE size to 128 when static WQE is supported. Fixes: de1d364c3815 ("RDMA/bnxt_re: Add support for Variable WQE in Genp7 adapters") Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Link: https://patch.msgid.link/1725444253-13221-3-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09RDMA/bnxt_re: Fix the compatibility flag for variable size WQESelvin Xavier
For older adapters that doesn't support variable size WQE, driver is wrongly reporting that variable WQE is supported, when the latest library is used. Report the variable WQE capability only if the driver supports it. Fixes: 10a104c0debb ("RDMA/bnxt_re: Enable variable size WQEs for user space applications") Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Link: https://patch.msgid.link/1725444253-13221-2-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09RDMA/mlx5: Fix MR cache temp entries cleanupMichael Guralnik
Fix the cleanup of the temp cache entries that are dynamically created in the MR cache. The cleanup of the temp cache entries is currently scheduled only when a new entry is created. Since in the cleanup of the entries only the mkeys are destroyed and the cache entry stays in the cache, subsequent registrations might reuse the entry and it will eventually be filled with new mkeys without cleanup ever getting scheduled again. On workloads that register and deregister MRs with a wide range of properties we see the cache ends up holding many cache entries, each holding the max number of mkeys that were ever used through it. Additionally, as the cleanup work is scheduled to run over the whole cache, any mkey that is returned to the cache after the cleanup was scheduled will be held for less than the intended 30 seconds timeout. Solve both issues by dropping the existing remove_ent_work and reusing the existing per-entry work to also handle the temp entries cleanup. Schedule the work to run with a 30 seconds delay every time we push an mkey to a clean temp entry. This ensures the cleanup runs on each entry only 30 seconds after the first mkey was pushed to an empty entry. As we have already been distinguishing between persistent and temp entries when scheduling the cache_work_func, it is not being scheduled in any other flows for the temp entries. Another benefit from moving to a per-entry cleanup is we now not required to hold the rb_tree mutex, thus enabling other flow to run concurrently. Fixes: dd1b913fb0d0 ("RDMA/mlx5: Cache all user cacheable mkeys on dereg MR flow") Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/e4fa4bb03bebf20dceae320f26816cd2dde23a26.1725362530.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09RDMA/mlx5: Limit usage of over-sized mkeys from the MR cacheMichael Guralnik
When searching the MR cache for suitable cache entries, don't use mkeys larger than twice the size required for the MR. This should ensure the usage of mkeys closer to the minimal required size and reduce memory waste. On driver init we create entries for mkeys with clear attributes and powers of 2 sizes from 4 to the max supported size. This solves the issue for anyone using mkeys that fit these requirements. In the use case where an MR is registered with different attributes, like an access flag we can't UMR, we'll create a new cache entry to store it upon dereg. Without this fix, any later registration with same attributes and smaller size will use the newly created cache entry and it's mkeys, disregarding the memory waste of using mkeys larger than required. For example, one worst-case scenario can be when registering and deregistering a 1GB mkey with ATS enabled which will cause the creation of a new cache entry to hold those type of mkeys. A user registering a 4k MR with ATS will end up using the new cache entry and an mkey that can support a 1GB MR, thus wasting x250k memory than actually needed in the HW. Additionally, allow all small registration to use the smallest size cache entry that is initialized on driver load even if size is larger than twice the required size. Fixes: 73d09b2fe833 ("RDMA/mlx5: Introduce mlx5r_cache_rb_key") Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/8ba3a6e3748aace2026de8b83da03aba084f78f4.1725362530.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09RDMA/mlx5: Fix counter update on MR cache mkey creationMichael Guralnik
After an mkey is created, update the counter for pending mkeys before reshceduling the work that is filling the cache. Rescheduling the work with a full MR cache entry and a wrong 'pending' counter will cause us to miss disabling the fill_to_high_water flag. Thus leaving the cache full but with an indication that it's still needs to be filled up to it's full size (2 * limit). Next time an mkey will be taken from the cache, we'll unnecessarily continue the process of filling the cache to it's full size. Fixes: 57e7071683ef ("RDMA/mlx5: Implement mkeys management via LIFO queue") Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/0f44f462ba22e45f72cb3d0ec6a748634086b8d0.1725362530.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09RDMA/mlx5: Drop redundant work canceling from clean_keys()Michael Guralnik
The canceling of dealyed work in clean_keys() is a leftover from years back and was added to prevent races in the cleanup process of MR cache. The cleanup process was rewritten a few years ago and the canceling of delayed work and flushing of workqueue was added before the call to clean_keys(). Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/943d21f5a9dba7b98a3e1d531e3561ffe9745d71.1725362530.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09RDMA/erdma: Return QP state in erdma_query_qpCheng Xu
Fix qp_state and cur_qp_state to return correct values in struct ib_qp_attr. Fixes: 155055771704 ("RDMA/erdma: Add verbs implementation") Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com> Link: https://patch.msgid.link/20240902112920.58749-4-chengyou@linux.alibaba.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09RDMA/erdma: Add disassociate ucontext supportCheng Xu
All IO pages mapped to user space are handled by rdma_user_mmap_io, so add empty stub for disassociate ucontext. Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com> Link: https://patch.msgid.link/20240902112920.58749-3-chengyou@linux.alibaba.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09RDMA/erdma: Refactor the initialization and destruction of EQCheng Xu
We extracted the common parts of the initialization/destruction process to make the code cleaner. Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com> Link: https://patch.msgid.link/20240902112920.58749-2-chengyou@linux.alibaba.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09RDMA/mlx5: Enable ATS when allocating kernel MRsMaher Sanalla
When creating kernel MRs, it is not definitive whether they will be used for peer-to-peer transactions or for other usecases, since address mapping is performed only after the MR is created. Since peer-to-peer transactions benefit significantly from ATS performance-wise, enable ATS on newly-allocated kernel MRs when supported. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Gal Shalom <galshalom@nvidia.com> Link: https://patch.msgid.link/fafd4c9f14cf438d2882d88649c2947e1d05d0b4.1725273403.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09net/mlx5: HWS, added API and enabled HWS supportYevgeny Kliteynik
Enabling HWS support in the mlx5 driver: - added HWS API header - added HWS files in the mlx5 driver makefile - added kconfig flag that enables HWS compilation Reviewed-by: Erez Shitrit <erezsh@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: HWS, added send engine and context handlingYevgeny Kliteynik
Added implementation of send engine and handling of HWS context. Reviewed-by: Itamar Gozlan <igozlan@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: HWS, added debug dump and internal headersYevgeny Kliteynik
Added debug dump of the existing HWS state, and all the required internal definitions. To dump the HWS state, cat the following debugfs node: cat /sys/kernel/debug/mlx5/<PCI>/steering/fdb/ctx_<ctx_id> Reviewed-by: Hamdan Agbariya <hamdani@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: HWS, added backward-compatible API handlingYevgeny Kliteynik
Added implementation of backward-compatible (BWC) steering API. Native HWS API is very different from SWS API: - SWS is synchronous (rule creation/deletion API call returns when the rule is created/deleted), while HWS is asynchronous (it requires polling for completion in order to know when the rule creation/deletion happened) - SWS manages its own memory (it allocates/frees all the needed memory for steering rules, while HWS requires the rules memory to be allocated/freed outside the API In order to make HWS fit the existing fs-core steering API paradigm, this patch adds implementation of backward-compatible (BWC) steering API that has the bahaviour similar to SWS: among others, it encompasses all the rules' memory management and completion polling, presenting the usual synchronous API for the upper layer. A user that wishes to utilize the full speed potential of HWS can call the HWS async API and have rule insertion/deletion batching, lower memory management overhead, and lower CPU utilization. Such approach will be taken by the future Connection Tracking. Note that BWC steering doesn't support yet rules that require more than one match STE - complex rules. This support will be added later on. Reviewed-by: Erez Shitrit <erezsh@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: HWS, added memory management handlingYevgeny Kliteynik
Added object pools and buddy allocator functionality. Reviewed-by: Itamar Gozlan <igozlan@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: HWS, added vport handlingYevgeny Kliteynik
Vport is a virtual eswitch port that is associated with its virtual function (VF), physical function (PF) or sub-function (SF). This patch adds handling of vports in HWS. Reviewed-by: Hamdan Agbariya <hamdani@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: HWS, added modify header pattern and args handlingYevgeny Kliteynik
Packet headers/metadta manipulations are split into two parts: - Header Modify Pattern: an object that describes which fields will be modified and in which way - Header Modify Argument: an object that provides the values to be used for header modification Reviewed-by: Hamdan Agbariya <hamdani@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: HWS, added FW commands handlingYevgeny Kliteynik
This patch adds implementation of FW object handling, such as creation/destruction, modification, and querying. Reviewed-by: Hamdan Agbariya <hamdani@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: HWS, added matchers functionalityYevgeny Kliteynik
Matcher object encompasses all the building blocks that are needed in order to perform flow steering of a given flow: - flow table that serves as entering point of this matcher - Rule Table Context (RTC) objects to hold ll the Steering Table Entries (STEs), both for matching the flow and for performing actions - rules that describe the set of matching parameters for a flow and actions to perform in case of a hit. This patch adds implementation of matchers handling in HWS. Reviewed-by: Itamar Gozlan <igozlan@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: HWS, added definers handlingYevgeny Kliteynik
The Match Definer combines packet fields and a mask, creating a key which can be used for packet matching during steering flow processing. This patch adds handling of definer objects in HWS. Reviewed-by: Hamdan Agbariya <hamdani@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-09-09net/mlx5: HWS, added rules handlingYevgeny Kliteynik
Steering rule is a concept that includes match parameters for a flow, and actions to perform on the flows that match these parameters. This patch adds rules handling part of HW Steering. Reviewed-by: Itamar Gozlan <igozlan@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>