summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-01-04net: wangxun: add ethtool_ops for channel numberJiawen Wu
Add support to get RX/TX queue number with ethtool -l, and set RX/TX queue number with ethtool -L. Since interrupts need to be rescheduled, adjust the allocation of msix enties. Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-04net: wangxun: add coalesce options supportJiawen Wu
Support to show RX/TX coalesce with ethtool -c and set RX/TX coalesce with ethtool -C. Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-04net: wangxun: add ethtool_ops for ring parametersJiawen Wu
Support to query RX/TX depth with ethtool -g, and change RX/TX depth with ethtool -G. Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-04net: wangxun: add flow control supportJiawen Wu
Add support to set pause params with ethtool -A and get pause params with ethtool -a, for ethernet driver txgbe and ngbe. Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-04net: ngbe: convert phylib to phylinkJiawen Wu
Implement phylink in ngbe driver, to handle phy uniformly for Wangxun ethernet devices. Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-04net: txgbe: use phylink bits added in libwxJiawen Wu
Convert txgbe to use phylink and phylink_config added in libwx. Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-04net: libwx: add phylink to libwxJiawen Wu
For the following implementation, add struct phylink and phylink_config to wx structure. Add the helper function for converting phylink to wx, implement ethtool ksetting and nway reset in libwx. Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-04Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/nexDavid S. Miller
t-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2024-01-02 (ice) This series contains updates to ice driver only. Karol adds support for capable devices to receive timestamp via interrupt rather than polling to allow for less delay. Andrii adds support switchdev hardware packet mirroring. Jake reworks VF rebuild to avoid destroying objects that do not need to be. Jan S removes reporting of rx_len_errors as they are incorrectly reported by hardware. Jan G adds const modifier to some uses that are applicable. Kunwu Chan adds some checks for failed memory allocations. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-04octeontx2-af: Re-enable MAC TX in otx2_stop processingNaveen Mamindlapalli
During QoS scheduling testing with multiple strict priority flows, the netdev tx watchdog timeout routine is invoked when a low priority QoS queue doesn't get a chance to transmit the packets because other high priority flows are completely subscribing the transmit link. The netdev tx watchdog timeout routine will stop MAC RX and TX functionality in otx2_stop() routine before cleanup of HW TX queues which results in SMQ flush errors because the packets belonging to low priority queues will never gets flushed since MAC TX is disabled. This patch fixes the issue by re-enabling MAC TX to ensure the packets in HW pipeline gets flushed properly. Fixes: a7faa68b4e7f ("octeontx2-af: Start/Stop traffic in CGX along with NPC") Signed-off-by: Naveen Mamindlapalli <naveenm@marvell.com> Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-04octeontx2-af: Always configure NIX TX link credits based on max frame sizeNaveen Mamindlapalli
Currently the NIX TX link credits are initialized based on the max frame size that can be transmitted on a link but when the MTU is changed, the NIX TX link credits are reprogrammed by the SW based on the new MTU value. Since SMQ max packet length is programmed to max frame size by default, there is a chance that NIX TX may stall while sending a max frame sized packet on the link with insufficient credits to send the packet all at once. This patch avoids stall issue by not changing the link credits dynamically when the MTU is changed. Fixes: 1c74b89171c3 ("octeontx2-af: Wait for TX link idle for credits change") Signed-off-by: Naveen Mamindlapalli <naveenm@marvell.com> Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com> Signed-off-by: Nithin Kumar Dabilpuram <ndabilpuram@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-04sctp: fix busy pollingEric Dumazet
Busy polling while holding the socket lock makes litle sense, because incoming packets wont reach our receive queue. Fixes: 8465a5fcd1ce ("sctp: add support for busy polling to sctp protocol") Reported-by: Jacob Moroni <jmoroni@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Cc: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-04x86/tools: objdump_reformat.awk: Skip bad instructions from llvm-objdumpNathan Chancellor
When running the instruction decoder selftest with LLVM=1 and CONFIG_PVH=y, there is a series of warnings: arch/x86/tools/insn_decoder_test: warning: Found an x86 instruction decoder bug, please report this. arch/x86/tools/insn_decoder_test: warning: ffffffff81000050 ea <unknown> arch/x86/tools/insn_decoder_test: warning: objdump says 1 bytes, but insn_get_length() says 7 arch/x86/tools/insn_decoder_test: warning: Decoded and checked 7214721 instructions with 1 failures GNU objdump outputs "(bad)" instead of "<unknown>", which is already handled in the bad_expr regex, so there is no warning. $ objdump -d arch/x86/platform/pvh/head.o | grep -E '50:\s+ea' 50: ea (bad) $ llvm-objdump -d arch/x86/platform/pvh/head.o | grep -E '50:\s+ea' 50: ea <unknown> Add "<unknown>" to the bad_expr regex to clear up the warning, allowing the instruction decoder selftest to fully pass with llvm-objdump. Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231205-objdump_reformat-awk-handle-llvm-objdump-bad_expr-v1-1-b4a74f39396f@kernel.org
2024-01-04ALSA: hda/realtek: Fix mute and mic-mute LEDs for HP ProBook 440 G6Siddhesh Dharme
LEDs in 'HP ProBook 440 G6' laptop are controlled by ALC236 codec. Enable already existing quirk 'ALC236_FIXUP_HP_MUTE_LED_MICMUTE_VREF' to fix mute and mic-mute LEDs. Signed-off-by: Siddhesh Dharme <siddheshdharme18@gmail.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20240104060736.5149-1-siddheshdharme18@gmail.com Signed-off-by: Takashi Iwai <tiwai@suse.de>
2024-01-04Merge tag 'asoc-fix-v6.7-rc8' of ↵Takashi Iwai
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus ASoC: Fixes for v6.7 I recently got a LibreTech Sapphire board for my CI and while integrating it found and fixed some issues, including crashes for the enum validation. There's also a couple of patches adding quirks for another x86 laptop from Hans and an error handling fix for the Freescale rpmsg driver.
2024-01-03Merge branch 'libbpf-side-__arg_ctx-fallback-support'Alexei Starovoitov
Andrii Nakryiko says: ==================== Libbpf-side __arg_ctx fallback support Support __arg_ctx global function argument tag semantics even on older kernels that don't natively support it through btf_decl_tag("arg:ctx"). Patches #2-#6 are preparatory work to allow to postpone BTF loading into the kernel until after all the BPF program relocations (including global func appending to main programs) are done. Patch #4 is perhaps the most important and establishes pre-created stable placeholder FDs, so that relocations can embed valid map FDs into ldimm64 instructions. Once BTF is done after relocation, what's left is to adjust BTF information to have each main program's copy of each used global subprog to point to its own adjusted FUNC -> FUNC_PROTO type chain (if they use __arg_ctx) in such a way as to satisfy type expectations of BPF verifier regarding the PTR_TO_CTX argument definition. See patch #8 for details. Patch #8 adds few more __arg_ctx use cases (edge cases like multiple arguments having __arg_ctx, etc) to test_global_func_ctx_args.c, to make it simple to validate that this logic indeed works on old kernels. It does. But just to be 100% sure patch #9 adds a test validating that libbpf uploads func_info with properly modified BTF data. v2->v3: - drop renaming patch (Alexei, Eduard); - use memfd_create() instead of /dev/null for placeholder FD (Eduard); - add one more test for validating BTF rewrite logic (Eduard); - fixed wrong -errno usage, reshuffled some BTF rewrite bits (Eduard); v1->v2: - do internal functions renaming in patch #1 (Alexei); - extract cloning of FUNC -> FUNC_PROTO information into separate function (Alexei); ==================== Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20240104013847.3875810-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03selftests/bpf: add __arg_ctx BTF rewrite testAndrii Nakryiko
Add a test validating that libbpf uploads BTF and func_info with rewritten type information for arguments of global subprogs that are marked with __arg_ctx tag. Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240104013847.3875810-10-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03selftests/bpf: add arg:ctx cases to test_global_funcs testsAndrii Nakryiko
Add a few extra cases of global funcs with context arguments. This time rely on "arg:ctx" decl_tag (__arg_ctx macro), but put it next to "classic" cases where context argument has to be of an exact type that BPF verifier expects (e.g., bpf_user_pt_regs_t for kprobe/uprobe). Colocating all these cases separately from other global func args that rely on arg:xxx decl tags (in verifier_global_subprogs.c) allows for simpler backwards compatibility testing on old kernels. All the cases in test_global_func_ctx_args.c are supposed to work on older kernels, which was manually validated during development. Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240104013847.3875810-9-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03libbpf: implement __arg_ctx fallback logicAndrii Nakryiko
Out of all special global func arg tag annotations, __arg_ctx is practically is the most immediately useful and most critical to have working across multitude kernel version, if possible. This would allow end users to write much simpler code if __arg_ctx semantics worked for older kernels that don't natively understand btf_decl_tag("arg:ctx") in verifier logic. Luckily, it is possible to ensure __arg_ctx works on old kernels through a bit of extra work done by libbpf, at least in a lot of common cases. To explain the overall idea, we need to go back at how context argument was supported in global funcs before __arg_ctx support was added. This was done based on special struct name checks in kernel. E.g., for BPF_PROG_TYPE_PERF_EVENT the expectation is that argument type `struct bpf_perf_event_data *` mark that argument as PTR_TO_CTX. This is all good as long as global function is used from the same BPF program types only, which is often not the case. If the same subprog has to be called from, say, kprobe and perf_event program types, there is no single definition that would satisfy BPF verifier. Subprog will have context argument either for kprobe (if using bpf_user_pt_regs_t struct name) or perf_event (with bpf_perf_event_data struct name), but not both. This limitation was the reason to add btf_decl_tag("arg:ctx"), making the actual argument type not important, so that user can just define "generic" signature: __noinline int global_subprog(void *ctx __arg_ctx) { ... } I won't belabor how libbpf is implementing subprograms, see a huge comment next to bpf_object_relocate_calls() function. The idea is that each main/entry BPF program gets its own copy of global_subprog's code appended. This per-program copy of global subprog code *and* associated func_info .BTF.ext information, pointing to FUNC -> FUNC_PROTO BTF type chain allows libbpf to simulate __arg_ctx behavior transparently, even if the kernel doesn't yet support __arg_ctx annotation natively. The idea is straightforward: each time we append global subprog's code and func_info information, we adjust its FUNC -> FUNC_PROTO type information, if necessary (that is, libbpf can detect the presence of btf_decl_tag("arg:ctx") just like BPF verifier would do it). The rest is just mechanical and somewhat painful BTF manipulation code. It's painful because we need to clone FUNC -> FUNC_PROTO, instead of reusing it, as same FUNC -> FUNC_PROTO chain might be used by another main BPF program within the same BPF object, so we can't just modify it in-place (and cloning BTF types within the same struct btf object is painful due to constant memory invalidation, see comments in code). Uploaded BPF object's BTF information has to work for all BPF programs at the same time. Once we have FUNC -> FUNC_PROTO clones, we make sure that instead of using some `void *ctx` parameter definition, we have an expected `struct bpf_perf_event_data *ctx` definition (as far as BPF verifier and kernel is concerned), which will mark it as context for BPF verifier. Same global subprog relocated and copied into another main BPF program will get different type information according to main program's type. It all works out in the end in a completely transparent way for end user. Libbpf maintains internal program type -> expected context struct name mapping internally. Note, not all BPF program types have named context struct, so this approach won't work for such programs (just like it didn't before __arg_ctx). So native __arg_ctx is still important to have in kernel to have generic context support across all BPF program types. Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240104013847.3875810-8-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03libbpf: move BTF loading step after relocation stepAndrii Nakryiko
With all the preparations in previous patches done we are ready to postpone BTF loading and sanitization step until after all the relocations are performed. Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240104013847.3875810-7-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03libbpf: move exception callbacks assignment logic into relocation stepAndrii Nakryiko
Move the logic of finding and assigning exception callback indices from BTF sanitization step to program relocations step, which seems more logical and will unblock moving BTF loading to after relocation step. Exception callbacks discovery and assignment has no dependency on BTF being loaded into the kernel, it only uses BTF information. It does need to happen before subprogram relocations happen, though. Which is why the split. No functional changes. Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240104013847.3875810-6-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03libbpf: use stable map placeholder FDsAndrii Nakryiko
Move map creation to later during BPF object loading by pre-creating stable placeholder FDs (utilizing memfd_create()). Use dup2() syscall to then atomically make those placeholder FDs point to real kernel BPF map objects. This change allows to delay BPF map creation to after all the BPF program relocations. That, in turn, allows to delay BTF finalization and loading into kernel to after all the relocations as well. We'll take advantage of the latter in subsequent patches to allow libbpf to adjust BTF in a way that helps with BPF global function usage. Clean up a few places where we close map->fd, which now shouldn't happen, because map->fd should be a valid FD regardless of whether map was created or not. Surprisingly and nicely it simplifies a bunch of error handling code. If this change doesn't backfire, I'm tempted to pre-create such stable FDs for other entities (progs, maybe even BTF). We previously did some manipulations to make gen_loader work with fake map FDs, with stable map FDs this hack is not necessary for maps (we still have it for BTF, but I left it as is for now). Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240104013847.3875810-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03libbpf: don't rely on map->fd as an indicator of map being createdAndrii Nakryiko
With the upcoming switch to preallocated placeholder FDs for maps, switch various getters/setter away from checking map->fd. Use map_is_created() helper that detect whether BPF map can be modified based on map->obj->loaded state, with special provision for maps set up with bpf_map__reuse_fd(). For backwards compatibility, we take map_is_created() into account in bpf_map__fd() getter as well. This way before bpf_object__load() phase bpf_map__fd() will always return -1, just as before the changes in subsequent patches adding stable map->fd placeholders. We also get rid of all internal uses of bpf_map__fd() getter, as it's more oriented for uses external to libbpf. The above map_is_created() check actually interferes with some of the internal uses, if map FD is fetched through bpf_map__fd(). Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240104013847.3875810-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03libbpf: use explicit map reuse flag to skip map creation stepsAndrii Nakryiko
Instead of inferring whether map already point to previously created/pinned BPF map (which user can specify with bpf_map__reuse_fd()) API), use explicit map->reused flag that is set in such case. Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240104013847.3875810-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03libbpf: make uniform use of btf__fd() accessor inside libbpfAndrii Nakryiko
It makes future grepping and code analysis a bit easier. Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240104013847.3875810-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-04x86/kprobes: fix incorrect return address calculation in ↵Jinghao Jia
kprobe_emulate_call_indirect kprobe_emulate_call_indirect currently uses int3_emulate_call to emulate indirect calls. However, int3_emulate_call always assumes the size of the call to be 5 bytes when calculating the return address. This is incorrect for register-based indirect calls in x86, which can be either 2 or 3 bytes depending on whether REX prefix is used. At kprobe runtime, the incorrect return address causes control flow to land onto the wrong place after return -- possibly not a valid instruction boundary. This can lead to a panic like the following: [ 7.308204][ C1] BUG: unable to handle page fault for address: 000000000002b4d8 [ 7.308883][ C1] #PF: supervisor read access in kernel mode [ 7.309168][ C1] #PF: error_code(0x0000) - not-present page [ 7.309461][ C1] PGD 0 P4D 0 [ 7.309652][ C1] Oops: 0000 [#1] SMP [ 7.309929][ C1] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.7.0-rc5-trace-for-next #6 [ 7.310397][ C1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-20220807_005459-localhost 04/01/2014 [ 7.311068][ C1] RIP: 0010:__common_interrupt+0x52/0xc0 [ 7.311349][ C1] Code: 01 00 4d 85 f6 74 39 49 81 fe 00 f0 ff ff 77 30 4c 89 f7 4d 8b 5e 68 41 ba 91 76 d8 42 45 03 53 fc 74 02 0f 0b cc ff d3 65 48 <8b> 05 30 c7 ff 7e 65 4c 89 3d 28 c7 ff 7e 5b 41 5c 41 5e 41 5f c3 [ 7.312512][ C1] RSP: 0018:ffffc900000e0fd0 EFLAGS: 00010046 [ 7.312899][ C1] RAX: 0000000000000001 RBX: 0000000000000023 RCX: 0000000000000001 [ 7.313334][ C1] RDX: 00000000000003cd RSI: 0000000000000001 RDI: ffff888100d302a4 [ 7.313702][ C1] RBP: 0000000000000001 R08: 0ef439818636191f R09: b1621ff338a3b482 [ 7.314146][ C1] R10: ffffffff81e5127b R11: ffffffff81059810 R12: 0000000000000023 [ 7.314509][ C1] R13: 0000000000000000 R14: ffff888100d30200 R15: 0000000000000000 [ 7.314951][ C1] FS: 0000000000000000(0000) GS:ffff88813bc80000(0000) knlGS:0000000000000000 [ 7.315396][ C1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7.315691][ C1] CR2: 000000000002b4d8 CR3: 0000000003028003 CR4: 0000000000370ef0 [ 7.316153][ C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 7.316508][ C1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 7.316948][ C1] Call Trace: [ 7.317123][ C1] <IRQ> [ 7.317279][ C1] ? __die_body+0x64/0xb0 [ 7.317482][ C1] ? page_fault_oops+0x248/0x370 [ 7.317712][ C1] ? __wake_up+0x96/0xb0 [ 7.317964][ C1] ? exc_page_fault+0x62/0x130 [ 7.318211][ C1] ? asm_exc_page_fault+0x22/0x30 [ 7.318444][ C1] ? __cfi_native_send_call_func_single_ipi+0x10/0x10 [ 7.318860][ C1] ? default_idle+0xb/0x10 [ 7.319063][ C1] ? __common_interrupt+0x52/0xc0 [ 7.319330][ C1] common_interrupt+0x78/0x90 [ 7.319546][ C1] </IRQ> [ 7.319679][ C1] <TASK> [ 7.319854][ C1] asm_common_interrupt+0x22/0x40 [ 7.320082][ C1] RIP: 0010:default_idle+0xb/0x10 [ 7.320309][ C1] Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 b8 0c 67 40 a5 66 90 0f 00 2d 09 b9 3b 00 fb f4 <fa> c3 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 b8 0c 67 40 a5 e9 [ 7.321449][ C1] RSP: 0018:ffffc9000009bee8 EFLAGS: 00000256 [ 7.321808][ C1] RAX: ffff88813bca8b68 RBX: 0000000000000001 RCX: 000000000001ef0c [ 7.322227][ C1] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000000000001ef0c [ 7.322656][ C1] RBP: ffffc9000009bef8 R08: 8000000000000000 R09: 00000000000008c2 [ 7.323083][ C1] R10: 0000000000000000 R11: ffffffff81058e70 R12: 0000000000000000 [ 7.323530][ C1] R13: ffff8881002b30c0 R14: 0000000000000000 R15: 0000000000000000 [ 7.323948][ C1] ? __cfi_lapic_next_deadline+0x10/0x10 [ 7.324239][ C1] default_idle_call+0x31/0x50 [ 7.324464][ C1] do_idle+0xd3/0x240 [ 7.324690][ C1] cpu_startup_entry+0x25/0x30 [ 7.324983][ C1] start_secondary+0xb4/0xc0 [ 7.325217][ C1] secondary_startup_64_no_verify+0x179/0x17b [ 7.325498][ C1] </TASK> [ 7.325641][ C1] Modules linked in: [ 7.325906][ C1] CR2: 000000000002b4d8 [ 7.326104][ C1] ---[ end trace 0000000000000000 ]--- [ 7.326354][ C1] RIP: 0010:__common_interrupt+0x52/0xc0 [ 7.326614][ C1] Code: 01 00 4d 85 f6 74 39 49 81 fe 00 f0 ff ff 77 30 4c 89 f7 4d 8b 5e 68 41 ba 91 76 d8 42 45 03 53 fc 74 02 0f 0b cc ff d3 65 48 <8b> 05 30 c7 ff 7e 65 4c 89 3d 28 c7 ff 7e 5b 41 5c 41 5e 41 5f c3 [ 7.327570][ C1] RSP: 0018:ffffc900000e0fd0 EFLAGS: 00010046 [ 7.327910][ C1] RAX: 0000000000000001 RBX: 0000000000000023 RCX: 0000000000000001 [ 7.328273][ C1] RDX: 00000000000003cd RSI: 0000000000000001 RDI: ffff888100d302a4 [ 7.328632][ C1] RBP: 0000000000000001 R08: 0ef439818636191f R09: b1621ff338a3b482 [ 7.329223][ C1] R10: ffffffff81e5127b R11: ffffffff81059810 R12: 0000000000000023 [ 7.329780][ C1] R13: 0000000000000000 R14: ffff888100d30200 R15: 0000000000000000 [ 7.330193][ C1] FS: 0000000000000000(0000) GS:ffff88813bc80000(0000) knlGS:0000000000000000 [ 7.330632][ C1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7.331050][ C1] CR2: 000000000002b4d8 CR3: 0000000003028003 CR4: 0000000000370ef0 [ 7.331454][ C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 7.331854][ C1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 7.332236][ C1] Kernel panic - not syncing: Fatal exception in interrupt [ 7.332730][ C1] Kernel Offset: disabled [ 7.333044][ C1] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- The relevant assembly code is (from objdump, faulting address highlighted): ffffffff8102ed9d: 41 ff d3 call *%r11 ffffffff8102eda0: 65 48 <8b> 05 30 c7 ff mov %gs:0x7effc730(%rip),%rax The emulation incorrectly sets the return address to be ffffffff8102ed9d + 0x5 = ffffffff8102eda2, which is the 8b byte in the middle of the next mov. This in turn causes incorrect subsequent instruction decoding and eventually triggers the page fault above. Instead of invoking int3_emulate_call, perform push and jmp emulation directly in kprobe_emulate_call_indirect. At this point we can obtain the instruction size from p->ainsn.size so that we can calculate the correct return address. Link: https://lore.kernel.org/all/20240102233345.385475-1-jinghao7@illinois.edu/ Fixes: 6256e668b7af ("x86/kprobes: Use int3 instead of debug trap for single-step") Cc: stable@vger.kernel.org Signed-off-by: Jinghao Jia <jinghao7@illinois.edu> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-01-03Merge branch 'bpf-reduce-memory-usage-for-bpf_global_percpu_ma'Alexei Starovoitov
Yonghong Song says: ==================== bpf: Reduce memory usage for bpf_global_percpu_ma Currently when a bpf program intends to allocate memory for percpu kptr, the verifier will call bpf_mem_alloc_init() to prefill all supported unit sizes and this caused memory consumption very big for large number of cpus. For example, for 128-cpu system, the total memory consumption with initial prefill is ~175MB. Things will become worse for systems with even more cpus. Patch 1 avoids unnecessary extra percpu memory allocation. Patch 2 adds objcg to bpf_mem_alloc at init stage so objcg can be associated with root cgroup and objcg can be passed to later bpf_mem_alloc_percpu_unit_init(). Patch 3 addresses memory consumption issue by avoiding to prefill with all unit sizes, i.e. only prefilling with user specified size. Patch 4 further reduces memory consumption by limiting the number of prefill entries for percpu memory allocation. Patch 5 has much smaller low/high watermarks for percpu allocation to reduce memory consumption. Patch 6 rejects percpu memory allocation with bpf_global_percpu_ma when allocation size is greater than 512 bytes. Patch 7 fixed test_bpf_ma test due to Patch 5. Patch 8 added one test to show the verification failure log message. Changelogs: v5 -> v6: . Change bpf_mem_alloc_percpu_init() to add objcg as one of parameters. For bpf_global_percpu_ma, the objcg is NULL, corresponding root memcg. v4 -> v5: . Do not do bpf_global_percpu_ma initialization at init stage, instead doing initialization when the verifier knows it is going to be used by bpf prog. . Using much smaller low/high watermarks for percpu allocation. v3 -> v4: . Add objcg to bpf_mem_alloc during init stage. . Initialize objcg at init stage but use it in bpf_mem_alloc_percpu_unit_init(). . Remove check_obj_size() in bpf_mem_alloc_percpu_unit_init(). v2 -> v3: . Clear the bpf_mem_cache if prefill fails. . Change test_bpf_ma percpu allocation tests to use bucket_size as allocation size instead of bucket_size - 8. . Remove __GFP_ZERO flag from __alloc_percpu_gfp() call. v1 -> v2: . Avoid unnecessary extra percpu memory allocation. . Add a separate function to do bpf_global_percpu_ma initialization . promote. . Promote function static 'sizes' array to file static. . Add comments to explain to refill only one item for percpu alloc. ==================== Link: https://lore.kernel.org/r/20231222031729.1287957-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03selftests/bpf: Add a selftest with > 512-byte percpu allocation sizeYonghong Song
Add a selftest to capture the verification failure when the allocation size is greater than 512. Acked-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231222031812.1293190-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03selftests/bpf: Cope with 512 bytes limit with bpf_global_percpu_maYonghong Song
In the previous patch, the maximum data size for bpf_global_percpu_ma is 512 bytes. This breaks selftest test_bpf_ma. The test is adjusted in two aspects: - Since the maximum allowed data size for bpf_global_percpu_ma is 512, remove all tests beyond that, names sizes 1024, 2048 and 4096. - Previously the percpu data size is bucket_size - 8 in order to avoid percpu allocation into the next bucket. This patch removed such data size adjustment thanks to Patch 1. Also, a better way to generate BTF type is used than adding a member to the value struct. Acked-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231222031807.1292853-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03bpf: Limit up to 512 bytes for bpf_global_percpu_ma allocationYonghong Song
For percpu data structure allocation with bpf_global_percpu_ma, the maximum data size is 4K. But for a system with large number of cpus, bigger data size (e.g., 2K, 4K) might consume a lot of memory. For example, the percpu memory consumption with unit size 2K and 1024 cpus will be 2K * 1K * 1k = 2GB memory. We should discourage such usage. Let us limit the maximum data size to be 512 for bpf_global_percpu_ma allocation. Acked-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231222031801.1290841-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03bpf: Use smaller low/high marks for percpu allocationYonghong Song
Currently, refill low/high marks are set with the assumption of normal non-percpu memory allocation. For example, for an allocation size 256, for non-percpu memory allocation, low mark is 32 and high mark is 96, resulting in the batch allocation of 48 elements and the allocated memory will be 48 * 256 = 12KB for this particular cpu. Assuming an 128-cpu system, the total memory consumption across all cpus will be 12K * 128 = 1.5MB memory. This might be okay for non-percpu allocation, but may not be good for percpu allocation, which will consume 1.5MB * 128 = 192MB memory in the worst case if every cpu has a chance of memory allocation. In practice, percpu allocation is very rare compared to non-percpu allocation. So let us have smaller low/high marks which can avoid unnecessary memory consumption. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231222031755.1289671-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03bpf: Refill only one percpu element in memallocYonghong Song
Typically for percpu map element or data structure, once allocated, most operations are lookup or in-place update. Deletion are really rare. Currently, for percpu data strcture, 4 elements will be refilled if the size is <= 256. Let us just do with one element for percpu data. For example, for size 256 and 128 cpus, the potential saving will be 3 * 256 * 128 * 128 = 12MB. Acked-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231222031750.1289290-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03bpf: Allow per unit prefill for non-fix-size percpu memory allocatorYonghong Song
Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem allocation") added support for non-fix-size percpu memory allocation. Such allocation will allocate percpu memory for all buckets on all cpus and the memory consumption is in the order to quadratic. For example, let us say, 4 cpus, unit size 16 bytes, so each cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 bytes. Then let us say, 8 cpus with the same unit size, each cpu has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 bytes. So if the number of cpus doubles, the number of memory consumption will be 4 times. So for a system with large number of cpus, the memory consumption goes up quickly with quadratic order. For example, for 4KB percpu allocation, 128 cpus. The total memory consumption will 4KB * 128 * 128 = 64MB. Things will become worse if the number of cpus is bigger (e.g., 512, 1024, etc.) In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is done in boot time, so for system with large number of cpus, the initial percpu memory consumption is very visible. For example, for 128 cpu system, the total percpu memory allocation will be at least (16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096) * 128 * 128 = ~138MB. which is pretty big. It will be even bigger for larger number of cpus. Note that the current prefill also allocates 4 entries if the unit size is less than 256. So on top of 138MB memory consumption, this will add more consumption with 3 * (16 + 32 + 64 + 96 + 128 + 196 + 256) * 128 * 128 = ~38MB. Next patch will try to reduce this memory consumption. Later on, Commit 1fda5bb66ad8 ("bpf: Do not allocate percpu memory at init stage") moved the non-fix-size percpu memory allocation to bpf verificaiton stage. Once a particular bpf_percpu_obj_new() is called by bpf program, the memory allocator will try to fill in the cache with all sizes, causing the same amount of percpu memory consumption as in the boot stage. To reduce the initial percpu memory consumption for non-fix-size percpu memory allocation, instead of filling the cache with all supported allocation sizes, this patch intends to fill the cache only for the requested size. As typically users will not use large percpu data structure, this can save memory significantly. For example, the allocation size is 64 bytes with 128 cpus. Then total percpu memory amount will be 64 * 128 * 128 = 1MB, much less than previous 138MB. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231222031745.1289082-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03bpf: Add objcg to bpf_mem_allocYonghong Song
The objcg is a bpf_mem_alloc level property since all bpf_mem_cache's are with the same objcg. This patch made such a property explicit. The next patch will use this property to save and restore objcg for percpu unit allocator. Acked-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231222031739.1288590-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03bpf: Avoid unnecessary extra percpu memory allocationYonghong Song
Currently, for percpu memory allocation, say if the user requests allocation size to be 32 bytes, the actually calculated size will be 40 bytes and it further rounds to 64 bytes, and eventually 64 bytes are allocated, wasting 32-byte memory. Change bpf_mem_alloc() to calculate the cache index based on the user-provided allocation size so unnecessary extra memory can be avoided. Suggested-by: Hou Tao <houtao1@huawei.com> Acked-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231222031734.1288400-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-03net: kcm: fix direct access to bv_lenMina Almasry
Minor fix for kcm: code wanting to access the fields inside an skb frag should use the skb_frag_*() helpers, instead of accessing the fields directly. Signed-off-by: Mina Almasry <almasrymina@google.com> Link: https://lore.kernel.org/r/20240102205959.794513-1-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-03vsock/virtio: use skb_frag_*() helpersMina Almasry
Minor fix for virtio: code wanting to access the fields inside an skb frag should use the skb_frag_*() helpers, instead of accessing the fields directly. This allows for extensions where the underlying memory is not a page. Acked-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Mina Almasry <almasrymina@google.com> Link: https://lore.kernel.org/r/20240102205905.793738-1-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-03net/sched: sch_api: conditional netlink notificationsPedro Tammela
Implement conditional netlink notifications for Qdiscs and classes, which were missing in the initial patches that targeted tc filters and actions. Notifications will only be built after passing a check for 'rtnl_notify_needed()'. For both Qdiscs and classes 'get' operations now call a dedicated notification function as it was not possible to distinguish between 'create' and 'get' before. This distinction is necessary because 'get' always send a notification. Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20231229132642.1489088-2-pctammela@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-03net/sched: introduce ACT_P_BOUND return codePedro Tammela
Bound actions always return '0' and as of today we rely on '0' being returned in order to properly skip bound actions in tcf_idr_insert_many. In order to further improve maintainability, introduce the ACT_P_BOUND return code. Actions are updated to return 'ACT_P_BOUND' instead of plain '0'. tcf_idr_insert_many is then updated to check for 'ACT_P_BOUND'. Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20231229132642.1489088-1-pctammela@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-03net-device: move xdp_prog to net_device_read_rxEric Dumazet
xdp_prog is used in receive path, both from XDP enabled drivers and from netif_elide_gro(). This patch also removes two 4-bytes holes. Fixes: 43a71cd66b9c ("net-device: reorganize net_device fast path variables") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Coco Li <lixiaoyan@google.com> Cc: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240102162220.750823-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-03Merge branch '10GbE' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2024-01-02 (ixgbe, i40e) This series contains updates to ixgbe and i40e drivers. Ovidiu Panait adds reporting of VF link state to ixgbe. Jedrzej removes uses of IXGBE_ERR* codes to instead use standard error codes. Andrii modifies behavior of VF disable to properly shut down queues on i40e. Simon Horman removes, undesired, use of comma operator for i40e. * '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: i40e: Avoid unnecessary use of comma operator i40e: Fix VF disable behavior to block all traffic ixgbe: Refactor returning internal error codes ixgbe: Refactor overtemp event handling ixgbe: report link state for VF devices ==================== Link: https://lore.kernel.org/r/20240102222429.699129-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-03Merge tag 'nf-24-01-03' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Fix nat packets in the related state in OVS, from Brad Cowie. 2) Drop chain reference counter on error path in case chain binding fails. * tag 'nf-24-01-03' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nft_immediate: drop chain reference counter on error netfilter: nf_nat: fix action not being set for all ct states ==================== Link: https://lore.kernel.org/r/20240103113001.137936-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-03Merge branch 'ena-driver-xdp-changes'Jakub Kicinski
David Arinzon says: ==================== ENA driver XDP changes This patchset contains multiple XDP-related changes in the ENA driver, including moving the XDP code to dedicated files. ==================== Link: https://lore.kernel.org/r/20240101190855.18739-1-darinzon@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-03net: ena: Take xdp packets stats into account in ena_get_stats64()David Arinzon
Queue stats using ifconfig and ip are retrieved via ena_get_stats64(). This function currently does not take the xdp sent or dropped packets stats into account. This commit adds the following xdp stats to ena_get_stats64(): tx bytes sent tx packets sent rx dropped packets Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: David Arinzon <darinzon@amazon.com> Link: https://lore.kernel.org/r/20240101190855.18739-12-darinzon@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-03net: ena: Make queue stats code cleaner by removing the if blockDavid Arinzon
Also shorten comment related to it. Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: David Arinzon <darinzon@amazon.com> Link: https://lore.kernel.org/r/20240101190855.18739-11-darinzon@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-03net: ena: Always register RX queue infoDavid Arinzon
The RX queue info contains information about the RX queue which might be relevant to the kernel. To avoid configuring this queue for different scenarios, this patch moves the RX queue configuration to ena_up()/ena_down() function and makes it configured every interface state toggle. Signed-off-by: Shay Agroskin <shayagr@amazon.com> Signed-off-by: David Arinzon <darinzon@amazon.com> Link: https://lore.kernel.org/r/20240101190855.18739-10-darinzon@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-03net: ena: Add more debug prints to XDP related functionDavid Arinzon
Used for better readability and debugging of XDP flow. Signed-off-by: Shay Agroskin <shayagr@amazon.com> Signed-off-by: David Arinzon <darinzon@amazon.com> Link: https://lore.kernel.org/r/20240101190855.18739-9-darinzon@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-03net: ena: Refactor napi functionsDavid Arinzon
This patch focuses on changes to the XDP part of the napi polling routine. 1. Update the `napi_comp` stat only when napi is actually complete. 2. Simplify the code by using a function pointer to the right napi routine (XDP vs non-XDP path) 3. Remove unnecessary local variables. 4. Adjust a debug print to show the processed XDP frame index rather than the pointer. Signed-off-by: David Arinzon <darinzon@amazon.com> Link: https://lore.kernel.org/r/20240101190855.18739-8-darinzon@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-03net: ena: Don't check if XDP program is loaded in ena_xdp_execute()David Arinzon
This check is already done in ena_clean_rx_irq() which indirectly calls it. This function is called in napi context and the driver doesn't allow to change the XDP program without performing destruction and reinitialization of napi context (part of ena_down/ena_up sequence). Signed-off-by: Shay Agroskin <shayagr@amazon.com> Signed-off-by: David Arinzon <darinzon@amazon.com> Link: https://lore.kernel.org/r/20240101190855.18739-7-darinzon@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-03net: ena: Use tx_ring instead of xdp_ring for XDP channel TXDavid Arinzon
When an XDP program is loaded the existing channels in the driver split into two halves: - The first half of the channels contain RX and TX rings, these queues are used for receiving traffic and sending packets originating from kernel. - The second half of the channels contain only a TX ring. These queues are used for sending packets that were redirected using XDP_TX or XDP_REDIRECT. Referring to the queues in the second half of the channels as "xdp_ring" can be confusing and may give the impression that ENA has the capability to generate an additional special queue. This patch ensures that the xdp_ring field is exclusively used to describe the XDP TX queue that a specific RX queue needs to utilize when forwarding packets with XDP TX and XDP REDIRECT, preserving the integrity of the xdp_ring field in ena_ring. Signed-off-by: Shay Agroskin <shayagr@amazon.com> Signed-off-by: David Arinzon <darinzon@amazon.com> Link: https://lore.kernel.org/r/20240101190855.18739-6-darinzon@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-03net: ena: Introduce total_tx_size field in ena_tx_buffer structDavid Arinzon
To avoid de-referencing skb or xdp_frame when we poll for TX completion (where they might not be in the cache), save the total TX packet size in the ena_tx_buffer object representing the packet. Also the 'print_once' field's type was changed from u32 to u8 to allow adding the 'total_tx_size' without changing the total size of the struct. Signed-off-by: Shay Agroskin <shayagr@amazon.com> Signed-off-by: David Arinzon <darinzon@amazon.com> Link: https://lore.kernel.org/r/20240101190855.18739-5-darinzon@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>