Age | Commit message (Collapse) | Author |
|
There are 3 paths that finally call do_setlink(), and validate_linkmsg()
is called in each path.
1. RTM_NEWLINK
1-1. dev is found in __rtnl_newlink()
1-2. dev isn't found, but IFLA_GROUP is specified in
rtnl_group_changelink()
2. RTM_SETLINK
The next patch factorises 1-1 to a separate function.
As a preparation, let's move validate_linkmsg() calls to do_setlink().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
We will move linkinfo to rtnl_newlink() and pass it down to other
functions.
Let's pack it into rtnl_newlink_tbs.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
objpool intends to use vmalloc for default (non-atomic) allocations of
percpu slots and objects. However, the condition checking if GFP flags
set any bit of GFP_ATOMIC is wrong b/c GFP_ATOMIC is a combination of bits
(__GFP_HIGH|__GFP_KSWAPD_RECLAIM) and so `pool->gfp & GFP_ATOMIC` will
be true if either bit is set. Since GFP_ATOMIC and GFP_KERNEL share the
___GFP_KSWAPD_RECLAIM bit, kmalloc will be used in cases when GFP_KERNEL
is specified, i.e. in all current usages of objpool.
This may lead to unexpected OOM errors since kmalloc cannot allocate
large amounts of memory.
For instance, objpool is used by fprobe rethook which in turn is used by
BPF kretprobe.multi and kprobe.session probe types. Trying to attach
these to all kernel functions with libbpf using
SEC("kprobe.session/*")
int kprobe(struct pt_regs *ctx)
{
[...]
}
fails on objpool slot allocation with ENOMEM.
Fix the condition to truly use vmalloc by default.
Link: https://lore.kernel.org/all/20240826060718.267261-1-vmalik@redhat.com/
Fixes: b4edb8d2d464 ("lib: objpool added: ring-array based lockless MPMC")
Signed-off-by: Viktor Malik <vmalik@redhat.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Matt Wu <wuqiang.matt@bytedance.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
Commit 9a400068a158 ("KVM: selftests: x86: Avoid using SSE/AVX
instructions") unconditionally added -march=x86-64-v2 to the CFLAGS used
to build the KVM selftests which does not work on non-x86 architectures:
cc1: error: unknown value ‘x86-64-v2’ for ‘-march’
Fix this by making the addition of this x86 specific command line flag
conditional on building for x86.
Fixes: 9a400068a158 ("KVM: selftests: x86: Avoid using SSE/AVX instructions")
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This was attempted by using the dev_name in the slab cache name, but as
Omar Sandoval pointed out, that can be an arbitrary string, eg something
like "/dev/root". Which in turn trips verify_dirent_name(), which fails
if a filename contains a slash.
So just make it use a sequence counter, and make it an atomic_t to avoid
any possible races or locking issues.
Reported-and-tested-by: Omar Sandoval <osandov@fb.com>
Link: https://lore.kernel.org/all/ZxafcO8KWMlXaeWE@telecaster.dhcp.thefacebook.com/
Fixes: 79efebae4afc ("9p: Avoid creating multiple slab caches with the same name")
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull kvm fixes from Paolo Bonzini:
"ARM64:
- Fix the guest view of the ID registers, making the relevant fields
writable from userspace (affecting ID_AA64DFR0_EL1 and
ID_AA64PFR1_EL1)
- Correcly expose S1PIE to guests, fixing a regression introduced in
6.12-rc1 with the S1POE support
- Fix the recycling of stage-2 shadow MMUs by tracking the context
(are we allowed to block or not) as well as the recycling state
- Address a couple of issues with the vgic when userspace
misconfigures the emulation, resulting in various splats. Headaches
courtesy of our Syzkaller friends
- Stop wasting space in the HYP idmap, as we are dangerously close to
the 4kB limit, and this has already exploded in -next
- Fix another race in vgic_init()
- Fix a UBSAN error when faking the cache topology with MTE enabled
RISCV:
- RISCV: KVM: use raw_spinlock for critical section in imsic
x86:
- A bandaid for lack of XCR0 setup in selftests, which causes trouble
if the compiler is configured to have x86-64-v3 (with AVX) as the
default ISA. Proper XCR0 setup will come in the next merge window.
- Fix an issue where KVM would not ignore low bits of the nested CR3
and potentially leak up to 31 bytes out of the guest memory's
bounds
- Fix case in which an out-of-date cached value for the segments
could by returned by KVM_GET_SREGS.
- More cleanups for KVM_X86_QUIRK_SLOT_ZAP_ALL
- Override MTRR state for KVM confidential guests, making it WB by
default as is already the case for Hyper-V guests.
Generic:
- Remove a couple of unused functions"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (27 commits)
RISCV: KVM: use raw_spinlock for critical section in imsic
KVM: selftests: Fix out-of-bounds reads in CPUID test's array lookups
KVM: selftests: x86: Avoid using SSE/AVX instructions
KVM: nSVM: Ignore nCR3[4:0] when loading PDPTEs from memory
KVM: VMX: reset the segment cache after segment init in vmx_vcpu_reset()
KVM: x86: Clean up documentation for KVM_X86_QUIRK_SLOT_ZAP_ALL
KVM: x86/mmu: Add lockdep assert to enforce safe usage of kvm_unmap_gfn_range()
KVM: x86/mmu: Zap only SPs that shadow gPTEs when deleting memslot
x86/kvm: Override default caching mode for SEV-SNP and TDX
KVM: Remove unused kvm_vcpu_gfn_to_pfn_atomic
KVM: Remove unused kvm_vcpu_gfn_to_pfn
KVM: arm64: Ensure vgic_ready() is ordered against MMIO registration
KVM: arm64: vgic: Don't check for vgic_ready() when setting NR_IRQS
KVM: arm64: Fix shift-out-of-bounds bug
KVM: arm64: Shave a few bytes from the EL2 idmap code
KVM: arm64: Don't eagerly teardown the vgic on init error
KVM: arm64: Expose S1PIE to guests
KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to reschedule
KVM: arm64: nv: Punt stage-2 recycling to a vCPU request
KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull uprobe fix from Masami Hiramatsu:
- uprobe: avoid out-of-bounds memory access of fetching args
Uprobe trace events can cause out-of-bounds memory access when
fetching user-space data which is bigger than one page, because it
does not check the local CPU buffer size when reading the data. This
checks the read data size and cut it down to the local CPU buffer
size.
* tag 'probes-fixes-v6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
uprobe: avoid out-of-bounds memory access of fetching args
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"afs:
- Fix a lock recursion in afs_wake_up_async_call() on ->notify_lock
netfs:
- Drop the references to a folio immediately after the folio has been
extracted to prevent races with future I/O collection
- Fix a documenation build error
- Downgrade the i_rwsem for buffered writes to fix a cifs reported
performance regression when switching to netfslib
vfs:
- Explicitly return -E2BIG from openat2() if the specified size is
unexpectedly large. This aligns openat2() with other extensible
struct based system calls
- When copying a mount namespace ensure that we only try to remove
the new copy from the mount namespace rbtree if it has already been
added to it
nilfs:
- Clear the buffer delay flag when clearing the buffer state clags
when a buffer head is discarded to prevent a kernel OOPs
ocfs2:
- Fix an unitialized value warning in ocfs2_setattr()
proc:
- Fix a kernel doc warning"
* tag 'vfs-6.12-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
proc: Fix W=1 build kernel-doc warning
afs: Fix lock recursion
fs: Fix uninitialized value issue in from_kuid and from_kgid
fs: don't try and remove empty rbtree node
netfs: Downgrade i_rwsem for a buffered write
nilfs2: fix kernel bug due to missing clearing of buffer delay flag
openat2: explicitly return -E2BIG for (usize > PAGE_SIZE)
netfs: fix documentation build error
netfs: In readahead, put the folio refs as soon extracted
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fix from Herbert Xu:
"Fix a regression in mpi that broke RSA"
* tag 'v6.12-p4' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: lib/mpi - Fix an "Uninitialized scalar variable" issue
|
|
There are two pages in one TLB entry on LoongArch system. For kernel
space, it requires both two pte entries (buddies) with PAGE_GLOBAL bit
set, otherwise HW treats it as non-global tlb, there will be potential
problems if tlb entry for kernel space is not global. Such as fail to
flush kernel tlb with the function local_flush_tlb_kernel_range() which
supposed only flush tlb with global bit.
Kernel address space areas include percpu, vmalloc, vmemmap, fixmap and
kasan areas. For these areas both two consecutive page table entries
should be enabled with PAGE_GLOBAL bit. So with function set_pte() and
pte_clear(), pte buddy entry is checked and set besides its own pte
entry. However it is not atomic operation to set both two pte entries,
there is problem with test_vmalloc test case.
So function kernel_pte_init() is added to init a pte table when it is
created for kernel address space, and the default initial pte value is
PAGE_GLOBAL rather than zero at beginning. Then only its own pte entry
need update with function set_pte() and pte_clear(), nothing to do with
the pte buddy entry.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
Not all tasks have a vDSO mapped, for example kthreads never do. If such
a task ever ends up calling stack_top(), it will derefence the NULL vdso
pointer and crash.
This can for example happen when using kunit:
[<9000000000203874>] stack_top+0x58/0xa8
[<90000000002956cc>] arch_pick_mmap_layout+0x164/0x220
[<90000000003c284c>] kunit_vm_mmap_init+0x108/0x12c
[<90000000003c1fbc>] __kunit_add_resource+0x38/0x8c
[<90000000003c2704>] kunit_vm_mmap+0x88/0xc8
[<9000000000410b14>] usercopy_test_init+0xbc/0x25c
[<90000000003c1db4>] kunit_try_run_case+0x5c/0x184
[<90000000003c3d54>] kunit_generic_run_threadfn_adapter+0x24/0x48
[<900000000022e4bc>] kthread+0xc8/0xd4
[<9000000000200ce8>] ret_from_kernel_thread+0xc/0xa4
Fixes: 803b0fc5c3f2 ("LoongArch: Add process management")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
The current size of vDSO code mapping is hardcoded to PAGE_SIZE. This
cannot work for 4KB page size after commit 18efd0b10e0fd77 ("LoongArch:
vDSO: Wire up getrandom() vDSO implementation") because the code size
increases to 8KB. Thus set the code mapping size to its real size, i.e.
PAGE_ALIGN(vdso_end - vdso_start).
Fixes: 18efd0b10e0fd77 ("LoongArch: vDSO: Wire up getrandom() vDSO implementation")
Reviewed-by: Xi Ruoyao <xry111@xry111.site>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
Unaligned access exception can be triggered in irq-enabled context such
as user mode, in this case do_ale() may call get_user() which may cause
sleep. Then we will get:
BUG: sleeping function called from invalid context at arch/loongarch/kernel/access-helper.h:7
in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 129, name: modprobe
preempt_count: 0, expected: 0
RCU nest depth: 0, expected: 0
CPU: 0 UID: 0 PID: 129 Comm: modprobe Tainted: G W 6.12.0-rc1+ #1723
Tainted: [W]=WARN
Stack : 9000000105e0bd48 0000000000000000 9000000003803944 9000000105e08000
9000000105e0bc70 9000000105e0bc78 0000000000000000 0000000000000000
9000000105e0bc78 0000000000000001 9000000185e0ba07 9000000105e0b890
ffffffffffffffff 9000000105e0bc78 73924b81763be05b 9000000100194500
000000000000020c 000000000000000a 0000000000000000 0000000000000003
00000000000023f0 00000000000e1401 00000000072f8000 0000007ffbb0e260
0000000000000000 0000000000000000 9000000005437650 90000000055d5000
0000000000000000 0000000000000003 0000007ffbb0e1f0 0000000000000000
0000005567b00490 0000000000000000 9000000003803964 0000007ffbb0dfec
00000000000000b0 0000000000000007 0000000000000003 0000000000071c1d
...
Call Trace:
[<9000000003803964>] show_stack+0x64/0x1a0
[<9000000004c57464>] dump_stack_lvl+0x74/0xb0
[<9000000003861ab4>] __might_resched+0x154/0x1a0
[<900000000380c96c>] emulate_load_store_insn+0x6c/0xf60
[<9000000004c58118>] do_ale+0x78/0x180
[<9000000003801bc8>] handle_ale+0x128/0x1e0
So enable IRQ if unaligned access exception is triggered in irq-enabled
context to fix it.
Cc: stable@vger.kernel.org
Reported-by: Binbin Zhou <zhoubinbin@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
In loongson_sysconf, The "core" of cores_per_node and cores_per_package
stands for a logical core, which means in a SMT system it stands for a
thread indeed. This information is gotten from SMBIOS Type4 Structure,
so in order to get a correct cores_per_package for both SMT and non-SMT
systems in parse_cpu_table() we should use SMBIOS_THREAD_PACKAGE_OFFSET
instead of SMBIOS_CORE_PACKAGE_OFFSET.
Cc: stable@vger.kernel.org
Reported-by: Chao Li <lichao@loongson.cn>
Tested-by: Chao Li <lichao@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
The information contained in the comment for LOONGARCH_CSR_ERA is even
less informative than the macro itself, which can cause confusion for
junior developers. Let's use the full English term.
Signed-off-by: Yanteng Si <siyanteng@cqsoftware.com.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
Tariq Toukan says:
====================
net/mlx5: Refactor esw QoS to support generalized operations
This patch series from the team to mlx5 core driver consists of one main
QoS part followed by small misc patches.
This main part (patches 1 to 11) by Carolina refactors the QoS handling
to generalize operations on scheduling groups and vports. These changes
are necessary to support new features that will extend group
functionality, introduce new group types, and support deeper
hierarchies.
Additionally, this refactor updates the terminology from "group" to
"node" to better reflect the hardware’s rate hierarchy and its use
of scheduling element nodes.
Simplify group scheduling element creation:
- net/mlx5: Refactor QoS group scheduling element creation
Refactor to support generalized operations for QoS:
- net/mlx5: Introduce node type to rate group structure
- net/mlx5: Add parent group support in rate group structure
- net/mlx5: Restrict domain list insertion to root TSAR ancestors
- net/mlx5: Rename vport QoS group reference to parent
- net/mlx5: Introduce node struct and rename group terminology to node
- net/mlx5: Refactor vport scheduling element creation function
- net/mlx5: Refactor vport QoS to use scheduling node structure
- net/mlx5: Remove vport QoS enabled flag
Support generalized operations for QoS elements:
- net/mlx5: Simplify QoS scheduling element configuration
- net/mlx5: Generalize QoS operations for nodes and vports
On top, patch 12 by Moshe handles FW request to move to drop mode.
In patch 13, Benjamin Poirier removes an empty eswitch flow table when
not used, which improves packet processing performance.
Patches 14 and 15 by Moshe are small field renamings as preparation for
future fields addition to these structures.
Series generated against:
commit c531f2269a53 ("net: bcmasp: enable SW timestamping")
====================
Link: https://patch.msgid.link/20241016173617.217736-1-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
As preparation for HW Steering support, rename modify header struct
member action to fs_dr_action, to distinguish from fs_hws_action which
will be added. Add a pointer where needed to keep code line shorter and
more readable.
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
As preparation for HW Steering support, rename packet reformat struct
member action to fs_dr_action, to distinguish from fs_hws_action which
will be added. Add a pointer where needed to keep code line shorter and
more readable.
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Currently, when VFs are created, two flow tables are added for the eswitch:
the "fdb" table, which contains rules for each VF and the "vepa_fdb" table.
In the default VEB mode, the vepa_fdb table is empty. When switching to
VEPA mode, flow steering rules are added to vepa_fdb. Even though the
vepa_fdb table is empty in VEB mode, its presence adds some cost to packet
processing. In some workloads, this leads to drops which are reported by
the rx_discards_phy ethtool counter.
In order to improve performance, only create vepa_fdb when in VEPA mode.
Tests were done on a ConnectX-6 Lx adapter forwarding 64B packets between
both ports using dpdk-testpmd. Numbers are Rx-pps for each port, as
reported by testpmd.
Without changes:
traffic to unknown mac
testpmd on PF
numvfs=0,0
35257998,35264499
numvfs=1,1
24590124,24590888
testpmd on VF with numvfs=1,1
20434338,20434887
traffic to VF mac
testpmd on VF with numvfs=1,1
30341014,30340749
With changes:
traffic to unknown mac
testpmd on PF
numvfs=0,0
35404361,35383378
numvfs=1,1
29801247,29790757
testpmd on VF with numvfs=1,1
24310435,24309084
traffic to VF mac
testpmd on VF with numvfs=1,1
34811436,34781706
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
On sync reset flow, firmware may request a PF, which already
acknowledged the unload event, to move to drop mode. Drop mode means
that this PF will reduce polling frequency, as this PF is not going to
have another active part in the reset, but only reload back after the
reset.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Aya Levin <ayal@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Refactor QoS normalization and rate calculation functions to operate
on mlx5_esw_sched_node, allowing for generalized handling of both
vports and nodes.
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Simplify the configuration of QoS scheduling elements by removing the
separate functions `esw_qos_node_config` and `esw_qos_vport_config`.
Instead, directly use the existing `esw_qos_sched_elem_config` function
for both nodes and vports.
This unification helps in generalizing operations on scheduling
elements nodes.
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Remove the `enabled` flag from the `vport->qos` struct, as QoS now
relies solely on the `sched_node` pointer to determine whether QoS
features are in use.
Currently, the vport `qos` struct consists only of the `sched_node`,
introducing an unnecessary two-level reference. However, the qos struct
is retained as it will be extended in future patches to support new QoS
features.
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Refactor the vport QoS structure by moving group membership and
scheduling details into the `mlx5_esw_sched_node` structure.
This change consolidates the vport into the rate hierarchy by unifying
the handling of different types of scheduling element nodes.
In addition, add a direct reference to the mlx5_vport within the
mlx5_esw_sched_node structure, to ensure that the vport is easily
accessible when a scheduling node is associated with a vport.
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Modify the vport scheduling element creation function to get the parent
node directly, aligning it with the group creation function.
This ensures a consistent flow for scheduling elements creation, as the
parent nodes already contain the device and parent element index.
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Introduce the `mlx5_esw_sched_node` struct, consolidating all rate
hierarchy related details, including membership and scheduling
parameters.
Since the group concept aligns with the `mlx5_esw_sched_node`, replace
the `mlx5_esw_rate_group` struct with it and rename the "group"
terminology to "node" throughout the rate hierarchy.
All relevant code paths and structures have been updated to use the
"node" terminology accordingly, laying the groundwork for future
patches that will unify the handling of different types of members
within the rate hierarchy.
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Rename the `group` field in the `mlx5_vport` structure to `parent` to
clarify the vport's role as a member of a parent group and distinguish
it from the concept of a general group.
Additionally, rename `group_entry` to `parent_entry` to reflect this
update.
This distinction will be important for handling more complex group
structures and scheduling elements.
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Update the logic for adding rate groups to the E-Switch domain list,
ensuring only groups with the root Transmit Scheduling Arbiter as their
parent are included.
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Introduce a `parent` field in the `mlx5_esw_rate_group` structure to
support hierarchical group relationships.
The `parent` can reference another group or be set to `NULL`,
indicating the group is connected to the root TSAR.
This change enables the ability to manage groups in a hierarchical
structure for future enhancements.
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Introduce the `sched_node_type` enum to represent both the group and
its members as scheduling nodes in the rate hierarchy.
Add the `type` field to the rate group structure to specify the type of
the node membership in the rate hierarchy.
Generalize comments to reflect this flexibility within the rate group
structure.
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Introduce `esw_qos_create_group_sched_elem` to handle the creation of
group scheduling elements for E-Switch QoS, Transmit Scheduling
Arbiter (TSAR).
This reduces duplication and simplifies code for TSAR setup.
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Static analysis on linux-next has detected the following issue
in function virtnet_stats_ctx_init, in drivers/net/virtio_net.c :
if (vi->device_stats_cap & VIRTIO_NET_STATS_TYPE_CVQ) {
queue_type = VIRTNET_Q_TYPE_CQ;
ctx->bitmap[queue_type] |= VIRTIO_NET_STATS_TYPE_CVQ;
ctx->desc_num[queue_type] += ARRAY_SIZE(virtnet_stats_cvq_desc);
ctx->size[queue_type] += sizeof(struct virtio_net_stats_cvq);
}
ctx->bitmap is declared as a u32 however it is being bit-wise or'd with
VIRTIO_NET_STATS_TYPE_CVQ and this is defined as 1 << 32:
include/uapi/linux/virtio_net.h:#define VIRTIO_NET_STATS_TYPE_CVQ (1ULL << 32)
..and hence the bit-wise or operation won't set any bits in ctx->bitmap
because 1ULL < 32 is too wide for a u32.
In fact, the field is read into a u64:
u64 offset, bitmap;
....
bitmap = ctx->bitmap[queue_type];
so to fix, it is enough to make bitmap an array of u64.
Fixes: 941168f8b40e5 ("virtio_net: support device stats")
Reported-by: "Colin King (gmail)" <colin.i.king@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/53e2bd6728136d5916e384a7840e5dc7eebff832.1729099611.git.mst@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Some workloads hit the infamous dev_watchdog() message:
"NETDEV WATCHDOG: eth0 (xxxx): transmit queue XX timed out"
It seems possible to hit this even for perfectly normal
BQL enabled drivers:
1) Assume a TX queue was idle for more than dev->watchdog_timeo
(5 seconds unless changed by the driver)
2) Assume a big packet is sent, exceeding current BQL limit.
3) Driver ndo_start_xmit() puts the packet in TX ring,
and netdev_tx_sent_queue() is called.
4) QUEUE_STATE_STACK_XOFF could be set from netdev_tx_sent_queue()
before txq->trans_start has been written.
5) txq->trans_start is written later, from netdev_start_xmit()
if (rc == NETDEV_TX_OK)
txq_trans_update(txq)
dev_watchdog() running on another cpu could read the old
txq->trans_start, and then see QUEUE_STATE_STACK_XOFF, because 5)
did not happen yet.
To solve the issue, write txq->trans_start right before one XOFF bit
is set :
- _QUEUE_STATE_DRV_XOFF from netif_tx_stop_queue()
- __QUEUE_STATE_STACK_XOFF from netdev_tx_sent_queue()
From dev_watchdog(), we have to read txq->state before txq->trans_start.
Add memory barriers to enforce correct ordering.
In the future, we could avoid writing over txq->trans_start for normal
operations, and rename this field to txq->xoff_start_time.
Fixes: bec251bc8b6a ("net: no longer stop all TX queues in dev_watchdog()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20241015194118.3951657-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The variable wwan_rtnl_link_ops assign a *bigger* maxtype which leads to
a global out-of-bounds read when parsing the netlink attributes. Exactly
same bug cause as the oob fixed in commit b33fb5b801c6 ("net: qualcomm:
rmnet: fix global oob in rmnet_policy").
==================================================================
BUG: KASAN: global-out-of-bounds in validate_nla lib/nlattr.c:388 [inline]
BUG: KASAN: global-out-of-bounds in __nla_validate_parse+0x19d7/0x29a0 lib/nlattr.c:603
Read of size 1 at addr ffffffff8b09cb60 by task syz.1.66276/323862
CPU: 0 PID: 323862 Comm: syz.1.66276 Not tainted 6.1.70 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x177/0x231 lib/dump_stack.c:106
print_address_description mm/kasan/report.c:284 [inline]
print_report+0x14f/0x750 mm/kasan/report.c:395
kasan_report+0x139/0x170 mm/kasan/report.c:495
validate_nla lib/nlattr.c:388 [inline]
__nla_validate_parse+0x19d7/0x29a0 lib/nlattr.c:603
__nla_parse+0x3c/0x50 lib/nlattr.c:700
nla_parse_nested_deprecated include/net/netlink.h:1269 [inline]
__rtnl_newlink net/core/rtnetlink.c:3514 [inline]
rtnl_newlink+0x7bc/0x1fd0 net/core/rtnetlink.c:3623
rtnetlink_rcv_msg+0x794/0xef0 net/core/rtnetlink.c:6122
netlink_rcv_skb+0x1de/0x420 net/netlink/af_netlink.c:2508
netlink_unicast_kernel net/netlink/af_netlink.c:1326 [inline]
netlink_unicast+0x74b/0x8c0 net/netlink/af_netlink.c:1352
netlink_sendmsg+0x882/0xb90 net/netlink/af_netlink.c:1874
sock_sendmsg_nosec net/socket.c:716 [inline]
__sock_sendmsg net/socket.c:728 [inline]
____sys_sendmsg+0x5cc/0x8f0 net/socket.c:2499
___sys_sendmsg+0x21c/0x290 net/socket.c:2553
__sys_sendmsg net/socket.c:2582 [inline]
__do_sys_sendmsg net/socket.c:2591 [inline]
__se_sys_sendmsg+0x19e/0x270 net/socket.c:2589
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x45/0x90 arch/x86/entry/common.c:81
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f67b19a24ad
RSP: 002b:00007f67b17febb8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f67b1b45f80 RCX: 00007f67b19a24ad
RDX: 0000000000000000 RSI: 0000000020005e40 RDI: 0000000000000004
RBP: 00007f67b1a1e01d R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffd2513764f R14: 00007ffd251376e0 R15: 00007f67b17fed40
</TASK>
The buggy address belongs to the variable:
wwan_rtnl_policy+0x20/0x40
The buggy address belongs to the physical page:
page:ffffea00002c2700 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xb09c
flags: 0xfff00000001000(reserved|node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000001000 ffffea00002c2708 ffffea00002c2708 0000000000000000
raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner info is not present (never set?)
Memory state around the buggy address:
ffffffff8b09ca00: 05 f9 f9 f9 05 f9 f9 f9 00 01 f9 f9 00 01 f9 f9
ffffffff8b09ca80: 00 00 00 05 f9 f9 f9 f9 00 00 03 f9 f9 f9 f9 f9
>ffffffff8b09cb00: 00 00 00 00 05 f9 f9 f9 00 00 00 00 f9 f9 f9 f9
^
ffffffff8b09cb80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================
According to the comment of `nla_parse_nested_deprecated`, use correct size
`IFLA_WWAN_MAX` here to fix this issue.
Fixes: 88b710532e53 ("wwan: add interface creation support")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Reviewed-by: Loic Poulain <loic.poulain@linaro.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241015131621.47503-1-linma@zju.edu.cn
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
- There is no NFPROTO_IPV6 family for mark and NFLOG.
- TRACE is also missing module autoload with NFPROTO_IPV6.
This results in ip6tables failing to restore a ruleset. This issue has been
reported by several users providing incomplete patches.
Very similar to Ilya Katsnelson's patch including a missing chunk in the
TRACE extension.
Fixes: 0bfcb7b71e73 ("netfilter: xtables: avoid NFPROTO_UNSPEC where needed")
Reported-by: Ignat Korchagin <ignat@cloudflare.com>
Reported-by: Ilya Katsnelson <me@0upti.me>
Reported-by: Krzysztof Olędzki <ole@ans.pl>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Jijie Shao says:
====================
Add support of HIBMCGE Ethernet Driver
This patch set adds the support of Hisilicon BMC Gigabit Ethernet Driver.
This patch set includes basic Rx/Tx functionality. It also includes
the registration and interrupt codes.
This work provides the initial support to the HIBMCGE and
would incrementally add features or enhancements.
====================
Link: https://patch.msgid.link/20241015123516.4035035-1-shaojijie@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Add myself as the maintainer for the hibmcge ethernet driver.
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Add a Makefile and update Kconfig to build hibmcge driver.
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Implement the .get_drvinfo .get_link .get_link_ksettings to get
the basic information and working status of the driver.
Implement the .set_link_ksettings to modify the rate, duplex,
and auto-negotiation status.
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Implement rx_poll function to read the rx descriptor after
receiving the rx interrupt. Adjust the skb based on the
descriptor to complete the reception of the packet.
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Implement .ndo_start_xmit function to fill the information of the packet
to be transmitted into the tx descriptor, and then the hardware will
transmit the packet using the information in the tx descriptor.
In addition, we also implemented the tx_handler function to enable the
tx descriptor to be reused, and .ndo_tx_timeout function to print some
information when the hardware is busy.
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Implement the .ndo_open() .ndo_stop() .ndo_set_mac_address()
and .ndo_change_mtu functions().
And .ndo_validate_addr calls the eth_validate_addr function directly
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The driver supports four interrupts: TX interrupt, RX interrupt,
mdio interrupt, and error interrupt.
Actually, the driver does not use the mdio interrupt.
Therefore, the driver does not request the mdio interrupt.
The error interrupt distinguishes different error information
by using different masks. To distinguish different errors,
the statistics count is added for each error.
To ensure the consistency of the code process, masks are added for the
TX interrupt and RX interrupt.
This patch implements interrupt request, and provides a
unified entry for the interrupt handler function. However,
the specific interrupt handler function of each interrupt
is not implemented currently.
Because of pcim_enable_device(), the interrupt vector
is already device managed and does not need to be free actively.
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Implements the C22 read and write PHY registers interfaces.
Some hardware interfaces related to the PHY are also implemented
in this patch.
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Add support for to read and write registers through the pic bar space.
Some driver parameters, such as mac_id, are determined by the
board form. Therefore, these parameters are initialized
from the register as device specifications.
the device specifications register are initialized and written by bmc.
driver will read these registers when loading.
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Add pci table supported in this module, and implement pci_driver function
to initialize this driver.
hibmcge is a passthrough network device. Its software runs
on the host side, and the MAC hardware runs on the BMC side
to reduce the host CPU area. The software interacts with the
MAC hardware through the PCIe.
┌─────────────────────────┐
│ HOST CPU network device │
│ ┌──────────────┐ │
│ │hibmcge driver│ │
│ └─────┬─┬──────┘ │
│ │ │ │
│HOST ┌───┴─┴───┐ │
│ │ PCIE RC │ │
└──────┴───┬─┬───┴────────┘
│ │
PCIE
│ │
┌──────┬───┴─┴───┬────────┐
│ │ PCIE EP │ │
│BMC └───┬─┬───┘ │
│ │ │ │
│ ┌────────┴─┴──────────┐ │
│ │ GE │ │
│ │ ┌─────┐ ┌─────┐ │ │
│ │ │ MAC │ │ MAC │ │ │
└─┴─┼─────┼────┼─────┼──┴─┘
│ PHY │ │ PHY │
└─────┘ └─────┘
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Aleksandr Mishin says:
====================
fsl/fman: Fix refcount handling of fman-related devices
The series is intended to fix refcount handling for fman-related "struct
device" objects - the devices are not released upon driver removal or in
the error paths during probe. This leads to device reference leaks.
The device pointers are now saved to struct mac_device and properly handled
in the driver's probe and removal functions.
Originally reported by Simon Horman <horms@kernel.org>
(https://lore.kernel.org/all/20240702133651.GK598357@kernel.org/)
Compile tested only.
====================
Link: https://patch.msgid.link/20241015060122.25709-1-amishin@t-argos.ru
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
In mac_probe() there are multiple calls to of_find_device_by_node(),
fman_bind() and fman_port_bind() which takes references to of_dev->dev.
Not all references taken by these calls are released later on error path
in mac_probe() and in mac_remove() which lead to reference leaks.
Add references release.
Fixes: 3933961682a3 ("fsl/fman: Add FMan MAC driver")
Signed-off-by: Aleksandr Mishin <amishin@t-argos.ru>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
In mac_probe() there are calls to of_find_device_by_node() which takes
references to of_dev->dev. These references are not saved and not released
later on error path in mac_probe() and in mac_remove().
Add new fields into mac_device structure to save references taken for
future use in mac_probe() and mac_remove().
This is a preparation for further reference leaks fix.
Signed-off-by: Aleksandr Mishin <amishin@t-argos.ru>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Seems Alcatel Lucent G-010S-P also have the same problem that it uses
TX_FAULT pin for SOC uart. So apply sfp_fixup_ignore_tx_fault to it.
Signed-off-by: Shengyu Qu <wiagn233@outlook.com>
Link: https://patch.msgid.link/TYCPR01MB84373677E45A7BFA5A28232C98792@TYCPR01MB8437.jpnprd01.prod.outlook.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|