Age | Commit message (Collapse) | Author |
|
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Unbreak br_netfilter physdev match support, from Florian Westphal.
2) Use GFP_KERNEL_ACCOUNT for stateful/policy objects, from Chen Aotian.
3) Use IS_ENABLED() in nf_reset_trace(), from Florian Westphal.
4) Fix validation of catch-all set element.
5) Tighten requirements for catch-all set elements.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: tighten netlink attribute requirements for catch-all elements
netfilter: nf_tables: validate catch-all set elements
netfilter: nf_tables: fix ifdef to also consider nf_tables=m
netfilter: nf_tables: Modify nla_memdup's flag to GFP_KERNEL_ACCOUNT
netfilter: br_netfilter: fix recent physdev match breakage
====================
Link: https://lore.kernel.org/r/20230418145048.67270-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Similarly to kmsan_vmap_pages_range_noflush(), kmsan_ioremap_page_range()
must also properly handle allocation/mapping failures. In the case of
such, it must clean up the already created metadata mappings and return an
error code, so that the error can be propagated to ioremap_page_range().
Without doing so, KMSAN may silently fail to bring the metadata for the
page range into a consistent state, which will result in user-visible
crashes when trying to access them.
Link: https://lkml.kernel.org/r/20230413131223.4135168-2-glider@google.com
Fixes: b073d7f8aee4 ("mm: kmsan: maintain KMSAN metadata for page operations")
Signed-off-by: Alexander Potapenko <glider@google.com>
Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com>
Link: https://lore.kernel.org/linux-mm/CANX2M5ZRrRA64k0hOif02TjmY9kbbO2aCBPyq79es34RXZ=cAw@mail.gmail.com/
Reviewed-by: Marco Elver <elver@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
As reported by Dipanjan Das, when KMSAN is used together with kernel fault
injection (or, generally, even without the latter), calls to kcalloc() or
__vmap_pages_range_noflush() may fail, leaving the metadata mappings for
the virtual mapping in an inconsistent state. When these metadata
mappings are accessed later, the kernel crashes.
To address the problem, we return a non-zero error code from
kmsan_vmap_pages_range_noflush() in the case of any allocation/mapping
failure inside it, and make vmap_pages_range_noflush() return an error if
KMSAN fails to allocate the metadata.
This patch also removes KMSAN_WARN_ON() from vmap_pages_range_noflush(),
as these allocation failures are not fatal anymore.
Link: https://lkml.kernel.org/r/20230413131223.4135168-1-glider@google.com
Fixes: b073d7f8aee4 ("mm: kmsan: maintain KMSAN metadata for page operations")
Signed-off-by: Alexander Potapenko <glider@google.com>
Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com>
Link: https://lore.kernel.org/linux-mm/CANX2M5ZRrRA64k0hOif02TjmY9kbbO2aCBPyq79es34RXZ=cAw@mail.gmail.com/
Reviewed-by: Marco Elver <elver@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
It is unused and should not be used. In order to avoid limitations in
4-address mode, the driver should always use ieee80211_tx_status_ext for
802.3 frames with a valid sta pointer.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20230417133751.79160-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Add netif_subqueue_completed_wake, complementing the subqueue versions
netif_subqueue_try_stop and netif_subqueue_maybe_stop.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
catch-all set element might jump/goto to chain that uses expressions
that require validation.
Fixes: aaa31047a6d2 ("netfilter: nftables: add catch-all set element support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
In order to not transmit (preemptible) frames which will be received by
the link partner as corrupted (because it doesn't support FP), the
hardware requires the driver to program the QSYS_PREEMPTION_CFG_P_QUEUES
register only after the MAC Merge layer becomes active (verification
succeeds, or was disabled).
There are some cases when FP is known (through experimentation) to be
broken. Give priority to FP over cut-through switching, and disable FP
for known broken link modes.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This doesn't apply anything to hardware and in general doesn't do
anything that the software variant doesn't do, except for checking that
there isn't more than 1 TXQ per TC (TXQs for a DSA switch are a dubious
concept anyway). The reason we add this is to be able to parse one more
field added to struct tc_mqprio_qopt_offload, namely preemptible_tcs.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The MAC Merge IRQ of all ports is shared with the PTP TX timestamp IRQ
of all ports, which means that currently, when a PTP TX timestamp is
generated, felix_irq_handler() also polls for the MAC Merge layer status
of all ports, looking for changes. This makes the kernel do more work,
and under certain circumstances may make ptp4l require a
tx_timestamp_timeout argument higher than before.
Changes to the MAC Merge layer status are only to be expected under
certain conditions - its TX direction needs to be enabled - so we can
check early if that is the case, and omit register access otherwise.
Make ocelot_mm_update_port_status() skip register access if
mm->tx_enabled is unset, and also call it once more, outside IRQ
context, from ocelot_port_set_mm(), when mm->tx_enabled transitions from
true to false, because an IRQ is also expected in that case.
Also, a port may have its MAC Merge layer enabled but it may not have
generated the interrupt. In that case, there's no point in writing to
DEV_MM_STATUS to acknowledge that IRQ. We can reduce the number of
register writes per port with MM enabled by keeping an "ack" variable
which writes the "write-one-to-clear" bits. Those are 3 in number:
PRMPT_ACTIVE_STICKY, UNEXP_RX_PFRM_STICKY and UNEXP_TX_PFRM_STICKY.
The other fields in DEV_MM_STATUS are read-only and it doesn't matter
what is written to them, so writing zero is just fine.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Unfortunately, the workarounds for the hardware bugs make it pointless
to keep fine-grained locking for the MAC Merge state of each port.
Our vsc9959_cut_through_fwd() implementation requires
ocelot->fwd_domain_lock to be held, in order to serialize with changes
to the bridging domains and to port speed changes (which affect which
ports can be cut-through). Simultaneously, the traffic classes which can
be cut-through cannot be preemptible at the same time, and this will
depend on the MAC Merge layer state (which changes from threaded
interrupt context).
Since vsc9959_cut_through_fwd() would have to hold the mm->lock of all
ports for a correct and race-free implementation with respect to
ocelot_mm_irq(), in practice it means that any time a port's mm->lock is
held, it would potentially block holders of ocelot->fwd_domain_lock.
In the interest of simple locking rules, make all MAC Merge layer state
changes (and preemptible traffic class changes) be serialized by the
ocelot->fwd_domain_lock.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When the switch emits an IRQ, we don't know what caused it, and we
iterate through all ports to check the MAC Merge status.
Move that iteration inside the ocelot lib; we will change the locking in
a future change and it would be good to encapsulate that lock completely
within the ocelot lib.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Extend packet reformat types and flow table capabilities with
IPsec packet offload tunnel bits.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
There are some use-cases where it is desirable to use bpf_redirect()
in combination with ifb device, which currently is not supported, for
example, around filtering inbound traffic with BPF to then push it to
ifb which holds the qdisc for shaping in contrast to doing that on the
egress device.
Toke mentions the following case related to OpenWrt:
Because there's not always a single egress on the other side. These are
mainly home routers, which tend to have one or more WiFi devices bridged
to one or more ethernet ports on the LAN side, and a single upstream WAN
port. And the objective is to control the total amount of traffic going
over the WAN link (in both directions), to deal with bufferbloat in the
ISP network (which is sadly still all too prevalent).
In this setup, the traffic can be split arbitrarily between the links
on the LAN side, and the only "single bottleneck" is the WAN link. So we
install both egress and ingress shapers on this, configured to something
like 95-98% of the true link bandwidth, thus moving the queues into the
qdisc layer in the router. It's usually necessary to set the ingress
bandwidth shaper a bit lower than the egress due to being "downstream"
of the bottleneck link, but it does work surprisingly well.
We usually use something like a matchall filter to put all ingress
traffic on the ifb, so doing the redirect from BPF has not been an
immediate requirement thus far. However, it does seem a bit odd that
this is not possible, and we do have a BPF-based filter that layers on
top of this kind of setup, which currently uses u32 as the ingress
filter and so it could presumably be improved to use BPF instead if
that was available.
Reported-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reported-by: Yafang Shao <laoar.shao@gmail.com>
Reported-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://git.openwrt.org/?p=project/qosify.git;a=blob;f=README
Link: https://lore.kernel.org/bpf/875y9yzbuy.fsf@toke.dk
Link: https://lore.kernel.org/r/8cebc8b2b6e967e10cbafe2ffd6795050e74accd.1681739137.git.daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
nftables can be built as a module, so fix the preprocessor conditional
accordingly.
Fixes: 478b360a47b7 ("netfilter: nf_tables: fix nf_trace always-on with XT_TRACE=n")
Reported-by: Florian Fainelli <f.fainelli@gmail.com>
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
In the latest RFC9260, the Host Name Address param has been deprecated.
For INIT chunk:
Note 3: An INIT chunk MUST NOT contain the Host Name Address
parameter. The receiver of an INIT chunk containing a Host Name
Address parameter MUST send an ABORT chunk and MAY include an
"Unresolvable Address" error cause.
For Supported Address Types:
The value indicating the Host Name Address parameter MUST NOT be
used when sending this parameter and MUST be ignored when receiving
this parameter.
Currently Linux SCTP doesn't really support Host Name Address param,
but only saves some flag and print debug info, which actually won't
even be triggered due to the verification in sctp_verify_param().
This patch is to delete those dead code.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We've managed to improve the UX for kptrs significantly over the last 9
months. All of the existing use cases which previously had KF_KPTR_GET
kfuncs (struct bpf_cpumask *, struct task_struct *, and struct cgroup *)
have all been updated to be synchronized using RCU. In other words,
their KF_KPTR_GET kfuncs have been removed in favor of KF_RCU |
KF_ACQUIRE kfuncs, with the pointers themselves also being readable from
maps in an RCU read region thanks to the types being RCU safe.
While KF_KPTR_GET was a logical starting point for kptrs, it's become
clear that they're not the correct abstraction. KF_KPTR_GET is a flag
that essentially does nothing other than enforcing that the argument to
a function is a pointer to a referenced kptr map value. At first glance,
that's a useful thing to guarantee to a kfunc. It gives kfuncs the
ability to try and acquire a reference on that kptr without requiring
the BPF prog to do something like this:
struct kptr_type *in_map, *new = NULL;
in_map = bpf_kptr_xchg(&map->value, NULL);
if (in_map) {
new = bpf_kptr_type_acquire(in_map);
in_map = bpf_kptr_xchg(&map->value, in_map);
if (in_map)
bpf_kptr_type_release(in_map);
}
That's clearly a pretty ugly (and racy) UX, and if using KF_KPTR_GET is
the only alternative, it's better than nothing. However, the problem
with any KF_KPTR_GET kfunc lies in the fact that it always requires some
kind of synchronization in order to safely do an opportunistic acquire
of the kptr in the map. This is because a BPF program running on another
CPU could do a bpf_kptr_xchg() on that map value, and free the kptr
after it's been read by the KF_KPTR_GET kfunc. For example, the
now-removed bpf_task_kptr_get() kfunc did the following:
struct task_struct *bpf_task_kptr_get(struct task_struct **pp)
{
struct task_struct *p;
rcu_read_lock();
p = READ_ONCE(*pp);
/* If p is non-NULL, it could still be freed by another CPU,
* so we have to do an opportunistic refcount_inc_not_zero()
* and return NULL if the task will be freed after the
* current RCU read region.
*/
|f (p && !refcount_inc_not_zero(&p->rcu_users))
p = NULL;
rcu_read_unlock();
return p;
}
In other words, the kfunc uses RCU to ensure that the task remains valid
after it's been peeked from the map. However, this is completely
redundant with just defining a KF_RCU kfunc that itself does a
refcount_inc_not_zero(), which is exactly what bpf_task_acquire() now
does.
So, the question of whether KF_KPTR_GET is useful is actually, "Are
there any synchronization mechanisms / safety flags that are required by
certain kptrs, but which are not provided by the verifier to kfuncs?"
The answer to that question today is "No", because every kptr we
currently care about is RCU protected.
Even if the answer ever became "yes", the proper way to support that
referenced kptr type would be to add support for whatever
synchronization mechanism it requires in the verifier, rather than
giving kfuncs a flag that says, "Here's a pointer to a referenced kptr
in a map, do whatever you need to do."
With all that said -- so as to allow us to consolidate the kfunc API,
and simplify the verifier a bit, this patch removes KF_KPTR_GET, and all
relevant logic from the verifier.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230416084928.326135-3-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
All btf_fields in an object are 0-initialized by memset in
bpf_obj_init. This might not be a valid initial state for some field
types, in which case kfuncs that use the type will properly initialize
their input if it's been 0-initialized. Some BPF graph collection types
and kfuncs do this: bpf_list_{head,node} and bpf_rb_node.
An earlier patch in this series added the bpf_refcount field, for which
the 0 state indicates that the refcounted object should be free'd.
bpf_obj_init treats this field specially, setting refcount to 1 instead
of relying on scattered "refcount is 0? Must have just been initialized,
let's set to 1" logic in kfuncs.
This patch extends this treatment to list and rbtree field types,
allowing most scattered initialization logic in kfuncs to be removed.
Note that bpf_{list_head,rb_root} may be inside a BPF map, in which case
they'll be 0-initialized without passing through the newly-added logic,
so scattered initialization logic must remain for these collection root
types.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230415201811.343116-9-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Consider this code snippet:
struct node {
long key;
bpf_list_node l;
bpf_rb_node r;
bpf_refcount ref;
}
int some_bpf_prog(void *ctx)
{
struct node *n = bpf_obj_new(/*...*/), *m;
bpf_spin_lock(&glock);
bpf_rbtree_add(&some_tree, &n->r, /* ... */);
m = bpf_refcount_acquire(n);
bpf_rbtree_add(&other_tree, &m->r, /* ... */);
bpf_spin_unlock(&glock);
/* ... */
}
After bpf_refcount_acquire, n and m point to the same underlying memory,
and that node's bpf_rb_node field is being used by the some_tree insert,
so overwriting it as a result of the second insert is an error. In order
to properly support refcounted nodes, the rbtree and list insert
functions must be allowed to fail. This patch adds such support.
The kfuncs bpf_rbtree_add, bpf_list_push_{front,back} are modified to
return an int indicating success/failure, with 0 -> success, nonzero ->
failure.
bpf_obj_drop on failure
=======================
Currently the only reason an insert can fail is the example above: the
bpf_{list,rb}_node is already in use. When such a failure occurs, the
insert kfuncs will bpf_obj_drop the input node. This allows the insert
operations to logically fail without changing their verifier owning ref
behavior, namely the unconditional release_reference of the input
owning ref.
With insert that always succeeds, ownership of the node is always passed
to the collection, since the node always ends up in the collection.
With a possibly-failed insert w/ bpf_obj_drop, ownership of the node
is always passed either to the collection (success), or to bpf_obj_drop
(failure). Regardless, it's correct to continue unconditionally
releasing the input owning ref, as something is always taking ownership
from the calling program on insert.
Keeping owning ref behavior unchanged results in a nice default UX for
insert functions that can fail. If the program's reaction to a failed
insert is "fine, just get rid of this owning ref for me and let me go
on with my business", then there's no reason to check for failure since
that's default behavior. e.g.:
long important_failures = 0;
int some_bpf_prog(void *ctx)
{
struct node *n, *m, *o; /* all bpf_obj_new'd */
bpf_spin_lock(&glock);
bpf_rbtree_add(&some_tree, &n->node, /* ... */);
bpf_rbtree_add(&some_tree, &m->node, /* ... */);
if (bpf_rbtree_add(&some_tree, &o->node, /* ... */)) {
important_failures++;
}
bpf_spin_unlock(&glock);
}
If we instead chose to pass ownership back to the program on failed
insert - by returning NULL on success or an owning ref on failure -
programs would always have to do something with the returned ref on
failure. The most likely action is probably "I'll just get rid of this
owning ref and go about my business", which ideally would look like:
if (n = bpf_rbtree_add(&some_tree, &n->node, /* ... */))
bpf_obj_drop(n);
But bpf_obj_drop isn't allowed in a critical section and inserts must
occur within one, so in reality error handling would become a
hard-to-parse mess.
For refcounted nodes, we can replicate the "pass ownership back to
program on failure" logic with this patch's semantics, albeit in an ugly
way:
struct node *n = bpf_obj_new(/* ... */), *m;
bpf_spin_lock(&glock);
m = bpf_refcount_acquire(n);
if (bpf_rbtree_add(&some_tree, &n->node, /* ... */)) {
/* Do something with m */
}
bpf_spin_unlock(&glock);
bpf_obj_drop(m);
bpf_refcount_acquire is used to simulate "return owning ref on failure".
This should be an uncommon occurrence, though.
Addition of two verifier-fixup'd args to collection inserts
===========================================================
The actual bpf_obj_drop kfunc is
bpf_obj_drop_impl(void *, struct btf_struct_meta *), with bpf_obj_drop
macro populating the second arg with 0 and the verifier later filling in
the arg during insn fixup.
Because bpf_rbtree_add and bpf_list_push_{front,back} now might do
bpf_obj_drop, these kfuncs need a btf_struct_meta parameter that can be
passed to bpf_obj_drop_impl.
Similarly, because the 'node' param to those insert functions is the
bpf_{list,rb}_node within the node type, and bpf_obj_drop expects a
pointer to the beginning of the node, the insert functions need to be
able to find the beginning of the node struct. A second
verifier-populated param is necessary: the offset of {list,rb}_node within the
node type.
These two new params allow the insert kfuncs to correctly call
__bpf_obj_drop_impl:
beginning_of_node = bpf_rb_node_ptr - offset
if (already_inserted)
__bpf_obj_drop_impl(beginning_of_node, btf_struct_meta->record);
Similarly to other kfuncs with "hidden" verifier-populated params, the
insert functions are renamed with _impl prefix and a macro is provided
for common usage. For example, bpf_rbtree_add kfunc is now
bpf_rbtree_add_impl and bpf_rbtree_add is now a macro which sets
"hidden" args to 0.
Due to the two new args BPF progs will need to be recompiled to work
with the new _impl kfuncs.
This patch also rewrites the "hidden argument" explanation to more
directly say why the BPF program writer doesn't need to populate the
arguments with anything meaningful.
How does this new logic affect non-owning references?
=====================================================
Currently, non-owning refs are valid until the end of the critical
section in which they're created. We can make this guarantee because, if
a non-owning ref exists, the referent was added to some collection. The
collection will drop() its nodes when it goes away, but it can't go away
while our program is accessing it, so that's not a problem. If the
referent is removed from the collection in the same CS that it was added
in, it can't be bpf_obj_drop'd until after CS end. Those are the only
two ways to free the referent's memory and neither can happen until
after the non-owning ref's lifetime ends.
On first glance, having these collection insert functions potentially
bpf_obj_drop their input seems like it breaks the "can't be
bpf_obj_drop'd until after CS end" line of reasoning. But we care about
the memory not being _freed_ until end of CS end, and a previous patch
in the series modified bpf_obj_drop such that it doesn't free refcounted
nodes until refcount == 0. So the statement can be more accurately
rewritten as "can't be free'd until after CS end".
We can prove that this rewritten statement holds for any non-owning
reference produced by collection insert functions:
* If the input to the insert function is _not_ refcounted
* We have an owning reference to the input, and can conclude it isn't
in any collection
* Inserting a node in a collection turns owning refs into
non-owning, and since our input type isn't refcounted, there's no
way to obtain additional owning refs to the same underlying
memory
* Because our node isn't in any collection, the insert operation
cannot fail, so bpf_obj_drop will not execute
* If bpf_obj_drop is guaranteed not to execute, there's no risk of
memory being free'd
* Otherwise, the input to the insert function is refcounted
* If the insert operation fails due to the node's list_head or rb_root
already being in some collection, there was some previous successful
insert which passed refcount to the collection
* We have an owning reference to the input, it must have been
acquired via bpf_refcount_acquire, which bumped the refcount
* refcount must be >= 2 since there's a valid owning reference and the
node is already in a collection
* Insert triggering bpf_obj_drop will decr refcount to >= 1, never
resulting in a free
So although we may do bpf_obj_drop during the critical section, this
will never result in memory being free'd, and no changes to non-owning
ref logic are needed in this patch.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230415201811.343116-6-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
A local kptr is considered 'refcounted' when it is of a type that has a
bpf_refcount field. When such a kptr is created, its refcount should be
initialized to 1; when destroyed, the object should be free'd only if a
refcount decr results in 0 refcount.
Existing logic always frees the underlying memory when destroying a
local kptr, and 0-initializes all btf_record fields. This patch adds
checks for "is local kptr refcounted?" and new logic for that case in
the appropriate places.
This patch focuses on changing existing semantics and thus conspicuously
does _not_ provide a way for BPF programs in increment refcount. That
follows later in the series.
__bpf_obj_drop_impl is modified to do the right thing when it sees a
refcounted type. Container types for graph nodes (list, tree, stashed in
map) are migrated to use __bpf_obj_drop_impl as a destructor for their
nodes instead of each having custom destruction code in their _free
paths. Now that "drop" isn't a synonym for "free" when the type is
refcounted it makes sense to centralize this logic.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230415201811.343116-4-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
A 'struct bpf_refcount' is added to the set of opaque uapi/bpf.h types
meant for use in BPF programs. Similarly to other opaque types like
bpf_spin_lock and bpf_rbtree_node, the verifier needs to know where in
user-defined struct types a bpf_refcount can be located, so necessary
btf_record plumbing is added to enable this. bpf_refcount is sized to
hold a refcount_t.
Similarly to bpf_spin_lock, the offset of a bpf_refcount is cached in
btf_record as refcount_off in addition to being in the field array.
Caching refcount_off makes sense for this field because further patches
in the series will modify functions that take local kptrs (e.g.
bpf_obj_drop) to change their behavior if the type they're operating on
is refcounted. So enabling fast "is this type refcounted?" checks is
desirable.
No such verifier behavior changes are introduced in this patch, just
logic to recognize 'struct bpf_refcount' in btf_record.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230415201811.343116-3-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The btf_field_offs struct contains (offset, size) for btf_record fields,
sorted by offset. btf_field_offs is always used in conjunction with
btf_record, which has btf_field 'fields' array with (offset, type), the
latter of which btf_field_offs' size is derived from via
btf_field_type_size.
This patch adds a size field to struct btf_field and sorts btf_record's
fields by offset, making it possible to get rid of btf_field_offs. Less
data duplication and less code complexity results.
Since btf_field_offs' lifetime closely followed the btf_record used to
populate it, most complexity wins are from removal of initialization
code like:
if (btf_record_successfully_initialized) {
foffs = btf_parse_field_offs(rec);
if (IS_ERR_OR_NULL(foffs))
// free the btf_record and return err
}
Other changes in this patch are pretty mechanical:
* foffs->field_off[i] -> rec->fields[i].offset
* foffs->field_sz[i] -> rec->fields[i].size
* Sort rec->fields in btf_parse_fields before returning
* It's possible that this is necessary independently of other
changes in this patch. btf_record_find in syscall.c expects
btf_record's fields to be sorted by offset, yet there's no
explicit sorting of them before this patch, record's fields are
populated in the order they're read from BTF struct definition.
BTF docs don't say anything about the sortedness of struct fields.
* All functions taking struct btf_field_offs * input now instead take
struct btf_record *. All callsites of these functions already have
access to the correct btf_record.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230415201811.343116-2-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Recent patches to mlx5 mentioned a regression when moving from
driver local page pool to only using the generic page pool code.
Page pool has two recycling paths (1) direct one, which runs in
safe NAPI context (basically consumer context, so producing
can be lockless); and (2) via a ptr_ring, which takes a spin
lock because the freeing can happen from any CPU; producer
and consumer may run concurrently.
Since the page pool code was added, Eric introduced a revised version
of deferred skb freeing. TCP skbs are now usually returned to the CPU
which allocated them, and freed in softirq context. This places the
freeing (producing of pages back to the pool) enticingly close to
the allocation (consumer).
If we can prove that we're freeing in the same softirq context in which
the consumer NAPI will run - lockless use of the cache is perfectly fine,
no need for the lock.
Let drivers link the page pool to a NAPI instance. If the NAPI instance
is scheduled on the same CPU on which we're freeing - place the pages
in the direct cache.
With that and patched bnxt (XDP enabled to engage the page pool, sigh,
bnxt really needs page pool work :() I see a 2.6% perf boost with
a TCP stream test (app on a different physical core than softirq).
The CPU use of relevant functions decreases as expected:
page_pool_refill_alloc_cache 1.17% -> 0%
_raw_spin_lock 2.41% -> 0.98%
Only consider lockless path to be safe when NAPI is scheduled
- in practice this should cover majority if not all of steady state
workloads. It's usually the NAPI kicking in that causes the skb flush.
The main case we'll miss out on is when application runs on the same
CPU as NAPI. In that case we don't use the deferred skb free path.
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add SDIO ids for use with the muRata 1YN (Cypress CYW43439).
The odd thing about this is that the previous 1YN populated
on M.2 card for evaluation purposes had BRCM SDIO vendor ID,
while the chip populated on real hardware has a Cypress one.
The device ID also differs between the two devices. But they
are both 43439 otherwise, so add the IDs for both.
On-device 1YN (43439), the new one, chip label reads "1YN":
```
/sys/.../mmc_host/mmc2/mmc2:0001 # cat vendor device
0x04b4
0xbd3d
```
EA M.2 evaluation board 1YN (43439), the old one, chip label reads "1YN ES1.4":
```
/sys/.../mmc_host/mmc0/mmc0:0001/# cat vendor device
0x02d0
0xa9a6
```
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Marek Vasut <marex@denx.de>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20230407203752.128539-1-marex@denx.de
|
|
During probe, get the hardware-allowed max MTU by querying the device
configuration. Users can select MTU up to the device limit.
When XDP is in use, limit MTU settings so the buffer size is within
one page. And, when MTU is set to a too large value, XDP is not allowed
to run.
Also, to prevent changing MTU fails, and leaves the NIC in a bad state,
pre-allocate all buffers before starting the change. So in low memory
condition, it will return error, without affecting the NIC.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Update RX data path to allocate and use RX queue DMA buffers with
proper size based on potentially various MTU sizes.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Move out common buffer allocation code from mana_process_rx_cqe() and
mana_alloc_rx_wqe() to helper functions.
Refactor related variables so they can be changed in one place, and buffer
sizes are in sync.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2023-04-11
1) Vlad adds the support for linux bridge multicast offload support
Patches #1 through #9
Synopsis
Vlad Says:
==============
Implement support of bridge multicast offload in mlx5. Handle port object
attribute SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED notification to toggle multicast
offload and bridge snooping support on bridge. Handle port object
SWITCHDEV_OBJ_ID_PORT_MDB notification to attach a bridge port to MDB.
Steering architecture
Existing offload infrastructure relies on two levels of flow tables - bridge
ingress and egress. For multicast offload the architecture is extended with
additional layer of per-port multicast replication tables. Such tables filter
loopback traffic (so packets are not replicated to their source port) and pop
VLAN headers for "untagged" VLANs. The tables are referenced by the MDB rules in
egress table. MDB egress rule can point to multiple per-port multicast tables,
which causes matching multicast traffic to be replicated to all of them, and,
consecutively, to several bridge ports:
+--------+--+
+---------------------------------------> Port 1 | |
| +-^------+--+
| |
| |
+-----------------------------------------+ | +---------------------------+ |
| EGRESS table | | +--> PORT 1 multicast table | |
+----------------------------------+ +-----------------------------------------+ | | +---------------------------+ |
| INGRESS table | | | | | | | |
+----------------------------------+ | dst_mac=P1,vlan=X -> pop vlan, goto P1 +--+ | | FG0: | |
| | | dst_mac=P1,vlan=Y -> pop vlan, goto P1 | | | src_port=dst_port -> drop | |
| src_mac=M1,vlan=X -> goto egress +---> dst_mac=P2,vlan=X -> pop vlan, goto P2 +--+ | | FG1: | |
| ... | | dst_mac=P2,vlan=Y -> goto P2 | | | | VLAN X -> pop, goto port | |
| | | dst_mac=MDB1,vlan=Y -> goto mcast P1,P2 +-----+ | ... | |
+----------------------------------+ | | | | | VLAN Y -> pop, goto port +-------+
+-----------------------------------------+ | | | FG3: |
| | | matchall -> goto port |
| | | |
| | +---------------------------+
| |
| |
| | +--------+--+
+---------------------------------------> Port 2 | |
| +-^------+--+
| |
| |
| +---------------------------+ |
+--> PORT 2 multicast table | |
+---------------------------+ |
| | |
| FG0: | |
| src_port=dst_port -> drop | |
| FG1: | |
| VLAN X -> pop, goto port | |
| ... | |
| | |
| FG3: | |
| matchall -> goto port +-------+
| |
+---------------------------+
Patches overview:
- Patch 1 adds hardware definition bits for capabilities required to replicate
multicast packets to multiple per-port tables. These bits are used by
following patches to only attempt multicast offload if firmware and hardware
provide necessary support.
- Pathces 2-4 patches are preparations and refactoring.
- Patch 5 implements necessary infrastructure to toggle multicast offload
via SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED port object attribute notification.
This also enabled IGMP and MLD snooping.
- Patch 6 implements per-port multicast replication tables. It only supports
filtering of loopback packets.
- Patch 7 extends per-port multicast tables with VLAN pop support for 'untagged'
VLANs.
- Patch 8 handles SWITCHDEV_OBJ_ID_PORT_MDB port object notifications. It
creates MDB replication rules in egress table that can replicate packets to
multiple per-port multicast tables.
- Patch 9 adds tracepoints for MDB events.
==============
2) Parav Create a new allocation profile for SFs, to save on memory
3) Yevgeny provides some initial patches for upcoming software steering
support new pattern/arguments type of modify_header actions.
Starting with ConnectX-6 DX, we use a new design of modify_header FW object.
The current modify_header object allows for having only limited number of
these FW objects, which means that we are limited in the number of offloaded
flows that require modify_header action.
As a preparation Yevgeny provides the following 4 patches:
- Patch 1: Add required mlx5_ifc HW bits
- Patch 2, 3: Add new WQE type and opcode that is required for pattern/arg
support and adds appropriate support in dr_send.c
- Patch 4: Add ICM pool for modify-header-pattern objects and implement
patterns cache, allowing patterns reuse for different flows
* tag 'mlx5-updates-2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5: DR, Add modify-header-pattern ICM pool
net/mlx5: DR, Prepare sending new WQE type
net/mlx5: Add new WQE for updating flow table
net/mlx5: Add mlx5_ifc bits for modify header argument
net/mlx5: DR, Set counter ID on the last STE for STEv1 TX
net/mlx5: Create a new profile for SFs
net/mlx5: Bridge, add tracepoints for multicast
net/mlx5: Bridge, implement mdb offload
net/mlx5: Bridge, support multicast VLAN pop
net/mlx5: Bridge, add per-port multicast replication tables
net/mlx5: Bridge, snoop igmp/mld packets
net/mlx5: Bridge, extract code to lookup parent bridge of port
net/mlx5: Bridge, move additional data structures to priv header
net/mlx5: Bridge, increase bridge tables sizes
net/mlx5: Add mlx5_ifc definitions for bridge multicast support
====================
Link: https://lore.kernel.org/r/20230412040752.14220-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This is a duplication of the FP adminStatus logic introduced for
tc-mqprio. Offloading is done through the tc_mqprio_qopt_offload
structure embedded within tc_taprio_qopt_offload. So practically, if a
device driver is written to treat the mqprio portion of taprio just like
standalone mqprio, it gets unified handling of frame preemption.
I would have reused more code with taprio, but this is mostly netlink
attribute parsing, which is hard to transform into generic code without
having something that stinks as a result. We have the same variables
with the same semantics, just different nlattr type values
(TCA_MQPRIO_TC_ENTRY=5 vs TCA_TAPRIO_ATTR_TC_ENTRY=12;
TCA_MQPRIO_TC_ENTRY_FP=2 vs TCA_TAPRIO_TC_ENTRY_FP=3, etc) and
consequently, different policies for the nest.
Every time nla_parse_nested() is called, an on-stack table "tb" of
nlattr pointers is allocated statically, up to the maximum understood
nlattr type. That array size is hardcoded as a constant, but when
transforming this into a common parsing function, it would become either
a VLA (which the Linux kernel rightfully doesn't like) or a call to the
allocator.
Having FP adminStatus in tc-taprio can be seen as addressing the 802.1Q
Annex S.3 "Scheduling and preemption used in combination, no HOLD/RELEASE"
and S.4 "Scheduling and preemption used in combination with HOLD/RELEASE"
use cases. HOLD and RELEASE events are emitted towards the underlying
MAC Merge layer when the schedule hits a Set-And-Hold-MAC or a
Set-And-Release-MAC gate operation. So within the tc-taprio UAPI space,
one can distinguish between the 2 use cases by choosing whether to use
the TC_TAPRIO_CMD_SET_AND_HOLD and TC_TAPRIO_CMD_SET_AND_RELEASE gate
operations within the schedule, or just TC_TAPRIO_CMD_SET_GATES.
A small part of the change is dedicated to refactoring the max_sdu
nlattr parsing to put all logic under the "if" that tests for presence
of that nlattr.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
IEEE 802.1Q-2018 clause 6.7.2 Frame preemption specifies that each
packet priority can be assigned to a "frame preemption status" value of
either "express" or "preemptible". Express priorities are transmitted by
the local device through the eMAC, and preemptible priorities through
the pMAC (the concepts of eMAC and pMAC come from the 802.3 MAC Merge
layer).
The FP adminStatus is defined per packet priority, but 802.1Q clause
12.30.1.1.1 framePreemptionAdminStatus also says that:
| Priorities that all map to the same traffic class should be
| constrained to use the same value of preemption status.
It is impossible to ignore the cognitive dissonance in the standard
here, because it practically means that the FP adminStatus only takes
distinct values per traffic class, even though it is defined per
priority.
I can see no valid use case which is prevented by having the kernel take
the FP adminStatus as input per traffic class (what we do here).
In addition, this also enforces the above constraint by construction.
User space network managers which wish to expose FP adminStatus per
priority are free to do so; they must only observe the prio_tc_map of
the netdev (which presumably is also under their control, when
constructing the mqprio netlink attributes).
The reason for configuring frame preemption as a property of the Qdisc
layer is that the information about "preemptible TCs" is closest to the
place which handles the num_tc and prio_tc_map of the netdev. If the
UAPI would have been any other layer, it would be unclear what to do
with the FP information when num_tc collapses to 0. A key assumption is
that only mqprio/taprio change the num_tc and prio_tc_map of the netdev.
Not sure if that's a great assumption to make.
Having FP in tc-mqprio can be seen as an implementation of the use case
defined in 802.1Q Annex S.2 "Preemption used in isolation". There will
be a separate implementation of FP in tc-taprio, for the other use
cases.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
With the multiplexed ndo_setup_tc() model which lacks a first-class
struct netlink_ext_ack * argument, the only way to pass the netlink
extended ACK message down to the device driver is to embed it within the
offload structure.
Do this for struct tc_mqprio_qopt_offload and struct tc_taprio_qopt_offload.
Since struct tc_taprio_qopt_offload also contains a tc_mqprio_qopt_offload
structure, and since device drivers might effectively reuse their mqprio
implementation for the mqprio portion of taprio, we make taprio set the
extack in both offload structures to point at the same netlink extack
message.
In fact, the taprio handling is a bit more tricky, for 2 reasons.
First is because the offload structure has a longer lifetime than the
extack structure. The driver is supposed to populate the extack
synchronously from ndo_setup_tc() and leave it alone afterwards.
To not have any use-after-free surprises, we zero out the extack pointer
when we leave taprio_enable_offload().
The second reason is because taprio does overwrite the extack message on
ndo_setup_tc() error. We need to switch to the weak form of setting an
extack message, which preserves a potential message set by the driver.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Create a wrapper over __ethtool_dev_mm_supported() which also calls
ethnl_ops_begin() and ethnl_ops_complete(). It can be used by other code
layers, such as tc, to make sure that preemptible TCs are supported
(this is true if an underlying MAC Merge layer exists).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The "u32 reg" argument that is passed to these functions is not a plain
address, but rather a driver-specific encoding of another enum
ocelot_target target in the upper bits, and an index into the
u32 ocelot->map[target][] array in the lower bits. That encoded value
takes the type "enum ocelot_reg" and is what is passed to these I/O
functions, so let's actually use that to prevent type confusion.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
test_ksyms_module fails to emit a kfunc call targeting a module on
s390x, because the verifier stores the difference between kfunc
address and __bpf_call_base in bpf_insn.imm, which is s32, and modules
are roughly (1 << 42) bytes away from the kernel on s390x.
Fix by keeping BTF id in bpf_insn.imm for BPF_PSEUDO_KFUNC_CALLs,
and storing the absolute address in bpf_kfunc_desc.
Introduce bpf_jit_supports_far_kfunc_call() in order to limit this new
behavior to the s390x JIT. Otherwise other JITs need to be modified,
which is not desired.
Introduce bpf_get_kfunc_addr() instead of exposing both
find_kfunc_desc() and struct bpf_kfunc_desc.
In addition to sorting kfuncs by imm, also sort them by offset, in
order to handle conflicting imms from different modules. Do this on
all architectures in order to simplify code.
Factor out resolving specialized kfuncs (XPD and dynptr) from
fixup_kfunc_call(). This was required in the first place, because
fixup_kfunc_call() uses find_kfunc_desc(), which returns a const
pointer, so it's not possible to modify kfunc addr without stripping
const, which is not nice. It also removes repetition of code like:
if (bpf_jit_supports_far_kfunc_call())
desc->addr = func;
else
insn->imm = BPF_CALL_IMM(func);
and separates kfunc_desc_tab fixups from kfunc_call fixups.
Suggested-by: Jiri Olsa <olsajiri@gmail.com>
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230412230632.885985-1-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
====================
pull-request: bpf-next 2023-04-13
We've added 260 non-merge commits during the last 36 day(s) which contain
a total of 356 files changed, 21786 insertions(+), 11275 deletions(-).
The main changes are:
1) Rework BPF verifier log behavior and implement it as a rotating log
by default with the option to retain old-style fixed log behavior,
from Andrii Nakryiko.
2) Adds support for using {FOU,GUE} encap with an ipip device operating
in collect_md mode and add a set of BPF kfuncs for controlling encap
params, from Christian Ehrig.
3) Allow BPF programs to detect at load time whether a particular kfunc
exists or not, and also add support for this in light skeleton,
from Alexei Starovoitov.
4) Optimize hashmap lookups when key size is multiple of 4,
from Anton Protopopov.
5) Enable RCU semantics for task BPF kptrs and allow referenced kptr
tasks to be stored in BPF maps, from David Vernet.
6) Add support for stashing local BPF kptr into a map value via
bpf_kptr_xchg(). This is useful e.g. for rbtree node creation
for new cgroups, from Dave Marchevsky.
7) Fix BTF handling of is_int_ptr to skip modifiers to work around
tracing issues where a program cannot be attached, from Feng Zhou.
8) Migrate a big portion of test_verifier unit tests over to
test_progs -a verifier_* via inline asm to ease {read,debug}ability,
from Eduard Zingerman.
9) Several updates to the instruction-set.rst documentation
which is subject to future IETF standardization
(https://lwn.net/Articles/926882/), from Dave Thaler.
10) Fix BPF verifier in the __reg_bound_offset's 64->32 tnum sub-register
known bits information propagation, from Daniel Borkmann.
11) Add skb bitfield compaction work related to BPF with the overall goal
to make more of the sk_buff bits optional, from Jakub Kicinski.
12) BPF selftest cleanups for build id extraction which stand on its own
from the upcoming integration work of build id into struct file object,
from Jiri Olsa.
13) Add fixes and optimizations for xsk descriptor validation and several
selftest improvements for xsk sockets, from Kal Conley.
14) Add BPF links for struct_ops and enable switching implementations
of BPF TCP cong-ctls under a given name by replacing backing
struct_ops map, from Kui-Feng Lee.
15) Remove a misleading BPF verifier env->bypass_spec_v1 check on variable
offset stack read as earlier Spectre checks cover this,
from Luis Gerhorst.
16) Fix issues in copy_from_user_nofault() for BPF and other tracers
to resemble copy_from_user_nmi() from safety PoV, from Florian Lehner
and Alexei Starovoitov.
17) Add --json-summary option to test_progs in order for CI tooling to
ease parsing of test results, from Manu Bretelle.
18) Batch of improvements and refactoring to prep for upcoming
bpf_local_storage conversion to bpf_mem_cache_{alloc,free} allocator,
from Martin KaFai Lau.
19) Improve bpftool's visual program dump which produces the control
flow graph in a DOT format by adding C source inline annotations,
from Quentin Monnet.
20) Fix attaching fentry/fexit/fmod_ret/lsm to modules by extracting
the module name from BTF of the target and searching kallsyms of
the correct module, from Viktor Malik.
21) Improve BPF verifier handling of '<const> <cond> <non_const>'
to better detect whether in particular jmp32 branches are taken,
from Yonghong Song.
22) Allow BPF TCP cong-ctls to write app_limited of struct tcp_sock.
A built-in cc or one from a kernel module is already able to write
to app_limited, from Yixin Shen.
Conflicts:
Documentation/bpf/bpf_devel_QA.rst
b7abcd9c656b ("bpf, doc: Link to submitting-patches.rst for general patch submission info")
0f10f647f455 ("bpf, docs: Use internal linking for link to netdev subsystem doc")
https://lore.kernel.org/all/20230307095812.236eb1be@canb.auug.org.au/
include/net/ip_tunnels.h
bc9d003dc48c3 ("ip_tunnel: Preserve pointer const in ip_tunnel_info_opts")
ac931d4cdec3d ("ipip,ip_tunnel,sit: Add FOU support for externally controlled ipip devices")
https://lore.kernel.org/all/20230413161235.4093777-1-broonie@kernel.org/
net/bpf/test_run.c
e5995bc7e2ba ("bpf, test_run: fix crashes due to XDP frame overwriting/corruption")
294635a8165a ("bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES")
https://lore.kernel.org/all/20230320102619.05b80a98@canb.auug.org.au/
====================
Link: https://lore.kernel.org/r/20230413191525.7295-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Conflicts:
tools/testing/selftests/net/config
62199e3f1658 ("selftests: net: Add VXLAN MDB test")
3a0385be133e ("selftests: add the missing CONFIG_IP_SCTP in net config")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf, and bluetooth.
Not all that quiet given spring celebrations, but "current" fixes are
thinning out, which is encouraging. One outstanding regression in the
mlx5 driver when using old FW, not blocking but we're pushing for a
fix.
Current release - new code bugs:
- eth: enetc: workaround for unresponsive pMAC after receiving
express traffic
Previous releases - regressions:
- rtnetlink: restore RTM_NEW/DELLINK notification behavior, keep the
pid/seq fields 0 for backward compatibility
Previous releases - always broken:
- sctp: fix a potential overflow in sctp_ifwdtsn_skip
- mptcp:
- use mptcp_schedule_work instead of open-coding it and make the
worker check stricter, to avoid scheduling work on closed
sockets
- fix NULL pointer dereference on fastopen early fallback
- skbuff: fix memory corruption due to a race between skb coalescing
and releasing clones confusing page_pool reference counting
- bonding: fix neighbor solicitation validation on backup slaves
- bpf: tcp: use sock_gen_put instead of sock_put in bpf_iter_tcp
- bpf: arm64: fixed a BTI error on returning to patched function
- openvswitch: fix race on port output leading to inf loop
- sfp: initialize sfp->i2c_block_size at sfp allocation to avoid
returning a different errno than expected
- phy: nxp-c45-tja11xx: unregister PTP, purge queues on remove
- Bluetooth: fix printing errors if LE Connection times out
- Bluetooth: assorted UaF, deadlock and data race fixes
- eth: macb: fix memory corruption in extended buffer descriptor mode
Misc:
- adjust the XDP Rx flow hash API to also include the protocol layers
over which the hash was computed"
* tag 'net-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (50 commits)
selftests/bpf: Adjust bpf_xdp_metadata_rx_hash for new arg
mlx4: bpf_xdp_metadata_rx_hash add xdp rss hash type
veth: bpf_xdp_metadata_rx_hash add xdp rss hash type
mlx5: bpf_xdp_metadata_rx_hash add xdp rss hash type
xdp: rss hash types representation
selftests/bpf: xdp_hw_metadata remove bpf_printk and add counters
skbuff: Fix a race between coalescing and releasing SKBs
net: macb: fix a memory corruption in extended buffer descriptor mode
selftests: add the missing CONFIG_IP_SCTP in net config
udp6: fix potential access to stale information
selftests: openvswitch: adjust datapath NL message declaration
selftests: mptcp: userspace pm: uniform verify events
mptcp: fix NULL pointer dereference on fastopen early fallback
mptcp: stricter state check in mptcp_worker
mptcp: use mptcp_schedule_work instead of open-coding it
net: enetc: workaround for unresponsive pMAC after receiving express traffic
sctp: fix a potential overflow in sctp_ifwdtsn_skip
net: qrtr: Fix an uninit variable access bug in qrtr_tx_resume()
rtnetlink: Restore RTM_NEW/DELLINK notification behavior
net: ti/cpsw: Add explicit platform_device.h and of_platform.h includes
...
|
|
Update API for bpf_xdp_metadata_rx_hash() with arg for xdp rss hash type
via mapping table.
The mlx5 hardware can also identify and RSS hash IPSEC. This indicate
hash includes SPI (Security Parameters Index) as part of IPSEC hash.
Extend xdp core enum xdp_rss_hash_type with IPSEC hash type.
Fixes: bc8d405b1ba9 ("net/mlx5e: Support RX XDP metadata")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/168132892548.340624.11185734579430124869.stgit@firesoul
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The RSS hash type specifies what portion of packet data NIC hardware used
when calculating RSS hash value. The RSS types are focused on Internet
traffic protocols at OSI layers L3 and L4. L2 (e.g. ARP) often get hash
value zero and no RSS type. For L3 focused on IPv4 vs. IPv6, and L4
primarily TCP vs UDP, but some hardware supports SCTP.
Hardware RSS types are differently encoded for each hardware NIC. Most
hardware represent RSS hash type as a number. Determining L3 vs L4 often
requires a mapping table as there often isn't a pattern or sorting
according to ISO layer.
The patch introduce a XDP RSS hash type (enum xdp_rss_hash_type) that
contains both BITs for the L3/L4 types, and combinations to be used by
drivers for their mapping tables. The enum xdp_rss_type_bits get exposed
to BPF via BTF, and it is up to the BPF-programmer to match using these
defines.
This proposal change the kfunc API bpf_xdp_metadata_rx_hash() adding
a pointer value argument for provide the RSS hash type.
Change signature for all xmo_rx_hash calls in drivers to make it compile.
The RSS type implementations for each driver comes as separate patches.
Fixes: 3d76a4d3d4e5 ("bpf: XDP metadata RX kfuncs")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/168132892042.340624.582563003880565460.stgit@firesoul
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Some drivers like iwlwifi might have per-STA queues, so we
may want to flush/drop just those queues rather than all
when removing a station. Add a separate method for that.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
The checks of whether or not a frame is bufferable were not
taking into account that some action frames aren't, such as
FTM. Check this, which requires some changes to the function
ieee80211_is_bufferable_mmpdu() since we need the whole skb
for the checks now.
Reviewed-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
WLAN_PUBLIC_ACTION_FTM_RESPONSE is duplicated with
WLAN_PUB_ACTION_FTM, but that might better be called
WLAN_PUB_ACTION_FTM_RESPONSE; clean up here.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Some platforms have dwmac4 implementations that have a different
address space layout than the default, resulting in the need to define
their own DMA/MTL offsets.
Extend the functions to allow a platform driver to indicate what its
addresses are, overriding the defaults.
Signed-off-by: Andrew Halaney <ahalaney@redhat.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Brian Masney <bmasney@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The commits referenced below allows userspace to use the NLM_F_ECHO flag
for RTM_NEW/DELLINK operations to receive unicast notifications for the
affected link. Prior to these changes, applications may have relied on
multicast notifications to learn the same information without specifying
the NLM_F_ECHO flag.
For such applications, the mentioned commits changed the behavior for
requests not using NLM_F_ECHO. Multicast notifications are still received,
but now use the portid of the requester and the sequence number of the
request instead of zero values used previously. For the application, this
message may be unexpected and likely handled as a response to the
NLM_F_ACKed request, especially if it uses the same socket to handle
requests and notifications.
To fix existing applications relying on the old notification behavior,
set the portid and sequence number in the notification only if the
request included the NLM_F_ECHO flag. This restores the old behavior
for applications not using it, but allows unicasted notifications for
others.
Fixes: f3a63cce1b4f ("rtnetlink: Honour NLM_F_ECHO flag in rtnl_delete_link")
Fixes: d88e136cab37 ("rtnetlink: Honour NLM_F_ECHO flag in rtnl_newlink_create")
Signed-off-by: Martin Willi <martin@strongswan.org>
Acked-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20230411074319.24133-1-martin@strongswan.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add two new kfuncs that allow a BPF tc-hook, installed on an ipip
device in collect-metadata mode, to control FOU encap parameters on a
per-packet level. The set of kfuncs is registered with the fou module.
The bpf_skb_set_fou_encap kfunc is supposed to be used in tandem and after
a successful call to the bpf_skb_set_tunnel_key bpf-helper. UDP source and
destination ports can be controlled by passing a struct bpf_fou_encap. A
source port of zero will auto-assign a source port. enum bpf_fou_encap_type
is used to specify if the egress path should FOU or GUE encap the packet.
On the ingress path bpf_skb_get_fou_encap can be used to read UDP source
and destination ports from the receiver's point of view and allows for
packet multiplexing across different destination ports within a single
BPF program and ipip device.
Signed-off-by: Christian Ehrig <cehrig@cloudflare.com>
Link: https://lore.kernel.org/r/e17c94a646b63e78ce0dbf3f04b2c33dc948a32d.1680874078.git.cehrig@cloudflare.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Today ipip devices in collect-metadata mode don't allow for sending FOU
or GUE encapsulated packets. This patch lifts the restriction by adding
a struct ip_tunnel_encap to the tunnel metadata.
On the egress path, the members of this struct can be set by the
bpf_skb_set_fou_encap kfunc via a BPF tc-hook. Instead of dropping packets
wishing to use additional UDP encapsulation, ip_md_tunnel_xmit now
evaluates the contents of this struct and adds the corresponding FOU or
GUE header. Furthermore, it is making sure that additional header bytes
are taken into account for PMTU discovery.
On the ingress path, an ipip device in collect-metadata mode will fill this
struct and a BPF tc-hook can obtain the information via a call to the
bpf_skb_get_fou_encap kfunc.
The minor change to ip_tunnel_encap, which now takes a pointer to
struct ip_tunnel_encap instead of struct ip_tunnel, allows us to control
FOU encap type and parameters on a per packet-level.
Signed-off-by: Christian Ehrig <cehrig@cloudflare.com>
Link: https://lore.kernel.org/r/cfea47de655d0f870248abf725932f851b53960a.1680874078.git.cehrig@cloudflare.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add the SDIO vendor ID for Realtek and some device IDs extracted from
their GPL vendor driver. This will be useful in the future when the
rtw88 driver gains support for these chips.
Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Ping-Ke Shih <pkshih@realtek.com>
Reviewed-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://lore.kernel.org/r/20230405200729.632435-7-martin.blumenstingl@googlemail.com
|
|
Add new WQE type: FLOW_TBL_ACCESS, which will be used for
writing modify header arguments.
This type has specific control segment and special data segment.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Add enum value for modify-header argument object and mlx5_bits
for the related capabilities.
Signed-off-by: Muhammad Sammar <muhammads@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Create a new profile for SFs in order to disable the command cache.
Each function command cache consumes ~500KB of memory, when using a
large number of SFs this savings is notable on memory constarined
systems.
Use a new profile to provide for future differences between SFs and PFs.
The mr_cache not used for non-PF functions, so it is excluded from the
new profile.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Bodong Wang <bodong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Add the required hardware definitions to mlx5_ifc: fdb_uplink_hairpin,
fdb_multi_path_any_table_limit_regc, fdb_multi_path_any_table.
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|