Age | Commit message (Collapse) | Author |
|
Now that each nexthop stores its region boundary in the multipath hash
function's output space, we can use hash-threshold instead of modulo-N
in multipath selection.
This reduces the number of checks we need to perform during lookup, as
dead and linkdown nexthops are assigned a negative region boundary. In
addition, in contrast to modulo-N, only flows near region boundaries are
affected when a nexthop is added or removed.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The hash thresholds assigned to IPv6 nexthops are in the range of
[-1, 2^31 - 1], where a negative value is assigned to nexthops that
should not be considered during multipath selection.
Therefore, in a similar fashion to IPv4, we need to use the upper
31-bits of the multipath hash for multipath selection.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Before we convert IPv6 to use hash-threshold instead of modulo-N, we
first need each nexthop to store its region boundary in the hash
function's output space.
The boundary is calculated by dividing the output space equally between
the different active nexthops. That is, nexthops that are not dead or
linkdown.
The boundaries are rebalanced whenever a nexthop is added or removed to
a multipath route and whenever a nexthop becomes active or inactive.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
gcc-8 reports
net/caif/caif_usb.c: In function 'cfusbl_device_notify':
./include/linux/string.h:245:9: warning: '__builtin_strncpy' output may
be truncated copying 15 bytes from a string of length 15
[-Wstringop-truncation]
The compiler require that the input param 'len' of strncpy() should be
greater than the length of the src string, so that '\0' is copied as
well. We can just use strlcpy() to avoid this warning.
Signed-off-by: Xiongfeng Wang <xiongfeng.wang@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
mlx5-updates-2018-01-08
Four patches from Or that add Hairpin support to mlx5:
===========================================================
From: Or Gerlitz <ogerlitz@mellanox.com>
We refer the ability of NIC HW to fwd packet received on one port to
the other port (also from a port to itself) as hairpin. The application API
is based
on ingress tc/flower rules set on the NIC with the mirred redirect
action. Other actions can apply to packets during the redirect.
Hairpin allows to offload the data-path of various SW DDoS gateways,
load-balancers, etc to HW. Packets go through all the required
processing in HW (header re-write, encap/decap, push/pop vlan) and
then forwarded, CPU stays at practically zero usage. HW Flow counters
are used by the control plane for monitoring and accounting.
Hairpin is implemented by pairing a receive queue (RQ) to send queue (SQ).
All the flows that share <recv NIC, mirred NIC> are redirected through
the same hairpin pair. Currently, only header-rewrite is supported as a
packet modification action.
I'd like to thanks Elijah Shakkour <elijahs@mellanox.com> for implementing this
functionality
on HW simulator, before it was avail in the FW so the driver code could be
tested early.
===========================================================
From Feras three patches that provide very small changes that allow IPoIB
to support RX timestamping for child interfaces, simply by hooking the mlx5e
timestamping PTP ioctl to IPoIB child interface netdev profile.
One patch from Gal to fix a spilling mistake.
Two patches from Eugenia adds drop counters to VF statistics
to be reported as part of VF statistics in netlink (iproute2) and
implemented them in mlx5 eswitch.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Some sockopt handling functions were calculating the length of the
buffer to be written to userspace and then calculating it again when
actually writing the buffer, which could lead to some write not using
an up-to-date length.
This patch updates such places to just make use of the len variable.
Also, replace some sizeof(type) to sizeof(var).
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Hangbin Liu reported that some sockopt calls could cause the kernel to log
a warning on memory allocation failure if the user supplied a large optlen
value. That is because some of them called memdup_user() without a ceiling
on optlen, allowing it to try to allocate really large buffers.
This patch adds a ceiling by limiting optlen to the maximum allowed that
would still make sense for these sockopt.
Reported-by: Hangbin Liu <haliu@redhat.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
So replace it with GFP_USER and also add __GFP_NOWARN.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The newly added NF_FLOW_TABLE options cause some build failures in
randconfig kernels:
- when CONFIG_NF_CONNTRACK is disabled, or is a loadable module but
NF_FLOW_TABLE is built-in:
In file included from net/netfilter/nf_flow_table.c:8:0:
include/net/netfilter/nf_conntrack.h:59:22: error: field 'ct_general' has incomplete type
struct nf_conntrack ct_general;
include/net/netfilter/nf_conntrack.h: In function 'nf_ct_get':
include/net/netfilter/nf_conntrack.h:148:15: error: 'const struct sk_buff' has no member named '_nfct'
include/net/netfilter/nf_conntrack.h: In function 'nf_ct_put':
include/net/netfilter/nf_conntrack.h:157:2: error: implicit declaration of function 'nf_conntrack_put'; did you mean 'nf_ct_put'? [-Werror=implicit-function-declaration]
net/netfilter/nf_flow_table.o: In function `nf_flow_offload_work_gc':
(.text+0x1540): undefined reference to `nf_ct_delete'
- when CONFIG_NF_TABLES is disabled:
In file included from net/ipv6/netfilter/nf_flow_table_ipv6.c:13:0:
include/net/netfilter/nf_tables.h: In function 'nft_gencursor_next':
include/net/netfilter/nf_tables.h:1189:14: error: 'const struct net' has no member named 'nft'; did you mean 'nf'?
- when CONFIG_NF_FLOW_TABLE_INET is enabled, but NF_FLOW_TABLE_IPV4
or NF_FLOW_TABLE_IPV6 are not, or are loadable modules
net/netfilter/nf_flow_table_inet.o: In function `nf_flow_offload_inet_hook':
nf_flow_table_inet.c:(.text+0x94): undefined reference to `nf_flow_offload_ipv6_hook'
nf_flow_table_inet.c:(.text+0x40): undefined reference to `nf_flow_offload_ip_hook'
- when CONFIG_NF_FLOW_TABLES is disabled, but the other options are
enabled:
net/netfilter/nf_flow_table_inet.o: In function `nf_flow_offload_inet_hook':
nf_flow_table_inet.c:(.text+0x6c): undefined reference to `nf_flow_offload_ipv6_hook'
net/netfilter/nf_flow_table_inet.o: In function `nf_flow_inet_module_exit':
nf_flow_table_inet.c:(.exit.text+0x8): undefined reference to `nft_unregister_flowtable_type'
net/netfilter/nf_flow_table_inet.o: In function `nf_flow_inet_module_init':
nf_flow_table_inet.c:(.init.text+0x8): undefined reference to `nft_register_flowtable_type'
net/ipv4/netfilter/nf_flow_table_ipv4.o: In function `nf_flow_ipv4_module_exit':
nf_flow_table_ipv4.c:(.exit.text+0x8): undefined reference to `nft_unregister_flowtable_type'
net/ipv4/netfilter/nf_flow_table_ipv4.o: In function `nf_flow_ipv4_module_init':
nf_flow_table_ipv4.c:(.init.text+0x8): undefined reference to `nft_register_flowtable_type'
This adds additional Kconfig dependencies to ensure that NF_CONNTRACK and NF_TABLES
are always visible from NF_FLOW_TABLE, and that the internal dependencies between
the four new modules are met.
Fixes: 7c23b629a808 ("netfilter: flow table support for the mixed IPv4/IPv6 family")
Fixes: 0995210753a2 ("netfilter: flow table support for IPv6")
Fixes: 97add9f0d66d ("netfilter: flow table support for IPv4")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Daniel Borkmann says:
====================
pull-request: bpf 2018-01-09
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Prevent out-of-bounds speculation in BPF maps by masking the
index after bounds checks in order to fix spectre v1, and
add an option BPF_JIT_ALWAYS_ON into Kconfig that allows for
removing the BPF interpreter from the kernel in favor of
JIT-only mode to make spectre v2 harder, from Alexei.
2) Remove false sharing of map refcount with max_entries which
was used in spectre v1, from Daniel.
3) Add a missing NULL psock check in sockmap in order to fix
a race, from John.
4) Fix test_align BPF selftest case since a recent change in
verifier rejects the bit-wise arithmetic on pointers
earlier but test_align update was missing, from Alexei.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
It allows matching packets based on Segment Routing Header
(SRH) information.
The implementation considers revision 7 of the SRH draft.
https://tools.ietf.org/html/draft-ietf-6man-segment-routing-header-07
Currently supported match options include:
(1) Next Header
(2) Hdr Ext Len
(3) Segments Left
(4) Last Entry
(5) Tag value of SRH
Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
EEXIST is used for an object that already exists, with the same
name/handle. However, there no same object there, instead there is a
object that is using the single slot that is available for NAT hooks
since patch f92b40a8b264 ("netfilter: core: only allow one nat hook per
hook point"). Let's change this return value before this behaviour gets
exposed in the first -rc.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Fixes the following sparse warning:
net/netfilter/core.c:380:6: warning:
symbol '__nf_unregister_net_hook' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Fix a typo, we should check 'flowtable' instead of 'table'.
Fixes: 3b49e2e94e6e ("netfilter: nf_tables: add flow table netlink frontend")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
a typo causes module auto load support to never be compiled in.
Fixes: 03d13b6868a2 ("netfilter: xtables: add and use xt_request_find_table_lock")
Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Remove the infrastructure to register/unregister nft_af_info structure,
this structure stores no useful information anymore.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Now that we have a single table list for each netns, we can get rid of
one pointer per family and the global afinfo list, thus, shrinking
struct netns for nftables that now becomes 64 bytes smaller.
And call __nft_release_afinfo() from __net_exit path accordingly to
release netnamespace objects on removal.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Place all existing user defined tables in struct net *, instead of
having one list per family. This saves us from one level of indentation
in netlink dump functions.
Place pointer to struct nft_af_info in struct nft_table temporarily, as
we still need this to put back reference module reference counter on
table removal.
This patch comes in preparation for the removal of struct nft_af_info.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
nf_tables_chain_type_lookup()
Pass family number instead, this comes in preparation for the removal of
struct nft_af_info.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
nf_tables_table_enable() and nf_tables_table_disable() take a pointer to
struct nft_af_info that is never used, remove it.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Replace it by a direct check for the netdev protocol family.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
We already validate the hook through bitmask, so this check is
superfluous. When removing this, this patch is also fixing a bug in the
new flowtable codebase, since ctx->afi points to the table family
instead of the netdev family which is where the flowtable is really
hooked in.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
We need to run xfrm_resolve_and_create_bundle() with
bottom halves off. Otherwise we may reuse an already
released dst_enty when the xfrm lookup functions are
called from process context.
Fixes: c30d78c14a813db39a647b6a348b428 ("xfrm: add xdst pcpu cache")
Reported-by: Darius Ski <darius.ski@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
Looks like commit e817f85652c1 ("xdp: generic XDP handling of
xdp_rxq_info") replaced kvfree(dev->_rx) in free_netdev() with
a call to netif_free_rx_queues() which doesn't actually free
the rings?
While at it remove the unnecessary temporary variable.
Fixes: e817f85652c1 ("xdp: generic XDP handling of xdp_rxq_info")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
kvzalloc'ed memory should be kvfree'd.
Fixes: e817f85652c1 ("xdp: generic XDP handling of xdp_rxq_info")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
We leak the allocated out_skb in case
pfkey_xfrm_policy2msg() fails. Fix this
by freeing it on error.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715.
A quote from goolge project zero blog:
"At this point, it would normally be necessary to locate gadgets in
the host kernel code that can be used to actually leak data by reading
from an attacker-controlled location, shifting and masking the result
appropriately and then using the result of that as offset to an
attacker-controlled address for a load. But piecing gadgets together
and figuring out which ones work in a speculation context seems annoying.
So instead, we decided to use the eBPF interpreter, which is built into
the host kernel - while there is no legitimate way to invoke it from inside
a VM, the presence of the code in the host kernel's text section is sufficient
to make it usable for the attack, just like with ordinary ROP gadgets."
To make attacker job harder introduce BPF_JIT_ALWAYS_ON config
option that removes interpreter from the kernel in favor of JIT-only mode.
So far eBPF JIT is supported by:
x64, arm64, arm32, sparc64, s390, powerpc64, mips64
The start of JITed program is randomized and code page is marked as read-only.
In addition "constant blinding" can be turned on with net.core.bpf_jit_harden
v2->v3:
- move __bpf_prog_ret0 under ifdef (Daniel)
v1->v2:
- fix init order, test_bpf and cBPF (Daniel's feedback)
- fix offloaded bpf (Jakub's feedback)
- add 'return 0' dummy in case something can invoke prog->bpf_func
- retarget bpf tree. For bpf-next the patch would need one extra hunk.
It will be sent when the trees are merged back to net-next
Considered doing:
int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT;
but it seems better to land the patch as-is and in bpf-next remove
bpf_jit_enable global variable from all JITs, consolidate in one place
and remove this jit_init() function.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
The current criteria for returning POLLOUT from a group member socket is
too simplistic. It basically returns POLLOUT as soon as the group has
external destinations, something obviously leading to a lot of spinning
during destination congestion situations. At the same time, the internal
congestion handling is unnecessarily complex.
We now change this as follows.
- We introduce an 'open' flag in struct tipc_group. This flag is used
only to help poll() get the setting of POLLOUT right, and *not* for
congeston handling as such. This means that a user can choose to
ignore an EAGAIN for a destination and go on sending messages to
other destinations in the group if he wants to.
- The flag is set to false every time we return EAGAIN on a send call.
- The flag is set to true every time any member, i.e., not necessarily
the member that caused EAGAIN, is removed from the small_win list.
- We remove the group member 'usr_pending' flag. The size of the send
window and presence in the 'small_win' list is sufficient criteria
for recognizing congestion.
This solution seems to be a reasonable compromise between 'anycast',
which is normally not waiting for POLLOUT for a specific destination,
and the other three send modes, which are.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When a member joins a group, it also indicates a binding scope. This
makes it possible to create both node local groups, invisible to other
nodes, as well as cluster global groups, visible everywhere.
In order to avoid that different members end up having permanently
differing views of group size and memberhip, we must inhibit locally
and globally bound members from joining the same group.
We do this by using the binding scope as an additional separator between
groups. I.e., a member must ignore all membership events from sockets
using a different scope than itself, and all lookups for message
destinations must require an exact match between the message's lookup
scope and the potential target's binding scope.
Apart from making it possible to create local groups using the same
identity on different nodes, a side effect of this is that it now also
becomes possible to create a cluster global group with the same identity
across the same nodes, without interfering with the local groups.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently, when a user is subscribing for binding table publications,
he will receive a PUBLISH event for all already existing matching items
in the binding table.
However, a group socket making a subscriptions doesn't need this initial
status update from the binding table, because it has already scanned it
during the join operation. Worse, the multiplicatory effect of issuing
mutual events for dozens or hundreds group members within a short time
frame put a heavy load on the topology server, with the end result that
scale out operations on a big group tend to take much longer than needed.
We now add a new filter option, TIPC_SUB_NO_STATUS, for topology server
subscriptions, so that this initial avalanche of events is suppressed.
This change, along with the previous commit, significantly improves the
range and speed of group scale out operations.
We keep the new option internal for the tipc driver, at least for now.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When a socket is joining a group, we look up in the binding table to
find if there are already other members of the group present. This is
used for being able to return EAGAIN instead of EHOSTUNREACH if the
user proceeds directly to a send attempt.
However, the information in the binding table can be used to directly
set the created member in state MBR_PUBLISHED and send a JOIN message
to the peer, instead of waiting for a topology PUBLISH event to do this.
When there are many members in a group, the propagation time for such
events can be significant, and we can save time during the join
operation if we use the initial lookup result fully.
In this commit, we eliminate the member state MBR_DISCOVERED which has
been the result of the initial lookup, and do instead go directly to
MBR_PUBLISHED, which initiates the setup.
After this change, the tipc_member FSM looks as follows:
+-----------+
---->| PUBLISHED |-----------------------------------------------+
PUB- +-----------+ LEAVE/WITHRAW |
LISH |JOIN |
| +-------------------------------------------+ |
| | LEAVE/WITHDRAW | |
| | +------------+ | |
| | +----------->| PENDING |---------+ | |
| | |msg/maxactv +-+---+------+ LEAVE/ | | |
| | | | | WITHDRAW | | |
| | | +----------+ | | | |
| | | |revert/maxactv| | | |
| | | V V V V V
| +----------+ msg +------------+ +-----------+
+-->| JOINED |------>| ACTIVE |------>| LEAVING |--->
| +----------+ +--- -+------+ LEAVE/+-----------+DOWN
| A A | WITHDRAW A A A EVT
| | | |RECLAIM | | |
| | |REMIT V | | |
| | |== adv +------------+ | | |
| | +---------| RECLAIMING |--------+ | |
| | +-----+------+ LEAVE/ | |
| | |REMIT WITHDRAW | |
| | |< adv | |
| |msg/ V LEAVE/ | |
| |adv==ADV_IDLE+------------+ WITHDRAW | |
| +-------------| REMITTED |------------+ |
| +------------+ |
|PUBLISH |
JOIN +-----------+ LEAVE/WITHDRAW |
---->| JOINING |-----------------------------------------------+
+-----------+
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
After the changes in the previous commit the group LEAVE sequence
can be simplified.
We now let the arrival of a LEAVE message unconditionally issue a group
DOWN event to the user. When a topology WITHDRAW event is received, the
member, if it still there, is set to state LEAVING, but we only issue a
group DOWN event when the link to the peer node is gone, so that no
LEAVE message is to be expected.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In the current implementation, a group socket receiving topology
events about other members just converts the topology event message
into a group event message and stores it until it reaches the right
state to issue it to the user. This complicates the code unnecessarily,
and becomes impractical when we in the coming commits will need to
create and issue membership events independently.
In this commit, we change this so that we just notice the type and
origin of the incoming topology event, and then drop the buffer. Only
when it is time to actually send a group event to the user do we
explicitly create a new message and send it upwards.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Analysis reveals that the member state MBR_QURANTINED in reality is
unnecessary, and can be replaced by the state MBR_JOINING at all
occurrencs.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We handle a corner case in the function tipc_group_update_rcv_win().
During extreme pessure it might happen that a message receiver has all
its active senders in RECLAIMING or REMITTED mode, meaning that there
is nobody to reclaim advertisements from if an additional sender tries
to go active.
Currently we just set the new sender to ACTIVE anyway, hence at least
theoretically opening up for a receiver queue overflow by exceeding the
MAX_ACTIVE limit. The correct solution to this is to instead add the
member to the pending queue, while letting the oldest member in that
queue revert to JOINED state.
In this commit we refactor the code for handling message arrival from
a JOINED member, both to make it more comprehensible and to cover the
case described above.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
- We remove the 'reclaiming' member list in struct tipc_group, since
it doesn't serve any purpose.
- We simplify the GRP_REMIT_MSG branch of tipc_group_protocol_rcv().
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In the current code, when creating a new fib6 table, tb6_root.leaf gets
initialized to net->ipv6.ip6_null_entry.
If a default route is being added with rt->rt6i_metric = 0xffffffff,
fib6_add() will add this route after net->ipv6.ip6_null_entry. As
null_entry is shared, it could cause problem.
In order to fix it, set fn->leaf to NULL before calling
fib6_add_rt2node() when trying to add the first default route.
And reset fn->leaf to null_entry when adding fails or when deleting the
last default route.
syzkaller reported the following issue which is fixed by this commit:
WARNING: suspicious RCU usage
4.15.0-rc5+ #171 Not tainted
-----------------------------
net/ipv6/ip6_fib.c:1702 suspicious rcu_dereference_protected() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
4 locks held by swapper/0/0:
#0: ((&net->ipv6.ip6_fib_timer)){+.-.}, at: [<00000000d43f631b>] lockdep_copy_map include/linux/lockdep.h:178 [inline]
#0: ((&net->ipv6.ip6_fib_timer)){+.-.}, at: [<00000000d43f631b>] call_timer_fn+0x1c6/0x820 kernel/time/timer.c:1310
#1: (&(&net->ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<000000002ff9d65c>] spin_lock_bh include/linux/spinlock.h:315 [inline]
#1: (&(&net->ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<000000002ff9d65c>] fib6_run_gc+0x9d/0x3c0 net/ipv6/ip6_fib.c:2007
#2: (rcu_read_lock){....}, at: [<0000000091db762d>] __fib6_clean_all+0x0/0x3a0 net/ipv6/ip6_fib.c:1560
#3: (&(&tb->tb6_lock)->rlock){+.-.}, at: [<000000009e503581>] spin_lock_bh include/linux/spinlock.h:315 [inline]
#3: (&(&tb->tb6_lock)->rlock){+.-.}, at: [<000000009e503581>] __fib6_clean_all+0x1d0/0x3a0 net/ipv6/ip6_fib.c:1948
stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc5+ #171
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:17 [inline]
dump_stack+0x194/0x257 lib/dump_stack.c:53
lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4585
fib6_del+0xcaa/0x11b0 net/ipv6/ip6_fib.c:1701
fib6_clean_node+0x3aa/0x4f0 net/ipv6/ip6_fib.c:1892
fib6_walk_continue+0x46c/0x8a0 net/ipv6/ip6_fib.c:1815
fib6_walk+0x91/0xf0 net/ipv6/ip6_fib.c:1863
fib6_clean_tree+0x1e6/0x340 net/ipv6/ip6_fib.c:1933
__fib6_clean_all+0x1f4/0x3a0 net/ipv6/ip6_fib.c:1949
fib6_clean_all net/ipv6/ip6_fib.c:1960 [inline]
fib6_run_gc+0x16b/0x3c0 net/ipv6/ip6_fib.c:2016
fib6_gc_timer_cb+0x20/0x30 net/ipv6/ip6_fib.c:2033
call_timer_fn+0x228/0x820 kernel/time/timer.c:1320
expire_timers kernel/time/timer.c:1357 [inline]
__run_timers+0x7ee/0xb70 kernel/time/timer.c:1660
run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
__do_softirq+0x2d7/0xb85 kernel/softirq.c:285
invoke_softirq kernel/softirq.c:365 [inline]
irq_exit+0x1cc/0x200 kernel/softirq.c:405
exiting_irq arch/x86/include/asm/apic.h:540 [inline]
smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:904
</IRQ>
Reported-by: syzbot <syzkaller@googlegroups.com>
Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit 8f659a03a0ba ("net: ipv4: fix for a race condition in
raw_sendmsg") fixed the issue of possibly inconsistent ->hdrincl handling
due to concurrent updates by reading this bit-field member into a local
variable and using the thus stabilized value in subsequent tests.
However, aforementioned commit also adds the (correct) comment that
/* hdrincl should be READ_ONCE(inet->hdrincl)
* but READ_ONCE() doesn't work with bit fields
*/
because as it stands, the compiler is free to shortcut or even eliminate
the local variable at its will.
Note that I have not seen anything like this happening in reality and thus,
the concern is a theoretical one.
However, in order to be on the safe side, emulate a READ_ONCE() on the
bit-field by doing it on the local 'hdrincl' variable itself:
int hdrincl = inet->hdrincl;
hdrincl = READ_ONCE(hdrincl);
This breaks the chain in the sense that the compiler is not allowed
to replace subsequent reads from hdrincl with reloads from inet->hdrincl.
Fixes: 8f659a03a0ba ("net: ipv4: fix for a race condition in raw_sendmsg")
Signed-off-by: Nicolai Stange <nstange@suse.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add a sanity check to ensure that all requested ring parameters
are within bounds, which should reduce errors in driver implementation.
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
gcc-8 reports
net/caif/caif_dev.c: In function 'caif_enroll_dev':
./include/linux/string.h:245:9: warning: '__builtin_strncpy' output may
be truncated copying 15 bytes from a string of length 15
[-Wstringop-truncation]
net/caif/cfctrl.c: In function 'cfctrl_linkup_request':
./include/linux/string.h:245:9: warning: '__builtin_strncpy' output may
be truncated copying 15 bytes from a string of length 15
[-Wstringop-truncation]
net/caif/cfcnfg.c: In function 'caif_connect_client':
./include/linux/string.h:245:9: warning: '__builtin_strncpy' output may
be truncated copying 15 bytes from a string of length 15
[-Wstringop-truncation]
The compiler require that the input param 'len' of strncpy() should be
greater than the length of the src string, so that '\0' is copied as
well. We can just use strlcpy() to avoid this warning.
Signed-off-by: Xiongfeng Wang <xiongfeng.wang@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Use the ARRAY_SIZE macro on array seg6_action_table to determine size of
the array. Improvement suggested by coccinelle.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Use AF_INET6 instead of AF_INET in IPv6-related code path
Signed-off-by: Andrii Vladyka <tulup@mail.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
The GRO layer does not necessarily pull the complete headers
into the linear part of the skb, a part may remain on the
first page fragment. This can lead to a crash if we try to
pull the headers, so make sure we have them on the linear
part before pulling.
Fixes: 7785bba299a8 ("esp: Add a software GRO codepath")
Reported-by: syzbot+82bbd65569c49c6c0c4d@syzkaller.appspotmail.com
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
Modern hardware can decide to drop packets going to/from a VF.
Add receive and transmit drop counters to be displayed at hypervisor
layer in iproute2 per VF statistics.
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
|
Pull networking fixes from David Miller:
1) Frag and UDP handling fixes in i40e driver, from Amritha Nambiar and
Alexander Duyck.
2) Undo unintentional UAPI change in netfilter conntrack, from Florian
Westphal.
3) Revert a change to how error codes are returned from
dev_get_valid_name(), it broke some apps.
4) Cannot cache routes for ipv6 tunnels in the tunnel is ipv4/ipv6
dual-stack. From Eli Cooper.
5) Fix missed PMTU updates in geneve, from Xin Long.
6) Cure double free in macvlan, from Gao Feng.
7) Fix heap out-of-bounds write in rds_message_alloc_sgs(), from
Mohamed Ghannam.
8) FEC bug fixes from FUgang Duan (mis-accounting of dev_id, missed
deferral of probe when the regulator is not ready yet).
9) Missing DMA mapping error checks in 3c59x, from Neil Horman.
10) Turn off Broadcom tags for some b53 switches, from Florian Fainelli.
11) Fix OOPS when get_target_net() is passed an SKB whose NETLINK_CB()
isn't initialized. From Andrei Vagin.
12) Fix crashes in fib6_add(), from Wei Wang.
13) PMTU bug fixes in SCTP from Marcelo Ricardo Leitner.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (56 commits)
sh_eth: fix TXALCR1 offsets
mdio-sun4i: Fix a memory leak
phylink: mark expected switch fall-throughs in phylink_mii_ioctl
sctp: fix the handling of ICMP Frag Needed for too small MTUs
sctp: do not retransmit upon FragNeeded if PMTU discovery is disabled
xen-netfront: enable device after manual module load
bnxt_en: Fix the 'Invalid VF' id check in bnxt_vf_ndo_prep routine.
bnxt_en: Fix population of flow_type in bnxt_hwrm_cfa_flow_alloc()
sh_eth: fix SH7757 GEther initialization
net: fec: free/restore resource in related probe error pathes
uapi/if_ether.h: prevent redefinition of struct ethhdr
ipv6: fix general protection fault in fib6_add()
RDS: null pointer dereference in rds_atomic_free_op
sh_eth: fix TSU resource handling
net: stmmac: enable EEE in MII, GMII or RGMII only
rtnetlink: give a user socket to get_target_net()
MAINTAINERS: Update my email address.
can: ems_usb: improve error reporting for error warning and error passive
can: flex_can: Correct the checking for frame length in flexcan_start_xmit()
can: gs_usb: fix return value of the "set_bittiming" callback
...
|
|
Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by TIPC at all.
So, remove the unused hardirq.h.
Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by openvswitch at all.
So, remove the unused hardirq.h.
Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: dev@openvswitch.org
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by caif at all.
So, remove the unused hardirq.h.
Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
Cc: Dmitry Tarnyagin <dmitry.tarnyagin@lockless.no>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|