Age | Commit message (Collapse) | Author |
|
This adds a small BPF helper similar to bpf_skb_load_bytes() that
is able to load relative to mac/net header offset from the skb's
linear data. Compared to bpf_skb_load_bytes(), it takes a fifth
argument namely start_header, which is either BPF_HDR_START_MAC
or BPF_HDR_START_NET. This allows for a more flexible alternative
compared to LD_ABS/LD_IND with negative offset. It's enabled for
tc BPF programs as well as sock filter program types where it's
mainly useful in reuseport programs to ease access to lower header
data.
Reference: https://lists.iovisor.org/pipermail/iovisor-dev/2017-March/000698.html
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The main part of this work is to finally allow removal of LD_ABS
and LD_IND from the BPF core by reimplementing them through native
eBPF instead. Both LD_ABS/LD_IND were carried over from cBPF and
keeping them around in native eBPF caused way more trouble than
actually worth it. To just list some of the security issues in
the past:
* fdfaf64e7539 ("x86: bpf_jit: support negative offsets")
* 35607b02dbef ("sparc: bpf_jit: fix loads from negative offsets")
* e0ee9c12157d ("x86: bpf_jit: fix two bugs in eBPF JIT compiler")
* 07aee9439454 ("bpf, sparc: fix usage of wrong reg for load_skb_regs after call")
* 6d59b7dbf72e ("bpf, s390x: do not reload skb pointers in non-skb context")
* 87338c8e2cbb ("bpf, ppc64: do not reload skb pointers in non-skb context")
For programs in native eBPF, LD_ABS/LD_IND are pretty much legacy
these days due to their limitations and more efficient/flexible
alternatives that have been developed over time such as direct
packet access. LD_ABS/LD_IND only cover 1/2/4 byte loads into a
register, the load happens in host endianness and its exception
handling can yield unexpected behavior. The latter is explained
in depth in f6b1b3bf0d5f ("bpf: fix subprog verifier bypass by
div/mod by 0 exception") with similar cases of exceptions we had.
In native eBPF more recent program types will disable LD_ABS/LD_IND
altogether through may_access_skb() in verifier, and given the
limitations in terms of exception handling, it's also disabled
in programs that use BPF to BPF calls.
In terms of cBPF, the LD_ABS/LD_IND is used in networking programs
to access packet data. It is not used in seccomp-BPF but programs
that use it for socket filtering or reuseport for demuxing with
cBPF. This is mostly relevant for applications that have not yet
migrated to native eBPF.
The main complexity and source of bugs in LD_ABS/LD_IND is coming
from their implementation in the various JITs. Most of them keep
the model around from cBPF times by implementing a fastpath written
in asm. They use typically two from the BPF program hidden CPU
registers for caching the skb's headlen (skb->len - skb->data_len)
and skb->data. Throughout the JIT phase this requires to keep track
whether LD_ABS/LD_IND are used and if so, the two registers need
to be recached each time a BPF helper would change the underlying
packet data in native eBPF case. At least in eBPF case, available
CPU registers are rare and the additional exit path out of the
asm written JIT helper makes it also inflexible since not all
parts of the JITer are in control from plain C. A LD_ABS/LD_IND
implementation in eBPF therefore allows to significantly reduce
the complexity in JITs with comparable performance results for
them, e.g.:
test_bpf tcpdump port 22 tcpdump complex
x64 - before 15 21 10 14 19 18
- after 7 10 10 7 10 15
arm64 - before 40 91 92 40 91 151
- after 51 64 73 51 62 113
For cBPF we now track any usage of LD_ABS/LD_IND in bpf_convert_filter()
and cache the skb's headlen and data in the cBPF prologue. The
BPF_REG_TMP gets remapped from R8 to R2 since it's mainly just
used as a local temporary variable. This allows to shrink the
image on x86_64 also for seccomp programs slightly since mapping
to %rsi is not an ereg. In callee-saved R8 and R9 we now track
skb data and headlen, respectively. For normal prologue emission
in the JITs this does not add any extra instructions since R8, R9
are pushed to stack in any case from eBPF side. cBPF uses the
convert_bpf_ld_abs() emitter which probes the fast path inline
already and falls back to bpf_skb_load_helper_{8,16,32}() helper
relying on the cached skb data and headlen as well. R8 and R9
never need to be reloaded due to bpf_helper_changes_pkt_data()
since all skb access in cBPF is read-only. Then, for the case
of native eBPF, we use the bpf_gen_ld_abs() emitter, which calls
the bpf_skb_load_helper_{8,16,32}_no_cache() helper unconditionally,
does neither cache skb data and headlen nor has an inlined fast
path. The reason for the latter is that native eBPF does not have
any extra registers available anyway, but even if there were, it
avoids any reload of skb data and headlen in the first place.
Additionally, for the negative offsets, we provide an alternative
bpf_skb_load_bytes_relative() helper in eBPF which operates
similarly as bpf_skb_load_bytes() and allows for more flexibility.
Tested myself on x64, arm64, s390x, from Sandipan on ppc64.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Remove all eBPF tests involving LD_ABS/LD_IND from test_bpf.ko. Reason
is that the eBPF tests from test_bpf module do not go via BPF verifier
and therefore any instruction rewrites from verifier cannot take place.
Therefore, move them into test_verifier which runs out of user space,
so that verfier can rewrite LD_ABS/LD_IND internally in upcoming patches.
It will have the same effect since runtime tests are also performed from
there. This also allows to finally unexport bpf_skb_vlan_{push,pop}_proto
and keep it internal to core kernel.
Additionally, also add further cBPF LD_ABS/LD_IND test coverage into
test_bpf.ko suite.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
No change in functionality, just remove the '__' prefix and replace it
with a 'bpf_' prefix instead. We later on add a couple of more helpers
for cBPF and keeping the scheme with '__' is suboptimal there.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Björn Töpel says:
====================
This patch set introduces a new address family called AF_XDP that is
optimized for high performance packet processing and, in upcoming
patch sets, zero-copy semantics. In this patch set, we have removed
all zero-copy related code in order to make it smaller, simpler and
hopefully more review friendly. This patch set only supports copy-mode
for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
for RX using the XDP_DRV path. Zero-copy support requires XDP and
driver changes that Jesper Dangaard Brouer is working on. Some of his
work has already been accepted. We will publish our zero-copy support
for RX and TX on top of his patch sets at a later point in time.
An AF_XDP socket (XSK) is created with the normal socket()
syscall. Associated with each XSK are two queues: the RX queue and the
TX queue. A socket can receive packets on the RX queue and it can send
packets on the TX queue. These queues are registered and sized with
the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is
mandatory to have at least one of these queues for each socket. In
contrast to AF_PACKET V2/V3 these descriptor queues are separated from
packet buffers. An RX or TX descriptor points to a data buffer in a
memory area called a UMEM. RX and TX can share the same UMEM so that a
packet does not have to be copied between RX and TX. Moreover, if a
packet needs to be kept for a while due to a possible retransmit, the
descriptor that points to that packet can be changed to point to
another and reused right away. This again avoids copying data.
This new dedicated packet buffer area is call a UMEM. It consists of a
number of equally size frames and each frame has a unique frame id. A
descriptor in one of the queues references a frame by referencing its
frame id. The user space allocates memory for this UMEM using whatever
means it feels is most appropriate (malloc, mmap, huge pages,
etc). This memory area is then registered with the kernel using the new
setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue
and the COMPLETION queue. The fill queue is used by the application to
send down frame ids for the kernel to fill in with RX packet
data. References to these frames will then appear in the RX queue of
the XSK once they have been received. The completion queue, on the
other hand, contains frame ids that the kernel has transmitted
completely and can now be used again by user space, for either TX or
RX. Thus, the frame ids appearing in the completion queue are ids that
were previously transmitted using the TX queue. In summary, the RX and
FILL queues are used for the RX path and the TX and COMPLETION queues
are used for the TX path.
The socket is then finally bound with a bind() call to a device and a
specific queue id on that device, and it is not until bind is
completed that traffic starts to flow. Note that in this patch set,
all packet data is copied out to user-space.
A new feature in this patch set is that the UMEM can be shared between
processes, if desired. If a process wants to do this, it simply skips
the registration of the UMEM and its corresponding two queues, sets a
flag in the bind call and submits the XSK of the process it would like
to share UMEM with as well as its own newly created XSK socket. The
new process will then receive frame id references in its own RX queue
that point to this shared UMEM. Note that since the queue structures
are single-consumer / single-producer (for performance reasons), the
new process has to create its own socket with associated RX and TX
queues, since it cannot share this with the other process. This is
also the reason that there is only one set of FILL and COMPLETION
queues per UMEM. It is the responsibility of a single process to
handle the UMEM. If multiple-producer / multiple-consumer queues are
implemented in the future, this requirement could be relaxed.
How is then packets distributed between these two XSK? We have
introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in
full). The user-space application can place an XSK at an arbitrary
place in this map. The XDP program can then redirect a packet to a
specific index in this map and at this point XDP validates that the
XSK in that map was indeed bound to that device and queue number. If
not, the packet is dropped. If the map is empty at that index, the
packet is also dropped. This also means that it is currently mandatory
to have an XDP program loaded (and one XSK in the XSKMAP) to be able
to get any traffic to user space through the XSK.
AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
driver does not have support for XDP, or XDP_SKB is explicitly chosen
when loading the XDP program, XDP_SKB mode is employed that uses SKBs
together with the generic XDP support and copies out the data to user
space. A fallback mode that works for any network device. On the other
hand, if the driver has support for XDP, it will be used by the AF_XDP
code to provide better performance, but there is still a copy of the
data into user space.
There is a xdpsock benchmarking/test application included that
demonstrates how to use AF_XDP sockets with both private and shared
UMEMs. Say that you would like your UDP traffic from port 4242 to end
up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
for this:
ethtool -N p3p2 rx-flow-hash udp4 fn
ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
action 16
Running the rxdrop benchmark in XDP_DRV mode can then be done
using:
samples/bpf/xdpsock -i p3p2 -q 16 -r -N
For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
can be displayed with "-h", as usual.
We have run some benchmarks on a dual socket system with two Broadwell
E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
cores which gives a total of 28, but only two cores are used in these
experiments. One for TR/RX and one for the user space application. The
memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
8192MB and with 8 of those DIMMs in the system we have 64 GB of total
memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
NIC is Intel I40E 40Gbit/s using the i40e driver.
Below are the results in Mpps of the I40E NIC benchmark runs for 64
and 1500 byte packets, generated by a commercial packet generator HW
outputing packets at full 40 Gbit/s line rate. The results are without
retpoline so that we can compare against previous numbers. With
retpoline, the AF_XDP numbers drop with between 10 - 15 percent.
AF_XDP performance 64 byte packets. Results from V2 in parenthesis.
Benchmark XDP_SKB XDP_DRV
rxdrop 2.9(3.0) 9.6(9.5)
txpush 2.6(2.5) NA*
l2fwd 1.9(1.9) 2.5(2.5) (TX using XDP_SKB in both cases)
AF_XDP performance 1500 byte packets:
Benchmark XDP_SKB XDP_DRV
rxdrop 2.1(2.2) 3.3(3.3)
l2fwd 1.4(1.4) 1.8(1.8) (TX using XDP_SKB in both cases)
* NA since we have no support for TX using the XDP_DRV infrastructure
in this patch set. This is for a future patch set since it involves
changes to the XDP NDOs. Some of this has been upstreamed by Jesper
Dangaard Brouer.
XDP performance on our system as a base line:
64 byte packets:
XDP stats CPU pps issue-pps
XDP-RX CPU 16 32.3(32.9)M 0
1500 byte packets:
XDP stats CPU pps issue-pps
XDP-RX CPU 16 3.3(3.3)M 0
Changes from V2:
* Fixed a race in XSKMAP map found by Will. The code has been
completely rearchitected and is now simpler, faster, and hopefully
also not racy. Please review and check if it holds.
If you would like to diff V2 against V3, you can find them here:
https://github.com/bjoto/linux/tree/af-xdp-v2-on-bpf-next
https://github.com/bjoto/linux/tree/af-xdp-v3-on-bpf-next
The structure of the patch set is as follows:
Patches 1-3: Basic socket and umem plumbing
Patches 4-9: RX support together with the new XSKMAP
Patches 10-13: TX support
Patch 14: Statistics support with getsockopt()
Patch 15: Sample application
We based this patch set on bpf-next commit a3fe1f6f2ada ("tools:
bpftool: change time format for program 'loaded at:' information")
To do for this patch set:
* Syzkaller torture session being worked on
Post-series plan:
* Optimize performance
* Kernel selftest
* Kernel load module support of AF_XDP would be nice. Unclear how to
achieve this though since our XDP code depends on net/core.
* Support for AF_XDP sockets without an XPD program loaded. In this
case all the traffic on a queue should go up to the user space socket.
* Daniel Borkmann's suggestion for a "copy to XDP socket, and return
XDP_PASS" for a tcpdump-like functionality.
* And of course getting to zero-copy support in small increments,
starting with TX then adding RX.
Thanks: Björn and Magnus
====================
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
It's possible for userspace to control attr->map_type. Sanitize it when
using it as an array index to prevent an out-of-bounds value being used
under speculation.
Found by smatch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: netdev@vger.kernel.org
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
This is a sample application for AF_XDP sockets. The application
supports three different modes of operation: rxdrop, txonly and l2fwd.
To show-case a simple round-robin load-balancing between a set of
sockets in an xskmap, set the RR_LB compile time define option to 1 in
"xdpsock.h".
v2: The entries variable was calculated twice in {umem,xq}_nb_avail.
Co-authored-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
In this commit, a new getsockopt is added: XDP_STATISTICS. This is
used to obtain stats from the sockets.
v2: getsockopt now returns size of stats structure.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Here, Tx support is added. The user fills the Tx queue with frames to
be sent by the kernel, and let's the kernel know using the sendmsg
syscall.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The new dev_direct_xmit will be used by AF_XDP in later commits.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Another setsockopt (XDP_TX_QUEUE) is added to let the process allocate
a queue, where the user process can pass frames to be transmitted by
the kernel.
The mmapping of the queue is done using the XDP_PGOFF_TX_QUEUE offset.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Here, we add another setsockopt for registered user memory (umem)
called XDP_UMEM_COMPLETION_QUEUE. Using this socket option, the
process can ask the kernel to allocate a queue (ring buffer) and also
mmap it (XDP_UMEM_PGOFF_COMPLETION_QUEUE) into the process.
The queue is used to explicitly pass ownership of umem frames from the
kernel to user process. This will be used by the TX path to tell user
space that a certain frame has been transmitted and user space can use
it for something else, if it wishes.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This commit wires up the xskmap to XDP_SKB layer.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This commit wires up the xskmap to XDP_DRV layer.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The xskmap is yet another BPF map, very much inspired by
dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application
adds AF_XDP sockets into the map, and by using the bpf_redirect_map
helper, an XDP program can redirect XDP frames to an AF_XDP socket.
Note that a socket that is bound to certain ifindex/queue index will
*only* accept XDP frames from that netdev/queue index. If an XDP
program tries to redirect from a netdev/queue index other than what
the socket is bound to, the frame will not be received on the socket.
A socket can reside in multiple maps.
v3: Fixed race and simplified code.
v2: Removed one indirection in map lookup.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Here the actual receive functions of AF_XDP are implemented, that in a
later commit, will be called from the XDP layers.
There's one set of functions for the XDP_DRV side and another for
XDP_SKB (generic).
A new XDP API, xdp_return_buff, is also introduced.
Adding xdp_return_buff, which is analogous to xdp_return_frame, but
acts upon an struct xdp_buff. The API will be used by AF_XDP in future
commits.
Support for the poll syscall is also implemented.
v2: xskq_validate_id did not update cons_tail.
The entries variable was calculated twice in xskq_nb_avail.
Squashed xdp_return_buff commit.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Here, the bind syscall is added. Binding an AF_XDP socket, means
associating the socket to an umem, a netdev and a queue index. This
can be done in two ways.
The first way, creating a "socket from scratch". Create the umem using
the XDP_UMEM_REG setsockopt and an associated fill queue with
XDP_UMEM_FILL_QUEUE. Create the Rx queue using the XDP_RX_QUEUE
setsockopt. Call bind passing ifindex and queue index ("channel" in
ethtool speak).
The second way to bind a socket, is simply skipping the
umem/netdev/queue index, and passing another already setup AF_XDP
socket. The new socket will then have the same umem/netdev/queue index
as the parent so it will share the same umem. You must also set the
flags field in the socket address to XDP_SHARED_UMEM.
v2: Use PTR_ERR instead of passing error variable explicitly.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Another setsockopt (XDP_RX_QUEUE) is added to let the process allocate
a queue, where the kernel can pass completed Rx frames from the kernel
to user process.
The mmapping of the queue is done using the XDP_PGOFF_RX_QUEUE offset.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Here, we add another setsockopt for registered user memory (umem)
called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can
ask the kernel to allocate a queue (ring buffer) and also mmap it
(XDP_UMEM_PGOFF_FILL_QUEUE) into the process.
The queue is used to explicitly pass ownership of umem frames from the
user process to the kernel. These frames will in a later patch be
filled in with Rx packet data by the kernel.
v2: Fixed potential crash in xsk_mmap.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
In this commit the base structure of the AF_XDP address family is set
up. Further, we introduce the abilty register a window of user memory
to the kernel via the XDP_UMEM_REG setsockopt syscall. The memory
window is viewed by an AF_XDP socket as a set of equally large
frames. After a user memory registration all frames are "owned" by the
user application, and not the kernel.
v2: More robust checks on umem creation and unaccount on error.
Call set_page_dirty_lock on cleanup.
Simplified xdp_umem_reg.
Co-authored-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Buildable skeleton of AF_XDP without any functionality. Just what it
takes to register a new address family.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Syzbot has reported that it can hit a NULL pointer dereference in
wb_workfn() due to wb->bdi->dev being NULL. This indicates that
wb_workfn() was called for an already unregistered bdi which should not
happen as wb_shutdown() called from bdi_unregister() should make sure
all pending writeback works are completed before bdi is unregistered.
Except that wb_workfn() itself can requeue the work with:
mod_delayed_work(bdi_wq, &wb->dwork, 0);
and if this happens while wb_shutdown() is waiting in:
flush_delayed_work(&wb->dwork);
the dwork can get executed after wb_shutdown() has finished and
bdi_unregister() has cleared wb->bdi->dev.
Make wb_workfn() use wakeup_wb() for requeueing the work which takes all
the necessary precautions against racing with bdi unregistration.
CC: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
CC: Tejun Heo <tj@kernel.org>
Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977
Reported-by: syzbot <syzbot+9873874c735f2892e7e9@syzkaller.appspotmail.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When commit [1] was added, SGID was queried to derive the SMAC address.
Then, later on during a refactor [2], SMAC was no longer needed. However,
the now useless GID query remained. Then during additional code changes
later on, the GID query was being done in such a way that it caused iWARP
queries to start breaking. Remove the useless GID query and resolve the
iWARP breakage at the same time.
This is discussed in [3].
[1] commit dd5f03beb4f7 ("IB/core: Ethernet L2 attributes in verbs/cm structures")
[2] commit 5c266b2304fb ("IB/cm: Remove the usage of smac and vid of qp_attr and cm_av")
[3] https://www.spinics.net/lists/linux-rdma/msg63951.html
Suggested-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
When the kernel was compiled using the UBSAN option,
we saw the following stack trace:
[ 1184.827917] UBSAN: Undefined behaviour in drivers/infiniband/hw/mlx4/mr.c:349:27
[ 1184.828114] signed integer overflow:
[ 1184.828247] -2147483648 - 1 cannot be represented in type 'int'
The problem was caused by calling round_up in procedure
mlx4_ib_umem_calc_optimal_mtt_size (on line 349, as noted in the stack
trace) with the second parameter (1 << block_shift) (which is an int).
The second parameter should have been (1ULL << block_shift) (which
is an unsigned long long).
(1 << block_shift) is treated by the compiler as an int (because 1 is
an integer).
Now, local variable block_shift is initialized to 31.
If block_shift is 31, 1 << block_shift is 1 << 31 = 0x80000000=-214748368.
This is the most negative int value.
Inside the round_up macro, there is a cast applied to ((1 << 31) - 1).
However, this cast is applied AFTER ((1 << 31) - 1) is calculated.
Since (1 << 31) is treated as an int, we get the negative overflow
identified by UBSAN in the process of calculating ((1 << 31) - 1).
The fix is to change (1 << block_shift) to (1ULL << block_shift) on
line 349.
Fixes: 9901abf58368 ("IB/mlx4: Use optimal numbers of MTT entries")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
When IRQ affinity is set and the interrupt type is unknown, a cpu
mask allocated within the function is never freed. Fix this memory
leak by allocating memory within the scope where it is used.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
When allocating device data, if there's an allocation failure, the
already allocated memory won't be freed such as per-cpu counters.
Fix memory leaks in exception path by creating a common reentrant
clean up function hfi1_clean_devdata() to be used at driver unload
time and device data allocation failure.
To accomplish this, free_platform_config() and clean_up_i2c() are
changed to be reentrant to remove dependencies when they are called
in different order. This helps avoid NULL pointer dereferences
introduced by this patch if those two functions weren't reentrant.
In addition, set dd->int_counter, dd->rcv_limit,
dd->send_schedule and dd->tx_opstats to NULL after they're freed in
hfi1_clean_devdata(), so that hfi1_clean_devdata() is fully reentrant.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
When an invalid num_vls is used as a module parameter, the code
execution follows an exception path where the macro dd_dev_err()
expects dd->pcidev->dev not to be NULL in hfi1_init_dd(). This
causes a NULL pointer dereference.
Fix hfi1_init_dd() by initializing dd->pcidev and dd->pcidev->dev
earlier in the code. If a dd exists, then dd->pcidev and
dd->pcidev->dev always exists.
BUG: unable to handle kernel NULL pointer dereference
at 00000000000000f0
IP: __dev_printk+0x15/0x90
Workqueue: events work_for_cpu_fn
RIP: 0010:__dev_printk+0x15/0x90
Call Trace:
dev_err+0x6c/0x90
? hfi1_init_pportdata+0x38d/0x3f0 [hfi1]
hfi1_init_dd+0xdd/0x2530 [hfi1]
? pci_conf1_read+0xb2/0xf0
? pci_read_config_word.part.9+0x64/0x80
? pci_conf1_write+0xb0/0xf0
? pcie_capability_clear_and_set_word+0x57/0x80
init_one+0x141/0x490 [hfi1]
local_pci_probe+0x3f/0xa0
work_for_cpu_fn+0x10/0x20
process_one_work+0x152/0x350
worker_thread+0x1cf/0x3e0
kthread+0xf5/0x130
? max_active_store+0x80/0x80
? kthread_bind+0x10/0x10
? do_syscall_64+0x6e/0x1a0
? SyS_exit_group+0x10/0x10
ret_from_fork+0x35/0x40
Cc: <stable@vger.kernel.org> # 4.9.x
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
AHG may be armed to use the stored header, which by design is limited
to edits in the PSN/A 32 bit word (bth2).
When the code is trying to send a BECN, the use of the stored header
will lose the BECN bit.
Fix by avoiding AHG when getting ready to send a BECN. This is
accomplished by always claiming the packet is not a middle packet which
is an AHG precursor. BECNs are not a normal case and this should not
hurt AHG optimizations.
Cc: <stable@vger.kernel.org> # 4.14.x
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
The module parameter num_user_context is defined as 'int' and
defaults to -1. The module_param_named() says that it is uint.
Correct module_param_named() type information and update the modinfo
text to reflect the default value.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
The code for handling a marked UD packet unconditionally returns the
dlid in the header of the FECN marked packet. This is not correct
for multicast packets where the DLID is in the multicast range.
The subsequent attempt to send the CNP with the multicast lid will
cause the chip to halt the ack send context because the source
lid doesn't match the chip programming. The send context will
be halted and flush any other pending packets in the pio ring causing
the CNP to not be sent.
A part of investigating the fix, it was determined that the 16B work
broke the FECN routine badly with inconsistent use of 16 bit and 32 bits
types for lids and pkeys. Since the port's source lid was correctly 32
bits the type mixmatches need to be dealt with at the same time as
fixing the CNP header issue.
Fix these issues by:
- Using the ports lid for as the SLID for responding to FECN marked UD
packets
- Insure pkey is always 16 bit in this and subordinate routines
- Insure lids are 32 bits in this and subordinate routines
Cc: <stable@vger.kernel.org> # 4.14.x
Fixes: 88733e3b8450 ("IB/hfi1: Add 16B UD support")
Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
syzbot reported a crash in tasklet_action_common() caused by dccp.
dccp needs to make sure socket wont disappear before tasklet handler
has completed.
This patch takes a reference on the socket when arming the tasklet,
and moves the sock_put() from dccp_write_xmit_timer() to dccp_write_xmitlet()
kernel BUG at kernel/softirq.c:514!
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 17 Comm: ksoftirqd/1 Not tainted 4.17.0-rc3+ #30
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:tasklet_action_common.isra.19+0x6db/0x700 kernel/softirq.c:515
RSP: 0018:ffff8801d9b3faf8 EFLAGS: 00010246
dccp_close: ABORT with 65423 bytes unread
RAX: 1ffff1003b367f6b RBX: ffff8801daf1f3f0 RCX: 0000000000000000
RDX: ffff8801cf895498 RSI: 0000000000000004 RDI: 0000000000000000
RBP: ffff8801d9b3fc40 R08: ffffed0039f12a95 R09: ffffed0039f12a94
dccp_close: ABORT with 65423 bytes unread
R10: ffffed0039f12a94 R11: ffff8801cf8954a3 R12: 0000000000000000
R13: ffff8801d9b3fc18 R14: dffffc0000000000 R15: ffff8801cf895490
FS: 0000000000000000(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2bc28000 CR3: 00000001a08a9000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
tasklet_action+0x1d/0x20 kernel/softirq.c:533
__do_softirq+0x2e0/0xaf5 kernel/softirq.c:285
dccp_close: ABORT with 65423 bytes unread
run_ksoftirqd+0x86/0x100 kernel/softirq.c:646
smpboot_thread_fn+0x417/0x870 kernel/smpboot.c:164
kthread+0x345/0x410 kernel/kthread.c:238
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
Code: 48 8b 85 e8 fe ff ff 48 8b 95 f0 fe ff ff e9 94 fb ff ff 48 89 95 f0 fe ff ff e8 81 53 6e 00 48 8b 95 f0 fe ff ff e9 62 fb ff ff <0f> 0b 48 89 cf 48 89 8d e8 fe ff ff e8 64 53 6e 00 48 8b 8d e8
RIP: tasklet_action_common.isra.19+0x6db/0x700 kernel/softirq.c:515 RSP: ffff8801d9b3faf8
Fixes: dc841e30eaea ("dccp: Extend CCID packet dequeueing interface")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Cc: dccp@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Ursula Braun says:
====================
net/smc: fixes 2018/05/03
here are smc fixes for 2 problems:
* receive buffers in SMC must be registered. If registration fails
these buffers must not be kept within the link group for reuse.
Patch 1 is a preparational patch; patch 2 contains the fix.
* sendpage: do not hold the sock lock when calling kernel_sendpage()
or sock_no_sendpage()
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The sendpage() call grabs the sock lock before calling the default
implementation - which tries to grab it once again.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com><
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When smc_wr_reg_send() fails then tag (regerr) the affected buffer and
free it in smc_buf_unuse().
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Consolidate the call to smc_wr_reg_send() in a new function.
No functional changes.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Trivial fix to spelling mistake in DP_NOTICE message
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If the I2C adapter that the PD controller is attached to
does not support SMBus protocol, the driver needs to handle
block reads separately. The first byte returned in block
read protocol will show the total number of bytes. It needs
to be stripped away.
This is handled separately in the driver only because right
now we have no way of requesting the used protocol with
regmap-i2c. This is in practice a workaround for what is
really a problem in regmap-i2c. The other option would have
been to register custom regmap, or not use regmap at all,
however, since the solution is very simple, I choose to use
it in this case for convenience. It is easy to remove once
we figure out how to handle this kind of cases in
regmap-i2c.
Fixes: 0a4c005bd171 ("usb: typec: driver for TI TPS6598x USB Power Delivery controllers")
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
The ref count for the USB role switch device must be
released after we are done using the switch.
Fixes: c6962c29729c ("usb: typec: tcpm: Set USB role switch to device mode when configured as such")
Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Petr Machata says:
====================
bridge: FDB: Notify about removal of non-user-added entries
Device drivers may generally need to keep in sync with bridge's FDB. In
particular, for its offload of tc mirror action where the mirrored-to
device is a gretap device, mlxsw needs to listen to a number of events,
FDB events among the others. SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE would be
a natural notification in that case.
However, for removal of FDB entries added due to device activity (as
opposed to explicit addition through "bridge fdb add" or similar), there
are no notifications.
Thus in patch #1, add the "added_by_user" field to switchdev
notifications sent for FDB activity. Adapt drivers to ignore activity on
non-user-added entries, to maintain the current behavior. Specifically
in case of mlxsw, allow mlxsw_sp_span_respin() call for any and all FDB
updates.
In patch #2, change the bridge driver to actually emit notifications for
these FDB entries. Take care not to send notification for bridge
updates that itself originate in SWITCHDEV_FDB_*_TO_BRIDGE events.
Changes from v1 to v2:
- Instead of introducing a new variant of fdb_delete(), add a new
parameter to the existing function.
- Name the parameter swdev_notify, not notify.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Do not automatically bail out on sending notifications about activity on
non-user-added FDB entries. Instead, notify about this activity except
for cases where the activity itself originates in a notification, to
avoid sending duplicate notifications.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The following patch enables sending notifications also for events on FDB
entries that weren't added by the user. Give the drivers the information
necessary to distinguish between the two origins of FDB entries.
To maintain the current behavior, have switchdev-implementing drivers
bail out on notifications about non-user-added FDB entries. In case of
mlxsw driver, allow a call to mlxsw_sp_span_respin() so that SPAN over
bridge catches up with the changed FDB.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Ido Schimmel says:
====================
mlxsw: Introduce support for CQEv1/2
Jiri says:
Current SwitchX2 and Spectrum FWs support CQEv0 and that is what we
implement in mlxsw. Spectrum FW also supports CQE v1 and v2.
However, Spectrum-2 won't support CQEv0. Prepare for it and setup the
CQE versions to use according to what is queried from FW.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Check number of CQEs for CQE version 2 reported by QUERY_AQ_CAP command.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Use previously added resources to query FW support for multiple versions
of CQEs. Use the biggest version supported. For SDQs, it has no sense to
use version 2 as it does not introduce any new features, but it is
twice the size of CQE version 1.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Introduce definitions of fields in CQE version 1 and 2. Also, introduce
common helpers that would call appropriate version-specific helpers
according to the version enum passed.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add resources that FW uses to report supported CQE versions.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
While handling netdevice events, br_device_event() sometimes uses
br_stp_(disable|enable)_port which unconditionally send a notification,
but then a second notification for the same event is sent at the end of
the br_device_event() function. To avoid sending duplicate notifications
in such cases, check if one has already been sent (i.e.
br_stp_enable/disable_port have been called).
The patch is based on a change by Satish Ashok.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Petr Machata says:
====================
selftests: forwarding: Updates to sysctl handling
Some selftests need to adjust sysctl settings. In order to be neutral to
the system that the test is run on, it is a good practice to change back
to the original setting after the test ends. That involves some
boilerplate that can be abstracted away.
In patch #1, introduce two functions, sysctl_set() and sysctl_restore().
The former stores the current value of a given setting, and sets a new
value. The latter restores the setting to the previously-stored value.
In patch #2, use these wrappers in a number of tests.
Additionally in patch #3, fix a problem in mirror_gre_nh.sh, which
neglected to set a sysctl that's crucial for the test to work.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The test fails to work if reverse-path filtering is in effect on the
mirrored-to host interface, or for all interfaces.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Instead of hand-managing the sysctl set and restore, use the wrappers
sysctl_set() and sysctl_restore() to do the bookkeeping automatically.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|