Age | Commit message (Collapse) | Author |
|
Some of the code paths calculating flow hash for IPv6 use flowlabel member
of struct flowi6 which, despite its name, encodes both flow label and
traffic class. If traffic class changes within a TCP connection (as e.g.
ssh does), ECMP route can switch between path. It's also incosistent with
other code paths where ip6_flowlabel() (returning only flow label) is used
to feed the key.
Use only flow label everywhere, including one place where hash key is set
using ip6_flowinfo().
Fixes: 51ebd3181572 ("ipv6: add support of equal cost multipath (ECMP)")
Fixes: f70ea018da06 ("net: Add functions to get skb->hash based on flow structures")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
"Misc bits and pieces not fitting into anything more specific"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: delete unnecessary assignment in vfs_listxattr
Documentation: filesystems: update filesystem locking documentation
vfs: namei: use path_equal() in follow_dotdot()
fs.h: fix outdated comment about file flags
__inode_security_revalidate() never gets NULL opt_dentry
make xattr_getsecurity() static
vfat: simplify checks in vfat_lookup()
get rid of dead code in d_find_alias()
it's SB_BORN, not MS_BORN...
msdos_rmdir(): kill BS comment
remove rpc_rmdir()
fs: avoid fdput() after failed fdget() in vfs_dedupe_file_range()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull procfs updates from Al Viro:
"Christoph's proc_create_... cleanups series"
* 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (44 commits)
xfs, proc: hide unused xfs procfs helpers
isdn/gigaset: add back gigaset_procinfo assignment
proc: update SIZEOF_PDE_INLINE_NAME for the new pde fields
tty: replace ->proc_fops with ->proc_show
ide: replace ->proc_fops with ->proc_show
ide: remove ide_driver_proc_write
isdn: replace ->proc_fops with ->proc_show
atm: switch to proc_create_seq_private
atm: simplify procfs code
bluetooth: switch to proc_create_seq_data
netfilter/x_tables: switch to proc_create_seq_private
netfilter/xt_hashlimit: switch to proc_create_{seq,single}_data
neigh: switch to proc_create_seq_data
hostap: switch to proc_create_{seq,single}_data
bonding: switch to proc_create_seq_data
rtc/proc: switch to proc_create_single_data
drbd: switch to proc_create_single
resource: switch to proc_create_seq_data
staging/rtl8192u: simplify procfs code
jfs: simplify procfs code
...
|
|
On arm64, ebt_entry_{match,watcher,target} structs are 40 bytes long
while on 32-bit arm these structs have a size of 36 bytes.
COMPAT_XT_ALIGN() macro cannot be used here to determine the necessary
padding for the CONFIG_COMPAT because it imposes an 8-byte boundary
alignment, condition that is not found in 32-bit ebtables application.
Signed-off-by: Alin Nastac <alin.nastac@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
There is mistake in the rt_mode_allow_non_local assignment.
It should be used to check if sending to non-local addresses is
allowed, now it checks if local addresses are allowed.
As local addresses are allowed for most of the cases, the only
places that are affected are for traffic to transparent cache
servers:
- bypass connections when cache server is not available
- related ICMP in FORWARD hook when sent to cache server
Fixes: 4a4739d56b00 ("ipvs: Pull out crosses_local_route_boundary logic")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
nft_reject_br_send_v6_unreach
In order to allocate icmpv6 skb, sizeof(struct ipv6hdr) should be used.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
syzbot was able to trick af_packet again [1]
Various commits tried to address the problem in the past,
but failed to take into account V3 header size.
[1]
tpacket_rcv: packet too big, clamped from 72 to 4294967224. macoff=96
BUG: KASAN: use-after-free in prb_run_all_ft_ops net/packet/af_packet.c:1016 [inline]
BUG: KASAN: use-after-free in prb_fill_curr_block.isra.59+0x4e5/0x5c0 net/packet/af_packet.c:1039
Write of size 2 at addr ffff8801cb62000e by task kworker/1:2/2106
CPU: 1 PID: 2106 Comm: kworker/1:2 Not tainted 4.17.0-rc7+ #77
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: ipv6_addrconf addrconf_dad_work
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1b9/0x294 lib/dump_stack.c:113
print_address_description+0x6c/0x20b mm/kasan/report.c:256
kasan_report_error mm/kasan/report.c:354 [inline]
kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
__asan_report_store2_noabort+0x17/0x20 mm/kasan/report.c:436
prb_run_all_ft_ops net/packet/af_packet.c:1016 [inline]
prb_fill_curr_block.isra.59+0x4e5/0x5c0 net/packet/af_packet.c:1039
__packet_lookup_frame_in_block net/packet/af_packet.c:1094 [inline]
packet_current_rx_frame net/packet/af_packet.c:1117 [inline]
tpacket_rcv+0x1866/0x3340 net/packet/af_packet.c:2282
dev_queue_xmit_nit+0x891/0xb90 net/core/dev.c:2018
xmit_one net/core/dev.c:3049 [inline]
dev_hard_start_xmit+0x16b/0xc10 net/core/dev.c:3069
__dev_queue_xmit+0x2724/0x34c0 net/core/dev.c:3584
dev_queue_xmit+0x17/0x20 net/core/dev.c:3617
neigh_resolve_output+0x679/0xad0 net/core/neighbour.c:1358
neigh_output include/net/neighbour.h:482 [inline]
ip6_finish_output2+0xc9c/0x2810 net/ipv6/ip6_output.c:120
ip6_finish_output+0x5fe/0xbc0 net/ipv6/ip6_output.c:154
NF_HOOK_COND include/linux/netfilter.h:277 [inline]
ip6_output+0x227/0x9b0 net/ipv6/ip6_output.c:171
dst_output include/net/dst.h:444 [inline]
NF_HOOK include/linux/netfilter.h:288 [inline]
ndisc_send_skb+0x100d/0x1570 net/ipv6/ndisc.c:491
ndisc_send_ns+0x3c1/0x8d0 net/ipv6/ndisc.c:633
addrconf_dad_work+0xbef/0x1340 net/ipv6/addrconf.c:4033
process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145
worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279
kthread+0x345/0x410 kernel/kthread.c:240
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
The buggy address belongs to the page:
page:ffffea00072d8800 count:0 mapcount:-127 mapping:0000000000000000 index:0xffff8801cb620e80
flags: 0x2fffc0000000000()
raw: 02fffc0000000000 0000000000000000 ffff8801cb620e80 00000000ffffff80
raw: ffffea00072e3820 ffffea0007132d20 0000000000000002 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8801cb61ff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff8801cb61ff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff8801cb620000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^
ffff8801cb620080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff8801cb620100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
Fixes: 2b6867c2ce76 ("net/packet: fix overflow in check for priv area size")
Fixes: dc808110bb62 ("packet: handle too big packets for PACKET_V3")
Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently, AF_XDP only supports a fixed frame-size memory scheme where
each frame is referenced via an index (idx). A user passes the frame
index to the kernel, and the kernel acts upon the data. Some NICs,
however, do not have a fixed frame-size model, instead they have a
model where a memory window is passed to the hardware and multiple
frames are filled into that window (referred to as the "type-writer"
model).
By changing the descriptor format from the current frame index
addressing scheme, AF_XDP can in the future be extended to support
these kinds of NICs.
In the index-based model, an idx refers to a frame of size
frame_size. Addressing a frame in the UMEM is done by offseting the
UMEM starting address by a global offset, idx * frame_size + offset.
Communicating via the fill- and completion-rings are done by means of
idx.
In this commit, the idx is removed in favor of an address (addr),
which is a relative address ranging over the UMEM. To convert an
idx-based address to the new addr is simply: addr = idx * frame_size +
offset.
We also stop referring to the UMEM "frame" as a frame. Instead it is
simply called a chunk.
To transfer ownership of a chunk to the kernel, the addr of the chunk
is passed in the fill-ring. Note, that the kernel will mask addr to
make it chunk aligned, so there is no need for userspace to do
that. E.g., for a chunk size of 2k, passing an addr of 2048, 2050 or
3000 to the fill-ring will refer to the same chunk.
On the completion-ring, the addr will match that of the Tx descriptor,
passed to the kernel.
Changing the descriptor format to use chunks/addr will allow for
future changes to move to a type-writer based model, where multiple
frames can reside in one chunk. In this model passing one single chunk
into the fill-ring, would potentially result in multiple Rx
descriptors.
This commit changes the uapi of AF_XDP sockets, and updates the
documentation.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
Previously, rx_dropped could be updated incorrectly, e.g. if the XDP
program redirected the frame to a socket bound to a different queue
than where the XDP program was executing.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
Previously the fill queue descriptor was not copied to kernel space
prior validating it, making it possible for userland to change the
descriptor post-kernel-validation.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
Use the right device to determine if redirect should be sent especially
when using vrf. Same as well as when sending the redirect.
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As Michal noted the flow struct takes both the flow label and priority.
Update the bpf_fib_lookup API to note that it is flowinfo and not just
the flow label.
Cc: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This is the first real user of the XDP_XMIT_FLUSH flag.
As pointed out many times, XDP_REDIRECT without using BPF maps is
significant slower than the map variant. This is primary due to the
lack of bulking, as the ndo_xdp_flush operation is required after each
frame (to avoid frames hanging on the egress device).
It is still possible to optimize this case. Instead of invoking two
NDO indirect calls, which are very expensive with CONFIG_RETPOLINE,
instead instruct ndo_xdp_xmit to flush via XDP_XMIT_FLUSH flag.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This patch only change the API and reject any use of flags. This is an
intermediate step that allows us to implement the flush flag operation
later, for each individual driver in a separate patch.
The plan is to implement flush operation via XDP_XMIT_FLUSH flag
and then remove XDP_XMIT_FLAGS_NONE when done.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Since the remaining bits are not filled in struct bpf_tunnel_key
resp. struct bpf_xfrm_state and originate from uninitialized stack
space, we should make sure to clear them before handing control
back to the program.
Also add a padding element to struct bpf_xfrm_state for future use
similar as we have in struct bpf_tunnel_key and clear it as well.
struct bpf_xfrm_state {
__u32 reqid; /* 0 4 */
__u32 spi; /* 4 4 */
__u16 family; /* 8 2 */
/* XXX 2 bytes hole, try to pack */
union {
__u32 remote_ipv4; /* 4 */
__u32 remote_ipv6[4]; /* 16 */
}; /* 12 16 */
/* size: 28, cachelines: 1, members: 4 */
/* sum members: 26, holes: 1, sum holes: 2 */
/* last cacheline: 28 bytes */
};
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add a new bpf_skb_cgroup_id() helper that allows to retrieve the
cgroup id from the skb's socket. This is useful in particular to
enable bpf_get_cgroup_classid()-like behavior for cgroup v1 in
cgroup v2 by allowing ID based matching on egress. This can in
particular be used in combination with applying policy e.g. from
map lookups, and also complements the older bpf_skb_under_cgroup()
interface. In user space the cgroup id for a given path can be
retrieved through the f_handle as demonstrated in [0] recently.
[0] https://lkml.org/lkml/2018/5/22/1190
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
ncsi_rsp_handler_gc() allocates the filter arrays using GFP_KERNEL in
softirq context, causing the below backtrace. This allocation is only a
few dozen bytes during probing so allocate with GFP_ATOMIC instead.
[ 42.813372] BUG: sleeping function called from invalid context at mm/slab.h:416
[ 42.820900] in_atomic(): 1, irqs_disabled(): 0, pid: 213, name: kworker/0:1
[ 42.827893] INFO: lockdep is turned off.
[ 42.832023] CPU: 0 PID: 213 Comm: kworker/0:1 Tainted: G W 4.13.16-01441-gad99b38 #65
[ 42.841007] Hardware name: Generic DT based system
[ 42.845966] Workqueue: events ncsi_dev_work
[ 42.850251] [<8010a494>] (unwind_backtrace) from [<80107510>] (show_stack+0x20/0x24)
[ 42.858046] [<80107510>] (show_stack) from [<80612770>] (dump_stack+0x20/0x28)
[ 42.865309] [<80612770>] (dump_stack) from [<80148248>] (___might_sleep+0x230/0x2b0)
[ 42.873241] [<80148248>] (___might_sleep) from [<80148334>] (__might_sleep+0x6c/0xac)
[ 42.881129] [<80148334>] (__might_sleep) from [<80240d6c>] (__kmalloc+0x210/0x2fc)
[ 42.888737] [<80240d6c>] (__kmalloc) from [<8060ad54>] (ncsi_rsp_handler_gc+0xd0/0x170)
[ 42.896770] [<8060ad54>] (ncsi_rsp_handler_gc) from [<8060b454>] (ncsi_rcv_rsp+0x16c/0x1d4)
[ 42.905314] [<8060b454>] (ncsi_rcv_rsp) from [<804d86c8>] (__netif_receive_skb_core+0x3c8/0xb50)
[ 42.914158] [<804d86c8>] (__netif_receive_skb_core) from [<804d96cc>] (__netif_receive_skb+0x20/0x7c)
[ 42.923420] [<804d96cc>] (__netif_receive_skb) from [<804de4b0>] (netif_receive_skb_internal+0x78/0x6a4)
[ 42.932931] [<804de4b0>] (netif_receive_skb_internal) from [<804df980>] (netif_receive_skb+0x78/0x158)
[ 42.942292] [<804df980>] (netif_receive_skb) from [<8042f204>] (ftgmac100_poll+0x43c/0x4e8)
[ 42.950855] [<8042f204>] (ftgmac100_poll) from [<804e094c>] (net_rx_action+0x278/0x4c4)
[ 42.958918] [<804e094c>] (net_rx_action) from [<801016a8>] (__do_softirq+0xe0/0x4c4)
[ 42.966716] [<801016a8>] (__do_softirq) from [<8011cd9c>] (do_softirq.part.4+0x50/0x78)
[ 42.974756] [<8011cd9c>] (do_softirq.part.4) from [<8011cebc>] (__local_bh_enable_ip+0xf8/0x11c)
[ 42.983579] [<8011cebc>] (__local_bh_enable_ip) from [<804dde08>] (__dev_queue_xmit+0x260/0x890)
[ 42.992392] [<804dde08>] (__dev_queue_xmit) from [<804df1f0>] (dev_queue_xmit+0x1c/0x20)
[ 43.000689] [<804df1f0>] (dev_queue_xmit) from [<806099c0>] (ncsi_xmit_cmd+0x1c0/0x244)
[ 43.008763] [<806099c0>] (ncsi_xmit_cmd) from [<8060dc14>] (ncsi_dev_work+0x2e0/0x4c8)
[ 43.016725] [<8060dc14>] (ncsi_dev_work) from [<80133dfc>] (process_one_work+0x214/0x6f8)
[ 43.024940] [<80133dfc>] (process_one_work) from [<80134328>] (worker_thread+0x48/0x558)
[ 43.033070] [<80134328>] (worker_thread) from [<8013ba80>] (kthread+0x130/0x174)
[ 43.040506] [<8013ba80>] (kthread) from [<80102950>] (ret_from_fork+0x14/0x24)
Fixes: 062b3e1b6d4f ("net/ncsi: Refactor MAC, VLAN filters")
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fix to return error code -EINVAL instead of 0 if optlen is invalid.
Fixes: 01d2f7e2cdd3 ("net/smc: sockopts TCP_NODELAY and TCP_CORK")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Filling in the padding slot in the bpf structure as a bug fix in 'ne'
overlapped with actually using that padding area for something in
'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pull networking fixes from David Miller:
1) Infinite loop in _decode_session6(), from Eric Dumazet.
2) Pass correct argument to nla_strlcpy() in netfilter, also from Eric
Dumazet.
3) Out of bounds memory access in ipv6 srh code, from Mathieu Xhonneux.
4) NULL deref in XDP_REDIRECT handling of tun driver, from Toshiaki
Makita.
5) Incorrect idr release in cls_flower, from Paul Blakey.
6) Probe error handling fix in davinci_emac, from Dan Carpenter.
7) Memory leak in XPS configuration, from Alexander Duyck.
8) Use after free with cloned sockets in kcm, from Kirill Tkhai.
9) MTU handling fixes fo ip_tunnel and ip6_tunnel, from Nicolas
Dichtel.
10) Fix UAPI hole in bpf data structure for 32-bit compat applications,
from Daniel Borkmann.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (33 commits)
bpf: fix uapi hole for 32 bit compat applications
net: usb: cdc_mbim: add flag FLAG_SEND_ZLP
ip6_tunnel: remove magic mtu value 0xFFF8
ip_tunnel: restore binding to ifaces with a large mtu
net: dsa: b53: Add BCM5389 support
kcm: Fix use-after-free caused by clonned sockets
net-sysfs: Fix memory leak in XPS configuration
ixgbe: fix parsing of TC actions for HW offload
net: ethernet: davinci_emac: fix error handling in probe()
net/ncsi: Fix array size in dumpit handler
cls_flower: Fix incorrect idr release when failing to modify rule
net/sonic: Use dma_mapping_error()
xfrm Fix potential error pointer dereference in xfrm_bundle_create.
vhost_net: flush batched heads before trying to busy polling
tun: Fix NULL pointer dereference in XDP redirect
be2net: Fix error detection logic for BE3
net: qmi_wwan: Add Netgear Aircard 779S
mlxsw: spectrum: Forbid creation of VLAN 1 over port/LAG
atm: zatm: fix memcmp casting
iwlwifi: pcie: compare with number of IRQs requested for, not number of CPUs
...
|
|
If there is a significant amount of chains list search is too slow, so
add an rhlist table for this.
This speeds up ruleset loading: for every new rule we have to check if
the name already exists in current generation.
We need to be able to cope with duplicate chain names in case a transaction
drops the nfnl mutex (for request_module) and the abort of this old
transaction is still pending.
The list is kept -- we need a way to iterate chains even if hash resize is
in progress without missing an entry.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This features which allows you to limit the maximum number of
connections per arbitrary key. The connlimit expression is stateful,
therefore it can be used from meters to dynamically populate a set, this
provides a mapping to the iptables' connlimit match. This patch also
comes that allows you define static connlimit policies.
This extension depends on the nf_conncount infrastructure.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Before this patch, cloned expressions are released via ->destroy. This
is a problem for the new connlimit expression since the ->destroy path
drop a reference on the conntrack modules and it unregisters hooks. The
new ->destroy_clone provides context that this expression is being
released from the packet path, so it is mirroring ->clone(), where
neither module reference is dropped nor hooks need to be unregistered -
because this done from the control plane path from the ->init() path.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Use garbage collector to schedule removal of elements based of feedback
from expression that this element comes with. Therefore, the garbage
collector is not guided by timeout expirations in this new mode.
The new connlimit expression sets on the NFT_EXPR_GC flag to enable this
behaviour, the dynset expression needs to explicitly enable the garbage
collector via set->ops->gc_init call.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
nft_set_elem_destroy() can be called from call_rcu context. Annotate
netns and table in set object so we can populate the context object.
Moreover, pass context object to nf_tables_set_elem_destroy() from the
commit phase, since it is already available from there.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This patch provides an interface to maintain the list of connections and
the lookup function to obtain the number of connections in the list.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
The new connlimit object needs this to properly deal with conntrack
dependencies.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
The extracted functions will likely be usefull to implement tproxy
support in nf_tables.
Extrancted functions:
- nf_tproxy_sk_is_transparent
- nf_tproxy_laddr4
- nf_tproxy_handle_time_wait4
- nf_tproxy_get_sock_v4
- nf_tproxy_laddr6
- nf_tproxy_handle_time_wait6
- nf_tproxy_get_sock_v6
(nf_)tproxy_handle_time_wait6 also needed some refactor as its current
implementation was xtables-specific.
Signed-off-by: Máté Eckl <ecklm94@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
There is a function in include/net/netfilter/nf_socket.h to decide if a
socket has IP(V6)_TRANSPARENT socket option set or not. However this
does the same as inet_sk_transparent() in include/net/tcp.h
include/net/tcp.h:1733
/* This helper checks if socket has IP_TRANSPARENT set */
static inline bool inet_sk_transparent(const struct sock *sk)
{
switch (sk->sk_state) {
case TCP_TIME_WAIT:
return inet_twsk(sk)->tw_transparent;
case TCP_NEW_SYN_RECV:
return inet_rsk(inet_reqsk(sk))->no_srccheck;
}
return inet_sk(sk)->transparent;
}
tproxy_sk_is_transparent has also been refactored to use this function
instead of reimplementing it.
Signed-off-by: Máté Eckl <ecklm94@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Pull rdma fixes from Jason Gunthorpe:
"Just three small last minute regressions that were found in the last
week. The Broadcom fix is a bit big for rc7, but since it is fixing
driver crash regressions that were merged via netdev into rc1, I am
sending it.
- bnxt netdev changes merged this cycle caused the bnxt RDMA driver
to crash under certain situations
- Arnd found (several, unfortunately) kconfig problems with the
patches adding INFINIBAND_ADDR_TRANS. Reverting this last part,
will fix it more fully outside -rc.
- Subtle change in error code for a uapi function caused breakage in
userspace. This was bug was subtly introduced cycle"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
IB/core: Fix error code for invalid GID entry
IB: Revert "remove redundant INFINIBAND kconfig dependencies"
RDMA/bnxt_re: Fix broken RoCE driver due to recent L2 driver changes
|
|
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for your net-next
tree, the most relevant things in this batch are:
1) Compile masquerade infrastructure into NAT module, from Florian Westphal.
Same thing with the redirection support.
2) Abort transaction if early initialization of the commit phase fails.
Also from Florian.
3) Get rid of synchronize_rcu() by using rule array in nf_tables, from
Florian.
4) Abort nf_tables batch if fatal signal is pending, from Florian.
5) Use .call_rcu nfnetlink from nf_tables to make dumps fully lockless.
From Florian Westphal.
6) Support to match transparent sockets from nf_tables, from Máté Eckl.
7) Audit support for nf_tables, from Phil Sutter.
8) Validate chain dependencies from commit phase, fall back to fine grain
validation only in case of errors.
9) Attach dst to skbuff from netfilter flowtable packet path, from
Jason A. Donenfeld.
10) Use artificial maximum attribute cap to remove VLA from nfnetlink.
Patch from Kees Cook.
11) Add extension to allow to forward packets through neighbour layer.
12) Add IPv6 conntrack helper support to IPVS, from Julian Anastasov.
13) Add IPv6 FTP conntrack support to IPVS, from Julian Anastasov.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
ip_vs_ftp requires conntrack modules for mangling
of FTP command responses in passive mode.
Make sure the conntrack hooks are registered when
real servers use NAT method in FTP virtual service.
The hooks will be registered while the service is
present.
Fixes: 0c66dc1ea3f0 ("netfilter: conntrack: register hooks in netns when needed by ruleset")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Clean up: This array was used in a dprintk that was replaced by a
trace point in commit ab03eff58eb5 ("xprtrdma: Add trace points in
RPC Call transmit paths").
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
I don't know where this value comes from (probably a copy and paste and
paste and paste ...).
Let's use standard values which are a bit greater.
Link: https://git.kernel.org/pub/scm/linux/kernel/git/davem/netdev-vger-cvs.git/commit/?id=e5afd356a411a
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Matches trace_xprtrdma_dma_unmap(mr).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Currently, when the sendctx queue is exhausted during marshaling, the
RPC/RDMA transport places the RPC task on the delayq, which forces a
wait for HZ >> 2 before the marshal and send is retried.
With this change, the transport now places such an RPC task on the
pending queue, and wakes it just as soon as more sendctxs become
available. This typically takes less than a millisecond, and the
write_space waking mechanism is less deadlock-prone.
Moreover, the waiting RPC task is holding the transport's write
lock, which blocks the transport from sending RPCs. Therefore faster
recovery from sendctx queue exhaustion is desirable.
Cf. commit 5804891455d5 ("xprtrdma: ->send_request returns -EAGAIN
when there are no free MRs").
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Clean up: The logic to wait for write space is common to a bunch of
the encoding helper functions. Lift it out and put it in the tail
of rpcrdma_marshal_req().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
The use of -EAGAIN in rpcrdma_convert_iovs() is a latent bug: the
transport never calls xprt_write_space() when more pages become
available. -ENOBUFS will trigger the correct "delay briefly and call
again" logic.
Fixes: 7a89f9c626e3 ("xprtrdma: Honor ->send_request API contract")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org # 4.8+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
After commit f6cc9c054e77, the following conf is broken (note that the
default loopback mtu is 65536, ie IP_MAX_MTU + 1):
$ ip tunnel add gre1 mode gre local 10.125.0.1 remote 10.125.0.2 dev lo
add tunnel "gre0" failed: Invalid argument
$ ip l a type dummy
$ ip l s dummy1 up
$ ip l s dummy1 mtu 65535
$ ip tunnel add gre1 mode gre local 10.125.0.1 remote 10.125.0.2 dev dummy1
add tunnel "gre0" failed: Invalid argument
dev_set_mtu() doesn't allow to set a mtu which is too large.
First, let's cap the mtu returned by ip_tunnel_bind_dev(). Second, remove
the magic value 0xFFF8 and use IP_MAX_MTU instead.
0xFFF8 seems to be there for ages, I don't know why this value was used.
With a recent kernel, it's also possible to set a mtu > IP_MAX_MTU:
$ ip l s dummy1 mtu 66000
After that patch, it's also possible to bind an ip tunnel on that kind of
interface.
CC: Petr Machata <petrm@mellanox.com>
CC: Ido Schimmel <idosch@mellanox.com>
Link: https://git.kernel.org/pub/scm/linux/kernel/git/davem/netdev-vger-cvs.git/commit/?id=e5afd356a411a
Fixes: f6cc9c054e77 ("ip_tunnel: Emit events for post-register MTU changes")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
pull request (net): ipsec 2018-05-31
1) Avoid possible overflow of the offset variable
in _decode_session6(), this fixes an infinite
lookp there. From Eric Dumazet.
2) We may use an error pointer in the error path of
xfrm_bundle_create(). Fix this by returning this
pointer directly to the caller.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
tc_ctl_tfilter handles three netlink message types: RTM_NEWTFILTER,
RTM_DELTFILTER, RTM_GETTFILTER. However, implementation of this function
involves a lot of branching on specific message type because most of the
code is message-specific. This significantly complicates adding new
functionality and doesn't provide much benefit of code reuse.
Split tc_ctl_tfilter to three standalone functions that handle filter new,
delete and get requests.
The only truly protocol independent part of tc_ctl_tfilter is code that
looks up queue, class, and block. Refactor this code to standalone
tcf_block_find function that is used by all three new handlers.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In rtnl_newlink(), NULL check is performed on m_ops however member of
ops is accessed. Fixed by accessing member of m_ops instead of ops.
[ 345.432629] BUG: KASAN: null-ptr-deref in rtnl_newlink+0x400/0x1110
[ 345.432629] Read of size 4 at addr 0000000000000088 by task ip/986
[ 345.432629]
[ 345.432629] CPU: 1 PID: 986 Comm: ip Not tainted 4.17.0-rc6+ #9
[ 345.432629] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ 345.432629] Call Trace:
[ 345.432629] dump_stack+0xc6/0x150
[ 345.432629] ? dump_stack_print_info.cold.0+0x1b/0x1b
[ 345.432629] ? kasan_report+0xb4/0x410
[ 345.432629] kasan_report.cold.4+0x8f/0x91
[ 345.432629] ? rtnl_newlink+0x400/0x1110
[ 345.432629] rtnl_newlink+0x400/0x1110
[...]
Fixes: ccf8dbcd062a ("rtnetlink: Remove VLA usage")
Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
(resend for properly queueing in patchwork)
kcm_clone() creates kernel socket, which does not take net counter.
Thus, the net may die before the socket is completely destructed,
i.e. kcm_exit_net() is executed before kcm_done().
Reported-by: syzbot+5f1a04e374a635efc426@syzkaller.appspotmail.com
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add support for FTP commands with extended format (RFC 2428):
- FTP EPRT: IPv4 and IPv6, active mode, similar to PORT
- FTP EPSV: IPv4 and IPv6, passive mode, similar to PASV.
EPSV response usually contains only port but we allow real
server to provide different address
We restrict control and data connection to be from same
address family.
Allow the "(" and ")" to be optional in PASV response.
Also, add ipvsh argument to the pkt_in/pkt_out handlers to better
access the payload after transport header.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Prepare NFCT to support IPv6 for FTP:
- Do not restrict the expectation callback to PF_INET
- Split the debug messages, so that the 160-byte limitation
in IP_VS_DBG_BUF is not exceeded when printing many IPv6
addresses. This means no more than 3 addresses in one message,
i.e. 1 tuple with 2 addresses or 1 connection with 3 addresses.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This allows us to forward packets from the netdev family via neighbour
layer, so you don't need an explicit link-layer destination when using
this expression from rules. The ttl/hop_limit field is decremented.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
The patch moves the "trans->msg_type == NFT_MSG_NEWSET" check before
using nft_trans_set(trans). Otherwise we can get out of bounds read.
For example, KASAN reported the one when running 0001_cache_handling_0 nft
test. In this case "trans->msg_type" was NFT_MSG_NEWTABLE:
[75517.177808] BUG: KASAN: slab-out-of-bounds in nft_set_lookup_global+0x22f/0x270 [nf_tables]
[75517.279094] Read of size 8 at addr ffff881bdb643fc8 by task nft/7356
...
[75517.375605] CPU: 26 PID: 7356 Comm: nft Tainted: G E 4.17.0-rc7.1.x86_64 #1
[75517.489587] Hardware name: Oracle Corporation SUN SERVER X4-2
[75517.618129] Call Trace:
[75517.648821] dump_stack+0xd1/0x13b
[75517.691040] ? show_regs_print_info+0x5/0x5
[75517.742519] ? kmsg_dump_rewind_nolock+0xf5/0xf5
[75517.799300] ? lock_acquire+0x143/0x310
[75517.846738] print_address_description+0x85/0x3a0
[75517.904547] kasan_report+0x18d/0x4b0
[75517.949892] ? nft_set_lookup_global+0x22f/0x270 [nf_tables]
[75518.019153] ? nft_set_lookup_global+0x22f/0x270 [nf_tables]
[75518.088420] ? nft_set_lookup_global+0x22f/0x270 [nf_tables]
[75518.157689] nft_set_lookup_global+0x22f/0x270 [nf_tables]
[75518.224869] nf_tables_newsetelem+0x1a5/0x5d0 [nf_tables]
[75518.291024] ? nft_add_set_elem+0x2280/0x2280 [nf_tables]
[75518.357154] ? nla_parse+0x1a5/0x300
[75518.401455] ? kasan_kmalloc+0xa6/0xd0
[75518.447842] nfnetlink_rcv+0xc43/0x1bdf [nfnetlink]
[75518.507743] ? nfnetlink_rcv+0x7a5/0x1bdf [nfnetlink]
[75518.569745] ? nfnl_err_reset+0x3c0/0x3c0 [nfnetlink]
[75518.631711] ? lock_acquire+0x143/0x310
[75518.679133] ? netlink_deliver_tap+0x9b/0x1070
[75518.733840] ? kasan_unpoison_shadow+0x31/0x40
[75518.788542] netlink_unicast+0x45d/0x680
[75518.837111] ? __isolate_free_page+0x890/0x890
[75518.891913] ? netlink_attachskb+0x6b0/0x6b0
[75518.944542] netlink_sendmsg+0x6fa/0xd30
[75518.993107] ? netlink_unicast+0x680/0x680
[75519.043758] ? netlink_unicast+0x680/0x680
[75519.094402] sock_sendmsg+0xd9/0x160
[75519.138810] ___sys_sendmsg+0x64d/0x980
[75519.186234] ? copy_msghdr_from_user+0x350/0x350
[75519.243118] ? lock_downgrade+0x650/0x650
[75519.292738] ? do_raw_spin_unlock+0x5d/0x250
[75519.345456] ? _raw_spin_unlock+0x24/0x30
[75519.395065] ? __handle_mm_fault+0xbde/0x3410
[75519.448830] ? sock_setsockopt+0x3d2/0x1940
[75519.500516] ? __lock_acquire.isra.25+0xdc/0x19d0
[75519.558448] ? lock_downgrade+0x650/0x650
[75519.608057] ? __audit_syscall_entry+0x317/0x720
[75519.664960] ? __fget_light+0x58/0x250
[75519.711325] ? __sys_sendmsg+0xde/0x170
[75519.758850] __sys_sendmsg+0xde/0x170
[75519.804193] ? __ia32_sys_shutdown+0x90/0x90
[75519.856725] ? syscall_trace_enter+0x897/0x10e0
[75519.912354] ? trace_event_raw_event_sys_enter+0x920/0x920
[75519.979432] ? __audit_syscall_entry+0x720/0x720
[75520.036118] do_syscall_64+0xa3/0x3d0
[75520.081248] ? prepare_exit_to_usermode+0x47/0x1d0
[75520.139904] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[75520.201680] RIP: 0033:0x7fc153320ba0
[75520.245772] RSP: 002b:00007ffe294c3638 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[75520.337708] RAX: ffffffffffffffda RBX: 00007ffe294c4820 RCX: 00007fc153320ba0
[75520.424547] RDX: 0000000000000000 RSI: 00007ffe294c46b0 RDI: 0000000000000003
[75520.511386] RBP: 00007ffe294c47b0 R08: 0000000000000004 R09: 0000000002114090
[75520.598225] R10: 00007ffe294c30a0 R11: 0000000000000246 R12: 00007ffe294c3660
[75520.684961] R13: 0000000000000001 R14: 00007ffe294c3650 R15: 0000000000000001
[75520.790946] Allocated by task 7356:
[75520.833994] kasan_kmalloc+0xa6/0xd0
[75520.878088] __kmalloc+0x189/0x450
[75520.920107] nft_trans_alloc_gfp+0x20/0x190 [nf_tables]
[75520.983961] nf_tables_newtable+0xcd0/0x1bd0 [nf_tables]
[75521.048857] nfnetlink_rcv+0xc43/0x1bdf [nfnetlink]
[75521.108655] netlink_unicast+0x45d/0x680
[75521.157013] netlink_sendmsg+0x6fa/0xd30
[75521.205271] sock_sendmsg+0xd9/0x160
[75521.249365] ___sys_sendmsg+0x64d/0x980
[75521.296686] __sys_sendmsg+0xde/0x170
[75521.341822] do_syscall_64+0xa3/0x3d0
[75521.386957] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[75521.467867] Freed by task 23454:
[75521.507804] __kasan_slab_free+0x132/0x180
[75521.558137] kfree+0x14d/0x4d0
[75521.596005] free_rt_sched_group+0x153/0x280
[75521.648410] sched_autogroup_create_attach+0x19a/0x520
[75521.711330] ksys_setsid+0x2ba/0x400
[75521.755529] __ia32_sys_setsid+0xa/0x10
[75521.802850] do_syscall_64+0xa3/0x3d0
[75521.848090] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[75521.929000] The buggy address belongs to the object at ffff881bdb643f80
which belongs to the cache kmalloc-96 of size 96
[75522.079797] The buggy address is located 72 bytes inside of
96-byte region [ffff881bdb643f80, ffff881bdb643fe0)
[75522.221234] The buggy address belongs to the page:
[75522.280100] page:ffffea006f6d90c0 count:1 mapcount:0 mapping:0000000000000000 index:0x0
[75522.377443] flags: 0x2fffff80000100(slab)
[75522.426956] raw: 002fffff80000100 0000000000000000 0000000000000000 0000000180200020
[75522.521275] raw: ffffea006e6fafc0 0000000c0000000c ffff881bf180f400 0000000000000000
[75522.615601] page dumped because: kasan: bad access detected
Fixes: 37a9cc525525 ("netfilter: nf_tables: add generation mask to sets")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
The helper and timeout strings are from user-space, we need to make
sure they are null terminated. If not, evil user could make kernel
read the unexpected memory, even print it when fail to find by the
following codes.
pr_info_ratelimited("No such helper \"%s\"\n", helper_name);
Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
In the quest to remove all stack VLA usage from the kernel[1], this
allocates the maximum size expected for all possible attrs and adds
sanity-checks at both registration and usage to make sure nothing
gets out of sync.
[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Some drivers, such as vxlan and wireguard, use the skb's dst in order to
determine things like PMTU. They therefore loose functionality when flow
offloading is enabled. So, we ensure the skb has it before xmit'ing it
in the offloading path.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|