Age | Commit message (Collapse) | Author |
|
The SKB_DROP_REASON_IP_INADDRERRORS drop reason is never returned from
any function, as such it cannot be returned from the ip_route_input call
tree. The 'reason != SKB_DROP_REASON_IP_INADDRERRORS' conditional is
thus always true.
Looking back at history, commit 50038bf38e65 ("net: ip: make
ip_route_input() return drop reasons") changed the ip_route_input
returned value check in br_nf_pre_routing_finish from -EHOSTUNREACH to
SKB_DROP_REASON_IP_INADDRERRORS. It turns out -EHOSTUNREACH could not be
returned either from the ip_route_input call tree and this since commit
251da4130115 ("ipv4: Cache ip_error() routes even when not
forwarding.").
Not a fix as this won't change the behavior. While at it use
kfree_skb_reason.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
In this commit, we make ip_route_input() return skb drop reasons that come
from ip_route_input_noref().
Meanwhile, adjust all the call to it.
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Convert the "tos" parameter of ip_route_output() to dscp_t. This way
we'll have a dscp_t value directly available when .flowi4_tos will
eventually be converted to dscp_t.
All ip_route_output() callers but one set this "tos" parameter to 0 and
therefore don't need to be adapted to the new prototype.
Only br_nf_pre_routing_finish() needs conversion. It can just use
ip4h_dscp() to get the DSCP field from the IPv4 header.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/0f10d031dd44c70aae9bc6e19391cb30d5c2fe71.1730928699.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cross-merge networking fixes after downstream PR (net-6.12-rc3).
No conflicts and no adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix a kernel panic in the br_netfilter module when sending untagged
traffic via a VxLAN device.
This happens during the check for fragmentation in br_nf_dev_queue_xmit.
It is dependent on:
1) the br_netfilter module being loaded;
2) net.bridge.bridge-nf-call-iptables set to 1;
3) a bridge with a VxLAN (single-vxlan-device) netdevice as a bridge port;
4) untagged frames with size higher than the VxLAN MTU forwarded/flooded
When forwarding the untagged packet to the VxLAN bridge port, before
the netfilter hooks are called, br_handle_egress_vlan_tunnel is called and
changes the skb_dst to the tunnel dst. The tunnel_dst is a metadata type
of dst, i.e., skb_valid_dst(skb) is false, and metadata->dst.dev is NULL.
Then in the br_netfilter hooks, in br_nf_dev_queue_xmit, there's a check
for frames that needs to be fragmented: frames with higher MTU than the
VxLAN device end up calling br_nf_ip_fragment, which in turns call
ip_skb_dst_mtu.
The ip_dst_mtu tries to use the skb_dst(skb) as if it was a valid dst
with valid dst->dev, thus the crash.
This case was never supported in the first place, so drop the packet
instead.
PING 10.0.0.2 (10.0.0.2) from 0.0.0.0 h1-eth0: 2000(2028) bytes of data.
[ 176.291791] Unable to handle kernel NULL pointer dereference at
virtual address 0000000000000110
[ 176.292101] Mem abort info:
[ 176.292184] ESR = 0x0000000096000004
[ 176.292322] EC = 0x25: DABT (current EL), IL = 32 bits
[ 176.292530] SET = 0, FnV = 0
[ 176.292709] EA = 0, S1PTW = 0
[ 176.292862] FSC = 0x04: level 0 translation fault
[ 176.293013] Data abort info:
[ 176.293104] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[ 176.293488] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[ 176.293787] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 176.293995] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000043ef5000
[ 176.294166] [0000000000000110] pgd=0000000000000000,
p4d=0000000000000000
[ 176.294827] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[ 176.295252] Modules linked in: vxlan ip6_udp_tunnel udp_tunnel veth
br_netfilter bridge stp llc ipv6 crct10dif_ce
[ 176.295923] CPU: 0 PID: 188 Comm: ping Not tainted
6.8.0-rc3-g5b3fbd61b9d1 #2
[ 176.296314] Hardware name: linux,dummy-virt (DT)
[ 176.296535] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS
BTYPE=--)
[ 176.296808] pc : br_nf_dev_queue_xmit+0x390/0x4ec [br_netfilter]
[ 176.297382] lr : br_nf_dev_queue_xmit+0x2ac/0x4ec [br_netfilter]
[ 176.297636] sp : ffff800080003630
[ 176.297743] x29: ffff800080003630 x28: 0000000000000008 x27:
ffff6828c49ad9f8
[ 176.298093] x26: ffff6828c49ad000 x25: 0000000000000000 x24:
00000000000003e8
[ 176.298430] x23: 0000000000000000 x22: ffff6828c4960b40 x21:
ffff6828c3b16d28
[ 176.298652] x20: ffff6828c3167048 x19: ffff6828c3b16d00 x18:
0000000000000014
[ 176.298926] x17: ffffb0476322f000 x16: ffffb7e164023730 x15:
0000000095744632
[ 176.299296] x14: ffff6828c3f1c880 x13: 0000000000000002 x12:
ffffb7e137926a70
[ 176.299574] x11: 0000000000000001 x10: ffff6828c3f1c898 x9 :
0000000000000000
[ 176.300049] x8 : ffff6828c49bf070 x7 : 0008460f18d5f20e x6 :
f20e0100bebafeca
[ 176.300302] x5 : ffff6828c7f918fe x4 : ffff6828c49bf070 x3 :
0000000000000000
[ 176.300586] x2 : 0000000000000000 x1 : ffff6828c3c7ad00 x0 :
ffff6828c7f918f0
[ 176.300889] Call trace:
[ 176.301123] br_nf_dev_queue_xmit+0x390/0x4ec [br_netfilter]
[ 176.301411] br_nf_post_routing+0x2a8/0x3e4 [br_netfilter]
[ 176.301703] nf_hook_slow+0x48/0x124
[ 176.302060] br_forward_finish+0xc8/0xe8 [bridge]
[ 176.302371] br_nf_hook_thresh+0x124/0x134 [br_netfilter]
[ 176.302605] br_nf_forward_finish+0x118/0x22c [br_netfilter]
[ 176.302824] br_nf_forward_ip.part.0+0x264/0x290 [br_netfilter]
[ 176.303136] br_nf_forward+0x2b8/0x4e0 [br_netfilter]
[ 176.303359] nf_hook_slow+0x48/0x124
[ 176.303803] __br_forward+0xc4/0x194 [bridge]
[ 176.304013] br_flood+0xd4/0x168 [bridge]
[ 176.304300] br_handle_frame_finish+0x1d4/0x5c4 [bridge]
[ 176.304536] br_nf_hook_thresh+0x124/0x134 [br_netfilter]
[ 176.304978] br_nf_pre_routing_finish+0x29c/0x494 [br_netfilter]
[ 176.305188] br_nf_pre_routing+0x250/0x524 [br_netfilter]
[ 176.305428] br_handle_frame+0x244/0x3cc [bridge]
[ 176.305695] __netif_receive_skb_core.constprop.0+0x33c/0xecc
[ 176.306080] __netif_receive_skb_one_core+0x40/0x8c
[ 176.306197] __netif_receive_skb+0x18/0x64
[ 176.306369] process_backlog+0x80/0x124
[ 176.306540] __napi_poll+0x38/0x17c
[ 176.306636] net_rx_action+0x124/0x26c
[ 176.306758] __do_softirq+0x100/0x26c
[ 176.307051] ____do_softirq+0x10/0x1c
[ 176.307162] call_on_irq_stack+0x24/0x4c
[ 176.307289] do_softirq_own_stack+0x1c/0x2c
[ 176.307396] do_softirq+0x54/0x6c
[ 176.307485] __local_bh_enable_ip+0x8c/0x98
[ 176.307637] __dev_queue_xmit+0x22c/0xd28
[ 176.307775] neigh_resolve_output+0xf4/0x1a0
[ 176.308018] ip_finish_output2+0x1c8/0x628
[ 176.308137] ip_do_fragment+0x5b4/0x658
[ 176.308279] ip_fragment.constprop.0+0x48/0xec
[ 176.308420] __ip_finish_output+0xa4/0x254
[ 176.308593] ip_finish_output+0x34/0x130
[ 176.308814] ip_output+0x6c/0x108
[ 176.308929] ip_send_skb+0x50/0xf0
[ 176.309095] ip_push_pending_frames+0x30/0x54
[ 176.309254] raw_sendmsg+0x758/0xaec
[ 176.309568] inet_sendmsg+0x44/0x70
[ 176.309667] __sys_sendto+0x110/0x178
[ 176.309758] __arm64_sys_sendto+0x28/0x38
[ 176.309918] invoke_syscall+0x48/0x110
[ 176.310211] el0_svc_common.constprop.0+0x40/0xe0
[ 176.310353] do_el0_svc+0x1c/0x28
[ 176.310434] el0_svc+0x34/0xb4
[ 176.310551] el0t_64_sync_handler+0x120/0x12c
[ 176.310690] el0t_64_sync+0x190/0x194
[ 176.311066] Code: f9402e61 79402aa2 927ff821 f9400023 (f9408860)
[ 176.315743] ---[ end trace 0000000000000000 ]---
[ 176.316060] Kernel panic - not syncing: Oops: Fatal exception in
interrupt
[ 176.316371] Kernel Offset: 0x37e0e3000000 from 0xffff800080000000
[ 176.316564] PHYS_OFFSET: 0xffff97d780000000
[ 176.316782] CPU features: 0x0,88000203,3c020000,0100421b
[ 176.317210] Memory Limit: none
[ 176.317527] ---[ end Kernel panic - not syncing: Oops: Fatal
Exception in interrupt ]---\
Fixes: 11538d039ac6 ("bridge: vlan dst_metadata hooks in ingress and egress paths")
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Andy Roulin <aroulin@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20241001154400.22787-2-aroulin@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Pass a dscp_t variable to ip_route_input(), instead of a plain u8, to
prevent accidental setting of ECN bits in ->flowi4_tos.
Callers of ip_route_input() to consider are:
* input_action_end_dx4_finish() and input_action_end_dt4() in
net/ipv6/seg6_local.c. These functions set the tos parameter to 0,
which is already a valid dscp_t value, so they don't need to be
adjusted for the new prototype.
* icmp_route_lookup(), which already has a dscp_t variable to pass as
parameter. We just need to remove the inet_dscp_to_dsfield()
conversion.
* br_nf_pre_routing_finish(), ip_options_rcv_srr() and ip4ip6_err(),
which get the DSCP directly from IPv4 headers. Define a helper to
read the .tos field of struct iphdr as dscp_t, so that these
function don't have to do the conversion manually.
While there, declare *iph as const in br_nf_pre_routing_finish(),
declare its local variables in reverse-christmas-tree order and move
the "err = ip_route_input()" assignment out of the conditional to avoid
checkpatch warning.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/e9d40781d64d3d69f4c79ac8a008b8d67a033e8d.1727807926.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Unmask upper DSCP bits when calling ip_route_output() so that in the
future it could perform the FIB lookup according to the full DSCP value.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Conntrack assumes an unconfirmed entry (not yet committed to global hash
table) has a refcount of 1 and is not visible to other cores.
With multicast forwarding this assumption breaks down because such
skbs get cloned after being picked up, i.e. ct->use refcount is > 1.
Likewise, bridge netfilter will clone broad/mutlicast frames and
all frames in case they need to be flood-forwarded during learning
phase.
For ip multicast forwarding or plain bridge flood-forward this will
"work" because packets don't leave softirq and are implicitly
serialized.
With nfqueue this no longer holds true, the packets get queued
and can be reinjected in arbitrary ways.
Disable this feature, I see no other solution.
After this patch, nfqueue cannot queue packets except the last
multicast/broadcast packet.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
const qualify the struct ctl_table argument in the proc_handler function
signatures. This is a prerequisite to moving the static ctl_table
structs into .rodata data which will ensure that proc_handler function
pointers cannot be modified.
This patch has been generated by the following coccinelle script:
```
virtual patch
@r1@
identifier ctl, write, buffer, lenp, ppos;
identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int write, void *buffer, size_t *lenp, loff_t *ppos);
@r2@
identifier func, ctl, write, buffer, lenp, ppos;
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int write, void *buffer, size_t *lenp, loff_t *ppos)
{ ... }
@r3@
identifier func;
@@
int func(
- struct ctl_table *
+ const struct ctl_table *
,int , void *, size_t *, loff_t *);
@r4@
identifier func, ctl;
@@
int func(
- struct ctl_table *ctl
+ const struct ctl_table *ctl
,int , void *, size_t *, loff_t *);
@r5@
identifier func, write, buffer, lenp, ppos;
@@
int func(
- struct ctl_table *
+ const struct ctl_table *
,int write, void *buffer, size_t *lenp, loff_t *ppos);
```
* Code formatting was adjusted in xfs_sysctl.c to comply with code
conventions. The xfs_stats_clear_proc_handler,
xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
adjusted.
* The ctl_table argument in proc_watchdog_common was const qualified.
This is called from a proc_handler itself and is calling back into
another proc_handler, making it necessary to change it as part of the
proc_handler migration.
Co-developed-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Co-developed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Joel Granados <j.granados@samsung.com>
|
|
brnf_frag_data_storage is a per-CPU variable and relies on disabled BH
for its locking. Without per-CPU locking in local_bh_disable() on
PREEMPT_RT this data structure requires explicit locking.
Add a local_lock_t to the data structure and use local_lock_nested_bh()
for locking. This change adds only lockdep coverage and does not alter
the functional behaviour for !PREEMPT_RT.
Cc: Florian Westphal <fw@strlen.de>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: bridge@lists.linux.dev
Cc: coreteam@netfilter.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20240620132727.660738-8-bigeasy@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which will
reduce the overall build time size of the kernel and run time memory
bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)
* Remove sentinel elements from ctl_table structs
* Remove instances where an array element is zeroed out to make it look
like a sentinel. This is not longer needed and is safe after commit
c899710fe7f9 ("networking: Update to register_net_sysctl_sz") added
the array size to the ctl_table registration
* Remove the need for having __NF_SYSCTL_CT_LAST_SYSCTL as the
sysctl array size is now in NF_SYSCTL_CT_LAST_SYSCTL
* Remove extra element in ctl_table arrays declarations
Acked-by: Kees Cook <keescook@chromium.org> # loadpin & yama
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
To be able to constify instances of struct ctl_tables it is necessary to
remove ways through which non-const versions are exposed from the
sysctl core.
One of these is the ctl_table_arg member of struct ctl_table_header.
Constify this reference as a prerequisite for the full constification of
struct ctl_table instances.
No functional change.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Cross-merge networking fixes after downstream PR.
Conflicts:
include/trace/events/rpcgss.h
386f4a737964 ("trace: events: cleanup deprecated strncpy uses")
a4833e3abae1 ("SUNRPC: Fix rpcgss_context trace event acceptor field")
Adjacent changes:
drivers/net/ethernet/intel/ice/ice_tc_lib.c
2cca35f5dd78 ("ice: Fix checking for unsupported keys on non-tunnel device")
784feaa65dfd ("ice: Add support for PFCP hardware offload in switchdev")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
For historical reasons, when bridge device is in promisc mode, packets
that are directed to the taps follow bridge input hook path. This patch
adds a workaround to reset conntrack for these packets.
Jianbo Liu reports warning splats in their test infrastructure where
cloned packets reach the br_netfilter input hook to confirm the
conntrack object.
Scratch one bit from BR_INPUT_SKB_CB to annotate that this packet has
reached the input hook because it is passed up to the bridge device to
reach the taps.
[ 57.571874] WARNING: CPU: 1 PID: 0 at net/bridge/br_netfilter_hooks.c:616 br_nf_local_in+0x157/0x180 [br_netfilter]
[ 57.572749] Modules linked in: xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat xt_addrtype xt_conntrack nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_isc si ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core mlx5ctl mlx5_core
[ 57.575158] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.8.0+ #19
[ 57.575700] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[ 57.576662] RIP: 0010:br_nf_local_in+0x157/0x180 [br_netfilter]
[ 57.577195] Code: fe ff ff 41 bd 04 00 00 00 be 04 00 00 00 e9 4a ff ff ff be 04 00 00 00 48 89 ef e8 f3 a9 3c e1 66 83 ad b4 00 00 00 04 eb 91 <0f> 0b e9 f1 fe ff ff 0f 0b e9 df fe ff ff 48 89 df e8 b3 53 47 e1
[ 57.578722] RSP: 0018:ffff88885f845a08 EFLAGS: 00010202
[ 57.579207] RAX: 0000000000000002 RBX: ffff88812dfe8000 RCX: 0000000000000000
[ 57.579830] RDX: ffff88885f845a60 RSI: ffff8881022dc300 RDI: 0000000000000000
[ 57.580454] RBP: ffff88885f845a60 R08: 0000000000000001 R09: 0000000000000003
[ 57.581076] R10: 00000000ffff1300 R11: 0000000000000002 R12: 0000000000000000
[ 57.581695] R13: ffff8881047ffe00 R14: ffff888108dbee00 R15: ffff88814519b800
[ 57.582313] FS: 0000000000000000(0000) GS:ffff88885f840000(0000) knlGS:0000000000000000
[ 57.583040] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 57.583564] CR2: 000000c4206aa000 CR3: 0000000103847001 CR4: 0000000000370eb0
[ 57.584194] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 57.584820] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 57.585440] Call Trace:
[ 57.585721] <IRQ>
[ 57.585976] ? __warn+0x7d/0x130
[ 57.586323] ? br_nf_local_in+0x157/0x180 [br_netfilter]
[ 57.586811] ? report_bug+0xf1/0x1c0
[ 57.587177] ? handle_bug+0x3f/0x70
[ 57.587539] ? exc_invalid_op+0x13/0x60
[ 57.587929] ? asm_exc_invalid_op+0x16/0x20
[ 57.588336] ? br_nf_local_in+0x157/0x180 [br_netfilter]
[ 57.588825] nf_hook_slow+0x3d/0xd0
[ 57.589188] ? br_handle_vlan+0x4b/0x110
[ 57.589579] br_pass_frame_up+0xfc/0x150
[ 57.589970] ? br_port_flags_change+0x40/0x40
[ 57.590396] br_handle_frame_finish+0x346/0x5e0
[ 57.590837] ? ipt_do_table+0x32e/0x430
[ 57.591221] ? br_handle_local_finish+0x20/0x20
[ 57.591656] br_nf_hook_thresh+0x4b/0xf0 [br_netfilter]
[ 57.592286] ? br_handle_local_finish+0x20/0x20
[ 57.592802] br_nf_pre_routing_finish+0x178/0x480 [br_netfilter]
[ 57.593348] ? br_handle_local_finish+0x20/0x20
[ 57.593782] ? nf_nat_ipv4_pre_routing+0x25/0x60 [nf_nat]
[ 57.594279] br_nf_pre_routing+0x24c/0x550 [br_netfilter]
[ 57.594780] ? br_nf_hook_thresh+0xf0/0xf0 [br_netfilter]
[ 57.595280] br_handle_frame+0x1f3/0x3d0
[ 57.595676] ? br_handle_local_finish+0x20/0x20
[ 57.596118] ? br_handle_frame_finish+0x5e0/0x5e0
[ 57.596566] __netif_receive_skb_core+0x25b/0xfc0
[ 57.597017] ? __napi_build_skb+0x37/0x40
[ 57.597418] __netif_receive_skb_list_core+0xfb/0x220
Fixes: 62e7151ae3eb ("netfilter: bridge: confirm multicast packets before passing them up the stack")
Reported-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Add a "scope" parameter to ip_route_output() so that callers don't have
to override the tos parameter with the RTO_ONLINK flag if they want a
local scope.
This will allow converting flowi4_tos to dscp_t in the future, thus
allowing static analysers to flag invalid interactions between
"tos" (the DSCP bits) and ECN.
Only three users ask for local scope (bonding, arp and atm). The others
continue to use RT_SCOPE_UNIVERSE. While there, add a comment to warn
users about the limitations of ip_route_output().
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Leon Romanovsky <leonro@nvidia.com> # infiniband
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
conntrack nf_confirm logic cannot handle cloned skbs referencing
the same nf_conn entry, which will happen for multicast (broadcast)
frames on bridges.
Example:
macvlan0
|
br0
/ \
ethX ethY
ethX (or Y) receives a L2 multicast or broadcast packet containing
an IP packet, flow is not yet in conntrack table.
1. skb passes through bridge and fake-ip (br_netfilter)Prerouting.
-> skb->_nfct now references a unconfirmed entry
2. skb is broad/mcast packet. bridge now passes clones out on each bridge
interface.
3. skb gets passed up the stack.
4. In macvlan case, macvlan driver retains clone(s) of the mcast skb
and schedules a work queue to send them out on the lower devices.
The clone skb->_nfct is not a copy, it is the same entry as the
original skb. The macvlan rx handler then returns RX_HANDLER_PASS.
5. Normal conntrack hooks (in NF_INET_LOCAL_IN) confirm the orig skb.
The Macvlan broadcast worker and normal confirm path will race.
This race will not happen if step 2 already confirmed a clone. In that
case later steps perform skb_clone() with skb->_nfct already confirmed (in
hash table). This works fine.
But such confirmation won't happen when eb/ip/nftables rules dropped the
packets before they reached the nf_confirm step in postrouting.
Pablo points out that nf_conntrack_bridge doesn't allow use of stateful
nat, so we can safely discard the nf_conn entry and let inet call
conntrack again.
This doesn't work for bridge netfilter: skb could have a nat
transformation. Also bridge nf prevents re-invocation of inet prerouting
via 'sabotage_in' hook.
Work around this problem by explicit confirmation of the entry at LOCAL_IN
time, before upper layer has a chance to clone the unconfirmed entry.
The downside is that this disables NAT and conntrack helpers.
Alternative fix would be to add locking to all code parts that deal with
unconfirmed packets, but even if that could be done in a sane way this
opens up other problems, for example:
-m physdev --physdev-out eth0 -j SNAT --snat-to 1.2.3.4
-m physdev --physdev-out eth1 -j SNAT --snat-to 1.2.3.5
For multicast case, only one of such conflicting mappings will be
created, conntrack only handles 1:1 NAT mappings.
Users should set create a setup that explicitly marks such traffic
NOTRACK (conntrack bypass) to avoid this, but we cannot auto-bypass
them, ruleset might have accept rules for untracked traffic already,
so user-visible behaviour would change.
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217777
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
An skb can be added to a neigh->arp_queue while waiting for an arp
reply. Where original skb's skb->dev can be different to neigh's
neigh->dev. For instance in case of bridging dnated skb from one veth to
another, the skb would be added to a neigh->arp_queue of the bridge.
As skb->dev can be reset back to nf_bridge->physindev and used, and as
there is no explicit mechanism that prevents this physindev from been
freed under us (for instance neigh_flush_dev doesn't cleanup skbs from
different device's neigh queue) we can crash on e.g. this stack:
arp_process
neigh_update
skb = __skb_dequeue(&neigh->arp_queue)
neigh_resolve_output(..., skb)
...
br_nf_dev_xmit
br_nf_pre_routing_finish_bridge_slow
skb->dev = nf_bridge->physindev
br_handle_frame_finish
Let's use plain ifindex instead of net_device link. To peek into the
original net_device we will use dev_get_by_index_rcu(). Thus either we
get device and are safe to use it or we don't get it and drop skb.
Fixes: c4e70a87d975 ("netfilter: bridge: rename br_netfilter.c to br_netfilter_hooks.c")
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
br_netfilter registers two forward hooks, one for ip and one for arp.
Just use a common function for both and then call the arp/ip helper
as needed.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
errno is 0 because these hooks are called from prerouting and forward.
There is no socket that the errno would ever be propagated to.
Other netfilter modules (e.g. nf_nat, conntrack, ...) can be converted
in a similar way.
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
n->output field can be read locklessly, while a writer
might change the pointer concurrently.
Add missing annotations to prevent load-store tearing.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Move from register_net_sysctl to register_net_sysctl_sz for all the
netfilter related files. Do this while making sure to mirror the NULL
assignments with a table_size of zero for the unprivileged users.
We need to move to the new function in preparation for when we change
SIZE_MAX to ARRAY_SIZE() in the register_net_sysctl macro. Failing to do
so would erroneously allow ARRAY_SIZE() to be called on a pointer. We
hold off the SIZE_MAX to ARRAY_SIZE change until we have migrated all
the relevant net sysctl registering functions to register_net_sysctl_sz
in subsequent commits.
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
|
Adjacent changes:
net/mptcp/protocol.h
63740448a32e ("mptcp: fix accept vs worker race")
2a6a870e44dd ("mptcp: stops worker on unaccepted sockets at listener close")
ddb1a072f858 ("mptcp: move first subflow allocation at mpc access time")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Recent attempt to ensure PREROUTING hook is executed again when a
decrypted ipsec packet received on a bridge passes through the network
stack a second time broke the physdev match in INPUT hook.
We can't discard the nf_bridge info strct from sabotage_in hook, as
this is needed by the physdev match.
Keep the struct around and handle this with another conditional instead.
Fixes: 2b272bb558f1 ("netfilter: br_netfilter: disable sabotage_in hook after first suppression")
Reported-and-tested-by: Farid BENAMROUCHE <fariouche@yahoo.fr>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
We have many lockless accesses to n->nud_state.
Before adding another one in the following patch,
add annotations to readers and writers.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
net/core/gro.c
7d2c89b32587 ("skb: Do mix page pool and page referenced frags in GRO")
b1a78b9b9886 ("net: add support for ipv4 big tcp")
https://lore.kernel.org/all/20230203094454.5766f160@canb.auug.org.au/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
These 3 places in bridge netfilter are called on RX path after GRO
and IPv4 TCP GSO packets may come through, so replace iph tot_len
accessing with skb_ip_totlen() in there.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When using a xfrm interface in a bridged setup (the outgoing device is
bridged), the incoming packets in the xfrm interface are only tracked
in the outgoing direction.
$ brctl show
bridge name interfaces
br_eth1 eth1
$ conntrack -L
tcp 115 SYN_SENT src=192... dst=192... [UNREPLIED] ...
If br_netfilter is enabled, the first (encrypted) packet is received onR
eth1, conntrack hooks are called from br_netfilter emulation which
allocates nf_bridge info for this skb.
If the packet is for local machine, skb gets passed up the ip stack.
The skb passes through ip prerouting a second time. br_netfilter
ip_sabotage_in supresses the re-invocation of the hooks.
After this, skb gets decrypted in xfrm layer and appears in
network stack a second time (after decryption).
Then, ip_sabotage_in is called again and suppresses netfilter
hook invocation, even though the bridge layer never called them
for the plaintext incarnation of the packet.
Free the bridge info after the first suppression to avoid this.
I was unable to figure out where the regression comes from, as far as i
can see br_netfilter always had this problem; i did not expect that skb
is looped again with different headers.
Fixes: c4b0e771f906 ("netfilter: avoid using skb->nf_bridge directly")
Reported-and-tested-by: Wolfgang Nothdurft <wolfgang@linogate.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
The IPv6 path already drops dst in the daddr changed case, but the IPv4
path does not. This change makes the two code paths consistent.
Further, it is possible that there is already a metadata_dst allocated from
ingress that might already be attached to skbuff->dst while following
the bridge path. If it is not released before setting a new
metadata_dst, it will be leaked. This is similar to what is done in
bpf_set_tunnel_key() or ip6_route_input().
It is important to note that the memory being leaked is not the dst
being set in the bridge code, but rather memory allocated from some
other code path that is not being freed correctly before the skb dst is
overwritten.
An example of the leakage fixed by this commit found using kmemleak:
unreferenced object 0xffff888010112b00 (size 256):
comm "softirq", pid 0, jiffies 4294762496 (age 32.012s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 80 16 f1 83 ff ff ff ff ................
e1 4e f6 82 ff ff ff ff 00 00 00 00 00 00 00 00 .N..............
backtrace:
[<00000000d79567ea>] metadata_dst_alloc+0x1b/0xe0
[<00000000be113e13>] udp_tun_rx_dst+0x174/0x1f0
[<00000000a36848f4>] geneve_udp_encap_recv+0x350/0x7b0
[<00000000d4afb476>] udp_queue_rcv_one_skb+0x380/0x560
[<00000000ac064aea>] udp_unicast_rcv_skb+0x75/0x90
[<000000009a8ee8c5>] ip_protocol_deliver_rcu+0xd8/0x230
[<00000000ef4980bb>] ip_local_deliver_finish+0x7a/0xa0
[<00000000d7533c8c>] __netif_receive_skb_one_core+0x89/0xa0
[<00000000a879497d>] process_backlog+0x93/0x190
[<00000000e41ade9f>] __napi_poll+0x28/0x170
[<00000000b4c0906b>] net_rx_action+0x14f/0x2a0
[<00000000b20dd5d4>] __do_softirq+0xf4/0x305
[<000000003a7d7e15>] __irq_exit_rcu+0xc3/0x140
[<00000000968d39a2>] sysvec_apic_timer_interrupt+0x9e/0xc0
[<000000009e920794>] asm_sysvec_apic_timer_interrupt+0x16/0x20
[<000000008942add0>] native_safe_halt+0x13/0x20
Florian Westphal says: "Original code was likely fine because nothing
ever did set a skb->dst entry earlier than bridge in those days."
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Harsh Modi <harshmodi@google.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
When br_netfilter module is loaded, skbs may be diverted to the
ipv4/ipv6 hooks, just like as if we were routing.
Unfortunately, bridge filter hooks with priority 0 may be skipped
in this case.
Example:
1. an nftables bridge ruleset is loaded, with a prerouting
hook that has priority 0.
2. interface is added to the bridge.
3. no tcp packet is ever seen by the bridge prerouting hook.
4. flush the ruleset
5. load the bridge ruleset again.
6. tcp packets are processed as expected.
After 1) the only registered hook is the bridge prerouting hook, but its
not called yet because the bridge hasn't been brought up yet.
After 2), hook order is:
0 br_nf_pre_routing // br_netfilter internal hook
0 chain bridge f prerouting // nftables bridge ruleset
The packet is diverted to br_nf_pre_routing.
If call-iptables is off, the nftables bridge ruleset is called as expected.
But if its enabled, br_nf_hook_thresh() will skip it because it assumes
that all 0-priority hooks had been called previously in bridge context.
To avoid this, check for the br_nf_pre_routing hook itself, we need to
resume directly after it, even if this hook has a priority of 0.
Unfortunately, this still results in different packet flow.
With this fix, the eval order after in 3) is:
1. br_nf_pre_routing
2. ip(6)tables (if enabled)
3. nftables bridge
but after 5 its the much saner:
1. nftables bridge
2. br_nf_pre_routing
3. ip(6)tables (if enabled)
Unfortunately I don't see a solution here:
It would be possible to move br_nf_pre_routing to a higher priority
so that it will be called later in the pipeline, but this also impacts
ebtables evaluation order, and would still result in this very ordering
problem for all nftables-bridge hooks with the same priority as the
br_nf_pre_routing one.
Searching back through the git history I don't think this has
ever behaved in any other way, hence, no fixes-tag.
Reported-by: Radim Hrazdil <rhrazdil@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This makes 'bridge-nf-filter-pppoe-tagged' sysctl work for
bridged traffic.
Looking at the original commit it doesn't appear this ever worked:
static unsigned int br_nf_post_routing(unsigned int hook, struct sk_buff **pskb,
[..]
if (skb->protocol == htons(ETH_P_8021Q)) {
skb_pull(skb, VLAN_HLEN);
skb->network_header += VLAN_HLEN;
+ } else if (skb->protocol == htons(ETH_P_PPP_SES)) {
+ skb_pull(skb, PPPOE_SES_HLEN);
+ skb->network_header += PPPOE_SES_HLEN;
}
[..]
NF_HOOK(... POST_ROUTING, ...)
... but the adjusted offsets are never restored.
The alternative would be to rip this code out for good,
but otoh we'd have to keep this anyway for the vlan handling
(which works because vlan tag info is in the skb, not the packet
payload).
Reported-and-tested-by: Amish Chana <amish@3g.co.za>
Fixes: 516299d2f5b6f97 ("[NETFILTER]: bridge-nf: filter bridged IPv4/IPv6 encapsulated in pppoe traffic")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Make use of netdev helper functions to improve code readability.
Replace 'dev->priv_flags & IFF_EBRIDGE' with netif_is_bridge_master(dev).
Signed-off-by: Kyungrok Chung <acadx0@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Netfilter changes PACKET_OTHERHOST to PACKET_HOST before invoking the
hooks as, while it's an expected value for a bridge, routing expects
PACKET_HOST. The change is undone later on after hook traversal. This
can be seen with pairs of functions updating skb>pkt_type and then
reverting it to its original value:
For hook NF_INET_PRE_ROUTING:
setup_pre_routing / br_nf_pre_routing_finish
For hook NF_INET_FORWARD:
br_nf_forward_ip / br_nf_forward_finish
But the third case where netfilter does this, for hook
NF_INET_POST_ROUTING, the packet type is changed in br_nf_post_routing
but never reverted. A comment says:
/* We assume any code from br_dev_queue_push_xmit onwards doesn't care
* about the value of skb->pkt_type. */
But when having a tunnel (say vxlan) attached to a bridge we have the
following call trace:
br_nf_pre_routing
br_nf_pre_routing_ipv6
br_nf_pre_routing_finish
br_nf_forward_ip
br_nf_forward_finish
br_nf_post_routing <- pkt_type is updated to PACKET_HOST
br_nf_dev_queue_xmit <- but not reverted to its original value
vxlan_xmit
vxlan_xmit_one
skb_tunnel_check_pmtu <- a check on pkt_type is performed
In this specific case, this creates issues such as when an ICMPv6 PTB
should be sent back. When CONFIG_BRIDGE_NETFILTER is enabled, the PTB
isn't sent (as skb_tunnel_check_pmtu checks if pkt_type is PACKET_HOST
and returns early).
If the comment is right and no one cares about the value of
skb->pkt_type after br_dev_queue_push_xmit (which isn't true), resetting
it to its original value should be safe.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20201123174902.622102-1-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from userspace in common code. This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.
As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
syzbot is kind enough to remind us we need to call skb_may_pull()
BUG: KMSAN: uninit-value in br_nf_forward_arp+0xe61/0x1230 net/bridge/br_netfilter_hooks.c:665
CPU: 1 PID: 11631 Comm: syz-executor.1 Not tainted 5.4.0-rc8-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1c9/0x220 lib/dump_stack.c:118
kmsan_report+0x128/0x220 mm/kmsan/kmsan_report.c:108
__msan_warning+0x64/0xc0 mm/kmsan/kmsan_instr.c:245
br_nf_forward_arp+0xe61/0x1230 net/bridge/br_netfilter_hooks.c:665
nf_hook_entry_hookfn include/linux/netfilter.h:135 [inline]
nf_hook_slow+0x18b/0x3f0 net/netfilter/core.c:512
nf_hook include/linux/netfilter.h:260 [inline]
NF_HOOK include/linux/netfilter.h:303 [inline]
__br_forward+0x78f/0xe30 net/bridge/br_forward.c:109
br_flood+0xef0/0xfe0 net/bridge/br_forward.c:234
br_handle_frame_finish+0x1a77/0x1c20 net/bridge/br_input.c:162
nf_hook_bridge_pre net/bridge/br_input.c:245 [inline]
br_handle_frame+0xfb6/0x1eb0 net/bridge/br_input.c:348
__netif_receive_skb_core+0x20b9/0x51a0 net/core/dev.c:4830
__netif_receive_skb_one_core net/core/dev.c:4927 [inline]
__netif_receive_skb net/core/dev.c:5043 [inline]
process_backlog+0x610/0x13c0 net/core/dev.c:5874
napi_poll net/core/dev.c:6311 [inline]
net_rx_action+0x7a6/0x1aa0 net/core/dev.c:6379
__do_softirq+0x4a1/0x83a kernel/softirq.c:293
do_softirq_own_stack+0x49/0x80 arch/x86/entry/entry_64.S:1091
</IRQ>
do_softirq kernel/softirq.c:338 [inline]
__local_bh_enable_ip+0x184/0x1d0 kernel/softirq.c:190
local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32
rcu_read_unlock_bh include/linux/rcupdate.h:688 [inline]
__dev_queue_xmit+0x38e8/0x4200 net/core/dev.c:3819
dev_queue_xmit+0x4b/0x60 net/core/dev.c:3825
packet_snd net/packet/af_packet.c:2959 [inline]
packet_sendmsg+0x8234/0x9100 net/packet/af_packet.c:2984
sock_sendmsg_nosec net/socket.c:637 [inline]
sock_sendmsg net/socket.c:657 [inline]
__sys_sendto+0xc44/0xc70 net/socket.c:1952
__do_sys_sendto net/socket.c:1964 [inline]
__se_sys_sendto+0x107/0x130 net/socket.c:1960
__x64_sys_sendto+0x6e/0x90 net/socket.c:1960
do_syscall_64+0xb6/0x160 arch/x86/entry/common.c:291
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45a679
Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f0a3c9e5c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 000000000045a679
RDX: 000000000000000e RSI: 0000000020000200 RDI: 0000000000000003
RBP: 000000000075bf20 R08: 00000000200000c0 R09: 0000000000000014
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f0a3c9e66d4
R13: 00000000004c8ec1 R14: 00000000004dfe28 R15: 00000000ffffffff
Uninit was created at:
kmsan_save_stack_with_flags mm/kmsan/kmsan.c:149 [inline]
kmsan_internal_poison_shadow+0x5c/0x110 mm/kmsan/kmsan.c:132
kmsan_slab_alloc+0x97/0x100 mm/kmsan/kmsan_hooks.c:86
slab_alloc_node mm/slub.c:2773 [inline]
__kmalloc_node_track_caller+0xe27/0x11a0 mm/slub.c:4381
__kmalloc_reserve net/core/skbuff.c:141 [inline]
__alloc_skb+0x306/0xa10 net/core/skbuff.c:209
alloc_skb include/linux/skbuff.h:1049 [inline]
alloc_skb_with_frags+0x18c/0xa80 net/core/skbuff.c:5662
sock_alloc_send_pskb+0xafd/0x10a0 net/core/sock.c:2244
packet_alloc_skb net/packet/af_packet.c:2807 [inline]
packet_snd net/packet/af_packet.c:2902 [inline]
packet_sendmsg+0x63a6/0x9100 net/packet/af_packet.c:2984
sock_sendmsg_nosec net/socket.c:637 [inline]
sock_sendmsg net/socket.c:657 [inline]
__sys_sendto+0xc44/0xc70 net/socket.c:1952
__do_sys_sendto net/socket.c:1964 [inline]
__se_sys_sendto+0x107/0x130 net/socket.c:1960
__x64_sys_sendto+0x6e/0x90 net/socket.c:1960
do_syscall_64+0xb6/0x160 arch/x86/entry/common.c:291
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: c4e70a87d975 ("netfilter: bridge: rename br_netfilter.c to br_netfilter_hooks.c")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
A kernel panic can happen if a host has disabled IPv6 on boot and have to
process guest packets (coming from a bridge) using it's ip6tables.
IPv6 packets need to be dropped if the IPv6 module is not loaded, and the
host ip6tables will be used.
Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Resolve conflict between d2912cb15bdd ("treewide: Replace GPLv2
boilerplate/reference with SPDX - rule 500") removing the GPL disclaimer
and fe03d4745675 ("Update my email address") which updates Jozsef
Kadlecsik's email.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Prevent a UAF in brnf_exit_net().
When unregister_net_sysctl_table() is called the ctl_hdr pointer will
obviously be freed and so accessing it righter after is invalid. Fix
this by stashing a pointer to the table we want to free before we
unregister the sysctl header.
Note that syzkaller falsely chased this down to the drm tree so the
Fixes tag that syzkaller requested would be wrong. This commit uses a
different but the correct Fixes tag.
/* Splat */
BUG: KASAN: use-after-free in br_netfilter_sysctl_exit_net
net/bridge/br_netfilter_hooks.c:1121 [inline]
BUG: KASAN: use-after-free in brnf_exit_net+0x38c/0x3a0
net/bridge/br_netfilter_hooks.c:1141
Read of size 8 at addr ffff8880a4078d60 by task kworker/u4:4/8749
CPU: 0 PID: 8749 Comm: kworker/u4:4 Not tainted 5.2.0-rc5-next-20190618 #17
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
01/01/2011
Workqueue: netns cleanup_net
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x172/0x1f0 lib/dump_stack.c:113
print_address_description.cold+0xd4/0x306 mm/kasan/report.c:351
__kasan_report.cold+0x1b/0x36 mm/kasan/report.c:482
kasan_report+0x12/0x20 mm/kasan/common.c:614
__asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
br_netfilter_sysctl_exit_net net/bridge/br_netfilter_hooks.c:1121 [inline]
brnf_exit_net+0x38c/0x3a0 net/bridge/br_netfilter_hooks.c:1141
ops_exit_list.isra.0+0xaa/0x150 net/core/net_namespace.c:154
cleanup_net+0x3fb/0x960 net/core/net_namespace.c:553
process_one_work+0x989/0x1790 kernel/workqueue.c:2269
worker_thread+0x98/0xe40 kernel/workqueue.c:2415
kthread+0x354/0x420 kernel/kthread.c:255
ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
Allocated by task 11374:
save_stack+0x23/0x90 mm/kasan/common.c:71
set_track mm/kasan/common.c:79 [inline]
__kasan_kmalloc mm/kasan/common.c:489 [inline]
__kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
kasan_kmalloc+0x9/0x10 mm/kasan/common.c:503
__do_kmalloc mm/slab.c:3645 [inline]
__kmalloc+0x15c/0x740 mm/slab.c:3654
kmalloc include/linux/slab.h:552 [inline]
kzalloc include/linux/slab.h:743 [inline]
__register_sysctl_table+0xc7/0xef0 fs/proc/proc_sysctl.c:1327
register_net_sysctl+0x29/0x30 net/sysctl_net.c:121
br_netfilter_sysctl_init_net net/bridge/br_netfilter_hooks.c:1105 [inline]
brnf_init_net+0x379/0x6a0 net/bridge/br_netfilter_hooks.c:1126
ops_init+0xb3/0x410 net/core/net_namespace.c:130
setup_net+0x2d3/0x740 net/core/net_namespace.c:316
copy_net_ns+0x1df/0x340 net/core/net_namespace.c:439
create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:103
unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:202
ksys_unshare+0x444/0x980 kernel/fork.c:2822
__do_sys_unshare kernel/fork.c:2890 [inline]
__se_sys_unshare kernel/fork.c:2888 [inline]
__x64_sys_unshare+0x31/0x40 kernel/fork.c:2888
do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Freed by task 9:
save_stack+0x23/0x90 mm/kasan/common.c:71
set_track mm/kasan/common.c:79 [inline]
__kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
__cache_free mm/slab.c:3417 [inline]
kfree+0x10a/0x2c0 mm/slab.c:3746
__rcu_reclaim kernel/rcu/rcu.h:215 [inline]
rcu_do_batch kernel/rcu/tree.c:2092 [inline]
invoke_rcu_callbacks kernel/rcu/tree.c:2310 [inline]
rcu_core+0xcc7/0x1500 kernel/rcu/tree.c:2291
__do_softirq+0x25c/0x94c kernel/softirq.c:292
The buggy address belongs to the object at ffff8880a4078d40
which belongs to the cache kmalloc-512 of size 512
The buggy address is located 32 bytes inside of
512-byte region [ffff8880a4078d40, ffff8880a4078f40)
The buggy address belongs to the page:
page:ffffea0002901e00 refcount:1 mapcount:0 mapping:ffff8880aa400a80
index:0xffff8880a40785c0
flags: 0x1fffc0000000200(slab)
raw: 01fffc0000000200 ffffea0001d636c8 ffffea0001b07308 ffff8880aa400a80
raw: ffff8880a40785c0 ffff8880a40780c0 0000000100000004 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8880a4078c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8880a4078c80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
> ffff8880a4078d00: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
^
ffff8880a4078d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8880a4078e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Reported-by: syzbot+43a3fa52c0d9c5c94f41@syzkaller.appspotmail.com
Fixes: 22567590b2e6 ("netfilter: bridge: namespace bridge netfilter sysctls")
Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Currently, the /proc/sys/net/bridge folder is only created in the initial
network namespace. This patch ensures that the /proc/sys/net/bridge folder
is available in each network namespace if the module is loaded and
disappears from all network namespaces when the module is unloaded.
In doing so the patch makes the sysctls:
bridge-nf-call-arptables
bridge-nf-call-ip6tables
bridge-nf-call-iptables
bridge-nf-filter-pppoe-tagged
bridge-nf-filter-vlan-tagged
bridge-nf-pass-vlan-input-dev
apply per network namespace. This unblocks some use-cases where users would
like to e.g. not do bridge filtering for bridges in a specific network
namespace while doing so for bridges located in another network namespace.
The netfilter rules are afaict already per network namespace so it should
be safe for users to specify whether bridge devices inside a network
namespace are supposed to go through iptables et al. or not. Also, this can
already be done per-bridge by setting an option for each individual bridge
via Netlink. It should also be possible to do this for all bridges in a
network namespace via sysctls.
Cc: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This ports the sysctls to use struct brnf_net.
With this patch we make it possible to namespace the br_netfilter module in
the following patch.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 3029 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Since Commit 21d1196a35f5 ("ipv4: set transport header earlier"),
skb->transport_header has been always set before entering INET
netfilter. This patch is to set skb->transport_header for bridge
before entering INET netfilter by bridge-nf-call-iptables.
It also fixes an issue that sctp_error() couldn't compute a right
csum due to unset skb->transport_header.
Fixes: e6d8b64b34aa ("net: sctp: fix and consolidate SCTP checksumming code")
Reported-by: Li Shuang <shuali@redhat.com>
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Followup to a173f066c7cf ("netfilter: bridge: Don't sabotage nf_hook
calls from an l3mdev"). Some packets (e.g., ndisc) do not have the skb
device flipped to the l3mdev (e.g., VRF) device. Update ip_sabotage_in
to not drop packets for slave devices too. Currently, neighbor
solicitation packets for 'dev -> bridge (addr) -> vrf' setups are getting
dropped. This patch enables IPv6 communications for bridges with an
address that are enslaved to a VRF.
Fixes: 73e20b761acf ("net: vrf: Add support for PREROUTING rules on vrf device")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Following command:
iptables -D FORWARD -m physdev ...
causes connectivity loss in some setups.
Reason is that iptables userspace will probe kernel for the module revision
of the physdev patch, and physdev has an artificial dependency on
br_netfilter (xt_physdev use makes no sense unless a br_netfilter module
is loaded).
This causes the "phydev" module to be loaded, which in turn enables the
"call-iptables" infrastructure.
bridged packets might then get dropped by the iptables ruleset.
The better fix would be to change the "call-iptables" defaults to 0 and
enforce explicit setting to 1, but that breaks backwards compatibility.
This does the next best thing: add a request_module call to checkentry.
This was a stray '-D ... -m physdev' won't activate br_netfilter
anymore.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
state
When handling DNAT'ed packets on a bridge device, the neighbour cache entry
from lookup was used without checking its state. It means that a cache entry
in the NUD_STALE state will be used directly instead of entering the NUD_DELAY
state to confirm the reachability of the neighbor.
This problem becomes worse after commit 2724680bceee ("neigh: Keep neighbour
cache entries if number of them is small enough."), since all neighbour cache
entries in the NUD_STALE state will be kept in the neighbour table as long as
the number of cache entries does not exceed the value specified in gc_thresh1.
This commit validates the state of a neighbour cache entry before using
the entry.
Signed-off-by: JianJhen Chen <kchen@synology.com>
Reviewed-by: JinLin Chen <jlchen@synology.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This converts the bridge netfilter (calling iptables hooks from bridge)
facility to use the extension infrastructure.
The bridge_nf specific hooks in skb clone and free paths are removed, they
have been replaced by the skb_ext hooks that do the same as the bridge nf
allocations hooks did.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This pointer is going to be removed soon, so use the existing helpers in
more places to avoid noise when the removal happens.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This removes assumption than vlan_tci != 0 when tag is present.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Minor conflict in net/core/rtnetlink.c, David Ahern's bug fix in 'net'
overlapped the renaming of a netlink attribute in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
No functional change, convert of nf_call_[ip|ip6|arp]tables to bits.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
For starters, the bridge netfilter code registers operations that
are invoked any time nh_hook is called. Specifically, ip_sabotage_in
watches for nested calls for NF_INET_PRE_ROUTING when a bridge is in
the stack.
Packet wise, the bridge netfilter hook runs first. br_nf_pre_routing
allocates nf_bridge, sets in_prerouting to 1 and calls NF_HOOK for
NF_INET_PRE_ROUTING. It's finish function, br_nf_pre_routing_finish,
then resets in_prerouting flag to 0 and the packet continues up the
stack. The packet eventually makes it to the VRF driver and it invokes
nf_hook for NF_INET_PRE_ROUTING in case any rules have been added against
the vrf device.
Because of the registered operations the call to nf_hook causes
ip_sabotage_in to be invoked. That function sees the nf_bridge on the
skb and that in_prerouting is not set. Thinking it is an invalid nested
call it steals (drops) the packet.
Update ip_sabotage_in to recognize that the bridge or one of its upper
devices (e.g., vlan) can be enslaved to a VRF (L3 master device) and
allow the packet to go through the nf_hook a second time.
Fixes: 73e20b761acf ("net: vrf: Add support for PREROUTING rules on vrf device")
Reported-by: D'Souza, Nelson <ndsouza@ciena.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|