Age | Commit message (Collapse) | Author |
|
__ip6_make_skb() gets a cork->dst ref, hands it over to skb and shortly
after puts cork->dst. Save two atomics by stealing it without extra
referencing, ip6_cork_release() handles NULL cork->dst.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Remove leftovers from flowtable modules, from Geert Uytterhoeven.
2) Missing refcount increment of conntrack template in nft_ct,
from Florian Westphal.
3) Reduce nft_zone selftest time, also from Florian.
4) Add selftest to cover stateless NAT on fragments, from Florian Westphal.
5) Do not set net_device when for reject packets from the bridge path,
from Phil Sutter.
6) Cancel register tracking info on nft_byteorder operations.
7) Extend nft_concat_range selftest to cover set reload with no elements,
from Florian Westphal.
8) Remove useless update of pointer in chain blob builder, reported
by kbuild test robot.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
netfilter: nf_tables: remove assignment with no effect in chain blob builder
selftests: nft_concat_range: add test for reload with no element add/del
netfilter: nft_byteorder: track register operations
netfilter: nft_reject_bridge: Fix for missing reply from prerouting
selftests: netfilter: check stateless nat udp checksum fixup
selftests: netfilter: reduce zone stress test running time
netfilter: nft_ct: fix use after free when attaching zone template
netfilter: Remove flowtable relics
====================
Link: https://lore.kernel.org/r/20220127235235.656931-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
No conflicts.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Increment the mgmt revision due to recent changes.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
BPF verifier supports direct memory access for BPF_PROG_TYPE_TRACING type
of bpf programs, e.g., a->b. If "a" is a pointer
pointing to kernel memory, bpf verifier will allow user to write
code in C like a->b and the verifier will translate it to a kernel
load properly. If "a" is a pointer to user memory, it is expected
that bpf developer should be bpf_probe_read_user() helper to
get the value a->b. Without utilizing BTF __user tagging information,
current verifier will assume that a->b is a kernel memory access
and this may generate incorrect result.
Now BTF contains __user information, it can check whether the
pointer points to a user memory or not. If it is, the verifier
can reject the program and force users to use bpf_probe_read_user()
helper explicitly.
In the future, we can easily extend btf_add_space for other
address space tagging, for example, rcu/percpu etc.
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220127154606.654961-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter and can.
Current release - new code bugs:
- tcp: add a missing sk_defer_free_flush() in tcp_splice_read()
- tcp: add a stub for sk_defer_free_flush(), fix CONFIG_INET=n
- nf_tables: set last expression in register tracking area
- nft_connlimit: fix memleak if nf_ct_netns_get() fails
- mptcp: fix removing ids bitmap setting
- bonding: use rcu_dereference_rtnl when getting active slave
- fix three cases of sleep in atomic context in drivers: lan966x, gve
- handful of build fixes for esoteric drivers after netdev->dev_addr
was made const
Previous releases - regressions:
- revert "ipv6: Honor all IPv6 PIO Valid Lifetime values", it broke
Linux compatibility with USGv6 tests
- procfs: show net device bound packet types
- ipv4: fix ip option filtering for locally generated fragments
- phy: broadcom: hook up soft_reset for BCM54616S
Previous releases - always broken:
- ipv4: raw: lock the socket in raw_bind()
- ipv4: decrease the use of shared IPID generator to decrease the
chance of attackers guessing the values
- procfs: fix cross-netns information leakage in /proc/net/ptype
- ethtool: fix link extended state for big endian
- bridge: vlan: fix single net device option dumping
- ping: fix the sk_bound_dev_if match in ping_lookup"
* tag 'net-5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (86 commits)
net: bridge: vlan: fix memory leak in __allowed_ingress
net: socket: rename SKB_DROP_REASON_SOCKET_FILTER
ipv4: remove sparse error in ip_neigh_gw4()
ipv4: avoid using shared IP generator for connected sockets
ipv4: tcp: send zero IPID in SYNACK messages
ipv4: raw: lock the socket in raw_bind()
MAINTAINERS: add missing IPv4/IPv6 header paths
MAINTAINERS: add more files to eth PHY
net: stmmac: dwmac-sun8i: use return val of readl_poll_timeout()
net: bridge: vlan: fix single net device option dumping
net: stmmac: skip only stmmac_ptp_register when resume from suspend
net: stmmac: configure PTP clock source prior to PTP initialization
Revert "ipv6: Honor all IPv6 PIO Valid Lifetime values"
connector/cn_proc: Use task_is_in_init_pid_ns()
pid: Introduce helper task_is_in_init_pid_ns()
gve: Fix GFP flags when allocing pages
net: lan966x: Fix sleep in atomic context when updating MAC table
net: lan966x: Fix sleep in atomic context when injecting frames
ethernet: seeq/ether3: don't write directly to netdev->dev_addr
ethernet: 8390/etherh: don't write directly to netdev->dev_addr
...
|
|
When using per-vlan state, if vlan snooping and stats are disabled,
untagged or priority-tagged ingress frame will go to check pvid state.
If the port state is forwarding and the pvid state is not
learning/forwarding, untagged or priority-tagged frame will be dropped
but skb memory is not freed.
Should free skb when __allowed_ingress returns false.
Fixes: a580c76d534c ("net: bridge: vlan: add per-vlan state")
Signed-off-by: Tim Yi <tim.yi@pica8.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20220127074953.12632-1-tim.yi@pica8.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
cppcheck possible warnings:
>> net/netfilter/nf_tables_api.c:2014:2: warning: Assignment of function parameter has no effect outside the function. Did you forget dereferencing it? [uselessAssignmentPtrArg]
ptr += offsetof(struct nft_rule_dp, data);
^
Reported-by: kernel test robot <yujie.liu@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Rename SKB_DROP_REASON_SOCKET_FILTER, which is used
as the reason of skb drop out of socket filter before
it's part of a released kernel. It will be used for
more protocols than just TCP in future series.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/all/20220127091308.91401-2-imagedong@tencent.com/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
hci_cmd_sync_queue shall return an error if HCI_UNREGISTER flag has
been set as that means hci_unregister_dev has been called so it will
likely cause a uaf after the timeout as the hdev will be freed.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
In commit 431280eebed9 ("ipv4: tcp: send zero IPID for RST and
ACK sent in SYN-RECV and TIME-WAIT state") we took care of some
ctl packets sent by TCP.
It turns out we need to use a similar strategy for SYNACK packets.
By default, they carry IP_DF and IPID==0, but there are ways
to ask them to use the hashed IP ident generator and thus
be used to build off-path attacks.
(Ref: Off-Path TCP Exploits of the Mixed IPID Assignment)
One of this way is to force (before listener is started)
echo 1 >/proc/sys/net/ipv4/ip_no_pmtu_disc
Another way is using forged ICMP ICMP_FRAG_NEEDED
with a very small MTU (like 68) to force a false return from
ip_dont_fragment()
In this patch, ip_build_and_send_pkt() uses the following
heuristics.
1) Most SYNACK packets are smaller than IPV4_MIN_MTU and therefore
can use IP_DF regardless of the listener or route pmtu setting.
2) In case the SYNACK packet is bigger than IPV4_MIN_MTU,
we use prandom_u32() generator instead of the IPv4 hashed ident one.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Ray Che <xijiache@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Cc: Geoff Alexander <alexandg@cs.unm.edu>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Move desc_array from the driver to the pool. The reason behind this is
that we can then reuse this array as a temporary storage for descriptors
in all zero-copy drivers that use the batched interface. This will make
it easier to add batching to more drivers.
i40e is the only driver that has a batched Tx zero-copy
implementation, so no need to touch any other driver.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20220125160446.78976-6-maciej.fijalkowski@intel.com
|
|
For some reason, raw_bind() forgot to lock the socket.
BUG: KCSAN: data-race in __ip4_datagram_connect / raw_bind
write to 0xffff8881170d4308 of 4 bytes by task 5466 on cpu 0:
raw_bind+0x1b0/0x250 net/ipv4/raw.c:739
inet_bind+0x56/0xa0 net/ipv4/af_inet.c:443
__sys_bind+0x14b/0x1b0 net/socket.c:1697
__do_sys_bind net/socket.c:1708 [inline]
__se_sys_bind net/socket.c:1706 [inline]
__x64_sys_bind+0x3d/0x50 net/socket.c:1706
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
read to 0xffff8881170d4308 of 4 bytes by task 5468 on cpu 1:
__ip4_datagram_connect+0xb7/0x7b0 net/ipv4/datagram.c:39
ip4_datagram_connect+0x2a/0x40 net/ipv4/datagram.c:89
inet_dgram_connect+0x107/0x190 net/ipv4/af_inet.c:576
__sys_connect_file net/socket.c:1900 [inline]
__sys_connect+0x197/0x1b0 net/socket.c:1917
__do_sys_connect net/socket.c:1927 [inline]
__se_sys_connect net/socket.c:1924 [inline]
__x64_sys_connect+0x3d/0x50 net/socket.c:1924
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0x00000000 -> 0x0003007f
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 5468 Comm: syz-executor.5 Not tainted 5.17.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
IIUC the TIPC msg helpers are not meant to provide
and exhaustive API, so remove the unused ones.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Not used since v2.6.37.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
get_prp_lan_id() has never been used.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Nothing takes the refcount since v4.9.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When dumping vlan options for a single net device we send the same
entries infinitely because user-space expects a 0 return at the end but
we keep returning skb->len and restarting the dump on retry. Fix it by
returning the value from br_vlan_dump_dev() if it completed or there was
an error. The only case that must return skb->len is when the dump was
incomplete and needs to continue (-EMSGSIZE).
Reported-by: Benjamin Poirier <bpoirier@nvidia.com>
Fixes: 8dcea187088b ("net: bridge: vlan: add rtm definitions and dump support")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This reverts commit b75326c201242de9495ff98e5d5cff41d7fc0d9d.
This commit breaks Linux compatibility with USGv6 tests. The RFC this
commit was based on is actually an expired draft: no published RFC
currently allows the new behaviour it introduced.
Without full IETF endorsement, the flash renumbering scenario this
patch was supposed to enable is never going to work, as other IPv6
equipements on the same LAN will keep the 2 hours limit.
Fixes: b75326c20124 ("ipv6: Honor all IPv6 PIO Valid Lifetime values")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Different netns has different requirement on the setting of min_adv_mss
sysctl which the advertised MSS will be never lower than.
Enable min_adv_mss to be configured per network namespace.
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Returning -1 does not indicate anything useful.
Use a standard and meaningful error code instead.
Fixes: a26c5fd7622d ("nl802154: add support for security layer")
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Acked-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/r/20220125121426.848337-6-miquel.raynal@bootlin.com
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
|
|
This reverts commit b515d2637276a3810d6595e10ab02c13bfd0b63a.
Commit b515d2637276a3810d6595e10ab02c13bfd0b63a ("xfrm: xfrm_state_mtu
should return at least 1280 for ipv6") in v5.14 breaks the TCP MSS
calculation in ipsec transport mode, resulting complete stalls of TCP
connections. This happens when the (P)MTU is 1280 or slighly larger.
The desired formula for the MSS is:
MSS = (MTU - ESP_overhead) - IP header - TCP header
However, the above commit clamps the (MTU - ESP_overhead) to a
minimum of 1280, turning the formula into
MSS = max(MTU - ESP overhead, 1280) - IP header - TCP header
With the (P)MTU near 1280, the calculated MSS is too large and the
resulting TCP packets never make it to the destination because they
are over the actual PMTU.
The above commit also causes suboptimal double fragmentation in
xfrm tunnel mode, as described in
https://lore.kernel.org/netdev/20210429202529.codhwpc7w6kbudug@dwarf.suse.cz/
The original problem the above commit was trying to fix is now fixed
by commit 6596a0229541270fb8d38d989f91b78838e5e9da ("xfrm: fix MTU
regression").
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
Remove two dead stubs, sk_msg_clear_meta() was never
used, use of xskq_cons_is_full() got replaced by
xsk_tx_writeable() in v5.10.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20220126185412.2776254-1-kuba@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
I forgot tcp had per netns tracking of timewait sockets,
and their sysctl to change the limit.
After 0dad4087a86a ("tcp/dccp: get rid of inet_twsk_purge()"),
whole struct net can be freed before last tw socket is freed.
We need to allocate a separate struct inet_timewait_death_row
object per netns.
tw_count becomes a refcount and gains associated debugging infrastructure.
BUG: KASAN: use-after-free in inet_twsk_kill+0x358/0x3c0 net/ipv4/inet_timewait_sock.c:46
Read of size 8 at addr ffff88807d5f9f40 by task kworker/1:7/3690
CPU: 1 PID: 3690 Comm: kworker/1:7 Not tainted 5.16.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: events pwq_unbound_release_workfn
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description.constprop.0.cold+0x8d/0x336 mm/kasan/report.c:255
__kasan_report mm/kasan/report.c:442 [inline]
kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
inet_twsk_kill+0x358/0x3c0 net/ipv4/inet_timewait_sock.c:46
call_timer_fn+0x1a5/0x6b0 kernel/time/timer.c:1421
expire_timers kernel/time/timer.c:1466 [inline]
__run_timers.part.0+0x67c/0xa30 kernel/time/timer.c:1734
__run_timers kernel/time/timer.c:1715 [inline]
run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1747
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558
invoke_softirq kernel/softirq.c:432 [inline]
__irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1097
</IRQ>
<TASK>
asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:638
RIP: 0010:lockdep_unregister_key+0x1c9/0x250 kernel/locking/lockdep.c:6328
Code: 00 00 00 48 89 ee e8 46 fd ff ff 4c 89 f7 e8 5e c9 ff ff e8 09 cc ff ff 9c 58 f6 c4 02 75 26 41 f7 c4 00 02 00 00 74 01 fb 5b <5d> 41 5c 41 5d 41 5e 41 5f e9 19 4a 08 00 0f 0b 5b 5d 41 5c 41 5d
RSP: 0018:ffffc90004077cb8 EFLAGS: 00000206
RAX: 0000000000000046 RBX: ffff88807b61b498 RCX: 0000000000000001
RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff888077027128 R08: 0000000000000001 R09: ffffffff8f1ea4fc
R10: fffffbfff1ff93ee R11: 000000000000af1e R12: 0000000000000246
R13: 0000000000000000 R14: ffffffff8ffc89b8 R15: ffffffff90157fb0
wq_unregister_lockdep kernel/workqueue.c:3508 [inline]
pwq_unbound_release_workfn+0x254/0x340 kernel/workqueue.c:3746
process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307
worker_thread+0x657/0x1110 kernel/workqueue.c:2454
kthread+0x2e9/0x3a0 kernel/kthread.c:377
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
</TASK>
Allocated by task 3635:
kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
kasan_set_track mm/kasan/common.c:46 [inline]
set_alloc_info mm/kasan/common.c:437 [inline]
__kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:470
kasan_slab_alloc include/linux/kasan.h:260 [inline]
slab_post_alloc_hook mm/slab.h:732 [inline]
slab_alloc_node mm/slub.c:3230 [inline]
slab_alloc mm/slub.c:3238 [inline]
kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3243
kmem_cache_zalloc include/linux/slab.h:705 [inline]
net_alloc net/core/net_namespace.c:407 [inline]
copy_net_ns+0x125/0x760 net/core/net_namespace.c:462
create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110
unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226
ksys_unshare+0x445/0x920 kernel/fork.c:3048
__do_sys_unshare kernel/fork.c:3119 [inline]
__se_sys_unshare kernel/fork.c:3117 [inline]
__x64_sys_unshare+0x2d/0x40 kernel/fork.c:3117
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at ffff88807d5f9a80
which belongs to the cache net_namespace of size 6528
The buggy address is located 1216 bytes inside of
6528-byte region [ffff88807d5f9a80, ffff88807d5fb400)
The buggy address belongs to the page:
page:ffffea0001f57e00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88807d5f9a80 pfn:0x7d5f8
head:ffffea0001f57e00 order:3 compound_mapcount:0 compound_pincount:0
memcg:ffff888070023001
flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000010200 ffff888010dd4f48 ffffea0001404e08 ffff8880118fd000
raw: ffff88807d5f9a80 0000000000040002 00000001ffffffff ffff888070023001
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 3634, ts 119694798460, free_ts 119693556950
prep_new_page mm/page_alloc.c:2434 [inline]
get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4165
__alloc_pages+0x1b2/0x500 mm/page_alloc.c:5389
alloc_pages+0x1aa/0x310 mm/mempolicy.c:2271
alloc_slab_page mm/slub.c:1799 [inline]
allocate_slab mm/slub.c:1944 [inline]
new_slab+0x28a/0x3b0 mm/slub.c:2004
___slab_alloc+0x87c/0xe90 mm/slub.c:3018
__slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3105
slab_alloc_node mm/slub.c:3196 [inline]
slab_alloc mm/slub.c:3238 [inline]
kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3243
kmem_cache_zalloc include/linux/slab.h:705 [inline]
net_alloc net/core/net_namespace.c:407 [inline]
copy_net_ns+0x125/0x760 net/core/net_namespace.c:462
create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110
unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226
ksys_unshare+0x445/0x920 kernel/fork.c:3048
__do_sys_unshare kernel/fork.c:3119 [inline]
__se_sys_unshare kernel/fork.c:3117 [inline]
__x64_sys_unshare+0x2d/0x40 kernel/fork.c:3117
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
page last free stack trace:
reset_page_owner include/linux/page_owner.h:24 [inline]
free_pages_prepare mm/page_alloc.c:1352 [inline]
free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1404
free_unref_page_prepare mm/page_alloc.c:3325 [inline]
free_unref_page+0x19/0x690 mm/page_alloc.c:3404
skb_free_head net/core/skbuff.c:655 [inline]
skb_release_data+0x65d/0x790 net/core/skbuff.c:677
skb_release_all net/core/skbuff.c:742 [inline]
__kfree_skb net/core/skbuff.c:756 [inline]
consume_skb net/core/skbuff.c:914 [inline]
consume_skb+0xc2/0x160 net/core/skbuff.c:908
skb_free_datagram+0x1b/0x1f0 net/core/datagram.c:325
netlink_recvmsg+0x636/0xea0 net/netlink/af_netlink.c:1998
sock_recvmsg_nosec net/socket.c:948 [inline]
sock_recvmsg net/socket.c:966 [inline]
sock_recvmsg net/socket.c:962 [inline]
____sys_recvmsg+0x2c4/0x600 net/socket.c:2632
___sys_recvmsg+0x127/0x200 net/socket.c:2674
__sys_recvmsg+0xe2/0x1a0 net/socket.c:2704
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Memory state around the buggy address:
ffff88807d5f9e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88807d5f9e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff88807d5f9f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88807d5f9f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88807d5fa000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fixes: 0dad4087a86a ("tcp/dccp: get rid of inet_twsk_purge()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reported-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20220126180714.845362-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cancel tracking for byteorder operation, otherwise selector + byteorder
operation is incorrectly reduced if source and destination registers are
the same.
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Prior to commit fa538f7cf05aa ("netfilter: nf_reject: add reject skbuff
creation helpers"), nft_reject_bridge did not assign to nskb->dev before
passing nskb on to br_forward(). The shared skbuff creation helpers
introduced in above commit do which seems to confuse br_forward() as
reject statements in prerouting hook won't emit a packet anymore.
Fix this by simply passing NULL instead of 'dev' to the helpers - they
use the pointer for just that assignment, nothing else.
Fixes: fa538f7cf05aa ("netfilter: nf_reject: add reject skbuff creation helpers")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
The conversion erroneously removed the refcount increment.
In case we can use the percpu template, we need to increment
the refcount, else it will be released when the skb gets freed.
In case the slowpath is taken, the new template already has a
refcount of 1.
Fixes: 719774377622 ("netfilter: conntrack: convert to refcount_t api")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
NF_FLOW_TABLE_IPV4 and NF_FLOW_TABLE_IPV6 are invisble, selected by
nothing (so they can no longer be enabled), and their last real users
have been removed (nf_flow_table_ipv6.c is empty).
Clean up the leftovers.
Fixes: c42ba4290b2147aa ("netfilter: flowtable: remove ipv4/ipv6 modules")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
xfrm_migrate cannot handle address family change of an xfrm_state.
The symptons are the xfrm_state will be migrated to a wrong address,
and sending as well as receiving packets wil be broken.
This commit fixes it by breaking the original xfrm_state_clone
method into two steps so as to update the props.family before
running xfrm_init_state. As the result, xfrm_state's inner mode,
outer mode, type and IP header length in xfrm_state_migrate can
be updated with the new address family.
Tested with additions to Android's kernel unit test suite:
https://android-review.googlesource.com/c/kernel/tests/+/1885354
Signed-off-by: Yan Yan <evitayan@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
This patch enables distinguishing SAs and SPs based on if_id during
the xfrm_migrate flow. This ensures support for xfrm interfaces
throughout the SA/SP lifecycle.
When there are multiple existing SPs with the same direction,
the same xfrm_selector and different endpoint addresses,
xfrm_migrate might fail with ENODATA.
Specifically, the code path for performing xfrm_migrate is:
Stage 1: find policy to migrate with
xfrm_migrate_policy_find(sel, dir, type, net)
Stage 2: find and update state(s) with
xfrm_migrate_state_find(mp, net)
Stage 3: update endpoint address(es) of template(s) with
xfrm_policy_migrate(pol, m, num_migrate)
Currently "Stage 1" always returns the first xfrm_policy that
matches, and "Stage 3" looks for the xfrm_tmpl that matches the
old endpoint address. Thus if there are multiple xfrm_policy
with same selector, direction, type and net, "Stage 1" might
rertun a wrong xfrm_policy and "Stage 3" will fail with ENODATA
because it cannot find a xfrm_tmpl with the matching endpoint
address.
The fix is to allow userspace to pass an if_id and add if_id
to the matching rule in Stage 1 and Stage 2 since if_id is a
unique ID for xfrm_policy and xfrm_state. For compatibility,
if_id will only be checked if the attribute is set.
Tested with additions to Android's kernel unit test suite:
https://android-review.googlesource.com/c/kernel/tests/+/1668886
Signed-off-by: Yan Yan <evitayan@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
The current implementation of HTB offload doesn't support some
parameters. Instead of ignoring them, actively return the EINVAL error
when they are set to non-defaults.
As this patch goes to stable, the driver API is not changed here. If
future drivers support more offload parameters, the checks can be moved
to the driver side.
Note that the buffer and cbuffer parameters are also not supported, but
the tc userspace tool assigns some default values derived from rate and
ceil, and identifying these defaults in sch_htb would be unreliable, so
they are still ignored.
Fixes: d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20220125100654.424570-1-maximmi@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
sk_gso_max_size is set based on the dst dev. Both users of it
adjust the value by the same offset - (MAX_TCP_HEADER + 1). Rather
than compute the same adjusted value on each call do the adjustment
once when set.
Signed-off-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220125024511.27480-1-dsahern@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
HCI_EV_VENDOR is in fact variable length since it acts as metaevent
where a vendor can implement their own event sets.
In addition to it this makes use of bt_dev_warn_ratelimited to supress
the amount of logging in case the event has more data than expected.
Fixes: 3e54c5890c87 ("Bluetooth: hci_event: Use of a function table to handle HCI event")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
The invalid SCO handle error is normally caused by a race in the USB
transport where the data and event happen to be 2 different endpoints
so the event carrying the SCO handle is processed after its data.
Note: This can probably be resolved with use of force_poll_sync
debugfs.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
Pull NFS client updates from Anna Schumaker:
"New Features:
- Basic handling for case insensitive filesystems
- Initial support for fs_locations and server trunking
Bugfixes and Cleanups:
- Cleanups to how the "struct cred *" is handled for the
nfs_access_entry
- Ensure the server has an up to date ctimes before hardlinking or
renaming
- Update 'blocks used' after writeback, fallocate, and clone
- nfs_atomic_open() fixes
- Improvements to sunrpc tracing
- Various null check & indenting related cleanups
- Some improvements to the sunrpc sysfs code:
- Use default_groups in kobj_type
- Fix some potential races and reference leaks
- A few tracepoint cleanups in xprtrdma"
[ This should have gone in during the merge window, but didn't. The
original pull request - sent during the merge window - had gotten
marked as spam and discarded due missing DKIM headers in the email
from Anna. - Linus ]
* tag 'nfs-for-5.17-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (35 commits)
SUNRPC: Don't dereference xprt->snd_task if it's a cookie
xprtrdma: Remove definitions of RPCDBG_FACILITY
xprtrdma: Remove final dprintk call sites from xprtrdma
sunrpc: Fix potential race conditions in rpc_sysfs_xprt_state_change()
net/sunrpc: fix reference count leaks in rpc_sysfs_xprt_state_change
NFSv4.1 test and add 4.1 trunking transport
SUNRPC allow for unspecified transport time in rpc_clnt_add_xprt
NFSv4 handle port presence in fs_location server string
NFSv4 expose nfs_parse_server_name function
NFSv4.1 query for fs_location attr on a new file system
NFSv4 store server support for fs_location attribute
NFSv4 remove zero number of fs_locations entries error check
NFSv4: nfs_atomic_open() can race when looking up a non-regular file
NFSv4: Handle case where the lookup of a directory fails
NFSv42: Fallocate and clone should also request 'blocks used'
NFSv4: Allow writebacks to request 'blocks used'
SUNRPC: use default_groups in kobj_type
NFS: use default_groups in kobj_type
NFS: Fix the verifier for case sensitive filesystem in nfs_atomic_open()
NFS: Add a helper to remove case-insensitive aliases
...
|
|
IPv6 GRO considers packets to belong to different flows when their
hop_limit is different. This seems counter-intuitive, the flow is
the same. hop_limit may vary because of various bugs or hacks but
that doesn't mean it's okay for GRO to reorder packets.
Practical impact of this problem on overall TCP performance
is unclear, but TCP itself detects this reordering and bumps
TCPSACKReorder resulting in user complaints.
Eric warns that there may be performance regressions in setups
which do packet spraying across links with similar RTT but different
hop count. To be safe let's target -next and not treat this
as a fix. If the packet spraying is using flow label there should
be no difference in behavior as flow label is checked first.
Note that the code plays an easy to miss trick by upcasting next_hdr
to a u16 pointer and compares next_hdr and hop_limit in one go.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Changes to VLAN filtering are not applicable to cross-chip
notifications.
On a system like this:
.-----. .-----. .-----.
| sw1 +---+ sw2 +---+ sw3 |
'-1-2-' '-1-2-' '-1-2-'
Before this change, upon sw1p1 leaving a bridge, a call to
dsa_port_vlan_filtering would also be made to sw2p1 and sw3p1.
In this scenario:
.---------. .-----. .-----.
| sw1 +---+ sw2 +---+ sw3 |
'-1-2-3-4-' '-1-2-' '-1-2-'
When sw1p4 would leave a bridge, dsa_port_vlan_filtering would be
called for sw2 and sw3 with a non-existing port - leading to array
out-of-bounds accesses and crashes on mv88e6xxx.
Fixes: d371b7c92d19 ("net: dsa: Unset vlan_filtering when ports leave the bridge")
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Most of dsa_switch_bridge_leave was, in fact, dealing with the syncing
of VLAN filtering for switches on which that is a global
setting. Separate the two phases to prepare for the cross-chip related
bugfix in the following commit.
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
TCP ipv4 uses per-cpu/per-netns ctl sockets in order to send
RST and some ACK packets (on behalf of TIMEWAIT sockets).
This adds memory and cpu costs, which do not seem needed.
Now typical servers have 256 or more cores, this adds considerable
tax to netns users.
tcp sockets are used from BH context, are not receiving packets,
and do not store any persistent state but the 'struct net' pointer
in order to be able to use IPv4 output functions.
Note that I attempted a related change in the past, that had
to be hot-fixed in commit bdbbb8527b6f ("ipv4: tcp: get rid of ugly unicast_sock")
This patch could very well surface old bugs, on layers not
taking care of sk->sk_kern_sock properly.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Back in linux-2.6.25 (commit 98c6d1b261e7 "[NETNS]: Make icmpv6_sk per namespace.",
we added private per-cpu/per-netns ipv6 icmp sockets.
This adds memory and cpu costs, which do not seem needed.
Now typical servers have 256 or more cores, this adds considerable
tax to netns users.
icmp sockets are used from BH context, are not receiving packets,
and do not store any persistent state but the 'struct net' pointer.
icmpv6_xmit_lock() already makes sure to lock the chosen per-cpu
socket.
This patch has a considerable impact on the number of netns
that the worker thread in cleanup_net() can dismantle per second,
because ip6mr_sk_done() is no longer called, meaning we no longer
acquire the rtnl mutex, competing with other threads adding new netns.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Back in linux-2.6.25 (commit 4a6ad7a141cb "[NETNS]: Make icmp_sk per namespace."),
we added private per-cpu/per-netns ipv4 icmp sockets.
This adds memory and cpu costs, which do not seem needed.
Now typical servers have 256 or more cores, this adds considerable
tax to netns users.
icmp sockets are used from BH context, are not receiving packets,
and do not store any persistent state but the 'struct net' pointer.
icmp_xmit_lock() already makes sure to lock the chosen per-cpu
socket.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Prior patches in the series made sure tw_timer_handler()
can be fired after netns has been dismantled/freed.
We no longer have to scan a potentially big TCP ehash
table at netns dismantle.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We will soon get rid of inet_twsk_purge().
This means that tw_timer_handler() might fire after
a netns has been dismantled/freed.
Instead of adding a function (and data structure) to find a netns
from tw->tw_net_cookie, just update the SNMP counters
a bit earlier, when the netns is known to be alive.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We want to allow inet_twsk_kill() working even if netns
has been dismantled/freed, to get rid of inet_twsk_purge().
This patch adds tw->tw_bslot to cache the bind bucket slot
so that inet_twsk_kill() no longer needs to dereference twsk_net(tw)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Use kvzalloc()/kvfree() instead of hand coded functions.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Since IPv4 routes support IPv6 gateways now, we can route IPv4 traffic in
NBMA tunnels.
Signed-off-by: Qing Deng <i@moy.cat>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When one of the three connection complete events is received multiple
times for the same handle, the device is registered multiple times which
leads to memory corruptions. Therefore, consequent events for a single
connection are ignored.
The conn->state can hold different values, therefore HCI_CONN_HANDLE_UNSET
is introduced to identify new connections. To make sure the events do not
contain this or another invalid handle HCI_CONN_HANDLE_MAX and checks
are introduced.
Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=215497
Signed-off-by: Soenke Huster <soenke.huster@eknoes.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
Daniel Borkmann says:
====================
pull-request: bpf-next 2022-01-24
We've added 80 non-merge commits during the last 14 day(s) which contain
a total of 128 files changed, 4990 insertions(+), 895 deletions(-).
The main changes are:
1) Add XDP multi-buffer support and implement it for the mvneta driver,
from Lorenzo Bianconi, Eelco Chaudron and Toke Høiland-Jørgensen.
2) Add unstable conntrack lookup helpers for BPF by using the BPF kfunc
infra, from Kumar Kartikeya Dwivedi.
3) Extend BPF cgroup programs to export custom ret value to userspace via
two helpers bpf_get_retval() and bpf_set_retval(), from YiFei Zhu.
4) Add support for AF_UNIX iterator batching, from Kuniyuki Iwashima.
5) Complete missing UAPI BPF helper description and change bpf_doc.py script
to enforce consistent & complete helper documentation, from Usama Arif.
6) Deprecate libbpf's legacy BPF map definitions and streamline XDP APIs to
follow tc-based APIs, from Andrii Nakryiko.
7) Support BPF_PROG_QUERY for BPF programs attached to sockmap, from Di Zhu.
8) Deprecate libbpf's bpf_map__def() API and replace users with proper getters
and setters, from Christy Lee.
9) Extend libbpf's btf__add_btf() with an additional hashmap for strings to
reduce overhead, from Kui-Feng Lee.
10) Fix bpftool and libbpf error handling related to libbpf's hashmap__new()
utility function, from Mauricio Vásquez.
11) Add support to BTF program names in bpftool's program dump, from Raman Shukhau.
12) Fix resolve_btfids build to pick up host flags, from Connor O'Brien.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (80 commits)
selftests, bpf: Do not yet switch to new libbpf XDP APIs
selftests, xsk: Fix rx_full stats test
bpf: Fix flexible_array.cocci warnings
xdp: disable XDP_REDIRECT for xdp frags
bpf: selftests: add CPUMAP/DEVMAP selftests for xdp frags
bpf: selftests: introduce bpf_xdp_{load,store}_bytes selftest
net: xdp: introduce bpf_xdp_pointer utility routine
bpf: generalise tail call map compatibility check
libbpf: Add SEC name for xdp frags programs
bpf: selftests: update xdp_adjust_tail selftest to include xdp frags
bpf: test_run: add xdp_shared_info pointer in bpf_test_finish signature
bpf: introduce frags support to bpf_prog_test_run_xdp()
bpf: move user_size out of bpf_test_init
bpf: add frags support to xdp copy helpers
bpf: add frags support to the bpf_xdp_adjust_tail() API
bpf: introduce bpf_xdp_get_buff_len helper
net: mvneta: enable jumbo frames if the loaded XDP program support frags
bpf: introduce BPF_F_XDP_HAS_FRAGS flag in prog_flags loading the ebpf program
net: mvneta: add frags support to XDP_TX
xdp: add frags support to xdp_return_{buff/frame}
...
====================
Link: https://lore.kernel.org/r/20220124221235.18993-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Zero-length and one-element arrays are deprecated, see:
Documentation/process/deprecated.rst
Flexible-array members should be used instead.
Generated by: scripts/coccinelle/misc/flexible_array.cocci
Fixes: c1ff181ffabc ("selftests/bpf: Extend kfunc selftests")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: kernel test robot <lkp@intel.com>
Signed-off-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/alpine.DEB.2.22.394.2201221206320.12220@hadrien
|
|
Commit 749439bfac6e1a2932c582e2699f91d329658196 ("ipv6: fix udpv6
sendmsg crash caused by too small MTU") breaks PMTU for xfrm.
A Packet Too Big ICMPv6 message received in response to an ESP
packet will prevent all further communication through the tunnel
if the reported MTU minus the ESP overhead is smaller than 1280.
E.g. in a case of a tunnel-mode ESP with sha256/aes the overhead
is 92 bytes. Receiving a PTB with MTU of 1371 or less will result
in all further packets in the tunnel dropped. A ping through the
tunnel fails with "ping: sendmsg: Invalid argument".
Apparently the MTU on the xfrm route is smaller than 1280 and
fails the check inside ip6_setup_cork() added by 749439bf.
We found this by debugging USGv6/ipv6ready failures. Failing
tests are: "Phase-2 Interoperability Test Scenario IPsec" /
5.3.11 and 5.4.11 (Tunnel Mode: Fragmentation).
Commit b515d2637276a3810d6595e10ab02c13bfd0b63a ("xfrm:
xfrm_state_mtu should return at least 1280 for ipv6") attempted
to fix this but caused another regression in TCP MSS calculations
and had to be reverted.
The patch below fixes the situation by dropping the MTU
check and instead checking for the underflows described in the
749439bf commit message.
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Fixes: 749439bfac6e ("ipv6: fix udpv6 sendmsg crash caused by too small MTU")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|