summaryrefslogtreecommitdiff
path: root/net/netfilter
AgeCommit message (Collapse)Author
2021-12-16netfilter: ctnetlink: remove expired entries firstFlorian Westphal
When dumping conntrack table to userspace via ctnetlink, check if the ct has already expired before doing any of the 'skip' checks. This expires dead entries faster. /proc handler also removes outdated entries first. Reported-by: Vitaly Zuevsky <vzuevsky@ns1.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-12-16netfilter: fix regression in looped (broad|multi)cast's MAC handlingIgnacy Gawędzki
In commit 5648b5e1169f ("netfilter: nfnetlink_queue: fix OOB when mac header was cleared"), the test for non-empty MAC header introduced in commit 2c38de4c1f8da7 ("netfilter: fix looped (broad|multi)cast's MAC handling") has been replaced with a test for a set MAC header. This breaks the case when the MAC header has been reset (using skb_reset_mac_header), as is the case with looped-back multicast packets. As a result, the packets ending up in NFQUEUE get a bogus hwaddr interpreted from the first bytes of the IP header. This patch adds a test for a non-empty MAC header in addition to the test for a set MAC header. The same two tests are also implemented in nfnetlink_log.c, where the initial code of commit 2c38de4c1f8da7 ("netfilter: fix looped (broad|multi)cast's MAC handling") has not been touched, but where supposedly the same situation may happen. Fixes: 5648b5e1169f ("netfilter: nfnetlink_queue: fix OOB when mac header was cleared") Signed-off-by: Ignacy Gawędzki <ignacy.gawedzki@green-communications.fr> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-12-15netfilter: nf_tables: fix use-after-free in nft_set_catchall_destroy()Eric Dumazet
We need to use list_for_each_entry_safe() iterator because we can not access @catchall after kfree_rcu() call. syzbot reported: BUG: KASAN: use-after-free in nft_set_catchall_destroy net/netfilter/nf_tables_api.c:4486 [inline] BUG: KASAN: use-after-free in nft_set_destroy net/netfilter/nf_tables_api.c:4504 [inline] BUG: KASAN: use-after-free in nft_set_destroy+0x3fd/0x4f0 net/netfilter/nf_tables_api.c:4493 Read of size 8 at addr ffff8880716e5b80 by task syz-executor.3/8871 CPU: 1 PID: 8871 Comm: syz-executor.3 Not tainted 5.16.0-rc5-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_address_description.constprop.0.cold+0x8d/0x2ed mm/kasan/report.c:247 __kasan_report mm/kasan/report.c:433 [inline] kasan_report.cold+0x83/0xdf mm/kasan/report.c:450 nft_set_catchall_destroy net/netfilter/nf_tables_api.c:4486 [inline] nft_set_destroy net/netfilter/nf_tables_api.c:4504 [inline] nft_set_destroy+0x3fd/0x4f0 net/netfilter/nf_tables_api.c:4493 __nft_release_table+0x79f/0xcd0 net/netfilter/nf_tables_api.c:9626 nft_rcv_nl_event+0x4f8/0x670 net/netfilter/nf_tables_api.c:9688 notifier_call_chain+0xb5/0x200 kernel/notifier.c:83 blocking_notifier_call_chain kernel/notifier.c:318 [inline] blocking_notifier_call_chain+0x67/0x90 kernel/notifier.c:306 netlink_release+0xcb6/0x1dd0 net/netlink/af_netlink.c:788 __sock_release+0xcd/0x280 net/socket.c:649 sock_close+0x18/0x20 net/socket.c:1314 __fput+0x286/0x9f0 fs/file_table.c:280 task_work_run+0xdd/0x1a0 kernel/task_work.c:164 tracehook_notify_resume include/linux/tracehook.h:189 [inline] exit_to_user_mode_loop kernel/entry/common.c:175 [inline] exit_to_user_mode_prepare+0x27e/0x290 kernel/entry/common.c:207 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline] syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300 do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f75fbf28adb Code: 0f 05 48 3d 00 f0 ff ff 77 45 c3 0f 1f 40 00 48 83 ec 18 89 7c 24 0c e8 63 fc ff ff 8b 7c 24 0c 41 89 c0 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 89 44 24 0c e8 a1 fc ff ff 8b 44 RSP: 002b:00007ffd8da7ec10 EFLAGS: 00000293 ORIG_RAX: 0000000000000003 RAX: 0000000000000000 RBX: 0000000000000004 RCX: 00007f75fbf28adb RDX: 00007f75fc08e828 RSI: ffffffffffffffff RDI: 0000000000000003 RBP: 00007f75fc08a960 R08: 0000000000000000 R09: 00007f75fc08e830 R10: 00007ffd8da7ed10 R11: 0000000000000293 R12: 00000000002067c3 R13: 00007ffd8da7ed10 R14: 00007f75fc088f60 R15: 0000000000000032 </TASK> Allocated by task 8886: kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38 kasan_set_track mm/kasan/common.c:46 [inline] set_alloc_info mm/kasan/common.c:434 [inline] ____kasan_kmalloc mm/kasan/common.c:513 [inline] ____kasan_kmalloc mm/kasan/common.c:472 [inline] __kasan_kmalloc+0xa6/0xd0 mm/kasan/common.c:522 kasan_kmalloc include/linux/kasan.h:269 [inline] kmem_cache_alloc_trace+0x1ea/0x4a0 mm/slab.c:3575 kmalloc include/linux/slab.h:590 [inline] nft_setelem_catchall_insert net/netfilter/nf_tables_api.c:5544 [inline] nft_setelem_insert net/netfilter/nf_tables_api.c:5562 [inline] nft_add_set_elem+0x232e/0x2f40 net/netfilter/nf_tables_api.c:5936 nf_tables_newsetelem+0x6ff/0xbb0 net/netfilter/nf_tables_api.c:6032 nfnetlink_rcv_batch+0x1710/0x25f0 net/netfilter/nfnetlink.c:513 nfnetlink_rcv_skb_batch net/netfilter/nfnetlink.c:634 [inline] nfnetlink_rcv+0x3af/0x420 net/netfilter/nfnetlink.c:652 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1345 netlink_sendmsg+0x904/0xdf0 net/netlink/af_netlink.c:1921 sock_sendmsg_nosec net/socket.c:704 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:724 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2409 ___sys_sendmsg+0xf3/0x170 net/socket.c:2463 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2492 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae Freed by task 15335: kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38 kasan_set_track+0x21/0x30 mm/kasan/common.c:46 kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370 ____kasan_slab_free mm/kasan/common.c:366 [inline] ____kasan_slab_free mm/kasan/common.c:328 [inline] __kasan_slab_free+0xd1/0x110 mm/kasan/common.c:374 kasan_slab_free include/linux/kasan.h:235 [inline] __cache_free mm/slab.c:3445 [inline] kmem_cache_free_bulk+0x67/0x1e0 mm/slab.c:3766 kfree_bulk include/linux/slab.h:446 [inline] kfree_rcu_work+0x51c/0xa10 kernel/rcu/tree.c:3273 process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298 worker_thread+0x658/0x11f0 kernel/workqueue.c:2445 kthread+0x405/0x4f0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 Last potentially related work creation: kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38 __kasan_record_aux_stack+0xb5/0xe0 mm/kasan/generic.c:348 kvfree_call_rcu+0x74/0x990 kernel/rcu/tree.c:3550 nft_set_catchall_destroy net/netfilter/nf_tables_api.c:4489 [inline] nft_set_destroy net/netfilter/nf_tables_api.c:4504 [inline] nft_set_destroy+0x34a/0x4f0 net/netfilter/nf_tables_api.c:4493 __nft_release_table+0x79f/0xcd0 net/netfilter/nf_tables_api.c:9626 nft_rcv_nl_event+0x4f8/0x670 net/netfilter/nf_tables_api.c:9688 notifier_call_chain+0xb5/0x200 kernel/notifier.c:83 blocking_notifier_call_chain kernel/notifier.c:318 [inline] blocking_notifier_call_chain+0x67/0x90 kernel/notifier.c:306 netlink_release+0xcb6/0x1dd0 net/netlink/af_netlink.c:788 __sock_release+0xcd/0x280 net/socket.c:649 sock_close+0x18/0x20 net/socket.c:1314 __fput+0x286/0x9f0 fs/file_table.c:280 task_work_run+0xdd/0x1a0 kernel/task_work.c:164 tracehook_notify_resume include/linux/tracehook.h:189 [inline] exit_to_user_mode_loop kernel/entry/common.c:175 [inline] exit_to_user_mode_prepare+0x27e/0x290 kernel/entry/common.c:207 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline] syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300 do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x44/0xae The buggy address belongs to the object at ffff8880716e5b80 which belongs to the cache kmalloc-64 of size 64 The buggy address is located 0 bytes inside of 64-byte region [ffff8880716e5b80, ffff8880716e5bc0) The buggy address belongs to the page: page:ffffea0001c5b940 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff8880716e5c00 pfn:0x716e5 flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff) raw: 00fff00000000200 ffffea0000911848 ffffea00007c4d48 ffff888010c40200 raw: ffff8880716e5c00 ffff8880716e5000 000000010000001e 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 0, migratetype Unmovable, gfp_mask 0x242040(__GFP_IO|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE), pid 3638, ts 211086074437, free_ts 211031029429 prep_new_page mm/page_alloc.c:2418 [inline] get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149 __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369 __alloc_pages_node include/linux/gfp.h:570 [inline] kmem_getpages mm/slab.c:1377 [inline] cache_grow_begin+0x75/0x470 mm/slab.c:2593 cache_alloc_refill+0x27f/0x380 mm/slab.c:2965 ____cache_alloc mm/slab.c:3048 [inline] ____cache_alloc mm/slab.c:3031 [inline] __do_cache_alloc mm/slab.c:3275 [inline] slab_alloc mm/slab.c:3316 [inline] __do_kmalloc mm/slab.c:3700 [inline] __kmalloc+0x3b3/0x4d0 mm/slab.c:3711 kmalloc include/linux/slab.h:595 [inline] kzalloc include/linux/slab.h:724 [inline] tomoyo_get_name+0x234/0x480 security/tomoyo/memory.c:173 tomoyo_parse_name_union+0xbc/0x160 security/tomoyo/util.c:260 tomoyo_update_path_number_acl security/tomoyo/file.c:687 [inline] tomoyo_write_file+0x629/0x7f0 security/tomoyo/file.c:1034 tomoyo_write_domain2+0x116/0x1d0 security/tomoyo/common.c:1152 tomoyo_add_entry security/tomoyo/common.c:2042 [inline] tomoyo_supervisor+0xbc7/0xf00 security/tomoyo/common.c:2103 tomoyo_audit_path_number_log security/tomoyo/file.c:235 [inline] tomoyo_path_number_perm+0x419/0x590 security/tomoyo/file.c:734 security_file_ioctl+0x50/0xb0 security/security.c:1541 __do_sys_ioctl fs/ioctl.c:868 [inline] __se_sys_ioctl fs/ioctl.c:860 [inline] __x64_sys_ioctl+0xb3/0x200 fs/ioctl.c:860 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae page last free stack trace: reset_page_owner include/linux/page_owner.h:24 [inline] free_pages_prepare mm/page_alloc.c:1338 [inline] free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1389 free_unref_page_prepare mm/page_alloc.c:3309 [inline] free_unref_page+0x19/0x690 mm/page_alloc.c:3388 slab_destroy mm/slab.c:1627 [inline] slabs_destroy+0x89/0xc0 mm/slab.c:1647 cache_flusharray mm/slab.c:3418 [inline] ___cache_free+0x4cc/0x610 mm/slab.c:3480 qlink_free mm/kasan/quarantine.c:146 [inline] qlist_free_all+0x4e/0x110 mm/kasan/quarantine.c:165 kasan_quarantine_reduce+0x180/0x200 mm/kasan/quarantine.c:272 __kasan_slab_alloc+0x97/0xb0 mm/kasan/common.c:444 kasan_slab_alloc include/linux/kasan.h:259 [inline] slab_post_alloc_hook mm/slab.h:519 [inline] slab_alloc_node mm/slab.c:3261 [inline] kmem_cache_alloc_node+0x2ea/0x590 mm/slab.c:3599 __alloc_skb+0x215/0x340 net/core/skbuff.c:414 alloc_skb include/linux/skbuff.h:1126 [inline] nlmsg_new include/net/netlink.h:953 [inline] rtmsg_ifinfo_build_skb+0x72/0x1a0 net/core/rtnetlink.c:3808 rtmsg_ifinfo_event net/core/rtnetlink.c:3844 [inline] rtmsg_ifinfo_event net/core/rtnetlink.c:3835 [inline] rtmsg_ifinfo+0x83/0x120 net/core/rtnetlink.c:3853 netdev_state_change net/core/dev.c:1395 [inline] netdev_state_change+0x114/0x130 net/core/dev.c:1386 linkwatch_do_dev+0x10e/0x150 net/core/link_watch.c:167 __linkwatch_run_queue+0x233/0x6a0 net/core/link_watch.c:213 linkwatch_event+0x4a/0x60 net/core/link_watch.c:252 process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298 Memory state around the buggy address: ffff8880716e5a80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ffff8880716e5b00: 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc >ffff8880716e5b80: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ^ ffff8880716e5c00: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ffff8880716e5c80: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc Fixes: aaa31047a6d2 ("netfilter: nftables: add catch-all set element support") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-12-08netfilter: conntrack: annotate data-races around ct->timeoutEric Dumazet
(struct nf_conn)->timeout can be read/written locklessly, add READ_ONCE()/WRITE_ONCE() to prevent load/store tearing. BUG: KCSAN: data-race in __nf_conntrack_alloc / __nf_conntrack_find_get write to 0xffff888132e78c08 of 4 bytes by task 6029 on cpu 0: __nf_conntrack_alloc+0x158/0x280 net/netfilter/nf_conntrack_core.c:1563 init_conntrack+0x1da/0xb30 net/netfilter/nf_conntrack_core.c:1635 resolve_normal_ct+0x502/0x610 net/netfilter/nf_conntrack_core.c:1746 nf_conntrack_in+0x1c5/0x88f net/netfilter/nf_conntrack_core.c:1901 ipv6_conntrack_local+0x19/0x20 net/netfilter/nf_conntrack_proto.c:414 nf_hook_entry_hookfn include/linux/netfilter.h:142 [inline] nf_hook_slow+0x72/0x170 net/netfilter/core.c:619 nf_hook include/linux/netfilter.h:262 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ip6_xmit+0xa3a/0xa60 net/ipv6/ip6_output.c:324 inet6_csk_xmit+0x1a2/0x1e0 net/ipv6/inet6_connection_sock.c:135 __tcp_transmit_skb+0x132a/0x1840 net/ipv4/tcp_output.c:1402 tcp_transmit_skb net/ipv4/tcp_output.c:1420 [inline] tcp_write_xmit+0x1450/0x4460 net/ipv4/tcp_output.c:2680 __tcp_push_pending_frames+0x68/0x1c0 net/ipv4/tcp_output.c:2864 tcp_push_pending_frames include/net/tcp.h:1897 [inline] tcp_data_snd_check+0x62/0x2e0 net/ipv4/tcp_input.c:5452 tcp_rcv_established+0x880/0x10e0 net/ipv4/tcp_input.c:5947 tcp_v6_do_rcv+0x36e/0xa50 net/ipv6/tcp_ipv6.c:1521 sk_backlog_rcv include/net/sock.h:1030 [inline] __release_sock+0xf2/0x270 net/core/sock.c:2768 release_sock+0x40/0x110 net/core/sock.c:3300 sk_stream_wait_memory+0x435/0x700 net/core/stream.c:145 tcp_sendmsg_locked+0xb85/0x25a0 net/ipv4/tcp.c:1402 tcp_sendmsg+0x2c/0x40 net/ipv4/tcp.c:1440 inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:644 sock_sendmsg_nosec net/socket.c:704 [inline] sock_sendmsg net/socket.c:724 [inline] __sys_sendto+0x21e/0x2c0 net/socket.c:2036 __do_sys_sendto net/socket.c:2048 [inline] __se_sys_sendto net/socket.c:2044 [inline] __x64_sys_sendto+0x74/0x90 net/socket.c:2044 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff888132e78c08 of 4 bytes by task 17446 on cpu 1: nf_ct_is_expired include/net/netfilter/nf_conntrack.h:286 [inline] ____nf_conntrack_find net/netfilter/nf_conntrack_core.c:776 [inline] __nf_conntrack_find_get+0x1c7/0xac0 net/netfilter/nf_conntrack_core.c:807 resolve_normal_ct+0x273/0x610 net/netfilter/nf_conntrack_core.c:1734 nf_conntrack_in+0x1c5/0x88f net/netfilter/nf_conntrack_core.c:1901 ipv6_conntrack_local+0x19/0x20 net/netfilter/nf_conntrack_proto.c:414 nf_hook_entry_hookfn include/linux/netfilter.h:142 [inline] nf_hook_slow+0x72/0x170 net/netfilter/core.c:619 nf_hook include/linux/netfilter.h:262 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ip6_xmit+0xa3a/0xa60 net/ipv6/ip6_output.c:324 inet6_csk_xmit+0x1a2/0x1e0 net/ipv6/inet6_connection_sock.c:135 __tcp_transmit_skb+0x132a/0x1840 net/ipv4/tcp_output.c:1402 __tcp_send_ack+0x1fd/0x300 net/ipv4/tcp_output.c:3956 tcp_send_ack+0x23/0x30 net/ipv4/tcp_output.c:3962 __tcp_ack_snd_check+0x2d8/0x510 net/ipv4/tcp_input.c:5478 tcp_ack_snd_check net/ipv4/tcp_input.c:5523 [inline] tcp_rcv_established+0x8c2/0x10e0 net/ipv4/tcp_input.c:5948 tcp_v6_do_rcv+0x36e/0xa50 net/ipv6/tcp_ipv6.c:1521 sk_backlog_rcv include/net/sock.h:1030 [inline] __release_sock+0xf2/0x270 net/core/sock.c:2768 release_sock+0x40/0x110 net/core/sock.c:3300 tcp_sendpage+0x94/0xb0 net/ipv4/tcp.c:1114 inet_sendpage+0x7f/0xc0 net/ipv4/af_inet.c:833 rds_tcp_xmit+0x376/0x5f0 net/rds/tcp_send.c:118 rds_send_xmit+0xbed/0x1500 net/rds/send.c:367 rds_send_worker+0x43/0x200 net/rds/threads.c:200 process_one_work+0x3fc/0x980 kernel/workqueue.c:2298 worker_thread+0x616/0xa70 kernel/workqueue.c:2445 kthread+0x2c7/0x2e0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 value changed: 0x00027cc2 -> 0x00000000 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 17446 Comm: kworker/u4:5 Tainted: G W 5.16.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: krdsd rds_send_worker Note: I chose an arbitrary commit for the Fixes: tag, because I do not think we need to backport this fix to very old kernels. Fixes: e37542ba111f ("netfilter: conntrack: avoid possible false sharing") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-12-08netfilter: nft_exthdr: break evaluation if setting TCP option failsPablo Neira Ayuso
Break rule evaluation on malformed TCP options. Fixes: 99d1712bc41c ("netfilter: exthdr: tcp option set support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-12-08nft_set_pipapo: Fix bucket load in AVX2 lookup routine for six 8-bit groupsStefano Brivio
The sixth byte of packet data has to be looked up in the sixth group, not in the seventh one, even if we load the bucket data into ymm6 (and not ymm5, for convenience of tracking stalls). Without this fix, matching on a MAC address as first field of a set, if 8-bit groups are selected (due to a small set size) would fail, that is, the given MAC address would never match. Reported-by: Nikita Yushchenko <nikita.yushchenko@virtuozzo.com> Cc: <stable@vger.kernel.org> # 5.6.x Fixes: 7400b063969b ("nft_set_pipapo: Introduce AVX2-based lookup implementation") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-By: Nikita Yushchenko <nikita.yushchenko@virtuozzo.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-11-30netfilter: nfnetlink_queue: silence bogus compiler warningFlorian Westphal
net/netfilter/nfnetlink_queue.c:601:36: warning: variable 'ctinfo' is uninitialized when used here [-Wuninitialized] if (ct && nfnl_ct->build(skb, ct, ctinfo, NFQA_CT, NFQA_CT_INFO) < 0) ctinfo is only uninitialized if ct == NULL. Init it to 0 to silence this. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-11-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter/IPVS fixes for net: 1) Add selftest for vrf+conntrack, from Florian Westphal. 2) Extend nfqueue selftest to cover nfqueue, also from Florian. 3) Remove duplicated include in nft_payload, from Wan Jiabing. 4) Several improvements to the nat port shadowing selftest, from Phil Sutter. 5) Fix filtering of reply tuple in ctnetlink, from Florent Fourcot. 6) Do not override error with -EINVAL in filter setup path, also from Florent. 7) Honor sysctl_expire_nodest_conn regardless conn_reuse_mode for reused connections, from yangxingwu. 8) Replace snprintf() by sysfs_emit() in xt_IDLETIMER as reported by Coccinelle, from Jing Yao. 9) Incorrect IPv6 tunnel match in flowtable offload, from Will Mortensen. 10) Switch port shadow selftest to use socat, from Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-11Merge tag 'net-5.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf, can and netfilter. Current release - regressions: - bpf: do not reject when the stack read size is different from the tracked scalar size - net: fix premature exit from NAPI state polling in napi_disable() - riscv, bpf: fix RV32 broken build, and silence RV64 warning Current release - new code bugs: - net: fix possible NULL deref in sock_reserve_memory - amt: fix error return code in amt_init(); fix stopping the workqueue - ax88796c: use the correct ioctl callback Previous releases - always broken: - bpf: stop caching subprog index in the bpf_pseudo_func insn - security: fixups for the security hooks in sctp - nfc: add necessary privilege flags in netlink layer, limit operations to admin only - vsock: prevent unnecessary refcnt inc for non-blocking connect - net/smc: fix sk_refcnt underflow on link down and fallback - nfnetlink_queue: fix OOB when mac header was cleared - can: j1939: ignore invalid messages per standard - bpf, sockmap: - fix race in ingress receive verdict with redirect to self - fix incorrect sk_skb data_end access when src_reg = dst_reg - strparser, and tls are reusing qdisc_skb_cb and colliding - ethtool: fix ethtool msg len calculation for pause stats - vlan: fix a UAF in vlan_dev_real_dev() when ref-holder tries to access an unregistering real_dev - udp6: make encap_rcv() bump the v6 not v4 stats - drv: prestera: add explicit padding to fix m68k build - drv: felix: fix broken VLAN-tagged PTP under VLAN-aware bridge - drv: mvpp2: fix wrong SerDes reconfiguration order Misc & small latecomers: - ipvs: auto-load ipvs on genl access - mctp: sanity check the struct sockaddr_mctp padding fields - libfs: support RENAME_EXCHANGE in simple_rename() - avoid double accounting for pure zerocopy skbs" * tag 'net-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (123 commits) selftests/net: udpgso_bench_rx: fix port argument net: wwan: iosm: fix compilation warning cxgb4: fix eeprom len when diagnostics not implemented net: fix premature exit from NAPI state polling in napi_disable() net/smc: fix sk_refcnt underflow on linkdown and fallback net/mlx5: Lag, fix a potential Oops with mlx5_lag_create_definer() gve: fix unmatched u64_stats_update_end() net: ethernet: lantiq_etop: Fix compilation error selftests: forwarding: Fix packet matching in mirroring selftests vsock: prevent unnecessary refcnt inc for nonblocking connect net: marvell: mvpp2: Fix wrong SerDes reconfiguration order net: ethernet: ti: cpsw_ale: Fix access to un-initialized memory net: stmmac: allow a tc-taprio base-time of zero selftests: net: test_vxlan_under_vrf: fix HV connectivity test net: hns3: allow configure ETS bandwidth of all TCs net: hns3: remove check VF uc mac exist when set by PF net: hns3: fix some mac statistics is always 0 in device version V2 net: hns3: fix kernel crash when unload VF while it is being reset net: hns3: sync rx ring head in echo common pull net: hns3: fix pfc packet number incorrect after querying pfc parameters ...
2021-11-08netfilter: flowtable: fix IPv6 tunnel addr matchWill Mortensen
Previously the IPv6 addresses in the key were clobbered and the mask was left unset. I haven't tested this; I noticed it while skimming the code to understand an unrelated issue. Fixes: cfab6dbd0ecf ("netfilter: flowtable: add tunnel match offload support") Cc: wenxu <wenxu@ucloud.cn> Signed-off-by: Will Mortensen <willmo@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-11-08netfilter: xt_IDLETIMER: replace snprintf in show functions with sysfs_emitJing Yao
coccicheck complains about the use of snprintf() in sysfs show functions: WARNING use scnprintf or sprintf Use sysfs_emit instead of scnprintf, snprintf or sprintf makes more sense. Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Jing Yao <yao.jing2@zte.com.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-11-08netfilter: ipvs: Fix reuse connection if RS weight is 0yangxingwu
We are changing expire_nodest_conn to work even for reused connections when conn_reuse_mode=0, just as what was done with commit dc7b3eb900aa ("ipvs: Fix reuse connection if real server is dead"). For controlled and persistent connections, the new connection will get the needed real server depending on the rules in ip_vs_check_template(). Fixes: d752c3645717 ("ipvs: allow rescheduling of new connections when port reuse is detected") Co-developed-by: Chuanqi Liu <legend050709@qq.com> Signed-off-by: Chuanqi Liu <legend050709@qq.com> Signed-off-by: yangxingwu <xingwu.yang@gmail.com> Acked-by: Simon Horman <horms@verge.net.au> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-11-08netfilter: ctnetlink: do not erase error code with EINVALFlorent Fourcot
And be consistent in error management for both orig/reply filtering Fixes: cb8aa9a3affb ("netfilter: ctnetlink: add kernel side filtering for dump") Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-11-08netfilter: ctnetlink: fix filtering with CTA_TUPLE_REPLYFlorent Fourcot
filter->orig_flags was used for a reply context. Fixes: cb8aa9a3affb ("netfilter: ctnetlink: add kernel side filtering for dump") Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-11-08netfilter: nft_payload: Remove duplicated include in nft_payload.cWan Jiabing
Fix following checkincludes.pl warning: ./net/netfilter/nft_payload.c: linux/ip.h is included more than once. Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-11-06Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc updates from Andrew Morton: "257 patches. Subsystems affected by this patch series: scripts, ocfs2, vfs, and mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache, gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc, pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools, memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm, vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram, cleanups, kfence, and damon)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits) mm/damon: remove return value from before_terminate callback mm/damon: fix a few spelling mistakes in comments and a pr_debug message mm/damon: simplify stop mechanism Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions Docs/admin-guide/mm/damon/start: simplify the content Docs/admin-guide/mm/damon/start: fix a wrong link Docs/admin-guide/mm/damon/start: fix wrong example commands mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on mm/damon: remove unnecessary variable initialization Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM) selftests/damon: support watermarks mm/damon/dbgfs: support watermarks mm/damon/schemes: activate schemes based on a watermarks mechanism tools/selftests/damon: update for regions prioritization of schemes mm/damon/dbgfs: support prioritization weights mm/damon/vaddr,paddr: support pageout prioritization mm/damon/schemes: prioritize regions within the quotas mm/damon/selftests: support schemes quotas mm/damon/dbgfs: support quotas of schemes ...
2021-11-06include/linux/mm.h: move nr_free_buffer_pages from swap.h to mm.hMianhan Liu
nr_free_buffer_pages could be exposed through mm.h instead of swap.h. The advantage of this change is that it can reduce the obsolete includes. For example, net/ipv4/tcp.c wouldn't need swap.h any more since it has already included mm.h. Similarly, after checking all the other files, it comes that tcp.c, udp.c meter.c ,... follow the same rule, so these files can have swap.h removed too. Moreover, after preprocessing all the files that use nr_free_buffer_pages, it turns out that those files have already included mm.h.Thus, we can move nr_free_buffer_pages from swap.h to mm.h safely. This change will not affect the compilation of other files. Link: https://lkml.kernel.org/r/20210912133640.1624-1-liumh1@shanghaitech.edu.cn Signed-off-by: Mianhan Liu <liumh1@shanghaitech.edu.cn> Cc: Jakub Kicinski <kuba@kernel.org> CC: Ulf Hansson <ulf.hansson@linaro.org> Cc: "David S . Miller" <davem@davemloft.net> Cc: Simon Horman <horms@verge.net.au> Cc: Pravin B Shelar <pshelar@ovn.org> Cc: Vlad Yasevich <vyasevich@gmail.com> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfJakub Kicinski
Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net 1) Fix mac address UAF reported by KASAN in nfnetlink_queue, from Florian Westphal. 2) Autoload genetlink IPVS on demand, from Thomas Weissschuh. * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf: ipvs: autoload ipvs on genl access netfilter: nfnetlink_queue: fix OOB when mac header was cleared ==================== Link: https://lore.kernel.org/r/20211101221528.236114-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) Use array_size() in ebtables, from Gustavo A. R. Silva. 2) Attach IPS_ASSURED to internal UDP stream state, reported by Maciej Zenczykowski. 3) Add NFT_META_IFTYPE to match on the interface type either from ingress or egress. 4) Generalize pktinfo->tprot_set to flags field. 5) Allow to match on inner headers / payload data. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01netfilter: nft_payload: support for inner header matching / manglingPablo Neira Ayuso
Allow to match and mangle on inner headers / payload data after the transport header. There is a new field in the pktinfo structure that stores the inner header offset which is calculated only when requested. Only TCP and UDP supported at this stage. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-11-01netfilter: nf_tables: convert pktinfo->tprot_set to flags fieldPablo Neira Ayuso
Generalize boolean field to store more flags on the pktinfo structure. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-11-01netfilter: nft_meta: add NFT_META_IFTYPEPablo Neira Ayuso
Generalize NFT_META_IIFTYPE to NFT_META_IFTYPE which allows you to match on the interface type of the skb->dev field. This field is used by the netdev family to add an implicit dependency to skip non-ethernet packets when matching on layer 3 and 4 TCP/IP header fields. For backward compatibility, add the NFT_META_IIFTYPE alias to NFT_META_IFTYPE. Add __NFT_META_IIFTYPE, to be used by userspace in the future to match specifically on the iiftype. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-11-01netfilter: conntrack: set on IPS_ASSURED if flows enters internal stream statePablo Neira Ayuso
The internal stream state sets the timeout to 120 seconds 2 seconds after the creation of the flow, attach this internal stream state to the IPS_ASSURED flag for consistent event reporting. Before this patch: [NEW] udp 17 30 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 [UNREPLIED] src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [UPDATE] udp 17 30 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [UPDATE] udp 17 30 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [ASSURED] [DESTROY] udp 17 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [ASSURED] Note IPS_ASSURED for the flow not yet in the internal stream state. after this update: [NEW] udp 17 30 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 [UNREPLIED] src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [UPDATE] udp 17 30 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [UPDATE] udp 17 120 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [ASSURED] [DESTROY] udp 17 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [ASSURED] Before this patch, short-lived UDP flows never entered IPS_ASSURED, so they were already candidate flow to be deleted by early_drop under stress. Before this patch, IPS_ASSURED is set on regardless the internal stream state, attach this internal stream state to IPS_ASSURED. packet #1 (original direction) enters NEW state packet #2 (reply direction) enters ESTABLISHED state, sets on IPS_SEEN_REPLY paclet #3 (any direction) sets on IPS_ASSURED (if 2 seconds since the creation has passed by). Reported-by: Maciej Żenczykowski <zenczykowski@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-26netfilter: conntrack: skip confirmation and nat hooks in postrouting for vrfFlorian Westphal
The VRF driver invokes netfilter for output+postrouting hooks so that users can create rules that check for 'oif $vrf' rather than lower device name. Afterwards, ip stack calls those hooks again. This is a problem when conntrack is used with IP masquerading. masquerading has an internal check that re-validates the output interface to account for route changes. This check will trigger in the vrf case. If the -j MASQUERADE rule matched on the first iteration, then round 2 finds state->out->ifindex != nat->masq_index: the latter is the vrf index, but out->ifindex is the lower device. The packet gets dropped and the conntrack entry is invalidated. This change makes conntrack postrouting skip the nat hooks. Also skip confirmation. This allows the second round (postrouting invocation from ipv4/ipv6) to create nat bindings. This also prevents the second round from seeing packets that had their source address changed by the nat hook. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-22ipvs: autoload ipvs on genl accessThomas Weißschuh
The kernel provides the functionality to automatically load modules providing genl families. Use this to remove the need for users to manually load the module. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-22netfilter: nfnetlink_queue: fix OOB when mac header was clearedFlorian Westphal
On 64bit platforms the MAC header is set to 0xffff on allocation and also when a helper like skb_unset_mac_header() is called. dev_parse_header may call skb_mac_header() which assumes valid mac offset: BUG: KASAN: use-after-free in eth_header_parse+0x75/0x90 Read of size 6 at addr ffff8881075a5c05 by task nf-queue/1364 Call Trace: memcpy+0x20/0x60 eth_header_parse+0x75/0x90 __nfqnl_enqueue_packet+0x1a61/0x3380 __nf_queue+0x597/0x1300 nf_queue+0xf/0x40 nf_hook_slow+0xed/0x190 nf_hook+0x184/0x440 ip_output+0x1c0/0x2a0 nf_reinject+0x26f/0x700 nfqnl_recv_verdict+0xa16/0x18b0 nfnetlink_rcv_msg+0x506/0xe70 The existing code only works if the skb has a mac header. Fixes: 2c38de4c1f8da7 ("netfilter: fix looped (broad|multi)cast's MAC handling") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Lots of simnple overlapping additions. With a build fix from Stephen Rothwell. Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS for net-next: 1) Add new run_estimation toggle to IPVS to stop the estimation_timer logic, from Dust Li. 2) Relax superfluous dynset check on NFT_SET_TIMEOUT. 3) Add egress hook, from Lukas Wunner. 4) Nowadays, almost all hook functions in x_table land just call the hook evaluation loop. Remove remaining hook wrappers from iptables and IPVS. From Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-18net: sched: Merge Qdisc::bstats and Qdisc::cpu_bstats data typesAhmed S. Darwish
The only factor differentiating per-CPU bstats data type (struct gnet_stats_basic_cpu) from the packed non-per-CPU one (struct gnet_stats_basic_packed) was a u64_stats sync point inside the former. The two data types are now equivalent: earlier commits added a u64_stats sync point to the latter. Combine both data types into "struct gnet_stats_basic_sync". This eliminates redundancy and simplifies the bstats read/write APIs. Use u64_stats_t for bstats "packets" and "bytes" data types. On 64-bit architectures, u64_stats sync points do not use sequence counter protection. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-18net: sched: Protect Qdisc::bstats with u64_statsAhmed S. Darwish
The not-per-CPU variant of qdisc tc (traffic control) statistics, Qdisc::gnet_stats_basic_packed bstats, is protected with Qdisc::running sequence counter. This sequence counter is used for reliably protecting bstats reads from parallel writes. Meanwhile, the seqcount's write section covers a much wider area than bstats update: qdisc_run_begin() => qdisc_run_end(). That read/write section asymmetry can lead to needless retries of the read section. To prepare for removing the Qdisc::running sequence counter altogether, introduce a u64_stats sync point inside bstats instead. Modify _bstats_update() to start/end the bstats u64_stats write section. For bisectability, and finer commits granularity, the bstats read section is still protected with a Qdisc::running read/retry loop and qdisc_run_begin/end() still starts/ends that seqcount write section. Once all call sites are modified to use _bstats_update(), the Qdisc::running seqcount will be removed and bstats read/retry loop will be modified to utilize the internal u64_stats sync point. Note, using u64_stats implies no sequence counter protection for 64-bit architectures. This can lead to the statistics "packets" vs. "bytes" values getting out of sync on rare occasions. The individual values will still be valid. [bigeasy: Minor commit message edits, init all gnet_stats_basic_packed.] Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-17netfilter: core: Fix clang warnings about unused static inlinesLukas Wunner
Unlike gcc, clang warns about unused static inlines that are not in an include file: net/netfilter/core.c:344:20: error: unused function 'nf_ingress_hook' [-Werror,-Wunused-function] static inline bool nf_ingress_hook(const struct nf_hook_ops *reg, int pf) ^ net/netfilter/core.c:353:20: error: unused function 'nf_egress_hook' [-Werror,-Wunused-function] static inline bool nf_egress_hook(const struct nf_hook_ops *reg, int pf) ^ According to commit 6863f5643dd7 ("kbuild: allow Clang to find unused static inline functions for W=1 build"), the proper resolution is to mark the affected functions as __maybe_unused. An alternative approach would be to move them to include/linux/netfilter_netdev.h, but since Pablo didn't do that in commit ddcfa710d40b ("netfilter: add nf_ingress_hook() helper function"), I'm guessing __maybe_unused is preferred. This fixes both the warning introduced by Pablo in v5.10 as well as the one recently introduced by myself with commit 42df6e1d221d ("netfilter: Introduce egress hook"). Fixes: ddcfa710d40b ("netfilter: add nf_ingress_hook() helper function") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14netfilter: ipvs: make global sysctl readonly in non-init netnsAntoine Tenart
Because the data pointer of net/ipv4/vs/debug_level is not updated per netns, it must be marked as read-only in non-init netns. Fixes: c6d2d445d8de ("IPVS: netns, final patch enabling network name space.") Signed-off-by: Antoine Tenart <atenart@kernel.org> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14netfilter: ipvs: merge ipv4 + ipv6 icmp reply handlersFlorian Westphal
Similar to earlier patches: allow ipv4 and ipv6 to use the same handler. ipv4 and ipv6 specific actions can be done by checking state->pf. v2: split the pf == NFPROTO_IPV4 check (Julian Anastasov) Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14netfilter: ipvs: remove unneeded input wrappersFlorian Westphal
After earlier patch ip_vs_hook_in can be used directly. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14netfilter: ipvs: remove unneeded output wrappersFlorian Westphal
After earlier patch we can use ip_vs_out_hook directly. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14netfilter: ipvs: prepare for hook function reductionFlorian Westphal
ipvs has multiple one-line wrappers for hooks, compact them. To avoid a large patch make the two most common helpers use the same function signature as hooks. Next patches can then remove the oneline wrappers. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-14netfilter: Introduce egress hookLukas Wunner
Support classifying packets with netfilter on egress to satisfy user requirements such as: * outbound security policies for containers (Laura) * filtering and mangling intra-node Direct Server Return (DSR) traffic on a load balancer (Laura) * filtering locally generated traffic coming in through AF_PACKET, such as local ARP traffic generated for clustering purposes or DHCP (Laura; the AF_PACKET plumbing is contained in a follow-up commit) * L2 filtering from ingress and egress for AVB (Audio Video Bridging) and gPTP with nftables (Pablo) * in the future: in-kernel NAT64/NAT46 (Pablo) The egress hook introduced herein complements the ingress hook added by commit e687ad60af09 ("netfilter: add netfilter ingress hook after handle_ing() under unique static key"). A patch for nftables to hook up egress rules from user space has been submitted separately, so users may immediately take advantage of the feature. Alternatively or in addition to netfilter, packets can be classified with traffic control (tc). On ingress, packets are classified first by tc, then by netfilter. On egress, the order is reversed for symmetry. Conceptually, tc and netfilter can be thought of as layers, with netfilter layered above tc. Traffic control is capable of redirecting packets to another interface (man 8 tc-mirred). E.g., an ingress packet may be redirected from the host namespace to a container via a veth connection: tc ingress (host) -> tc egress (veth host) -> tc ingress (veth container) In this case, netfilter egress classifying is not performed when leaving the host namespace! That's because the packet is still on the tc layer. If tc redirects the packet to a physical interface in the host namespace such that it leaves the system, the packet is never subjected to netfilter egress classifying. That is only logical since it hasn't passed through netfilter ingress classifying either. Packets can alternatively be redirected at the netfilter layer using nft fwd. Such a packet *is* subjected to netfilter egress classifying since it has reached the netfilter layer. Internally, the skb->nf_skip_egress flag controls whether netfilter is invoked on egress by __dev_queue_xmit(). Because __dev_queue_xmit() may be called recursively by tunnel drivers such as vxlan, the flag is reverted to false after sch_handle_egress(). This ensures that netfilter is applied both on the overlay and underlying network. Interaction between tc and netfilter is possible by setting and querying skb->mark. If netfilter egress classifying is not enabled on any interface, it is patched out of the data path by way of a static_key and doesn't make a performance difference that is discernible from noise: Before: 1537 1538 1538 1537 1538 1537 Mb/sec After: 1536 1534 1539 1539 1539 1540 Mb/sec Before + tc accept: 1418 1418 1418 1419 1419 1418 Mb/sec After + tc accept: 1419 1424 1418 1419 1422 1420 Mb/sec Before + tc drop: 1620 1619 1619 1619 1620 1620 Mb/sec After + tc drop: 1616 1624 1625 1624 1622 1619 Mb/sec When netfilter egress classifying is enabled on at least one interface, a minimal performance penalty is incurred for every egress packet, even if the interface it's transmitted over doesn't have any netfilter egress rules configured. That is caused by checking dev->nf_hooks_egress against NULL. Measurements were performed on a Core i7-3615QM. Commands to reproduce: ip link add dev foo type dummy ip link set dev foo up modprobe pktgen echo "add_device foo" > /proc/net/pktgen/kpktgend_3 samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1 Accept all traffic with tc: tc qdisc add dev foo clsact tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,' Drop all traffic with tc: tc qdisc add dev foo clsact tc filter add dev foo egress bpf da bytecode '1,6 0 0 2,' Apply this patch when measuring packet drops to avoid errors in dmesg: https://lore.kernel.org/netdev/a73dda33-57f4-95d8-ea51-ed483abd6a7a@iogearbox.net/ Signed-off-by: Lukas Wunner <lukas@wunner.de> Cc: Laura García Liébana <nevola@gmail.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Thomas Graf <tgraf@suug.ch> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-07netfilter: nft_dynset: relax superfluous check on set updatesPablo Neira Ayuso
Relax this condition to make add and update commands idempotent for sets with no timeout. The eval function already checks if the set element timeout is available and updates it if the update command is used. Fixes: 22fe54d5fefc ("netfilter: nf_tables: add support for dynamic set updates") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-07ipvs: add sysctl_run_estimation to support disable estimationDust Li
estimation_timer will iterate the est_list to do estimation for each ipvs stats. When there are lots of services, the list can be very large. We found that estimation_timer() run for more then 200ms on a machine with 104 CPU and 50K services. yunhong-cgl jiang report the same phenomenon before: https://www.spinics.net/lists/lvs-devel/msg05426.html In some cases(for example a large K8S cluster with many ipvs services), ipvs estimation may not be needed. So adding a sysctl blob to allow users to disable this completely. Default is: 1 (enable) Cc: yunhong-cgl jiang <xintian1976@gmail.com> Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Acked-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-07netfilter: nf_tables: skip netdev events generated on netns removalFlorian Westphal
syzbot reported following (harmless) WARN: WARNING: CPU: 1 PID: 2648 at net/netfilter/core.c:468 nft_netdev_unregister_hooks net/netfilter/nf_tables_api.c:230 [inline] nf_tables_unregister_hook include/net/netfilter/nf_tables.h:1090 [inline] __nft_release_basechain+0x138/0x640 net/netfilter/nf_tables_api.c:9524 nft_netdev_event net/netfilter/nft_chain_filter.c:351 [inline] nf_tables_netdev_event+0x521/0x8a0 net/netfilter/nft_chain_filter.c:382 reproducer: unshare -n bash -c 'ip link add br0 type bridge; nft add table netdev t ; \ nft add chain netdev t ingress \{ type filter hook ingress device "br0" \ priority 0\; policy drop\; \}' Problem is that when netns device exit hooks create the UNREGISTER event, the .pre_exit hook for nf_tables core has already removed the base hook. Notifier attempts to do this again. The need to do base hook unregister unconditionally was needed in the past, because notifier was last stage where reg->dev dereference was safe. Now that nf_tables does the hook removal in .pre_exit, this isn't needed anymore. Reported-and-tested-by: syzbot+154bd5be532a63aa778b@syzkaller.appspotmail.com Fixes: 767d1216bff825 ("netfilter: nftables: fix possible UAF over chains from packet path in netns") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-07netfilter: Kconfig: use 'default y' instead of 'm' for bool config optionVegard Nossum
This option, NF_CONNTRACK_SECMARK, is a bool, so it can never be 'm'. Fixes: 33b8e77605620 ("[NETFILTER]: Add CONFIG_NETFILTER_ADVANCED option") Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-07netfilter: xt_IDLETIMER: fix panic that occurs when timer_type has garbage valueJuhee Kang
Currently, when the rule related to IDLETIMER is added, idletimer_tg timer structure is initialized by kmalloc on executing idletimer_tg_create function. However, in this process timer->timer_type is not defined to a specific value. Thus, timer->timer_type has garbage value and it occurs kernel panic. So, this commit fixes the panic by initializing timer->timer_type using kzalloc instead of kmalloc. Test commands: # iptables -A OUTPUT -j IDLETIMER --timeout 1 --label test $ cat /sys/class/xt_idletimer/timers/test Killed Splat looks like: BUG: KASAN: user-memory-access in alarm_expires_remaining+0x49/0x70 Read of size 8 at addr 0000002e8c7bc4c8 by task cat/917 CPU: 12 PID: 917 Comm: cat Not tainted 5.14.0+ #3 79940a339f71eb14fc81aee1757a20d5bf13eb0e Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Call Trace: dump_stack_lvl+0x6e/0x9c kasan_report.cold+0x112/0x117 ? alarm_expires_remaining+0x49/0x70 __asan_load8+0x86/0xb0 alarm_expires_remaining+0x49/0x70 idletimer_tg_show+0xe5/0x19b [xt_IDLETIMER 11219304af9316a21bee5ba9d58f76a6b9bccc6d] dev_attr_show+0x3c/0x60 sysfs_kf_seq_show+0x11d/0x1f0 ? device_remove_bin_file+0x20/0x20 kernfs_seq_show+0xa4/0xb0 seq_read_iter+0x29c/0x750 kernfs_fop_read_iter+0x25a/0x2c0 ? __fsnotify_parent+0x3d1/0x570 ? iov_iter_init+0x70/0x90 new_sync_read+0x2a7/0x3d0 ? __x64_sys_llseek+0x230/0x230 ? rw_verify_area+0x81/0x150 vfs_read+0x17b/0x240 ksys_read+0xd9/0x180 ? vfs_write+0x460/0x460 ? do_syscall_64+0x16/0xc0 ? lockdep_hardirqs_on+0x79/0x120 __x64_sys_read+0x43/0x50 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f0cdc819142 Code: c0 e9 c2 fe ff ff 50 48 8d 3d 3a ca 0a 00 e8 f5 19 02 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24 RSP: 002b:00007fff28eee5b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f0cdc819142 RDX: 0000000000020000 RSI: 00007f0cdc032000 RDI: 0000000000000003 RBP: 00007f0cdc032000 R08: 00007f0cdc031010 R09: 0000000000000000 R10: 0000000000000022 R11: 0000000000000246 R12: 00005607e9ee31f0 R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000 Fixes: 68983a354a65 ("netfilter: xtables: Add snapshot of hardidletimer target") Signed-off-by: Juhee Kang <claudiajkang@gmail.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-02netfilter: nf_tables: honor NLM_F_CREATE and NLM_F_EXCL in event notificationPablo Neira Ayuso
Include the NLM_F_CREATE and NLM_F_EXCL flags in netlink event notifications, otherwise userspace cannot distiguish between create and add commands. Fixes: 96518518cc41 ("netfilter: add nftables") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-09-28netfilter: nf_tables: reverse order in rule replacement expansionPablo Neira Ayuso
Deactivate old rule first, then append the new rule, so rule replacement notification via netlink first reports the deletion of the old rule with handle X in first place, then it adds the new rule (reusing the handle X of the replaced old rule). Note that the abort path releases the transaction that has been created by nft_delrule() on error. Fixes: ca08987885a1 ("netfilter: nf_tables: deactivate expressions in rule replecement routine") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-09-28netfilter: nf_tables: add position handle in event notificationPablo Neira Ayuso
Add position handle to allow to identify the rule location from netlink events. Otherwise, userspace cannot incrementally update a userspace cache through monitoring events. Skip handle dump if the rule has been either inserted (at the beginning of the ruleset) or appended (at the end of the ruleset), the NLM_F_APPEND netlink flag is sufficient in these two cases. Handle NLM_F_REPLACE as NLM_F_APPEND since the rule replacement expansion appends it after the specified rule handle. Fixes: 96518518cc41 ("netfilter: add nftables") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-09-21netfilter: conntrack: serialize hash resizes and cleanupsEric Dumazet
Syzbot was able to trigger the following warning [1] No repro found by syzbot yet but I was able to trigger similar issue by having 2 scripts running in parallel, changing conntrack hash sizes, and: for j in `seq 1 1000` ; do unshare -n /bin/true >/dev/null ; done It would take more than 5 minutes for net_namespace structures to be cleaned up. This is because nf_ct_iterate_cleanup() has to restart everytime a resize happened. By adding a mutex, we can serialize hash resizes and cleanups and also make get_next_corpse() faster by skipping over empty buckets. Even without resizes in the picture, this patch considerably speeds up network namespace dismantles. [1] INFO: task syz-executor.0:8312 can't die for more than 144 seconds. task:syz-executor.0 state:R running task stack:25672 pid: 8312 ppid: 6573 flags:0x00004006 Call Trace: context_switch kernel/sched/core.c:4955 [inline] __schedule+0x940/0x26f0 kernel/sched/core.c:6236 preempt_schedule_common+0x45/0xc0 kernel/sched/core.c:6408 preempt_schedule_thunk+0x16/0x18 arch/x86/entry/thunk_64.S:35 __local_bh_enable_ip+0x109/0x120 kernel/softirq.c:390 local_bh_enable include/linux/bottom_half.h:32 [inline] get_next_corpse net/netfilter/nf_conntrack_core.c:2252 [inline] nf_ct_iterate_cleanup+0x15a/0x450 net/netfilter/nf_conntrack_core.c:2275 nf_conntrack_cleanup_net_list+0x14c/0x4f0 net/netfilter/nf_conntrack_core.c:2469 ops_exit_list+0x10d/0x160 net/core/net_namespace.c:171 setup_net+0x639/0xa30 net/core/net_namespace.c:349 copy_net_ns+0x319/0x760 net/core/net_namespace.c:470 create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110 unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226 ksys_unshare+0x445/0x920 kernel/fork.c:3128 __do_sys_unshare kernel/fork.c:3202 [inline] __se_sys_unshare kernel/fork.c:3200 [inline] __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3200 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f63da68e739 RSP: 002b:00007f63d7c05188 EFLAGS: 00000246 ORIG_RAX: 0000000000000110 RAX: ffffffffffffffda RBX: 00007f63da792f80 RCX: 00007f63da68e739 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000040000000 RBP: 00007f63da6e8cc4 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f63da792f80 R13: 00007fff50b75d3f R14: 00007f63d7c05300 R15: 0000000000022000 Showing all locks held in the system: 1 lock held by khungtaskd/27: #0: ffffffff8b980020 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x53/0x260 kernel/locking/lockdep.c:6446 2 locks held by kworker/u4:2/153: #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline] #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: arch_atomic_long_set include/linux/atomic/atomic-long.h:41 [inline] #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: atomic_long_set include/linux/atomic/atomic-instrumented.h:1198 [inline] #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: set_work_data kernel/workqueue.c:634 [inline] #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: set_work_pool_and_clear_pending kernel/workqueue.c:661 [inline] #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x896/0x1690 kernel/workqueue.c:2268 #1: ffffc9000140fdb0 ((kfence_timer).work){+.+.}-{0:0}, at: process_one_work+0x8ca/0x1690 kernel/workqueue.c:2272 1 lock held by systemd-udevd/2970: 1 lock held by in:imklog/6258: #0: ffff88807f970ff0 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0xe9/0x100 fs/file.c:990 3 locks held by kworker/1:6/8158: 1 lock held by syz-executor.0/8312: 2 locks held by kworker/u4:13/9320: 1 lock held by syz-executor.5/10178: 1 lock held by syz-executor.4/10217: Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-09-21netfilter: log: work around missing softdep backend moduleFlorian Westphal
iptables/nftables has two types of log modules: 1. backend, e.g. nf_log_syslog, which implement the functionality 2. frontend, e.g. xt_LOG or nft_log, which call the functionality provided by backend based on nf_tables or xtables rule set. Problem is that the request_module() call to load the backed in nf_logger_find_get() might happen with nftables transaction mutex held in case the call path is via nf_tables/nft_compat. This can cause deadlocks (see 'Fixes' tags for details). The chosen solution as to let modprobe deal with this by adding 'pre: ' soft dep tag to xt_LOG (to load the syslog backend) and xt_NFLOG (to load nflog backend). Eric reports that this breaks on systems with older modprobe that doesn't support softdeps. Another, similar issue occurs when someone either insmods xt_(NF)LOG directly or unloads the backend module (possible if no log frontend is in use): because the frontend module is already loaded, modprobe is not invoked again so the softdep isn't evaluated. Add a workaround: If nf_logger_find_get() returns -ENOENT and call is not via nft_compat, load the backend explicitly and try again. Else, let nft_compat ask for deferred request_module via nf_tables infra. Softdeps are kept in-place, so with newer modprobe the dependencies are resolved from userspace. Fixes: cefa31a9d461 ("netfilter: nft_log: perform module load from nf_tables") Fixes: a38b5b56d6f4 ("netfilter: nf_log: add module softdeps") Reported-and-tested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-09-21netfilter: nf_nat_masquerade: defer conntrack walk to work queueFlorian Westphal
The ipv4 and device notifiers are called with RTNL mutex held. The table walk can take some time, better not block other RTNL users. 'ip a' has been reported to block for up to 20 seconds when conntrack table has many entries and device down events are frequent (e.g., PPP). Reported-and-tested-by: Martin Zaharinov <micron10@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-09-21netfilter: nf_nat_masquerade: make async masq_inet6_event handling genericFlorian Westphal
masq_inet6_event is called asynchronously from system work queue, because the inet6 notifier is atomic and nf_iterate_cleanup can sleep. The ipv4 and device notifiers call nf_iterate_cleanup directly. This is legal, but these notifiers are called with RTNL mutex held. A large conntrack table with many devices coming and going will have severe impact on the system usability, with 'ip a' blocking for several seconds. This change places the defer code into a helper and makes it more generic so ipv4 and ifdown notifiers can be converted to defer the cleanup walk as well in a follow patch. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-09-21netfilter: nf_tables: Fix oversized kvmalloc() callsPablo Neira Ayuso
The commit 7661809d493b ("mm: don't allow oversized kvmalloc() calls") limits the max allocatable memory via kvmalloc() to MAX_INT. Reported-by: syzbot+cd43695a64bcd21b8596@syzkaller.appspotmail.com Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>