summaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)Author
2019-06-14tcp: add tcp_tx_skb_cache sysctlEric Dumazet
Feng Tang reported a performance regression after introduction of per TCP socket tx/rx caches, for TCP over loopback (netperf) There is high chance the regression is caused by a change on how well the 32 KB per-thread page (current->task_frag) can be recycled, and lack of pcp caches for order-3 pages. I could not reproduce the regression myself, cpus all being spinning on the mm spinlocks for page allocs/freeing, regardless of enabling or disabling the per tcp socket caches. It seems best to disable the feature by default, and let admins enabling it. MM layer either needs to provide scalable order-3 pages allocations, or could attempt a trylock on zone->lock if the caller only attempts to get a high-order page and is able to fallback to order-0 ones in case of pressure. Tests run on a 56 cores host (112 hyper threads) - 35.49% netperf [kernel.vmlinux] [k] queued_spin_lock_slowpath - 35.49% queued_spin_lock_slowpath - 18.18% get_page_from_freelist - __alloc_pages_nodemask - 18.18% alloc_pages_current skb_page_frag_refill sk_page_frag_refill tcp_sendmsg_locked tcp_sendmsg inet_sendmsg sock_sendmsg __sys_sendto __x64_sys_sendto do_syscall_64 entry_SYSCALL_64_after_hwframe __libc_send + 17.31% __free_pages_ok + 31.43% swapper [kernel.vmlinux] [k] intel_idle + 9.12% netperf [kernel.vmlinux] [k] copy_user_enhanced_fast_string + 6.53% netserver [kernel.vmlinux] [k] copy_user_enhanced_fast_string + 0.69% netserver [kernel.vmlinux] [k] queued_spin_lock_slowpath + 0.68% netperf [kernel.vmlinux] [k] skb_release_data + 0.52% netperf [kernel.vmlinux] [k] tcp_sendmsg_locked 0.46% netperf [kernel.vmlinux] [k] _raw_spin_lock_irqsave Fixes: 472c2e07eef0 ("tcp: add one skb cache for tx") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Feng Tang <feng.tang@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14tcp: add tcp_rx_skb_cache sysctlEric Dumazet
Instead of relying on rps_needed, it is safer to use a separate static key, since we do not want to enable TCP rx_skb_cache by default. This feature can cause huge increase of memory usage on hosts with millions of sockets. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14udp: Remove unused variable/function (exact_dif)Tim Beale
This was originally passed through to the VRF logic in compute_score(). But that logic has now been replaced by udp_sk_bound_dev_eq() and so this code is no longer used or needed. Signed-off-by: Tim Beale <timbeale@catalyst.net.nz> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14udp: Remove unused parameter (exact_dif)Tim Beale
Originally this was used by the VRF logic in compute_score(), but that was later replaced by udp_sk_bound_dev_eq() and the parameter became unused. Note this change adds an 'unused variable' compiler warning that will be removed in the next patch (I've split the removal in two to make review slightly easier). Signed-off-by: Tim Beale <timbeale@catalyst.net.nz> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14ipv4: tcp: fix ACK/RST sent with a transmit delayEric Dumazet
If we want to set a EDT time for the skb we want to send via ip_send_unicast_reply(), we have to pass a new parameter and initialize ipc.sockc.transmit_time with it. This fixes the EDT time for ACK/RST packets sent on behalf of a TIME_WAIT socket. Fixes: a842fe1425cb ("tcp: add optional per socket transmit delay") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14ipv4: Support multipath hashing on inner IP pkts for GRE tunnelStephen Suryaputra
Multipath hash policy value of 0 isn't distributing since the outer IP dest and src aren't varied eventhough the inner ones are. Since the flow is on the inner ones in the case of tunneled traffic, hashing on them is desired. This is done mainly for IP over GRE, hence only tested for that. But anything else supported by flow dissection should work. v2: Use skb_flow_dissect_flow_keys() directly so that other tunneling can be supported through flow dissection (per Nikolay Aleksandrov). v3: Remove accidental inclusion of ports in the hash keys and clarify the documentation (Nikolay Alexandrov). Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14tcp: use static_branch_deferred_inc for clean_acked_data_enabledWillem de Bruijn
Deferred static key clean_acked_data_enabled uses the deferred variants of dec and flush. Do the same for inc. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14docs: kbuild: convert docs to ReST and rename to *.rstMauro Carvalho Chehab
The kbuild documentation clearly shows that the documents there are written at different times: some use markdown, some use their own peculiar logic to split sections. Convert everything to ReST without affecting too much the author's style and avoiding adding uneeded markups. The conversion is actually: - add blank lines and identation in order to identify paragraphs; - fix tables markups; - add some lists markups; - mark literal blocks; - adjust title markups. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2019-06-12tcp: add optional per socket transmit delayEric Dumazet
Adding delays to TCP flows is crucial for studying behavior of TCP stacks, including congestion control modules. Linux offers netem module, but it has unpractical constraints : - Need root access to change qdisc - Hard to setup on egress if combined with non trivial qdisc like FQ - Single delay for all flows. EDT (Earliest Departure Time) adoption in TCP stack allows us to enable a per socket delay at a very small cost. Networking tools can now establish thousands of flows, each of them with a different delay, simulating real world conditions. This requires FQ packet scheduler or a EDT-enabled NIC. This patchs adds TCP_TX_DELAY socket option, to set a delay in usec units. unsigned int tx_delay = 10000; /* 10 msec */ setsockopt(fd, SOL_TCP, TCP_TX_DELAY, &tx_delay, sizeof(tx_delay)); Note that FQ packet scheduler limits might need some tweaking : man tc-fq PARAMETERS limit Hard limit on the real queue size. When this limit is reached, new packets are dropped. If the value is lowered, packets are dropped so that the new limit is met. Default is 10000 packets. flow_limit Hard limit on the maximum number of packets queued per flow. Default value is 100. Use of TCP_TX_DELAY option will increase number of skbs in FQ qdisc, so packets would be dropped if any of the previous limit is hit. Use of a jump label makes this support runtime-free, for hosts never using the option. Also note that TSQ (TCP Small Queues) limits are slightly changed with this patch : we need to account that skbs artificially delayed wont stop us providind more skbs to feed the pipe (netem uses skb_orphan_partial() for this purpose, but FQ can not use this trick) Because of that, using big delays might very well trigger old bugs in TSO auto defer logic and/or sndbuf limited detection. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-11net: correct udp zerocopy refcnt also when zerocopy only on appendWillem de Bruijn
The below patch fixes an incorrect zerocopy refcnt increment when appending with MSG_MORE to an existing zerocopy udp skb. send(.., MSG_ZEROCOPY | MSG_MORE); // refcnt 1 send(.., MSG_ZEROCOPY | MSG_MORE); // refcnt still 1 (bar frags) But it missed that zerocopy need not be passed at the first send. The right test whether the uarg is newly allocated and thus has extra refcnt 1 is not !skb, but !skb_zcopy. send(.., MSG_MORE); // <no uarg> send(.., MSG_ZEROCOPY); // refcnt 1 Fixes: 100f6d8e09905 ("net: correct zerocopy refcnt with udp MSG_MORE") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10nexthops: add support for replaceDavid Ahern
Add support for atomically upating a nexthop config. When updating a nexthop, walk the lists of associated fib entries and verify the new config is valid. Replace is done by swapping nh_info for single nexthops - new config is applied to old nexthop struct, and old config is moved to new nexthop struct. For nexthop groups the same applies but for nh_group. In addition for groups the nh_parent reference needs to be updated. The old config is released by calling __remove_nexthop on the 'new' nexthop which now has the old config. This is done to avoid messing around with the list_heads that track which fib entries are using the nexthop. After the swap of config data, bump the sequence counters for FIB entries to invalidate any dst entries and send notifications to userspace. The notifications include the new nexthop spec as well as any fib entries using the updated nexthop struct. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10ipv4: Optimization for fib_info lookup with nexthopsDavid Ahern
Be optimistic about re-using a fib_info when nexthop id is given and the route does not use metrics. Avoids a memory allocation which in most cases is expected to be freed anyways. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10ipv4: Allow routes to use nexthop objectsDavid Ahern
Add support for RTA_NH_ID attribute to allow a user to specify a nexthop id to use with a route. fc_nh_id is added to fib_config to hold the value passed in the RTA_NH_ID attribute. If a nexthop id is given, the gateway, device, encap and multipath attributes can not be set. Update fib_nh_match to check ids on a route delete. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10nexthops: Add ipv6 helper to walk all fib6_nh in a nexthop structDavid Ahern
IPv6 has traditionally had a single fib6_nh per fib6_info. With nexthops we can have multiple fib6_nh associated with a fib6_info. Add a nexthop helper to invoke a callback for each fib6_nh in a 'struct nexthop'. If the callback returns non-0, the loop is stopped and the return value passed to the caller. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10tcp: Make tcp_fastopen_alloc_ctx staticYueHaibing
Fix sparse warning: net/ipv4/tcp_fastopen.c:75:29: warning: symbol 'tcp_fastopen_alloc_ctx' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Jason Baron <jbaron@akamai.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10Update my email addressJozsef Kadlecsik
It's better to use my kadlec@netfilter.org email address in the source code. I might not be able to use kadlec@blackhole.kfki.hu in the future. Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2019-06-09ipv6: tcp: send consistent autoflowlabel in TIME_WAIT stateEric Dumazet
In case autoflowlabel is in action, skb_get_hash_flowi6() derives a non zero skb->hash to the flowlabel. If skb->hash is zero, a flow dissection is performed. Since all TCP skbs sent from ESTABLISH state inherit their skb->hash from sk->sk_txhash, we better keep a copy of sk->sk_txhash into the TIME_WAIT socket. After this patch, ACK or RST packets sent on behalf of a TIME_WAIT socket have the flowlabel that was previously used by the flow. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09tcp: fix undo spurious SYNACK in passive Fast OpenYuchung Cheng
Commit 794200d66273 ("tcp: undo cwnd on Fast Open spurious SYNACK retransmit") may cause tcp_fastretrans_alert() to warn about pending retransmission in Open state. This is triggered when the Fast Open server both sends data and has spurious SYNACK retransmission during the handshake, and the data packets were lost or reordered. The root cause is a bit complicated: (1) Upon receiving SYN-data: a full socket is created with snd_una = ISN + 1 by tcp_create_openreq_child() (2) On SYNACK timeout the server/sender enters CA_Loss state. (3) Upon receiving the final ACK to complete the handshake, sender does not mark FLAG_SND_UNA_ADVANCED since (1) Sender then calls tcp_process_loss since state is CA_loss by (2) (4) tcp_process_loss() does not invoke undo operations but instead mark REXMIT_LOST to force retransmission (5) tcp_rcv_synrecv_state_fastopen() calls tcp_try_undo_loss(). It changes state to CA_Open but has positive tp->retrans_out (6) Next ACK triggers the WARN_ON in tcp_fastretrans_alert() The step that goes wrong is (4) where the undo operation should have been invoked because the ACK successfully acknowledged the SYN sequence. This fixes that by specifically checking undo when the SYN-ACK sequence is acknowledged. Then after tcp_process_loss() the state would be further adjusted based in tcp_fastretrans_alert() to avoid triggering the warning in (6). Fixes: 794200d66273 ("tcp: undo cwnd on Fast Open spurious SYNACK retransmit") Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09net: ipv4: fib_semantics: fix uninitialized variableEnrico Weigelt
fix an uninitialized variable: CC net/ipv4/fib_semantics.o net/ipv4/fib_semantics.c: In function 'fib_check_nh_v4_gw': net/ipv4/fib_semantics.c:1027:12: warning: 'err' may be used uninitialized in this function [-Wmaybe-uninitialized] if (!tbl || err) { ^~ Signed-off-by: Enrico Weigelt <info@metux.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-08Merge tag 'spdx-5.2-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull yet more SPDX updates from Greg KH: "Another round of SPDX header file fixes for 5.2-rc4 These are all more "GPL-2.0-or-later" or "GPL-2.0-only" tags being added, based on the text in the files. We are slowly chipping away at the 700+ different ways people tried to write the license text. All of these were reviewed on the spdx mailing list by a number of different people. We now have over 60% of the kernel files covered with SPDX tags: $ ./scripts/spdxcheck.py -v 2>&1 | grep Files Files checked: 64533 Files with SPDX: 40392 Files with errors: 0 I think the majority of the "easy" fixups are now done, it's now the start of the longer-tail of crazy variants to wade through" * tag 'spdx-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (159 commits) treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 450 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 449 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 448 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 446 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 445 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 444 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 443 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 442 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 440 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 438 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 437 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 436 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 435 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 434 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 433 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 432 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 431 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 430 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 429 ...
2019-06-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2019-06-07 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix several bugs in riscv64 JIT code emission which forgot to clear high 32-bits for alu32 ops, from Björn and Luke with selftests covering all relevant BPF alu ops from Björn and Jiong. 2) Two fixes for UDP BPF reuseport that avoid calling the program in case of __udp6_lib_err and UDP GRO which broke reuseport_select_sock() assumption that skb->data is pointing to transport header, from Martin. 3) Two fixes for BPF sockmap: a use-after-free from sleep in psock's backlog workqueue, and a missing restore of sk_write_space when psock gets dropped, from Jakub and John. 4) Fix unconnected UDP sendmsg hook API which is insufficient as-is since it breaks standard applications like DNS if reverse NAT is not performed upon receive, from Daniel. 5) Fix an out-of-bounds read in __bpf_skc_lookup which in case of AF_INET6 fails to verify that the length of the tuple is long enough, from Lorenz. 6) Fix libbpf's libbpf__probe_raw_btf to return an fd instead of 0/1 (for {un,}successful probe) as that is expected to be propagated as an fd to load_sk_storage_btf() and thus closing the wrong descriptor otherwise, from Michal. 7) Fix bpftool's JSON output for the case when a lookup fails, from Krzesimir. 8) Minor misc fixes in docs, samples and selftests, from various others. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Some ISDN files that got removed in net-next had some changes done in mainline, take the removals. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Free AF_PACKET po->rollover properly, from Willem de Bruijn. 2) Read SFP eeprom in max 16 byte increments to avoid problems with some SFP modules, from Russell King. 3) Fix UDP socket lookup wrt. VRF, from Tim Beale. 4) Handle route invalidation properly in s390 qeth driver, from Julian Wiedmann. 5) Memory leak on unload in RDS, from Zhu Yanjun. 6) sctp_process_init leak, from Neil HOrman. 7) Fix fib_rules rule insertion semantic change that broke Android, from Hangbin Liu. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (33 commits) pktgen: do not sleep with the thread lock held. net: mvpp2: Use strscpy to handle stat strings net: rds: fix memory leak in rds_ib_flush_mr_pool ipv6: fix EFAULT on sendto with icmpv6 and hdrincl ipv6: use READ_ONCE() for inet->hdrincl as in ipv4 Revert "fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied" net: aquantia: fix wol configuration not applied sometimes ethtool: fix potential userspace buffer overflow Fix memory leak in sctp_process_init net: rds: fix memory leak when unload rds_rdma ipv6: fix the check before getting the cookie in rt6_get_cookie ipv4: not do cache for local delivery if bc_forwarding is enabled s390/qeth: handle error when updating TX queue count s390/qeth: fix VLAN attribute in bridge_hostnotify udev event s390/qeth: check dst entry before use s390/qeth: handle limited IPv4 broadcast in L3 TX path net: fix indirect calls helpers for ptype list hooks. net: ipvlan: Fix ipvlan device tso disabled while NETIF_F_IP_CSUM is set udp: only choose unbound UDP socket for multicast when not in a VRF net/tls: replace the sleeping lock around RX resync with a bit lock ...
2019-06-06bpf: fix unconnected udp hooksDaniel Borkmann
Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently to applications as also stated in original motivation in 7828f20e3779 ("Merge branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter two hooks into Cilium to enable host based load-balancing with Kubernetes, I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes typically sets up DNS as a service and is thus subject to load-balancing. Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API is currently insufficient and thus not usable as-is for standard applications shipped with most distros. To break down the issue we ran into with a simple example: # cat /etc/resolv.conf nameserver 147.75.207.207 nameserver 147.75.207.208 For the purpose of a simple test, we set up above IPs as service IPs and transparently redirect traffic to a different DNS backend server for that node: # cilium service list ID Frontend Backend 1 147.75.207.207:53 1 => 8.8.8.8:53 2 147.75.207.208:53 1 => 8.8.8.8:53 The attached BPF program is basically selecting one of the backends if the service IP/port matches on the cgroup hook. DNS breaks here, because the hooks are not transparent enough to applications which have built-in msg_name address checks: # nslookup 1.1.1.1 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 [...] ;; connection timed out; no servers could be reached # dig 1.1.1.1 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 [...] ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1 ;; global options: +cmd ;; connection timed out; no servers could be reached For comparison, if none of the service IPs is used, and we tell nslookup to use 8.8.8.8 directly it works just fine, of course: # nslookup 1.1.1.1 8.8.8.8 1.1.1.1.in-addr.arpa name = one.one.one.one. In order to fix this and thus act more transparent to the application, this needs reverse translation on recvmsg() side. A minimal fix for this API is to add similar recvmsg() hooks behind the BPF cgroups static key such that the program can track state and replace the current sockaddr_in{,6} with the original service IP. From BPF side, this basically tracks the service tuple plus socket cookie in an LRU map where the reverse NAT can then be retrieved via map value as one example. Side-note: the BPF cgroups static key should be converted to a per-hook static key in future. Same example after this fix: # cilium service list ID Frontend Backend 1 147.75.207.207:53 1 => 8.8.8.8:53 2 147.75.207.208:53 1 => 8.8.8.8:53 Lookups work fine now: # nslookup 1.1.1.1 1.1.1.1.in-addr.arpa name = one.one.one.one. Authoritative answers can be found from: # dig 1.1.1.1 ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550 ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 512 ;; QUESTION SECTION: ;1.1.1.1. IN A ;; AUTHORITY SECTION: . 23426 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400 ;; Query time: 17 msec ;; SERVER: 147.75.207.207#53(147.75.207.207) ;; WHEN: Tue May 21 12:59:38 UTC 2019 ;; MSG SIZE rcvd: 111 And from an actual packet level it shows that we're using the back end server when talking via 147.75.207.20{7,8} front end: # tcpdump -i any udp [...] 12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38) 12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38) 12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67) 12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67) [...] In order to be flexible and to have same semantics as in sendmsg BPF programs, we only allow return codes in [1,1] range. In the sendmsg case the program is called if msg->msg_name is present which can be the case in both, connected and unconnected UDP. The former only relies on the sockaddr_in{,6} passed via connect(2) if passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar way to call into the BPF program whenever a non-NULL msg->msg_name was passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note that for TCP case, the msg->msg_name is ignored in the regular recvmsg path and therefore not relevant. For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE, the hook is not called. This is intentional as it aligns with the same semantics as in case of TCP cgroup BPF hooks right now. This might be better addressed in future through a different bpf_attach_type such that this case can be distinguished from the regular recvmsg paths, for example. Fixes: 1cedee13d25a ("bpf: Hooks for sys_sendmsg") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrey Ignatov <rdna@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-06xfrm: remove type and offload_type map from xfrm_state_afinfoFlorian Westphal
Only a handful of xfrm_types exist, no need to have 512 pointers for them. Reduces size of afinfo struct from 4k to 120 bytes on 64bit platforms. Also, the unregister function doesn't need to return an error, no single caller does anything useful with it. Just place a WARN_ON() where needed instead. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-06-06xfrm: remove eth_proto value from xfrm_state_afinfoFlorian Westphal
xfrm_prepare_input needs to lookup the state afinfo backend again to fetch the address family ethernet protocol value. There are only two address families, so a switch statement is simpler. While at it, use u8 for family and proto and remove the owner member -- its not used anywhere. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-06-05inet_connection_sock: remove unused parameter of reqsk_queue_unlink funcZhiqiang Liu
small cleanup: "struct request_sock_queue *queue" parameter of reqsk_queue_unlink func is never used in the func, so we can remove it. Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05ipv4: not do cache for local delivery if bc_forwarding is enabledXin Long
With the topo: h1 ---| rp1 | | route rp3 |--- h3 (192.168.200.1) h2 ---| rp2 | If rp1 bc_forwarding is set while rp2 bc_forwarding is not, after doing "ping 192.168.200.255" on h1, then ping 192.168.200.255 on h2, and the packets can still be forwared. This issue was caused by the input route cache. It should only do the cache for either bc forwarding or local delivery. Otherwise, local delivery can use the route cache for bc forwarding of other interfaces. This patch is to fix it by not doing cache for local delivery if all.bc_forwarding is enabled. Note that we don't fix it by checking route cache local flag after rt_cache_valid() in "local_input:" and "ip_mkroute_input", as the common route code shouldn't be touched for bc_forwarding. Fixes: 5cbf777cfdf6 ("route: add support for directed broadcast forwarding") Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05net: ipv4: drop unneeded likely() call around IS_ERR()Enrico Weigelt
IS_ERR() already calls unlikely(), so this extra unlikely() call around IS_ERR() is not needed. Signed-off-by: Enrico Weigelt <info@metux.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 422Thomas Gleixner
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms and conditions of the gnu general public license version 2 as published by the free software foundation extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 101 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190531190113.822954939@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 269Thomas Gleixner
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of version 2 of the gnu general public license as published by the free software foundation this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details you should have received a copy of the gnu general public license along with this program if not write to the free software foundation inc 51 franklin street fifth floor boston ma 02110 1301 usa extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 21 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Richard Fontana <rfontana@redhat.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190529141334.228102212@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05xfrm: remove init_flags indirection from xfrm_state_afinfoFlorian Westphal
There is only one implementation of this function; just call it directly. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-06-05xfrm: remove init_temprop indirection from xfrm_state_afinfoFlorian Westphal
same as previous patch: just place this in the caller, no need to have an indirection for a structure initialization. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-06-05xfrm: remove init_tempsel indirection from xfrm_state_afinfoFlorian Westphal
Simple initialization, handle it in the caller. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-06-04ipv6: Plumb support for nexthop object in a fib6_infoDavid Ahern
Add struct nexthop and nh_list list_head to fib6_info. nh_list is the fib6_info side of the nexthop <-> fib_info relationship. Since a fib6_info referencing a nexthop object can not have 'sibling' entries (the old way of doing multipath routes), the nh_list is a union with fib6_siblings. Add f6i_list list_head to 'struct nexthop' to track fib6_info entries using a nexthop instance. Update __remove_nexthop_fib to walk f6_list and delete fib entries using the nexthop. Add a few nexthop helpers for use when a nexthop is added to fib6_info: - nexthop_fib6_nh - return first fib6_nh in a nexthop object - fib6_info_nh_dev moved to nexthop.h and updated to use nexthop_fib6_nh if the fib6_info references a nexthop object - nexthop_path_fib6_result - similar to ipv4, select a path within a multipath nexthop object. If the nexthop is a blackhole, set fib6_result type to RTN_BLACKHOLE, and set the REJECT flag Update the fib6_info references to check for nh and take a different path as needed: - rt6_qualify_for_ecmp - if a fib entry uses a nexthop object it can NOT be coalesced with other fib entries into a multipath route - rt6_duplicate_nexthop - use nexthop_cmp if either fib6_info references a nexthop - addrconf (host routes), RA's and info entries (anything configured via ndisc) does not use nexthop objects - fib6_info_destroy_rcu - put reference to nexthop object - fib6_purge_rt - drop fib6_info from f6i_list - fib6_select_path - update to use the new nexthop_path_fib6_result when fib entry uses a nexthop object - rt6_device_match - update to catch use of nexthop object as a blackhole and set fib6_type and flags. - ip6_route_info_create - don't add space for fib6_nh if fib entry is going to reference a nexthop object, take a reference to nexthop object, disallow use of source routing - rt6_nlmsg_size - add space for RTA_NH_ID - add rt6_fill_node_nexthop to add nexthop data on a dump As with ipv4, most of the changes push existing code into the else branch of whether the fib entry uses a nexthop object. Update the nexthop code to walk f6i_list on a nexthop deleted to remove fib entries referencing it. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04ipv4: Plumb support for nexthop object in a fib_infoDavid Ahern
Add 'struct nexthop' and nh_list list_head to fib_info. nh_list is the fib_info side of the nexthop <-> fib_info relationship. Add fi_list list_head to 'struct nexthop' to track fib_info entries using a nexthop instance. Add __remove_nexthop_fib and add it to __remove_nexthop to walk the new list_head and mark those fib entries as dead when the nexthop is deleted. Add a few nexthop helpers for use when a nexthop is added to fib_info: - nexthop_cmp to determine if 2 nexthops are the same - nexthop_path_fib_result to select a path for a multipath 'struct nexthop' - nexthop_fib_nhc to select a specific fib_nh_common within a multipath 'struct nexthop' Update existing fib_info_nhc to use nexthop_fib_nhc if a fib_info uses a 'struct nexthop', and mark fib_info_nh as only used for the non-nexthop case. Update the fib_info functions to check for fi->nh and take a different path as needed: - free_fib_info_rcu - put the nexthop object reference - fib_release_info - remove the fib_info from the nexthop's fi_list - nh_comp - use nexthop_cmp when either fib_info references a nexthop object - fib_info_hashfn - use the nexthop id for the hashing vs the oif of each fib_nh in a fib_info - fib_nlmsg_size - add space for the RTA_NH_ID attribute - fib_create_info - verify nexthop reference can be taken, verify nexthop spec is valid for fib entry, and add fib_info to fi_list for a nexthop - fib_select_multipath - use the new nexthop_path_fib_result to select a path when nexthop objects are used - fib_table_lookup - if the 'struct nexthop' is a blackhole nexthop, treat it the same as a fib entry using 'blackhole' The bulk of the changes are in fib_semantics.c and most of that is moving the existing change_nexthops into an else branch. Update the nexthop code to walk fi_list on a nexthop deleted to remove fib entries referencing it. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04ipv4: Prepare for fib6_nh from a nexthop objectDavid Ahern
Convert more IPv4 code to use fib_nh_common over fib_nh to enable routes to use a fib6_nh based nexthop. In the end, only code not using a nexthop object in a fib_info should directly access fib_nh in a fib_info without checking the famiy and going through fib_nh_common. Those functions will be marked when it is not directly evident. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04ipv4: Use accessors for fib_info nexthop dataDavid Ahern
Use helpers to access fib_nh and fib_nhs fields of a fib_info. Drop the fib_dev macro which is an alias for the first nexthop. Replacements: fi->fib_dev --> fib_info_nh(fi, 0)->fib_nh_dev fi->fib_nh --> fib_info_nh(fi, 0) fi->fib_nh[i] --> fib_info_nh(fi, i) fi->fib_nhs --> fib_info_num_path(fi) where fib_info_nh(fi, i) returns fi->fib_nh[nhsel] and fib_info_num_path returns fi->fib_nhs. Move the existing fib_info_nhc to nexthop.h and define the new ones there. A later patch adds a check if a fib_info uses a nexthop object, and defining the helpers in nexthop.h avoid circular header dependencies. After this all remaining open coded references to fi->fib_nhs and fi->fib_nh are in: - fib_create_info and helpers used to lookup an existing fib_info entry, and - the netdev event functions fib_sync_down_dev and fib_sync_up. The latter two will not be reused for nexthops, and the fib_create_info will be updated to handle a nexthop in a fib_info. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04udp: only choose unbound UDP socket for multicast when not in a VRFTim Beale
By default, packets received in another VRF should not be passed to an unbound socket in the default VRF. This patch updates the IPv4 UDP multicast logic to match the unicast VRF logic (in compute_score()), as well as the IPv6 mcast logic (in __udp_v6_is_mcast_sock()). The particular case I noticed was DHCP discover packets going to the 255.255.255.255 address, which are handled by __udp4_lib_mcast_deliver(). The previous code meant that running multiple different DHCP server or relay agent instances across VRFs did not work correctly - any server/relay agent in the default VRF received DHCP discover packets for all other VRFs. Fixes: 6da5b0f027a8 ("net: ensure unbound datagram socket to be chosen when not in a VRF") Signed-off-by: Tim Beale <timbeale@catalyst.net.nz> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04net: ipv4: fix rcu lockdep splat due to wrong annotationFlorian Westphal
syzbot triggered following splat when strict netlink validation is enabled: net/ipv4/devinet.c:1766 suspicious rcu_dereference_check() usage! This occurs because we hold RTNL mutex, but no rcu read lock. The second call site holds both, so just switch to the _rtnl variant. Reported-by: syzbot+bad6e32808a3a97b1515@syzkaller.appspotmail.com Fixes: 2638eb8b50cf ("net: ipv4: provide __rcu annotation for ifa_list") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-03net: fix use-after-free in kfree_skb_listEric Dumazet
syzbot reported nasty use-after-free [1] Lets remove frag_list field from structs ip_fraglist_iter and ip6_fraglist_iter. This seens not needed anyway. [1] : BUG: KASAN: use-after-free in kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706 Read of size 8 at addr ffff888085a3cbc0 by task syz-executor303/8947 CPU: 0 PID: 8947 Comm: syz-executor303 Not tainted 5.2.0-rc2+ #12 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188 __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317 kasan_report+0x12/0x20 mm/kasan/common.c:614 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132 kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706 ip6_fragment+0x1ef4/0x2680 net/ipv6/ip6_output.c:882 __ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144 ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179 dst_output include/net/dst.h:433 [inline] ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179 ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796 ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816 rawv6_push_pending_frames net/ipv6/raw.c:617 [inline] rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947 inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:671 ___sys_sendmsg+0x803/0x920 net/socket.c:2292 __sys_sendmsg+0x105/0x1d0 net/socket.c:2330 __do_sys_sendmsg net/socket.c:2339 [inline] __se_sys_sendmsg net/socket.c:2337 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x44add9 Code: e8 7c e6 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 1b 05 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f826f33bce8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00000000006e7a18 RCX: 000000000044add9 RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000005 RBP: 00000000006e7a10 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006e7a1c R13: 00007ffcec4f7ebf R14: 00007f826f33c9c0 R15: 20c49ba5e353f7cf Allocated by task 8947: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_kmalloc mm/kasan/common.c:489 [inline] __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462 kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497 slab_post_alloc_hook mm/slab.h:437 [inline] slab_alloc_node mm/slab.c:3269 [inline] kmem_cache_alloc_node+0x131/0x710 mm/slab.c:3579 __alloc_skb+0xd5/0x5e0 net/core/skbuff.c:199 alloc_skb include/linux/skbuff.h:1058 [inline] __ip6_append_data.isra.0+0x2a24/0x3640 net/ipv6/ip6_output.c:1519 ip6_append_data+0x1e5/0x320 net/ipv6/ip6_output.c:1688 rawv6_sendmsg+0x1467/0x35e0 net/ipv6/raw.c:940 inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:671 ___sys_sendmsg+0x803/0x920 net/socket.c:2292 __sys_sendmsg+0x105/0x1d0 net/socket.c:2330 __do_sys_sendmsg net/socket.c:2339 [inline] __se_sys_sendmsg net/socket.c:2337 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 8947: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451 kasan_slab_free+0xe/0x10 mm/kasan/common.c:459 __cache_free mm/slab.c:3432 [inline] kmem_cache_free+0x86/0x260 mm/slab.c:3698 kfree_skbmem net/core/skbuff.c:625 [inline] kfree_skbmem+0xc5/0x150 net/core/skbuff.c:619 __kfree_skb net/core/skbuff.c:682 [inline] kfree_skb net/core/skbuff.c:699 [inline] kfree_skb+0xf0/0x390 net/core/skbuff.c:693 kfree_skb_list+0x44/0x60 net/core/skbuff.c:708 __dev_xmit_skb net/core/dev.c:3551 [inline] __dev_queue_xmit+0x3034/0x36b0 net/core/dev.c:3850 dev_queue_xmit+0x18/0x20 net/core/dev.c:3914 neigh_direct_output+0x16/0x20 net/core/neighbour.c:1532 neigh_output include/net/neighbour.h:511 [inline] ip6_finish_output2+0x1034/0x2550 net/ipv6/ip6_output.c:120 ip6_fragment+0x1ebb/0x2680 net/ipv6/ip6_output.c:863 __ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144 ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179 dst_output include/net/dst.h:433 [inline] ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179 ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796 ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816 rawv6_push_pending_frames net/ipv6/raw.c:617 [inline] rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947 inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:671 ___sys_sendmsg+0x803/0x920 net/socket.c:2292 __sys_sendmsg+0x105/0x1d0 net/socket.c:2330 __do_sys_sendmsg net/socket.c:2339 [inline] __se_sys_sendmsg net/socket.c:2337 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff888085a3cbc0 which belongs to the cache skbuff_head_cache of size 224 The buggy address is located 0 bytes inside of 224-byte region [ffff888085a3cbc0, ffff888085a3cca0) The buggy address belongs to the page: page:ffffea0002168f00 refcount:1 mapcount:0 mapping:ffff88821b6f63c0 index:0x0 flags: 0x1fffc0000000200(slab) raw: 01fffc0000000200 ffffea00027bbf88 ffffea0002105b88 ffff88821b6f63c0 raw: 0000000000000000 ffff888085a3c080 000000010000000c 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888085a3ca80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff888085a3cb00: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc >ffff888085a3cb80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb ^ ffff888085a3cc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888085a3cc80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc Fixes: 0feca6190f88 ("net: ipv6: add skbuff fraglist splitter") Fixes: c8b17be0b7a4 ("net: ipv4: add skbuff fraglist splitter") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-03tcp: use this_cpu_read(*X) instead of *this_cpu_ptr(X)Eric Dumazet
this_cpu_read(*X) is slightly faster than *this_cpu_ptr(X) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-03ipv4: icmp: use this_cpu_read() in icmp_sk()Eric Dumazet
this_cpu_read(*X) is faster than *this_cpu_ptr(X) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-03bpf: udp: Avoid calling reuseport's bpf_prog from udp_groMartin KaFai Lau
When the commit a6024562ffd7 ("udp: Add GRO functions to UDP socket") added udp[46]_lib_lookup_skb to the udp_gro code path, it broke the reuseport_select_sock() assumption that skb->data is pointing to the transport header. This patch follows an earlier __udp6_lib_err() fix by passing a NULL skb to avoid calling the reuseport's bpf_prog. Fixes: a6024562ffd7 ("udp: Add GRO functions to UDP socket") Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-02net: ipv4: provide __rcu annotation for ifa_listFlorian Westphal
ifa_list is protected by rcu, yet code doesn't reflect this. Add the __rcu annotations and fix up all places that are now reported by sparse. I've done this in the same commit to not add intermediate patches that result in new warnings. Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02net: use new in_dev_ifa iteratorsFlorian Westphal
Use in_dev_for_each_ifa_rcu/rtnl instead. This prevents sparse warnings once proper __rcu annotations are added. Signed-off-by: Florian Westphal <fw@strlen.de> t di# Last commands done (6 commands done): Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02netfilter: use in_dev_for_each_ifa_rcuFlorian Westphal
Netfilter hooks are always running under rcu read lock, use the new iterator macro so sparse won't complain once we add proper __rcu annotations. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02devinet: use in_dev_for_each_ifa_rcu in more placesFlorian Westphal
This also replaces spots that used for_primary_ifa(). for_primary_ifa() aborts the loop on the first secondary address seen. Replace it with either the rcu or rtnl variant of in_dev_for_each_ifa(), but two places will now also consider secondary addresses too: inet_addr_onlink() and inet_ifa_byprefix(). I do not understand why they should ignore secondary addresses. Why would a secondary address not be considered 'on link'? When matching a prefix, why ignore a matching secondary address? Other places get converted as well, but gain "->flags & SECONDARY" check. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02net: inetdevice: provide replacement iterators for in_ifaddr walkFlorian Westphal
The ifa_list is protected either by rcu or rtnl lock, but the current iterators do not account for this. This adds two iterators as replacement, a later patch in the series will update them with the needed rcu/rtnl_dereference calls. Its not done in this patch yet to avoid sparse warnings -- the fields lack the proper __rcu annotation. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset container Netfilter/IPVS update for net-next: 1) Add UDP tunnel support for ICMP errors in IPVS. Julian Anastasov says: This patchset is a followup to the commit that adds UDP/GUE tunnel: "ipvs: allow tunneling with gue encapsulation". What we do is to put tunnel real servers in hash table (patch 1), add function to lookup tunnels (patch 2) and use it to strip the embedded tunnel headers from ICMP errors (patch 3). 2) Extend xt_owner to match for supplementary groups, from Lukasz Pawelczyk. 3) Remove unused oif field in flow_offload_tuple object, from Taehee Yoo. 4) Release basechain counters from workqueue to skip synchronize_rcu() call. From Florian Westphal. 5) Replace skb_make_writable() by skb_ensure_writable(). Patchset from Florian Westphal. 6) Checksum support for gue encapsulation in IPVS, from Jacky Hu. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>