summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2018-04-27tipc: fix bug in function tipc_nl_node_dump_monitorJon Maloy
Commit 36a50a989ee8 ("tipc: fix infinite loop when dumping link monitor summary") intended to fix a problem with user tool looping when max number of bearers are enabled. Unfortunately, the wrong version of the commit was posted, so the problem was not solved at all. This commit adds the missing part. Fixes: 36a50a989ee8 ("tipc: fix infinite loop when dumping link monitor summary") Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27vti6: Change minimum MTU to IPV4_MIN_MTU, vti6 can carry IPv4 tooStefano Brivio
A vti6 interface can carry IPv4 as well, so it makes no sense to enforce a minimum MTU of IPV6_MIN_MTU. If the user sets an MTU below IPV6_MIN_MTU, IPv6 will be disabled on the interface, courtesy of addrconf_notify(). Reported-by: Xin Long <lucien.xin@gmail.com> Fixes: b96f9afee4eb ("ipv4/6: use core net MTU range checking") Fixes: c6741fbed6dc ("vti6: Properly adjust vti6 MTU from MTU of lower device") Fixes: 53c81e95df17 ("ip6_vti: adjust vti mtu according to mtu of lower device") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-04-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf-next 2018-04-27 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Add extensive BPF helper description into include/uapi/linux/bpf.h and a new script bpf_helpers_doc.py which allows for generating a man page out of it. Thus, every helper in BPF now comes with proper function signature, detailed description and return code explanation, from Quentin. 2) Migrate the BPF collect metadata tunnel tests from BPF samples over to the BPF selftests and further extend them with v6 vxlan, geneve and ipip tests, simplify the ipip tests, improve documentation and convert to bpf_ntoh*() / bpf_hton*() api, from William. 3) Currently, helpers that expect ARG_PTR_TO_MAP_{KEY,VALUE} can only access stack and packet memory. Extend this to allow such helpers to also use map values, which enabled use cases where value from a first lookup can be directly used as a key for a second lookup, from Paul. 4) Add a new helper bpf_skb_get_xfrm_state() for tc BPF programs in order to retrieve XFRM state information containing SPI, peer address and reqid values, from Eyal. 5) Various optimizations in nfp driver's BPF JIT in order to turn ADD and SUB instructions with negative immediate into the opposite operation with a positive immediate such that nfp can better fit small immediates into instructions. Savings in instruction count up to 4% have been observed, from Jakub. 6) Add the BPF prog's gpl_compatible flag to struct bpf_prog_info and add support for dumping this through bpftool, from Jiri. 7) Move the BPF sockmap samples over into BPF selftests instead since sockmap was rather a series of tests than sample anyway and this way this can be run from automated bots, from John. 8) Follow-up fix for bpf_adjust_tail() helper in order to make it work with generic XDP, from Nikita. 9) Some follow-up cleanups to BTF, namely, removing unused defines from BTF uapi header and renaming 'name' struct btf_* members into name_off to make it more clear they are offsets into string section, from Martin. 10) Remove test_sock_addr from TEST_GEN_PROGS in BPF selftests since not run directly but invoked from test_sock_addr.sh, from Yonghong. 11) Remove redundant ret assignment in sample BPF loader, from Wang. 12) Add couple of missing files to BPF selftest's gitignore, from Anders. There are two trivial merge conflicts while pulling: 1) Remove samples/sockmap/Makefile since all sockmap tests have been moved to selftests. 2) Add both hunks from tools/testing/selftests/bpf/.gitignore to the file since git should ignore all of them. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27netfilter: nf_tables: skip synchronize_rcu if transaction log is emptyFlorian Westphal
After processing the transaction log, the remaining entries of the log need to be released. However, in some cases no entries remain, e.g. because the transaction did not remove anything. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-27netfilter: x_tables: check name length in find_match/target, tooFlorian Westphal
ebtables uses find_match() rather than find_request_match in one case (see bcf4934288402be3464110109a4dae3bd6fb3e93, "netfilter: ebtables: Fix extension lookup with identical name"), so extend the check on name length to those functions too. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-27netfilter: Fix handling simultaneous open in TCP conntrackJozsef Kadlecsik
Dominique Martinet reported a TCP hang problem when simultaneous open was used. The problem is that the tcp_conntracks state table is not smart enough to handle the case. The state table could be fixed by introducing a new state, but that would require more lines of code compared to this patch, due to the required backward compatibility with ctnetlink. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Reported-by: Dominique Martinet <asmadeus@codewreck.org> Tested-by: Dominique Martinet <asmadeus@codewreck.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-27ipvs: initialize tbl->entries in ip_vs_lblc_init_svc()Cong Wang
Similarly, tbl->entries is not initialized after kmalloc(), therefore causes an uninit-value warning in ip_vs_lblc_check_expire(), as reported by syzbot. Reported-by: <syzbot+3e9695f147fb529aa9bc@syzkaller.appspotmail.com> Cc: Simon Horman <horms@verge.net.au> Cc: Julian Anastasov <ja@ssi.bg> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-27ipvs: initialize tbl->entries after allocationCong Wang
tbl->entries is not initialized after kmalloc(), therefore causes an uninit-value warning in ip_vs_lblc_check_expire() as reported by syzbot. Reported-by: <syzbot+3dfdea57819073a04f21@syzkaller.appspotmail.com> Cc: Simon Horman <horms@verge.net.au> Cc: Julian Anastasov <ja@ssi.bg> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-27Merge tag 'ipvs-for-v4.18' of ↵Pablo Neira Ayuso
http://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next Simon Horman says: ==================== IPVS Updates for v4.18 please consider these IPVS enhancements for v4.18. * Whitepace cleanup * Add Maglev hashing algorithm as a IPVS scheduler Inju Song says "Implements the Google's Maglev hashing algorithm as a IPVS scheduler. Basically it provides consistent hashing but offers some special features about disruption and load balancing. 1) minimal disruption: when the set of destinations changes, a connection will likely be sent to the same destination as it was before. 2) load balancing: each destination will receive an almost equal number of connections. Seel also: [3.4 Consistent Hasing] in https://www.usenix.org/system/files/conference/nsdi16/nsdi16-paper-eisenbud.pdf " * Fix to correct implementation of Knuth's multiplicative hashing which is used in sh/dh/lblc/lblcr algorithms. Instead the implementation provided by the hash_32() macro is used. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-27netfilter: nf_tables: merge exthdr expression into nft coreFlorian Westphal
before: text data bss dec hex filename 5056 844 0 5900 170c net/netfilter/nft_exthdr.ko 102456 2316 401 105173 19ad5 net/netfilter/nf_tables.ko after: 106410 2392 401 109203 1aa93 net/netfilter/nf_tables.ko Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-27netfilter: nf_tables: merge rt expression into nft coreFlorian Westphal
before: text data bss dec hex filename 2657 844 0 3501 dad net/netfilter/nft_rt.ko 100826 2240 401 103467 1942b net/netfilter/nf_tables.ko after: 2657 844 0 3501 dad net/netfilter/nft_rt.ko 102456 2316 401 105173 19ad5 net/netfilter/nf_tables.ko Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-27netfilter: nf_tables: make meta expression builtinFlorian Westphal
size net/netfilter/nft_meta.ko text data bss dec hex filename 5826 936 1 6763 1a6b net/netfilter/nft_meta.ko 96407 2064 400 98871 18237 net/netfilter/nf_tables.ko after: 100826 2240 401 103467 1942b net/netfilter/nf_tables.ko Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-26bpf: fix xdp_generic for bpf_adjust_tail usecaseNikita V. Shirokov
When bpf_adjust_tail was introduced for generic xdp, it changed skb's tail pointer, so it was pointing to the new "end of the packet". However skb's len field wasn't properly modified, so on the wire ethernet frame had original (or even bigger, if adjust_head was used) size. This diff is fixing this. Fixes: 198d83bb3 (" bpf: make generic xdp compatible w/ bpf_xdp_adjust_tail") Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-26udp: add gso support to virtual devicesWillem de Bruijn
Virtual devices such as tunnels and bonding can handle large packets. Only segment packets when reaching a physical or loopback device. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26udp: add gso segment cmsgWillem de Bruijn
Allow specifying segment size in the send call. The new control message performs the same function as socket option UDP_SEGMENT while avoiding the extra system call. [ Export udp_cmsg_send for ipv6. -DaveM ] Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26udp: paged allocation with gsoWillem de Bruijn
When sending large datagrams that are later segmented, store data in page frags to avoid copying from linear in skb_segment. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26udp: better wmem accounting on gsoWillem de Bruijn
skb_segment by default transfers allocated wmem from the gso skb to the tail of the segment list. This underreports real truesize of the list, especially if the tail might be dropped. Similar to tcp_gso_segment, update wmem_alloc with the aggregate list truesize and make each segment responsible for its own share by setting skb->destructor. Clear gso_skb->destructor prior to calling skb_segment to skip the default assignment to tail. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26udp: generate gso with UDP_SEGMENTWillem de Bruijn
Support generic segmentation offload for udp datagrams. Callers can concatenate and send at once the payload of multiple datagrams with the same destination. To set segment size, the caller sets socket option UDP_SEGMENT to the length of each discrete payload. This value must be smaller than or equal to the relevant MTU. A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a per send call basis. Total byte length may then exceed MTU. If not an exact multiple of segment size, the last segment will be shorter. The implementation adds a gso_size field to the udp socket, ip(v6) cmsg cookie and inet_cork structure to be able to set the value at setsockopt or cmsg time and to work with both lockless and corked paths. Initial benchmark numbers show UDP GSO about as expensive as TCP GSO. tcp tso 3197 MB/s 54232 msg/s 54232 calls/s 6,457,754,262 cycles tcp gso 1765 MB/s 29939 msg/s 29939 calls/s 11,203,021,806 cycles tcp without tso/gso * 739 MB/s 12548 msg/s 12548 calls/s 11,205,483,630 cycles udp 876 MB/s 14873 msg/s 624666 calls/s 11,205,777,429 cycles udp gso 2139 MB/s 36282 msg/s 36282 calls/s 11,204,374,561 cycles [*] after reverting commit 0a6b2a1dc2a2 ("tcp: switch to GSO being always on") Measured total system cycles ('-a') for one core while pinning both the network receive path and benchmark process to that core: perf stat -a -C 12 -e cycles \ ./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4 Note the reduction in calls/s with GSO. Bytes per syscall drops increases from 1470 to 61818. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26udp: add udp gsoWillem de Bruijn
Implement generic segmentation offload support for udp datagrams. A follow-up patch adds support to the protocol stack to generate such packets. UDP GSO is not UFO. UFO fragments a single large datagram. GSO splits a large payload into a number of discrete UDP datagrams. The implementation adds a GSO type SKB_UDP_GSO_L4 to differentiate it from UFO (SKB_UDP_GSO). IPPROTO_UDPLITE is excluded, as that protocol has no gso handler registered. [ Export __udp_gso_segment for ipv6. -DaveM ] Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26udp: expose inet cork to udpWillem de Bruijn
UDP segmentation offload needs access to inet_cork in the udp layer. Pass the struct to ip(6)_make_skb instead of allocating it on the stack in that function itself. This patch is a noop otherwise. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26libceph: validate con->state at the top of try_write()Ilya Dryomov
ceph_con_workfn() validates con->state before calling try_read() and then try_write(). However, try_read() temporarily releases con->mutex, notably in process_message() and ceph_con_in_msg_alloc(), opening the window for ceph_con_close() to sneak in, close the connection and release con->sock. When try_write() is called on the assumption that con->state is still valid (i.e. not STANDBY or CLOSED), a NULL sock gets passed to the networking stack: BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 IP: selinux_socket_sendmsg+0x5/0x20 Make sure con->state is valid at the top of try_write() and add an explicit BUG_ON for this, similar to try_read(). Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/23706 Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2018-04-26xfrm: remove VLA usage in __xfrm6_sort()Kees Cook
In the quest to remove all stack VLA usage removed from the kernel[1], just use XFRM_MAX_DEPTH as already done for the "class" array. In one case, it'll do this loop up to 5, the other caller up to 6. [1] https://lkml.org/lkml/2018/3/7/621 Co-developed-by: Andreas Christoforou <andreaschristofo@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-04-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Merging net into net-next to help the bpf folks avoid some really ugly merge conflicts. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2018-04-25 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix to clear the percpu metadata_dst that could otherwise carry stale ip_tunnel_info, from William. 2) Fix that reduces the number of passes in x64 JIT with regards to dead code sanitation to avoid risk of prog rejection, from Gianluca. 3) Several fixes of sockmap programs, besides others, fixing a double page_put() in error path, missing refcount hold for pinned sockmap, adding required -target bpf for clang in sample Makefile, from John. 4) Fix to disable preemption in __BPF_PROG_RUN_ARRAY() paths, from Roman. 5) Fix tools/bpf/ Makefile with regards to a lex/yacc build error seen on older gcc-5, from John. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-25rds: ib: Fix missing call to rds_ib_dev_put in rds_ib_setup_qpDag Moxnes
The function rds_ib_setup_qp is calling rds_ib_get_client_data and should correspondingly call rds_ib_dev_put. This call was lost in the non-error path with the introduction of error handling done in commit 3b12f73a5c29 ("rds: ib: add error handle") Signed-off-by: Dag Moxnes <dag.moxnes@oracle.com> Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-25net/smc: keep clcsock reference in smc_tcp_listen_work()Ursula Braun
The internal CLC socket should exist till the SMC-socket is released. Function tcp_listen_worker() releases the internal CLC socket of a listen socket, if an smc_close_active() is called. This function is called for the final release(), but it is called for shutdown SHUT_RDWR as well. This opens a door for protection faults, if socket calls using the internal CLC socket are called for a shutdown listen socket. With the changes of commit 3d502067599f ("net/smc: simplify wait when closing listen socket") there is no need anymore to release the internal CLC socket in function tcp_listen_worker((). It is sufficient to release it in smc_release(). Fixes: 127f49705823 ("net/smc: release clcsock from tcp_listen_worker") Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Reported-by: syzbot+9045fc589fcd196ef522@syzkaller.appspotmail.com Reported-by: syzbot+28a2c86cf19c81d871fa@syzkaller.appspotmail.com Reported-by: syzbot+9605e6cace1b5efd4a0a@syzkaller.appspotmail.com Reported-by: syzbot+cf9012c597c8379d535c@syzkaller.appspotmail.com Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-25sctp: remove the unused sctp_assoc_is_match functionXin Long
After Commit 4f0087812648 ("sctp: apply rhashtable api to send/recv path"), there's no place using sctp_assoc_is_match, so remove it. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-25batman-adv: fix batadv_interface_tx()'s return typeLuc Van Oostenryck
The method ndo_start_xmit() is defined as returning an 'netdev_tx_t', which is a typedef for an enum type, but the implementation in this driver returns an 'int'. Fix this by returning 'netdev_tx_t' in this driver too. Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> [sven@narfation.org: fixed alignment] Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-04-25net: rules: Move l3mdev attribute validation to a helperDavid Ahern
Move the check on FRA_L3MDEV attribute to helper to improve the readability of fib_nl2rule. Update the extack messages to be clear when the configuration option is disabled versus an invalid value has been passed. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-25sctp: fix identification of new acks for SFR-CACCMarcelo Ricardo Leitner
It's currently written as: if (!tchunk->tsn_gap_acked) { [1] tchunk->tsn_gap_acked = 1; ... } if (TSN_lte(tsn, sack_ctsn)) { if (!tchunk->tsn_gap_acked) { /* SFR-CACC processing */ ... } } Which causes the SFR-CACC processing on ack reception to never process, as tchunk->tsn_gap_acked is always true by then. Block [1] was moved to that position by the commit marked below. This patch fixes it by doing SFR-CACC processing earlier, before tsn_gap_acked is set to true. Fixes: 31b02e154940 ("sctp: Failover transmitted list on transport delete") Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-25sctp: fix const parameter violation in sctp_make_sackMarcelo Ricardo Leitner
sctp_make_sack() make changes to the asoc and this cast is just bypassing the const attribute. As there is no need to have the const there, just remove it and fix the violation. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-25neighbour: support for NTF_EXT_LEARNED flagRoopa Prabhu
This patch extends NTF_EXT_LEARNED support to the neighbour system. Example use-case: An Ethernet VPN implementation (eg in FRR routing suite) can use this flag to add dynamic reachable external neigh entires learned via control plane. The use of neigh NTF_EXT_LEARNED in this patch is consistent with its use with bridge and vxlan fdb entries. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-25ipv6: addrconf: don't evaluate keep_addr_on_down twiceIvan Vecera
The addrconf_ifdown() evaluates keep_addr_on_down state twice. There is no need to do it. Cc: David Ahern <dsahern@gmail.com> Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-25ipv6: sr: Compute flowlabel for outer IPv6 header of seg6 encap modeAhmed Abdelsalam
ECMP (equal-cost multipath) hashes are typically computed on the packets' 5-tuple(src IP, dst IP, src port, dst port, L4 proto). For encapsulated packets, the L4 data is not readily available and ECMP hashing will often revert to (src IP, dst IP). This will lead to traffic polarization on a single ECMP path, causing congestion and waste of network capacity. In IPv6, the 20-bit flow label field is also used as part of the ECMP hash. In the lack of L4 data, the hashing will be on (src IP, dst IP, flow label). Having a non-zero flow label is thus important for proper traffic load balancing when L4 data is unavailable (i.e., when packets are encapsulated). Currently, the seg6_do_srh_encap() function extracts the original packet's flow label and set it as the outer IPv6 flow label. There are two issues with this behaviour: a) There is no guarantee that the inner flow label is set by the source. b) If the original packet is not IPv6, the flow label will be set to zero (e.g., IPv4 or L2 encap). This patch adds a function, named seg6_make_flowlabel(), that computes a flow label from a given skb. It supports IPv6, IPv4 and L2 payloads, and leverages the per namespace 'seg6_flowlabel" sysctl value. The currently support behaviours are as follows: -1 set flowlabel to zero. 0 copy flowlabel from Inner paceket in case of Inner IPv6 (Set flowlabel to 0 in case IPv4/L2) 1 Compute the flowlabel using seg6_make_flowlabel() This patch has been tested for IPv6, IPv4, and L2 traffic. Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com> Acked-by: David Lebrun <dlebrun@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-25bpf: clear the ip_tunnel_info.William Tu
The percpu metadata_dst might carry the stale ip_tunnel_info and cause incorrect behavior. When mixing tests using ipv4/ipv6 bpf vxlan and geneve tunnel, the ipv6 tunnel info incorrectly uses ipv4's src ip addr as its ipv6 src address, because the previous tunnel info does not clean up. The patch zeros the fields in ip_tunnel_info. Signed-off-by: William Tu <u9012063@gmail.com> Reported-by: Yifeng Sun <pkusunyifeng@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2018-04-24bpf: add helper for getting xfrm statesEyal Birger
This commit introduces a helper which allows fetching xfrm state parameters by eBPF programs attached to TC. Prototype: bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags) skb: pointer to skb index: the index in the skb xfrm_state secpath array xfrm_state: pointer to 'struct bpf_xfrm_state' size: size of 'struct bpf_xfrm_state' flags: reserved for future extensions The helper returns 0 on success. Non zero if no xfrm state at the index is found - or non exists at all. struct bpf_xfrm_state currently includes the SPI, peer IPv4/IPv6 address and the reqid; it can be further extended by adding elements to its end - indicating the populated fields by the 'size' argument - keeping backwards compatibility. Typical usage: struct bpf_xfrm_state x = {}; bpf_skb_get_xfrm_state(skb, 0, &x, sizeof(x), 0); ... Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-24net/ipv6: fix LOCKDEP issue in rt6_remove_exception_rt()Eric Dumazet
rt6_remove_exception_rt() is called under rcu_read_lock() only. We lock rt6_exception_lock a bit later, so we do not hold rt6_exception_lock yet. Fixes: 8a14e46f1402 ("net/ipv6: Fix missing rcu dereferences on from") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: David Ahern <dsahern@gmail.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24net/tls: remove redundant second null check on sgoutColin Ian King
A duplicated null check on sgout is redundant as it is known to be already true because of the identical earlier check. Remove it. Detected by cppcheck: net/tls/tls_sw.c:696: (warning) Identical inner 'if' condition is always true. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24ipconfig: Write NTP server IPs to /proc/net/ipconfig/ntp_serversChris Novakovic
Distributed filesystems are most effective when the server and client clocks are synchronised. Embedded devices often use NFS for their root filesystem but typically do not contain an RTC, so the clocks of the NFS server and the embedded device will be out-of-sync when the root filesystem is mounted (and may not be synchronised until late in the boot process). Extend ipconfig with the ability to export IP addresses of NTP servers it discovers to /proc/net/ipconfig/ntp_servers. They can be supplied as follows: - If ipconfig is configured manually via the "ip=" or "nfsaddrs=" kernel command line parameters, one NTP server can be specified in the new "<ntp0-ip>" parameter. - If ipconfig is autoconfigured via DHCP, request DHCP option 42 in the DHCPDISCOVER message, and record the IP addresses of up to three NTP servers sent by the responding DHCP server in the subsequent DHCPOFFER message. ipconfig will only write the NTP server IP addresses it discovers to /proc/net/ipconfig/ntp_servers, one per line (in the order received from the DHCP server, if DHCP autoconfiguration is used); making use of these NTP servers is the responsibility of a user space process (e.g. an initrd/initram script that invokes an NTP client before mounting an NFS root filesystem). Signed-off-by: Chris Novakovic <chris@chrisn.me.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24ipconfig: Create /proc/net/ipconfig directoryChris Novakovic
To allow ipconfig to report IP configuration details to user space processes without cluttering /proc/net, create a new subdirectory /proc/net/ipconfig. All files containing IP configuration details should be written to this directory. Signed-off-by: Chris Novakovic <chris@chrisn.me.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24ipconfig: Correctly initialise ic_nameserversChris Novakovic
ic_nameservers, which stores the list of name servers discovered by ipconfig, is initialised (i.e. has all of its elements set to NONE, or 0xffffffff) by ic_nameservers_predef() in the following scenarios: - before the "ip=" and "nfsaddrs=" kernel command line parameters are parsed (in ip_auto_config_setup()); - before autoconfiguring via DHCP or BOOTP (in ic_bootp_init()), in order to clear any values that may have been set after parsing "ip=" or "nfsaddrs=" and are no longer needed. This means that ic_nameservers_predef() is not called when neither "ip=" nor "nfsaddrs=" is specified on the kernel command line. In this scenario, every element in ic_nameservers remains set to 0x00000000, which is indistinguishable from ANY and causes pnp_seq_show() to write the following (bogus) information to /proc/net/pnp: #MANUAL nameserver 0.0.0.0 nameserver 0.0.0.0 nameserver 0.0.0.0 This is potentially problematic for systems that blindly link /etc/resolv.conf to /proc/net/pnp. Ensure that ic_nameservers is also initialised when neither "ip=" nor "nfsaddrs=" are specified by calling ic_nameservers_predef() in ip_auto_config(), but only when ip_auto_config_setup() was not called earlier. This causes the following to be written to /proc/net/pnp, and is consistent with what gets written when ipconfig is configured manually but no name servers are specified on the kernel command line: #MANUAL Signed-off-by: Chris Novakovic <chris@chrisn.me.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24ipconfig: BOOTP: Request CONF_NAMESERVERS_MAX name serversChris Novakovic
When ipconfig is autoconfigured via BOOTP, the request packet initialised by ic_bootp_init_ext() always allocates 8 bytes for the name server option, limiting the BOOTP server to responding with at most 2 name servers even though ipconfig in fact supports an arbitrary number of name servers (as defined by CONF_NAMESERVERS_MAX, which is currently 3). Only request name servers in the request packet if CONF_NAMESERVERS_MAX is positive (to comply with [1, §3.8]), and allocate enough space in the packet for CONF_NAMESERVERS_MAX name servers to indicate the maximum number we can accept in response. [1] RFC 2132, "DHCP Options and BOOTP Vendor Extensions": https://tools.ietf.org/rfc/rfc2132.txt Signed-off-by: Chris Novakovic <chris@chrisn.me.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24ipconfig: BOOTP: Don't request IEN-116 name serversChris Novakovic
When ipconfig is autoconfigured via BOOTP, the request packet initialised by ic_bootp_init_ext() allocates 8 bytes for tag 5 ("Name Server" [1, §3.7]), but tag 5 in the response isn't processed by ic_do_bootp_ext(). Instead, allocate the 8 bytes to tag 6 ("Domain Name Server" [1, §3.8]), which is processed by ic_do_bootp_ext(), and appears to have been the intended tag to request. This won't cause any breakage for existing users, as tag 5 responses provided by BOOTP servers weren't being processed anyway. [1] RFC 2132, "DHCP Options and BOOTP Vendor Extensions": https://tools.ietf.org/rfc/rfc2132.txt Signed-off-by: Chris Novakovic <chris@chrisn.me.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24ipconfig: Tidy up reporting of name serversChris Novakovic
Commit 5e953778a2aab04929a5e7b69f53dc26e39b079e ("ipconfig: add nameserver IPs to kernel-parameter ip=") adds the IP addresses of discovered name servers to the summary printed by ipconfig when configuration is complete. It appears the intention in ip_auto_config() was to print the name servers on a new line (especially given the spacing and lack of comma before "nameserver0="), but they're actually printed on the same line as the NFS root filesystem configuration summary: [ 0.686186] IP-Config: Complete: [ 0.686226] device=eth0, hwaddr=xx:xx:xx:xx:xx:xx, ipaddr=10.0.0.2, mask=255.255.255.0, gw=10.0.0.1 [ 0.686328] host=test, domain=example.com, nis-domain=(none) [ 0.686386] bootserver=10.0.0.1, rootserver=10.0.0.1, rootpath= nameserver0=10.0.0.1 This makes it harder to read and parse ipconfig's output. Instead, print the name servers on a separate line: [ 0.791250] IP-Config: Complete: [ 0.791289] device=eth0, hwaddr=xx:xx:xx:xx:xx:xx, ipaddr=10.0.0.2, mask=255.255.255.0, gw=10.0.0.1 [ 0.791407] host=test, domain=example.com, nis-domain=(none) [ 0.791475] bootserver=10.0.0.1, rootserver=10.0.0.1, rootpath= [ 0.791476] nameserver0=10.0.0.1 Signed-off-by: Chris Novakovic <chris@chrisn.me.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24tcp: md5: only call tp->af_specific->md5_lookup() for md5 socketsEric Dumazet
RETPOLINE made calls to tp->af_specific->md5_lookup() quite expensive, given they have no result. We can omit the calls for sockets that have no md5 keys. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24packet: fix bitfield update raceWillem de Bruijn
Updates to the bitfields in struct packet_sock are not atomic. Serialize these read-modify-write cycles. Move po->running into a separate variable. Its writes are protected by po->bind_lock (except for one startup case at packet_create). Also replace a textual precondition warning with lockdep annotation. All others are set only in packet_setsockopt. Serialize these updates by holding the socket lock. Analogous to other field updates, also hold the lock when testing whether a ring is active (pg_vec). Fixes: 8dc419447415 ("[PACKET]: Add optional checksum computation for recvmsg") Reported-by: DaeRyong Jeong <threeearcat@gmail.com> Reported-by: Byoungyoung Lee <byoungyoung@purdue.edu> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24Revert "net: init sk_cookie for inet socket"Yafang Shao
This reverts commit <c6849a3ac17e> ("net: init sk_cookie for inet socket") Per discussion with Eric, when update sock_net(sk)->cookie_gen, the whole cache cache line will be invalidated, as this cache line is shared with all cpus, that may cause great performace hit. Bellow is the data form Eric. "Performance is reduced from ~5 Mpps to ~3.8 Mpps with 16 RX queues on my host" when running synflood test. Have to revert it to prevent from cache line false sharing. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24libceph: reschedule a tick in finish_hunting()Ilya Dryomov
If we go without an established session for a while, backoff delay will climb to 30 seconds. The keepalive timeout is also 30 seconds, so it's pretty easily hit after a prolonged hunting for a monitor: we don't get a chance to send out a keepalive in time, which means we never get back a keepalive ack in time, cutting an established session and attempting to connect to a different monitor every 30 seconds: [Sun Apr 1 23:37:05 2018] libceph: mon0 10.80.20.99:6789 session established [Sun Apr 1 23:37:36 2018] libceph: mon0 10.80.20.99:6789 session lost, hunting for new mon [Sun Apr 1 23:37:36 2018] libceph: mon2 10.80.20.103:6789 session established [Sun Apr 1 23:38:07 2018] libceph: mon2 10.80.20.103:6789 session lost, hunting for new mon [Sun Apr 1 23:38:07 2018] libceph: mon1 10.80.20.100:6789 session established [Sun Apr 1 23:38:37 2018] libceph: mon1 10.80.20.100:6789 session lost, hunting for new mon [Sun Apr 1 23:38:37 2018] libceph: mon2 10.80.20.103:6789 session established [Sun Apr 1 23:39:08 2018] libceph: mon2 10.80.20.103:6789 session lost, hunting for new mon The regular keepalive interval is 10 seconds. After ->hunting is cleared in finish_hunting(), call __schedule_delayed() to ensure we send out a keepalive after 10 seconds. Cc: stable@vger.kernel.org # 4.7+ Link: http://tracker.ceph.com/issues/23537 Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2018-04-24libceph: un-backoff on tick when we have a authenticated sessionIlya Dryomov
This means that if we do some backoff, then authenticate, and are healthy for an extended period of time, a subsequent failure won't leave us starting our hunting sequence with a large backoff. Mirrors ceph.git commit d466bc6e66abba9b464b0b69687cf45c9dccf383. Cc: stable@vger.kernel.org # 4.7+ Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>