summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2016-04-04tcp: rate limit ACK sent by SYN_RECV request socketsEric Dumazet
Attackers like to use SYNFLOOD targeting one 5-tuple, as they hit a single RX queue (and cpu) on the victim. If they use random sequence numbers in their SYN, we detect they do not match the expected window and send back an ACK. This patch adds a rate limitation, so that the effect of such attacks is limited to ingress only. We roughly double our ability to absorb such attacks. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Maciej Żenczykowski <maze@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04ipv4: tcp: set SOCK_USE_WRITE_QUEUE for ip_send_unicast_reply()Eric Dumazet
TCP uses per cpu 'sockets' to send some packets : - RST packets ( tcp_v4_send_reset()) ) - ACK packets for SYN_RECV and TIMEWAIT sockets By setting SOCK_USE_WRITE_QUEUE flag, we tell sock_wfree() to not call sk_write_space() since these internal sockets do not care. This gives a small performance improvement, merely by allowing cpu to properly predict the sock_wfree() conditional branch, and avoiding one atomic operation. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04tcp: increment sk_drops for listenersEric Dumazet
Goal: packets dropped by a listener are accounted for. This adds tcp_listendrop() helper, and clears sk_drops in sk_clone_lock() so that children do not inherit their parent drop count. Note that we no longer increment LINUX_MIB_LISTENDROPS counter when sending a SYNCOOKIE, since the SYN packet generated a SYNACK. We already have a separate LINUX_MIB_SYNCOOKIESSENT Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04tcp: increment sk_drops for dropped rx packetsEric Dumazet
Now ss can report sk_drops, we can instruct TCP to increment this per socket counter when it drops an incoming frame, to refine monitoring and debugging. Following patch takes care of listeners drops. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04sock_diag: add SK_MEMINFO_DROPSEric Dumazet
Reporting sk_drops to user space was available for UDP sockets using /proc interface. Add this to sock_diag, so that we can have the same information available to ss users, and we'll be able to add sk_drops indications for TCP sockets as well. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04tcp/dccp: do not touch listener sk_refcnt under synfloodEric Dumazet
When a SYNFLOOD targets a non SO_REUSEPORT listener, multiple cpus contend on sk->sk_refcnt and sk->sk_wmem_alloc changes. By letting listeners use SOCK_RCU_FREE infrastructure, we can relax TCP_LISTEN lookup rules and avoid touching sk_refcnt Note that we still use SLAB_DESTROY_BY_RCU rules for other sockets, only listeners are impacted by this change. Peak performance under SYNFLOOD is increased by ~33% : On my test machine, I could process 3.2 Mpps instead of 2.4 Mpps Most consuming functions are now skb_set_owner_w() and sock_wfree() contending on sk->sk_wmem_alloc when cooking SYNACK and freeing them. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04tcp/dccp: use rcu locking in inet_diag_find_one_icsk()Eric Dumazet
RX packet processing holds rcu_read_lock(), so we can remove pairs of rcu_read_lock()/rcu_read_unlock() in lookup functions if inet_diag also holds rcu before calling them. This is needed anyway as __inet_lookup_listener() and inet6_lookup_listener() will soon no longer increment refcount on the found listener. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04tcp/dccp: remove BH disable/enable in lookupEric Dumazet
Since linux 2.6.29, lookups only use rcu locking. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04udp: no longer use SLAB_DESTROY_BY_RCUEric Dumazet
Tom Herbert would like not touching UDP socket refcnt for encapsulated traffic. For this to happen, we need to use normal RCU rules, with a grace period before freeing a socket. UDP sockets are not short lived in the high usage case, so the added cost of call_rcu() should not be a concern. This actually removes a lot of complexity in UDP stack. Multicast receives no longer need to hold a bucket spinlock. Note that ip early demux still needs to take a reference on the socket. Same remark for functions used by xt_socket and xt_PROXY netfilter modules, but this might be changed later. Performance for a single UDP socket receiving flood traffic from many RX queues/cpus. Simple udp_rx using simple recvfrom() loop : 438 kpps instead of 374 kpps : 17 % increase of the peak rate. v2: Addressed Willem de Bruijn feedback in multicast handling - keep early demux break in __udp4_lib_demux_lookup() Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <tom@herbertland.com> Cc: Willem de Bruijn <willemb@google.com> Tested-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04net: add SOCK_RCU_FREE socket flagEric Dumazet
We want a generic way to insert an RCU grace period before socket freeing for cases where RCU_SLAB_DESTROY_BY_RCU is adding too much overhead. SLAB_DESTROY_BY_RCU strict rules force us to take a reference on the socket sk_refcnt, and it is a performance problem for UDP encapsulation, or TCP synflood behavior, as many CPUs might attempt the atomic operations on a shared sk_refcnt UDP sockets and TCP listeners can set SOCK_RCU_FREE so that their lookup can use traditional RCU rules, without refcount changes. They can set the flag only once hashed and visible by other cpus. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <tom@herbertland.com> Tested-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04bridge: Fix incorrect variable assignment on error path in br_sysfs_addbrBastien Philbert
This fixes the incorrect variable assignment on error path in br_sysfs_addbr for when the call to kobject_create_and_add fails to assign the value of -EINVAL to the returned variable of err rather then incorrectly return zero making callers think this function has succeededed due to the previous assignment being assigned zero when assigning it the successful return value of the call to sysfs_create_group which is zero. Signed-off-by: Bastien Philbert <bastienphilbert@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04ipv6: l2tp: fix a potential issue in l2tp_ip6_recvHaishuang Yan
pskb_may_pull() can change skb->data, so we have to load ptr/optr at the right place. Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04ipv4: l2tp: fix a potential issue in l2tp_ip_recvHaishuang Yan
pskb_may_pull() can change skb->data, so we have to load ptr/optr at the right place. Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04sock: enable timestamping using control messagesSoheil Hassas Yeganeh
Currently, SOL_TIMESTAMPING can only be enabled using setsockopt. This is very costly when users want to sample writes to gather tx timestamps. Add support for enabling SO_TIMESTAMPING via control messages by using tsflags added in `struct sockcm_cookie` (added in the previous patches in this series) to set the tx_flags of the last skb created in a sendmsg. With this patch, the timestamp recording bits in tx_flags of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg. Please note that this is only effective for overriding the recording timestamps flags. Users should enable timestamp reporting (e.g., SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using socket options and then should ask for SOF_TIMESTAMPING_TX_* using control messages per sendmsg to sample timestamps for each write. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04ipv6: process socket-level control messages in IPv6Soheil Hassas Yeganeh
Process socket-level control messages by invoking __sock_cmsg_send in ip6_datagram_send_ctl for control messages on the SOL_SOCKET layer. This makes sure whenever ip6_datagram_send_ctl is called for udp and raw, we also process socket-level control messages. This is a bit uglier than IPv4, since IPv6 does not have something like ipcm_cookie. Perhaps we can later create a control message cookie for IPv6? Note that this commit interprets new control messages that were ignored before. As such, this commit does not change the behavior of IPv6 control messages. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04ipv4: process socket-level control messages in IPv4Soheil Hassas Yeganeh
Process socket-level control messages by invoking __sock_cmsg_send in ip_cmsg_send for control messages on the SOL_SOCKET layer. This makes sure whenever ip_cmsg_send is called in udp, icmp, and raw, we also process socket-level control messages. Note that this commit interprets new control messages that were ignored before. As such, this commit does not change the behavior of IPv4 control messages. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04sock: accept SO_TIMESTAMPING flags in socket cmsgSoheil Hassas Yeganeh
Accept SO_TIMESTAMPING in control messages of the SOL_SOCKET level as a basis to accept timestamping requests per write. This implementation only accepts TX recording flags (i.e., SOF_TIMESTAMPING_TX_HARDWARE, SOF_TIMESTAMPING_TX_SOFTWARE, SOF_TIMESTAMPING_TX_SCHED, and SOF_TIMESTAMPING_TX_ACK) in control messages. Users need to set reporting flags (e.g., SOF_TIMESTAMPING_OPT_ID) per socket via socket options. This commit adds a tsflags field in sockcm_cookie which is set in __sock_cmsg_send. It only override the SOF_TIMESTAMPING_TX_* bits in sockcm_cookie.tsflags allowing the control message to override the recording behavior per write, yet maintaining the value of other flags. This patch implements validating the control message and setting tsflags in struct sockcm_cookie. Next commits in this series will actually implement timestamping per write for different protocols. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04tcp: use one bit in TCP_SKB_CB to mark ACK timestampsSoheil Hassas Yeganeh
Currently, to avoid a cache line miss for accessing skb_shinfo, tcp_ack_tstamp skips socket that do not have SOF_TIMESTAMPING_TX_ACK bit set in sk_tsflags. This is implemented based on an implicit assumption that the SOF_TIMESTAMPING_TX_ACK is set via socket options for the duration that ACK timestamps are needed. To implement per-write timestamps, this check should be removed and replaced with a per-packet alternative that quickly skips packets missing ACK timestamps marks without a cache-line miss. To enable per-packet marking without a cache line miss, use one bit in TCP_SKB_CB to mark a whether a SKB might need a ack tx timestamp or not. Further checks in tcp_ack_tstamp are not modified and work as before. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04tcp: accept SOF_TIMESTAMPING_OPT_ID for passive TFOSoheil Hassas Yeganeh
SOF_TIMESTAMPING_OPT_ID is set to get data-independent IDs to associate timestamps with send calls. For TCP connections, tp->snd_una is used as the starting point to calculate relative IDs. This socket option will fail if set before the handshake on a passive TCP fast open connection with data in SYN or SYN/ACK, since setsockopt requires the connection to be in the ESTABLISHED state. To address these, instead of limiting the option to the ESTABLISHED state, accept the SOF_TIMESTAMPING_OPT_ID option as long as the connection is not in LISTEN or CLOSE states. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04sock: break up sock_cmsg_snd into __sock_cmsg_snd and loopWillem de Bruijn
To process cmsg's of the SOL_SOCKET level in addition to cmsgs of another level, protocols can call sock_cmsg_send(). This causes a double walk on the cmsghdr list, one for SOL_SOCKET and one for the other level. Extract the inner demultiplex logic from the loop that walks the list, to allow having this called directly from a walker in the protocol specific code. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04Merge branch 'PAGE_CACHE_SIZE-removal'Linus Torvalds
Merge PAGE_CACHE_SIZE removal patches from Kirill Shutemov: "PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. Let's stop pretending that pages in page cache are special. They are not. The first patch with most changes has been done with coccinelle. The second is manual fixups on top. The third patch removes macros definition" [ I was planning to apply this just before rc2, but then I spaced out, so here it is right _after_ rc2 instead. As Kirill suggested as a possibility, I could have decided to only merge the first two patches, and leave the old interfaces for compatibility, but I'd rather get it all done and any out-of-tree modules and patches can trivially do the converstion while still also working with older kernels, so there is little reason to try to maintain the redundant legacy model. - Linus ] * PAGE_CACHE_SIZE-removal: mm: drop PAGE_CACHE_* and page_cache_{get,release} definition mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
2016-04-04mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usageKirill A. Shutemov
Mostly direct substitution with occasional adjustment or removing outdated comments. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macrosKirill A. Shutemov
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04sunrpc: Fix skcipher/shash conversionHerbert Xu
The skcpiher/shash conversion introduced a number of bugs in the sunrpc code: 1) Missing calls to skcipher_request_set_tfm lead to crashes. 2) The allocation size of shash_desc is too small which leads to memory corruption. Fixes: 3b5cf20cf439 ("sunrpc: Use skcipher and ahash/shash") Reported-by: J. Bruce Fields <bfields@fieldses.org> Tested-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-04-02netlink: use nla_get_in_addr and nla_put_in_addr for ipv4 addressHaishuang Yan
Since nla_get_in_addr and nla_put_in_addr were implemented, so use them appropriately. Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-02tcp: remove cwnd moderation after recoveryYuchung Cheng
For non-SACK connections, cwnd is lowered to inflight plus 3 packets when the recovery ends. This is an optional feature in the NewReno RFC 2582 to reduce the potential burst when cwnd is "re-opened" after recovery and inflight is low. This feature is questionably effective because of PRR: when the recovery ends (i.e., snd_una == high_seq) NewReno holds the CA_Recovery state for another round trip to prevent false fast retransmits. But if the inflight is low, PRR will overwrite the moderated cwnd in tcp_cwnd_reduction() later regardlessly. So if a receiver responds bogus ACKs (i.e., acking future data) to speed up transfer after recovery, it can only induce a burst up to a window worth of data packets by acking up to SND.NXT. A restart from (short) idle or receiving streched ACKs can both cause such bursts as well. On the other hand, if the recovery ends because the sender detects the losses were spurious (e.g., reordering). This feature unconditionally lowers a reverted cwnd even though nothing was lost. By principle loss recovery module should not update cwnd. Further pacing is much more effective to reduce burst. Hence this patch removes the cwnd moderation feature. v2 changes: revised commit message on bogus ACKs and burst, and missing signature Signed-off-by: Matt Mathis <mattmathis@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Missing device reference in IPSEC input path results in crashes during device unregistration. From Subash Abhinov Kasiviswanathan. 2) Per-queue ISR register writes not being done properly in macb driver, from Cyrille Pitchen. 3) Stats accounting bugs in bcmgenet, from Patri Gynther. 4) Lightweight tunnel's TTL and TOS were swapped in netlink dumps, from Quentin Armitage. 5) SXGBE driver has off-by-one in probe error paths, from Rasmus Villemoes. 6) Fix race in save/swap/delete options in netfilter ipset, from Vishwanath Pai. 7) Ageing time of bridge not set properly when not operating over a switchdev device. Fix from Haishuang Yan. 8) Fix GRO regression wrt nested FOU/GUE based tunnels, from Alexander Duyck. 9) IPV6 UDP code bumps wrong stats, from Eric Dumazet. 10) FEC driver should only access registers that actually exist on the given chipset, fix from Fabio Estevam. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (73 commits) net: mvneta: fix changing MTU when using per-cpu processing stmmac: fix MDIO settings Revert "stmmac: Fix 'eth0: No PHY found' regression" stmmac: fix TX normal DESC net: mvneta: use cache_line_size() to get cacheline size net: mvpp2: use cache_line_size() to get cacheline size net: mvpp2: fix maybe-uninitialized warning tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter net: usb: cdc_ncm: adding Telit LE910 V2 mobile broadband card rtnl: fix msg size calculation in if_nlmsg_size() fec: Do not access unexisting register in Coldfire net: mvneta: replace MVNETA_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES net: mvpp2: replace MVPP2_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES net: dsa: mv88e6xxx: Clear the PDOWN bit on setup net: dsa: mv88e6xxx: Introduce _mv88e6xxx_phy_page_{read, write} bpf: make padding in bpf_tunnel_key explicit ipv6: udp: fix UDP_MIB_IGNOREDMULTI updates bnxt_en: Fix ethtool -a reporting. bnxt_en: Fix typo in bnxt_hwrm_set_pause_common(). bnxt_en: Implement proper firmware message padding. ...
2016-04-01tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filterDaniel Borkmann
Sasha Levin reported a suspicious rcu_dereference_protected() warning found while fuzzing with trinity that is similar to this one: [ 52.765684] net/core/filter.c:2262 suspicious rcu_dereference_protected() usage! [ 52.765688] other info that might help us debug this: [ 52.765695] rcu_scheduler_active = 1, debug_locks = 1 [ 52.765701] 1 lock held by a.out/1525: [ 52.765704] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff816a64b7>] rtnl_lock+0x17/0x20 [ 52.765721] stack backtrace: [ 52.765728] CPU: 1 PID: 1525 Comm: a.out Not tainted 4.5.0+ #264 [...] [ 52.765768] Call Trace: [ 52.765775] [<ffffffff813e488d>] dump_stack+0x85/0xc8 [ 52.765784] [<ffffffff810f2fa5>] lockdep_rcu_suspicious+0xd5/0x110 [ 52.765792] [<ffffffff816afdc2>] sk_detach_filter+0x82/0x90 [ 52.765801] [<ffffffffa0883425>] tun_detach_filter+0x35/0x90 [tun] [ 52.765810] [<ffffffffa0884ed4>] __tun_chr_ioctl+0x354/0x1130 [tun] [ 52.765818] [<ffffffff8136fed0>] ? selinux_file_ioctl+0x130/0x210 [ 52.765827] [<ffffffffa0885ce3>] tun_chr_ioctl+0x13/0x20 [tun] [ 52.765834] [<ffffffff81260ea6>] do_vfs_ioctl+0x96/0x690 [ 52.765843] [<ffffffff81364af3>] ? security_file_ioctl+0x43/0x60 [ 52.765850] [<ffffffff81261519>] SyS_ioctl+0x79/0x90 [ 52.765858] [<ffffffff81003ba2>] do_syscall_64+0x62/0x140 [ 52.765866] [<ffffffff817d563f>] entry_SYSCALL64_slow_path+0x25/0x25 Same can be triggered with PROVE_RCU (+ PROVE_RCU_REPEATEDLY) enabled from tun_attach_filter() when user space calls ioctl(tun_fd, TUN{ATTACH, DETACH}FILTER, ...) for adding/removing a BPF filter on tap devices. Since the fix in f91ff5b9ff52 ("net: sk_{detach|attach}_filter() rcu fixes") sk_attach_filter()/sk_detach_filter() now dereferences the filter with rcu_dereference_protected(), checking whether socket lock is held in control path. Since its introduction in 994051625981 ("tun: socket filter support"), tap filters are managed under RTNL lock from __tun_chr_ioctl(). Thus the sock_owned_by_user(sk) doesn't apply in this specific case and therefore triggers the false positive. Extend the BPF API with __sk_attach_filter()/__sk_detach_filter() pair that is used by tap filters and pass in lockdep_rtnl_is_held() for the rcu_dereference_protected() checks instead. Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-31rtnl: fix msg size calculation in if_nlmsg_size()Nicolas Dichtel
Size of the attribute IFLA_PHYS_PORT_NAME was missing. Fixes: db24a9044ee1 ("net: add support for phys_port_name") CC: David Ahern <dsahern@gmail.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-31vti: Add pmtu handling to vti_xmit.Steffen Klassert
We currently rely on the PMTU discovery of xfrm. However if a packet is locally sent, the PMTU mechanism of xfrm tries to do local socket notification what might not work for applications like ping that don't check for this. So add pmtu handling to vti_xmit to report MTU changes immediately. Reported-by: Mark McKinstry <Mark.McKinstry@alliedtelesis.co.nz> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2016-03-30bpf: make padding in bpf_tunnel_key explicitDaniel Borkmann
Make the 2 byte padding in struct bpf_tunnel_key between tunnel_ttl and tunnel_label members explicit. No issue has been observed, and gcc/llvm does padding for the old struct already, where tunnel_label was not yet present, so the current code works, but since it's part of uapi, make sure we don't introduce holes in structs. Therefore, add tunnel_ext that we can use generically in future (f.e. to flag OAM messages for backends, etc). Also add the offset to the compat tests to be sure should some compilers not padd the tail of the old version of bpf_tunnel_key. Fixes: 4018ab1875e0 ("bpf: support flow label for bpf_skb_{set, get}_tunnel_key") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-30ipv6: udp: fix UDP_MIB_IGNOREDMULTI updatesEric Dumazet
IPv6 counters updates use a different macro than IPv4. Fixes: 36cbb2452cbaf ("udp: Increment UDP_MIB_IGNOREDMULTI for arriving unmatched multicasts") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Rick Jones <rick.jones2@hp.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-30gro: Allow tunnel stacking in the case of FOU/GUEAlexander Duyck
This patch should fix the issues seen with a recent fix to prevent tunnel-in-tunnel frames from being generated with GRO. The fix itself is correct for now as long as we do not add any devices that support NETIF_F_GSO_GRE_CSUM. When such a device is added it could have the potential to mess things up due to the fact that the outer transport header points to the outer UDP header and not the GRE header as would be expected. Fixes: fac8e0f579695 ("tunnels: Don't apply GRO to multiple layers of encapsulation.") Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-30sctp: really allow using GFP_KERNEL on sctp_packet_transmitMarcelo Ricardo Leitner
Somehow my patch for commit cea8768f333e ("sctp: allow sctp_transmit_packet and others to use gfp") missed two important chunks, which are now added. Fixes: cea8768f333e ("sctp: allow sctp_transmit_packet and others to use gfp") Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-By: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-30bridge: Allow set bridge ageing time when switchdev disabledHaishuang Yan
When NET_SWITCHDEV=n, switchdev_port_attr_set will return -EOPNOTSUPP, we should ignore this error code and continue to set the ageing time. Fixes: c62987bbd8a1 ("bridge: push bridge setting ageing_time down to switchdev") Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Acked-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-29netfilter: ip6t_SYNPROXY: remove magic number for hop_limitLiping Zhang
Replace '64' with the per-net ipv6_devconf_all's hop_limit when building the ipv6 header. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-29netfilter: bridge: nf queue verdict to use NFQA_VLAN and NFQA_L2HDRStephane Bryant
This makes nf queues use NFQA_VLAN and NFQA_L2HDR in verdict to modify the original skb Signed-off-by: Stephane Bryant <stephane.ml.bryant@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-29netfilter: bridge: pass L2 header and VLAN as netlink attributes in queues ↵Stephane Bryant
to userspace - This creates 2 netlink attribute NFQA_VLAN and NFQA_L2HDR. - These are filled up for the PF_BRIDGE family on the way to userspace. - NFQA_VLAN is a nested attribute, with the NFQA_VLAN_PROTO and the NFQA_VLAN_TCI carrying the corresponding vlan_proto and vlan_tci fields from the skb using big endian ordering (and using the CFI bit as the VLAN_TAG_PRESENT flag in vlan_tci as in the skb) Signed-off-by: Stephane Bryant <stephane.ml.bryant@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-29netfilter: bridge: add nf_afinfo to enable queuing to userspaceStephane Bryant
This just adds and registers a nf_afinfo for the ethernet bridge, which enables queuing to userspace for the AF_BRIDGE family. No checksum computation is done. Signed-off-by: Stephane Bryant <stephane.ml.bryant@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for you net tree, they are: 1) There was a race condition between parallel save/swap and delete, which resulted a kernel crash due to the increase ref for save, swap, wrong ref decrease operations. Reported and fixed by Vishwanath Pai. 2) OVS should call into CT NAT for packets of new expected connections only when the conntrack state is persisted with the 'commit' option to the OVS CT action. From Jarno Rajahalme. 3) Resolve kconfig dependencies with new OVS NAT support. From Arnd Bergmann. 4) Early validation of entry->target_offset to make sure it doesn't take us out from the blob, from Florian Westphal. 5) Again early validation of entry->next_offset to make sure it doesn't take out from the blob, also from Florian. 6) Check that entry->target_offset is always of of sizeof(struct xt_entry) for unconditional entries, when checking both from check_underflow() and when checking for loops in mark_source_chains(), again from Florian. 7) Fix inconsistent behaviour in nfnetlink_queue when NFQA_CFG_F_FAIL_OPEN is set and netlink_unicast() fails due to buffer overrun, we have to reinject the packet as the user expects. 8) Enforce nul-terminated table names from getsockopt GET_ENTRIES requests. 9) Don't assume skb->sk is set from nft_bridge_reject and synproxy, this fixes a recent update of the code to namespaceify ip_default_ttl, patch from Liping Zhang. This batch comes with four patches to validate x_tables blobs coming from userspace. CONFIG_USERNS exposes the x_tables interface to unpriviledged users and to be honest this interface never received the attention for this move away from the CAP_NET_ADMIN domain. Florian is working on another round with more patches with more sanity checks, so expect a bit more Netfilter fixes in this development cycle than usual. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-28[net] drop 'size' argument of sock_recvmsg()Al Viro
all callers have it equal to msg_data_left(msg). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-03-28netfilter: ipv4: fix NULL dereferenceLiping Zhang
Commit fa50d974d104 ("ipv4: Namespaceify ip_default_ttl sysctl knob") use sock_net(skb->sk) to get the net namespace, but we can't assume that sk_buff->sk is always exist, so when it is NULL, oops will happen. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Reviewed-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28netfilter: x_tables: enforce nul-terminated table name from getsockopt ↵Pablo Neira Ayuso
GET_ENTRIES Make sure the table names via getsockopt GET_ENTRIES is nul-terminated in ebtables and all the x_tables variants and their respective compat code. Uncovered by KASAN. Reported-by: Baozeng Ding <sploving1@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28netfilter: nfnetlink_queue: honor NFQA_CFG_F_FAIL_OPEN when netlink unicast ↵Pablo Neira Ayuso
fails When netlink unicast fails to deliver the message to userspace, we should also check if the NFQA_CFG_F_FAIL_OPEN flag is set so we reinject the packet back to the stack. I think the user expects no packet drops when this flag is set due to queueing to userspace errors, no matter if related to the internal queue or when sending the netlink message to userspace. The userspace application will still get the ENOBUFS error via recvmsg() so the user still knows that, with the current configuration that is in place, the userspace application is not consuming the messages at the pace that the kernel needs. Reported-by: "Yigal Reiss (yreiss)" <yreiss@cisco.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Tested-by: "Yigal Reiss (yreiss)" <yreiss@cisco.com>
2016-03-28netfilter: x_tables: fix unconditional helperFlorian Westphal
Ben Hawkes says: In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it is possible for a user-supplied ipt_entry structure to have a large next_offset field. This field is not bounds checked prior to writing a counter value at the supplied offset. Problem is that mark_source_chains should not have been called -- the rule doesn't have a next entry, so its supposed to return an absolute verdict of either ACCEPT or DROP. However, the function conditional() doesn't work as the name implies. It only checks that the rule is using wildcard address matching. However, an unconditional rule must also not be using any matches (no -m args). The underflow validator only checked the addresses, therefore passing the 'unconditional absolute verdict' test, while mark_source_chains also tested for presence of matches, and thus proceeeded to the next (not-existent) rule. Unify this so that all the callers have same idea of 'unconditional rule'. Reported-by: Ben Hawkes <hawkes@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28netfilter: x_tables: make sure e->next_offset covers remaining blob sizeFlorian Westphal
Otherwise this function may read data beyond the ruleset blob. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28netfilter: x_tables: validate e->target_offset earlyFlorian Westphal
We should check that e->target_offset is sane before mark_source_chains gets called since it will fetch the target entry for loop detection. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28openvswitch: call only into reachable nf-nat codeArnd Bergmann
The openvswitch code has gained support for calling into the nf-nat-ipv4/ipv6 modules, however those can be loadable modules in a configuration in which openvswitch is built-in, leading to link errors: net/built-in.o: In function `__ovs_ct_lookup': :(.text+0x2cc2c8): undefined reference to `nf_nat_icmp_reply_translation' :(.text+0x2cc66c): undefined reference to `nf_nat_icmpv6_reply_translation' The dependency on (!NF_NAT || NF_NAT) prevents similar issues, but NF_NAT is set to 'y' if any of the symbols selecting it are built-in, but the link error happens when any of them are modular. A second issue is that even if CONFIG_NF_NAT_IPV6 is built-in, CONFIG_NF_NAT_IPV4 might be completely disabled. This is unlikely to be useful in practice, but the driver currently only handles IPv6 being optional. This patch improves the Kconfig dependency so that openvswitch cannot be built-in if either of the two other symbols are set to 'm', and it replaces the incorrect #ifdef in ovs_ct_nat_execute() with two "if (IS_ENABLED())" checks that should catch all corner cases also make the code more readable. The same #ifdef exists ovs_ct_nat_to_attr(), where it does not cause a link error, but for consistency I'm changing it the same way. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: 05752523e565 ("openvswitch: Interface with NAT.") Acked-by: Joe Stringer <joe@ovn.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28openvswitch: Fix checking for new expected connections.Jarno Rajahalme
OVS should call into CT NAT for packets of new expected connections only when the conntrack state is persisted with the 'commit' option to the OVS CT action. The test for this condition is doubly wrong, as the CT status field is ANDed with the bit number (IPS_EXPECTED_BIT) rather than the mask (IPS_EXPECTED), and due to the wrong assumption that the expected bit would apply only for the first (i.e., 'new') packet of a connection, while in fact the expected bit remains on for the lifetime of an expected connection. The 'ctinfo' value IP_CT_RELATED derived from the ct status can be used instead, as it is only ever applicable to the 'new' packets of the expected connection. Fixes: 05752523e565 ('openvswitch: Interface with NAT.') Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28netfilter: ipset: fix race condition in ipset save, swap and deleteVishwanath Pai
This fix adds a new reference counter (ref_netlink) for the struct ip_set. The other reference counter (ref) can be swapped out by ip_set_swap and we need a separate counter to keep track of references for netlink events like dump. Using the same ref counter for dump causes a race condition which can be demonstrated by the following script: ipset create hash_ip1 hash:ip family inet hashsize 1024 maxelem 500000 \ counters ipset create hash_ip2 hash:ip family inet hashsize 300000 maxelem 500000 \ counters ipset create hash_ip3 hash:ip family inet hashsize 1024 maxelem 500000 \ counters ipset save & ipset swap hash_ip3 hash_ip2 ipset destroy hash_ip3 /* will crash the machine */ Swap will exchange the values of ref so destroy will see ref = 0 instead of ref = 1. With this fix in place swap will not succeed because ipset save still has ref_netlink on the set (ip_set_swap doesn't swap ref_netlink). Both delete and swap will error out if ref_netlink != 0 on the set. Note: The changes to *_head functions is because previously we would increment ref whenever we called these functions, we don't do that anymore. Reviewed-by: Joshua Hunt <johunt@akamai.com> Signed-off-by: Vishwanath Pai <vpai@akamai.com> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>