summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2017-12-08net: sched: drop qdisc_reset from dev_graft_qdiscJohn Fastabend
In qdisc_graft_qdisc a "new" qdisc is attached and the 'qdisc_destroy' operation is called on the old qdisc. The destroy operation will wait a rcu grace period and call qdisc_rcu_free(). At which point gso_cpu_skb is free'd along with all stats so no need to zero stats and gso_cpu_skb from the graft operation itself. Further after dropping the qdisc locks we can not continue to call qdisc_reset before waiting an rcu grace period so that the qdisc is detached from all cpus. By removing the qdisc_reset() here we get the correct property of waiting an rcu grace period and letting the qdisc_destroy operation clean up the qdisc correctly. Note, a refcnt greater than 1 would cause the destroy operation to be aborted however if this ever happened the reference to the qdisc would be lost and we would have a memory leak. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08net: sched: explicit locking in gso_cpu fallbackJohn Fastabend
This work is preparing the qdisc layer to support egress lockless qdiscs. If we are running the egress qdisc lockless in the case we overrun the netdev, for whatever reason, the netdev returns a busy error code and the skb is parked on the gso_skb pointer. With many cores all hitting this case at once its possible to have multiple sk_buffs here so we turn gso_skb into a queue. This should be the edge case and if we see this frequently then the netdev/qdisc layer needs to back off. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08net: sched: a dflt qdisc may be used with per cpu statsJohn Fastabend
Enable dflt qdisc support for per cpu stats before this patch a dflt qdisc was required to use the global statistics qstats and bstats. This adds a static flags field to qdisc_ops that is propagated into qdisc->flags in qdisc allocate call. This allows the allocation block to completely allocate the qdisc object so we don't have dangling allocations after qdisc init. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08net: sched: remove remaining uses for qdisc_qlen in xmit pathJohn Fastabend
sch_direct_xmit() uses qdisc_qlen as a return value but all call sites of the routine only check if it is zero or not. Simplify the logic so that we don't need to return an actual queue length value. This introduces a case now where sch_direct_xmit would have returned a qlen of zero but now it returns true. However in this case all call sites of sch_direct_xmit will implement a dequeue() and get a null skb and abort. This trades tracking qlen in the hotpath for an extra dequeue operation. Overall this seems to be good for performance. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08net: sched: allow qdiscs to handle lockingJohn Fastabend
This patch adds a flag for queueing disciplines to indicate the stack does not need to use the qdisc lock to protect operations. This can be used to build lockless scheduling algorithms and improving performance. The flag is checked in the tx path and the qdisc lock is only taken if it is not set. For now use a conditional if statement. Later we could be more aggressive if it proves worthwhile and use a static key or wrap this in a likely(). Also the lockless case drops the TCQ_F_CAN_BYPASS logic. The reason for this is synchronizing a qlen counter across threads proves to cost more than doing the enqueue/dequeue operations when tested with pktgen. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08net: sched: cleanup qdisc_run and __qdisc_run semanticsJohn Fastabend
Currently __qdisc_run calls qdisc_run_end() but does not call qdisc_run_begin(). This makes it hard to track pairs of qdisc_run_{begin,end} across function calls. To simplify reading these code paths this patch moves begin/end calls into qdisc_run(). Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08tcp_bbr: reset long-term bandwidth sampling on loss recovery undoNeal Cardwell
Fix BBR so that upon notification of a loss recovery undo BBR resets long-term bandwidth sampling. Under high reordering, reordering events can be interpreted as loss. If the reordering and spurious loss estimates are high enough, this can cause BBR to spuriously estimate that we are seeing loss rates high enough to trigger long-term bandwidth estimation. To avoid that problem, this commit resets long-term bandwidth sampling on loss recovery undo events. Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08tcp_bbr: reset full pipe detection on loss recovery undoNeal Cardwell
Fix BBR so that upon notification of a loss recovery undo BBR resets the full pipe detection (STARTUP exit) state machine. Under high reordering, reordering events can be interpreted as loss. If the reordering and spurious loss estimates are high enough, this could previously cause BBR to spuriously estimate that the pipe is full. Since spurious loss recovery means that our overall sending will have slowed down spuriously, this commit gives a flow more time to probe robustly for bandwidth and decide the pipe is really full. Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08tcp_bbr: record "full bw reached" decision in new full_bw_reached bitNeal Cardwell
This commit records the "full bw reached" decision in a new full_bw_reached bit. This is a pure refactor that does not change the current behavior, but enables subsequent fixes and improvements. In particular, this enables simple and clean fixes because the full_bw and full_bw_cnt can be unconditionally zeroed without worrying about forgetting that we estimated we filled the pipe in Startup. And it enables future improvements because multiple code paths can be used for estimating that we filled the pipe in Startup; any new code paths only need to set this bit when they think the pipe is full. Note that this fix intentionally reduces the width of the full_bw_cnt counter, since we have never used the most significant bit. Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf-next 2017-12-07 The following pull-request contains BPF updates for your net-next tree. The main changes are: 1) Detailed documentation of BPF development process from Daniel. 2) Addition of is_fullsock, snd_cwnd and srtt_us fields to bpf_sock_ops from Lawrence. 3) Minor follow up for bpf_skb_set_tunnel_key() from William. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08tcp: invalidate rate samples during SACK renegingYousuk Seung
Mark tcp_sock during a SACK reneging event and invalidate rate samples while marked. Such rate samples may overestimate bw by including packets that were SACKed before reneging. < ack 6001 win 10000 sack 7001:38001 < ack 7001 win 0 sack 8001:38001 // Reneg detected > seq 7001:8001 // RTO, SACK cleared. < ack 38001 win 10000 In above example the rate sample taken after the last ack will count 7001-38001 as delivered while the actual delivery rate likely could be much lower i.e. 7001-8001. This patch adds a new field tcp_sock.sack_reneg and marks it when we declare SACK reneging and entering TCP_CA_Loss, and unmarks it after the last rate sample was taken before moving back to TCP_CA_Open. This patch also invalidates rate samples taken while tcp_sock.is_sack_reneg is set. Fixes: b9f64820fb22 ("tcp: track data delivery rate for a TCP connection") Signed-off-by: Yousuk Seung <ysseung@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-07smc: support variable CLC proposal messagesUrsula Braun
According to RFC7609 [1] the CLC proposal message contains an area of unknown length for future growth. Additionally it may contain up to 8 IPv6 prefixes. The current version of the SMC-code does not understand CLC proposal messages using these variable length fields and, thus, is incompatible with SMC implementations in other operating systems. This patch makes sure, SMC understands incoming CLC proposals * with arbitrary length values for future growth * with up to 8 IPv6 prefixes [1] SMC-R Informational RFC: http://www.rfc-editor.org/info/rfc7609 Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Reviewed-by: Hans Wippel <hwippel@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-07smc: no consumer update in tasklet contextUrsula Braun
The SMC protocol requires to send a separate consumer cursor update, if it cannot be piggybacked to updates of the producer cursor. When receiving a blocked signal from the sender, this update is sent already in tasklet context. In addition consumer cursor updates are sent after data receival. Sending of cursor updates is controlled by sequence numbers. Assuming receiving stray messages the receiver drops updates with older sequence numbers than an already received cursor update with a higher sequence number. Sending consumer cursor updates in tasklet context may result in wrong order sends and its corresponding drops at the receiver. Since it is sufficient to send consumer cursor updates once the data is received, this patch gets rid of the consumer cursor update in tasklet context to guarantee in-sequence arrival of cursor updates. Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-07smc: cleanup close checking during data receivalUrsula Braun
When waiting for data to be received it must be checked if the peer signals shutdown. The SMC code uses two different checks for this purpose, even though just one check is sufficient. This patch removes the superfluous test for SOCK_DONE. Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-07smc: no update for unused sk_write_pendingUrsula Braun
The smc code never checks the sk_write_pending sock field. Thus there is no need to update it. Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-07smc: improve smc_clc_send_decline() error handlingUrsula Braun
Let smc_clc_send_decline() return with an error, if the amount sent is smaller than the length of an smc decline message. Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-07smc: make smc_close_active_abort() staticUrsula Braun
smc_close_active_abort() is used in smc_close.c only. Make it static. Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-07tcp: use current time in tcp_rcv_space_adjust()Eric Dumazet
When I switched rcv_rtt_est to high resolution timestamps, I forgot that tp->tcp_mstamp needed to be refreshed in tcp_rcv_space_adjust() Using an old timestamp leads to autotuning lags. Fixes: 645f4c6f2ebd ("tcp: switch rcv_rtt_est and rcvq_space to high resolution timestamps") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Wei Wang <weiwan@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-07net: dsa: Allow compiling out legacy supportFlorian Fainelli
Introduce a configuration option: CONFIG_NET_DSA_LEGACY allowing to compile out support for the old platform device and Device Tree binding registration. Support for these configurations is scheduled to be removed in 4.17. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-07adding missing rcu_read_unlock in ipxip6_rcvNikita V. Shirokov
commit 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels") introduced new exit point in ipxip6_rcv. however rcu_read_unlock is missing there. this diff is fixing this v1->v2: instead of doing rcu_read_unlock in place, we are going to "drop" section (to prevent skb leakage) Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels") Signed-off-by: Nikita V. Shirokov <tehnerd@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-06rds: Fix NULL pointer dereference in __rds_rdma_mapHåkon Bugge
This is a fix for syzkaller719569, where memory registration was attempted without any underlying transport being loaded. Analysis of the case reveals that it is the setsockopt() RDS_GET_MR (2) and RDS_GET_MR_FOR_DEST (7) that are vulnerable. Here is an example stack trace when the bug is hit: BUG: unable to handle kernel NULL pointer dereference at 00000000000000c0 IP: __rds_rdma_map+0x36/0x440 [rds] PGD 2f93d03067 P4D 2f93d03067 PUD 2f93d02067 PMD 0 Oops: 0000 [#1] SMP Modules linked in: bridge stp llc tun rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rds binfmt_misc sb_edac intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul c rc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd glue_helper cryptd iTCO_wdt mei_me sg iTCO_vendor_support ipmi_si mei ipmi_devintf nfsd shpchp pcspkr i2c_i801 ioatd ma ipmi_msghandler wmi lpc_ich mfd_core auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 mgag200 i2c_algo_bit drm_kms_helper ixgbe syscopyarea ahci sysfillrect sysimgblt libahci mdio fb_sys_fops ttm ptp libata sd_mod mlx4_core drm crc32c_intel pps_core megaraid_sas i2c_core dca dm_mirror dm_region_hash dm_log dm_mod CPU: 48 PID: 45787 Comm: repro_set2 Not tainted 4.14.2-3.el7uek.x86_64 #2 Hardware name: Oracle Corporation ORACLE SERVER X5-2L/ASM,MOBO TRAY,2U, BIOS 31110000 03/03/2017 task: ffff882f9190db00 task.stack: ffffc9002b994000 RIP: 0010:__rds_rdma_map+0x36/0x440 [rds] RSP: 0018:ffffc9002b997df0 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff882fa2182580 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffc9002b997e40 RDI: ffff882fa2182580 RBP: ffffc9002b997e30 R08: 0000000000000000 R09: 0000000000000002 R10: ffff885fb29e3838 R11: 0000000000000000 R12: ffff882fa2182580 R13: ffff882fa2182580 R14: 0000000000000002 R15: 0000000020000ffc FS: 00007fbffa20b700(0000) GS:ffff882fbfb80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000c0 CR3: 0000002f98a66006 CR4: 00000000001606e0 Call Trace: rds_get_mr+0x56/0x80 [rds] rds_setsockopt+0x172/0x340 [rds] ? __fget_light+0x25/0x60 ? __fdget+0x13/0x20 SyS_setsockopt+0x80/0xe0 do_syscall_64+0x67/0x1b0 entry_SYSCALL64_slow_path+0x25/0x25 RIP: 0033:0x7fbff9b117f9 RSP: 002b:00007fbffa20aed8 EFLAGS: 00000293 ORIG_RAX: 0000000000000036 RAX: ffffffffffffffda RBX: 00000000000c84a4 RCX: 00007fbff9b117f9 RDX: 0000000000000002 RSI: 0000400000000114 RDI: 000000000000109b RBP: 00007fbffa20af10 R08: 0000000000000020 R09: 00007fbff9dd7860 R10: 0000000020000ffc R11: 0000000000000293 R12: 0000000000000000 R13: 00007fbffa20b9c0 R14: 00007fbffa20b700 R15: 0000000000000021 Code: 41 56 41 55 49 89 fd 41 54 53 48 83 ec 18 8b 87 f0 02 00 00 48 89 55 d0 48 89 4d c8 85 c0 0f 84 2d 03 00 00 48 8b 87 00 03 00 00 <48> 83 b8 c0 00 00 00 00 0f 84 25 03 00 0 0 48 8b 06 48 8b 56 08 The fix is to check the existence of an underlying transport in __rds_rdma_map(). Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-06net_sched: use macvlan real dev trans_start in dev_trans_start()Chris Dion
Macvlan devices are similar to vlans and do not update their own trans_start. In order for arp monitoring to work for a bond device when the slaves are macvlans, obtain its real device. Signed-off-by: Chris Dion <christopher.dion@dell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-06rds: debug: fix null check on static arrayPrashant Bhole
t_name cannot be NULL since it is an array field of a struct. Replacing null check on static array with string length check using strnlen() Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-06act_mirred: get rid of mirred_list_lock spinlockCong Wang
TC actions are no longer freed in RCU callbacks and we should always have RTNL lock, so this spinlock is no longer needed. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jiri Pirko <jiri@mellanox.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-06act_mirred: get rid of tcfm_ifindex from struct tcf_mirredCong Wang
tcfm_dev always points to the correct netdev and we already hold a refcnt, so no need to use tcfm_ifindex to lookup again. If we would support moving target netdev across netns, using pointer would be better than ifindex. This also fixes dumping obsolete ifindex, now after the target device is gone we just dump 0 as ifindex. Cc: Jiri Pirko <jiri@mellanox.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-06ip6_gre: add ip6 erspan collect_md modeWilliam Tu
Similar to ip6 gretap and ip4 gretap, the patch allows erspan tunnel to operate in collect metadata mode. bpf_skb_[gs]et_tunnel_key() helpers can make use of it right away. Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05make sock_alloc_file() do sock_release() on failuresAl Viro
This changes calling conventions (and simplifies the hell out the callers). New rules: once struct socket had been passed to sock_alloc_file(), it's been consumed either by struct file or by sock_release() done by sock_alloc_file(). Either way the caller should not do sock_release() after that point. Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05socketpair(): allocate descriptors firstAl Viro
simplifies failure exits considerably... Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05fix kcm_clone()Al Viro
1) it's fput() or sock_release(), not both 2) don't do fd_install() until the last failure exit. 3) not a bug per se, but... don't attach socket to struct file until it's set up. Take reserving descriptor into the caller, move fd_install() to the caller, sanitize failure exits and calling conventions. Cc: stable@vger.kernel.org # v4.6+ Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05dccp: CVE-2017-8824: use-after-free in DCCP codeMohamed Ghannam
Whenever the sock object is in DCCP_CLOSED state, dccp_disconnect() must free dccps_hc_tx_ccid and dccps_hc_rx_ccid and set to NULL. Signed-off-by: Mohamed Ghannam <simo.ghannam@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05net_sched: remove unused parameter from act cleanup opsCong Wang
No one actually uses it. Cc: Jiri Pirko <jiri@mellanox.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05net: dsa: assign a CPU port to DSA portVivien Didelot
DSA ports also need to have a dedicated CPU port assigned to them, because they need to know where to egress frames targeting the CPU, e.g. To_Cpu frames received on a Marvell Tag port. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05VSOCK: fix outdated sk_state value in hvs_release()Stefan Hajnoczi
Since commit 3b4477d2dcf2709d0be89e2a8dced3d0f4a017f2 ("VSOCK: use TCP state constants for sk_state") VSOCK has used TCP_* constants for sk_state. Commit b4562ca7925a3bedada87a3dd072dd5bad043288 ("hv_sock: add locking in the open/close/release code paths") reintroduced the SS_DISCONNECTING constant. This patch replaces the old SS_DISCONNECTING with the new TCP_CLOSING constant. CC: Dexuan Cui <decui@microsoft.com> CC: Cathy Avery <cavery@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Jorgen Hansen <jhansen@vmware.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05net: sched: sch_api: rearrange init handlingAlexander Aring
This patch fixes the following checkpatch error: ERROR: do not use assignment in if condition by rearranging the if condition to execute init callback only if init callback exists. The whole setup afterwards is called in any case, doesn't matter if init callback is set or not. This patch has the same behaviour as before, just without assign err variable in if condition. It also makes the code easier to read. Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Cc: David Ahern <dsahern@gmail.com> Signed-off-by: Alexander Aring <aring@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05net: sched: sch_api: fix code style issuesAlexander Aring
This patch fix checkpatch issues for upcomming patches according to the sched api file. It changes checking on null pointer, remove unnecessary brackets, add variable names for parameters and adjust 80 char width. Cc: David Ahern <dsahern@gmail.com> Signed-off-by: Alexander Aring <aring@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05net_sched: get rid of rcu_barrier() in tcf_block_put_ext()Cong Wang
Both Eric and Paolo noticed the rcu_barrier() we use in tcf_block_put_ext() could be a performance bottleneck when we have a lot of tc classes. Paolo provided the following to demonstrate the issue: tc qdisc add dev lo root htb for I in `seq 1 1000`; do tc class add dev lo parent 1: classid 1:$I htb rate 100kbit tc qdisc add dev lo parent 1:$I handle $((I + 1)): htb for J in `seq 1 10`; do tc filter add dev lo parent $((I + 1)): u32 match ip src 1.1.1.$J done done time tc qdisc del dev root real 0m54.764s user 0m0.023s sys 0m0.000s The rcu_barrier() there is to ensure we free the block after all chains are gone, that is, to queue tcf_block_put_final() at the tail of workqueue. We can achieve this ordering requirement by refcnt'ing tcf block instead, that is, the tcf block is freed only when the last chain in this block is gone. This also simplifies the code. Paolo reported after this patch we get: real 0m0.017s user 0m0.000s sys 0m0.017s Tested-by: Paolo Abeni <pabeni@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Jiri Pirko <jiri@mellanox.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05tipc: fix memory leak in tipc_accept_from_sock()Jon Maloy
When the function tipc_accept_from_sock() fails to create an instance of struct tipc_subscriber it omits to free the already created instance of struct tipc_conn instance before it returns. We fix that with this commit. Reported-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05tipc: fix a null pointer deref on error pathCong Wang
In tipc_topsrv_kern_subscr() when s->tipc_conn_new() fails we call tipc_close_conn() to clean up, but in this case calling conn_put() is just enough. This fixes the folllowing crash: kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 0 PID: 3085 Comm: syzkaller064164 Not tainted 4.15.0-rc1+ #137 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 task: 00000000c24413a5 task.stack: 000000005e8160b5 RIP: 0010:__lock_acquire+0xd55/0x47f0 kernel/locking/lockdep.c:3378 RSP: 0018:ffff8801cb5474a8 EFLAGS: 00010002 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffff85ecb400 RBP: ffff8801cb547830 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: ffffffff87489d60 R12: ffff8801cd2980c0 R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000020 FS: 00000000014ee880(0000) GS:ffff8801db400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffee2426e40 CR3: 00000001cb85a000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:4004 __raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline] _raw_spin_lock_bh+0x31/0x40 kernel/locking/spinlock.c:175 spin_lock_bh include/linux/spinlock.h:320 [inline] tipc_subscrb_subscrp_delete+0x8f/0x470 net/tipc/subscr.c:201 tipc_subscrb_delete net/tipc/subscr.c:238 [inline] tipc_subscrb_release_cb+0x17/0x30 net/tipc/subscr.c:316 tipc_close_conn+0x171/0x270 net/tipc/server.c:204 tipc_topsrv_kern_subscr+0x724/0x810 net/tipc/server.c:514 tipc_group_create+0x702/0x9c0 net/tipc/group.c:184 tipc_sk_join net/tipc/socket.c:2747 [inline] tipc_setsockopt+0x249/0xc10 net/tipc/socket.c:2861 SYSC_setsockopt net/socket.c:1851 [inline] SyS_setsockopt+0x189/0x360 net/socket.c:1830 entry_SYSCALL_64_fastpath+0x1f/0x96 Fixes: 14c04493cb77 ("tipc: add ability to order and receive topology events in driver") Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Jon Maloy <jon.maloy@ericsson.com> Cc: Ying Xue <ying.xue@windriver.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05net_sched: red: Avoid illegal valuesNogah Frankel
Check the qmin & qmax values doesn't overflow for the given Wlog value. Check that qmin <= qmax. Fixes: a783474591f2 ("[PKT_SCHED]: Generic RED layer") Signed-off-by: Nogah Frankel <nogahf@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05flow_dissector: dissect tunnel info outside __skb_flow_dissect()Simon Horman
Move dissection of tunnel info to outside of the main flow dissection function, __skb_flow_dissect(). The sole user of this feature, the flower classifier, is updated to call tunnel info dissection directly, using skb_flow_dissect_tunnel_info(). This results in a slightly less complex implementation of __skb_flow_dissect(), in particular removing logic from that call path which is not used by the majority of users. The expense of this is borne by the flower classifier which now has to make an extra call for tunnel info dissection. This patch should not result in any behavioural change. Signed-off-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05Revert "net: core: maybe return -EEXIST in __dev_alloc_name"Johannes Berg
This reverts commit d6f295e9def0; some userspace (in the case we noticed it's wpa_supplicant), is relying on the current error code to determine that a fixed name interface already exists. Reported-by: Jouni Malinen <j@w1.fi> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05Revert "tcp: must block bh in __inet_twsk_hashdance()"Eric Dumazet
We had to disable BH _before_ calling __inet_twsk_hashdance() in commit cfac7f836a71 ("tcp/dccp: block bh before arming time_wait timer"). This means we can revert 614bdd4d6e61 ("tcp: must block bh in __inet_twsk_hashdance()"). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05rtnetlink: fix rtnl_link msghandler rcu annotationsFlorian Westphal
Incorrect/missing annotations caused a few sparse warnings: rtnetlink.c:155:15: incompatible types .. (different address spaces) rtnetlink.c:157:23: incompatible types .. (different address spaces) rtnetlink.c:185:15: incompatible types .. (different address spaces) rtnetlink.c:285:15: incompatible types .. (different address spaces) rtnetlink.c:317:9: incompatible types .. (different address spaces) rtnetlink.c:3054:23: incompatible types .. (different address spaces) no change in generated code. Fixes: addf9b90de22f7 ("net: rtnetlink: use rcu to free rtnl message handlers") Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Small overlapping change conflict ('net' changed a line, 'net-next' added a line right afterwards) in flexcan.c Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05bpf: Add access to snd_cwnd and others in sock_opsLawrence Brakmo
Adds read access to snd_cwnd and srtt_us fields of tcp_sock. Since these fields are only valid if the socket associated with the sock_ops program call is a full socket, the field is_fullsock is also added to the bpf_sock_ops struct. If the socket is not a full socket, reading these fields returns 0. Note that in most cases it will not be necessary to check is_fullsock to know if there is a full socket. The context of the call, as specified by the 'op' field, can sometimes determine whether there is a full socket. The struct bpf_sock_ops has the following fields added: __u32 is_fullsock; /* Some TCP fields are only valid if * there is a full socket. If not, the * fields read as zero. */ __u32 snd_cwnd; __u32 srtt_us; /* Averaged RTT << 3 in usecs */ There is a new macro, SOCK_OPS_GET_TCP32(NAME), to make it easier to add read access to more 32 bit tcp_sock fields. Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-04bpf: move bpf csum flag checkWilliam Tu
trivial move the BPF_F_ZERO_CSUM_TX check right below the 'flags & BPF_F_DONT_FRAGMENT', so common tun_flags handling is logically together. Signed-off-by: William Tu <u9012063@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Various TCP control block fixes, including one that crashes with SELinux, from David Ahern and Eric Dumazet. 2) Fix ACK generation in rxrpc, from David Howells. 3) ipvlan doesn't set the mark properly in the ipv4 route lookup key, from Gao Feng. 4) SIT configuration doesn't take on the frag_off ipv4 field configuration properly, fix from Hangbin Liu. 5) TSO can fail after device down/up on stmmac, fix from Lars Persson. 6) Various bpftool fixes (mostly in JSON handling) from Quentin Monnet. 7) Various SKB leak fixes in vhost/tun/tap (mostly observed as performance problems). From Wei Xu. 8) mvpps's TX descriptors were not zero initialized, from Yan Markman. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (57 commits) tcp: use IPCB instead of TCP_SKB_CB in inet_exact_dif_match() tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb() rxrpc: Fix the MAINTAINERS record rxrpc: Use correct netns source in rxrpc_release_sock() liquidio: fix incorrect indentation of assignment statement stmmac: reset last TSO segment size after device open ipvlan: Add the skb->mark as flow4's member to lookup route s390/qeth: build max size GSO skbs on L2 devices s390/qeth: fix GSO throughput regression s390/qeth: fix thinko in IPv4 multicast address tracking tap: free skb if flags error tun: free skb in early errors vhost: fix skb leak in handle_rx() bnxt_en: Fix a variable scoping in bnxt_hwrm_do_send_msg() bnxt_en: fix dst/src fid for vxlan encap/decap actions bnxt_en: wildcard smac while creating tunnel decap filter bnxt_en: Need to unconditionally shut down RoCE in bnxt_shutdown phylink: ensure we take the link down when phylink_stop() is called sfp: warn about modules requiring address change sequence sfp: improve RX_LOS handling ...
2017-12-04rtnetlink: ipv6: convert remaining users to rtnl_register_moduleFlorian Westphal
convert remaining users of rtnl_register to rtnl_register_module and un-export rtnl_register. Requested-by: David S. Miller <davem@davemloft.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf-next 2017-12-03 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Addition of a software model for BPF offloads in order to ease testing code changes in that area and make semantics more clear. This is implemented in a new driver called netdevsim, which can later also be extended for other offloads. SR-IOV support is added as well to netdevsim. BPF kernel selftests for offloading are added so we can track basic functionality as well as exercising all corner cases around BPF offloading, from Jakub. 2) Today drivers have to drop the reference on BPF progs they hold due to XDP on device teardown themselves. Change this in order to make XDP handling inside the drivers less error prone, and move disabling XDP to the core instead, also from Jakub. 3) Misc set of BPF verifier improvements and cleanups as preparatory work for upcoming BPF-to-BPF calls. Among others, this set also improves liveness marking such that pruning can be slightly more effective. Register and stack liveness information is now included in the verifier log as well, from Alexei. 4) nfp JIT improvements in order to identify load/store sequences in the BPF prog e.g. coming from memcpy lowering and optimizing them through the NPU's command push pull (CPP) instruction, from Jiong. 5) Cleanups to test_cgrp2_attach2.c BPF sample code in oder to remove bpf_prog_attach() magic values and replacing them with actual proper attach flag instead, from David. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-04rtnetlink: remove __rtnl_registerFlorian Westphal
This removes __rtnl_register and switches callers to either rtnl_register or rtnl_register_module. Also, rtnl_register() will now print an error if memory allocation failed rather than panic the kernel. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>