summaryrefslogtreecommitdiff
path: root/net/sched
AgeCommit message (Collapse)Author
2016-11-24net sched filters: fix filter handle ID in tfilter_notify_chain()Roman Mashak
Should pass valid filter handle, not the netlink flags. Fixes: 30a391a13ab92 ("net sched filters: pass netlink message flags in event notification") Signed-off-by: Roman Mashak <mrv@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
All conflicts were simple overlapping changes except perhaps for the Thunder driver. That driver has a change_mtu method explicitly for sending a message to the hardware. If that fails it returns an error. Normally a driver doesn't need an ndo_change_mtu method becuase those are usually just range changes, which are now handled generically. But since this extra operation is needed in the Thunder driver, it has to stay. However, if the message send fails we have to restore the original MTU before the change because the entire call chain expects that if an error is thrown by ndo_change_mtu then the MTU did not change. Therefore code is added to nicvf_change_mtu to remember the original MTU, and to restore it upon nicvf_update_hw_max_frs() failue. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18netns: make struct pernet_operations::id unsigned intAlexey Dobriyan
Make struct pernet_operations::id unsigned. There are 2 reasons to do so: 1) This field is really an index into an zero based array and thus is unsigned entity. Using negative value is out-of-bound access by definition. 2) On x86_64 unsigned 32-bit data which are mixed with pointers via array indexing or offsets added or subtracted to pointers are preffered to signed 32-bit data. "int" being used as an array index needs to be sign-extended to 64-bit before being used. void f(long *p, int i) { g(p[i]); } roughly translates to movsx rsi, esi mov rdi, [rsi+...] call g MOVSX is 3 byte instruction which isn't necessary if the variable is unsigned because x86_64 is zero extending by default. Now, there is net_generic() function which, you guessed it right, uses "int" as an array index: static inline void *net_generic(const struct net *net, int id) { ... ptr = ng->ptr[id - 1]; ... } And this function is used a lot, so those sign extensions add up. Patch snipes ~1730 bytes on allyesconfig kernel (without all junk messing with code generation): add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730) Unfortunately some functions actually grow bigger. This is a semmingly random artefact of code generation with register allocator being used differently. gcc decides that some variable needs to live in new r8+ registers and every access now requires REX prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be used which is longer than [r8] However, overall balance is in negative direction: add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730) function old new delta nfsd4_lock 3886 3959 +73 tipc_link_build_proto_msg 1096 1140 +44 mac80211_hwsim_new_radio 2776 2808 +32 tipc_mon_rcv 1032 1058 +26 svcauth_gss_legacy_init 1413 1429 +16 tipc_bcbase_select_primary 379 392 +13 nfsd4_exchange_id 1247 1260 +13 nfsd4_setclientid_confirm 782 793 +11 ... put_client_renew_locked 494 480 -14 ip_set_sockfn_get 730 716 -14 geneve_sock_add 829 813 -16 nfsd4_sequence_done 721 703 -18 nlmclnt_lookup_host 708 686 -22 nfsd4_lockt 1085 1063 -22 nfs_get_client 1077 1050 -27 tcf_bpf_init 1106 1076 -30 nfsd4_encode_fattr 5997 5930 -67 Total: Before=154856051, After=154854321, chg -0.00% Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-17net sched filters: pass netlink message flags in event notificationRoman Mashak
Userland client should be able to read an event, and reflect it back to the kernel, therefore it needs to extract complete set of netlink flags. For example, this will allow "tc monitor" to distinguish Add and Replace operations. Signed-off-by: Roman Mashak <mrv@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-17net_sched: sch_fq: use hash_ptr()Eric Dumazet
When I wrote sch_fq.c, hash_ptr() on 64bit arches was awful, and I chose hash_32(). Linus Torvalds and George Spelvin fixed this issue, so we can use hash_ptr() to get more entropy on 64bit arches with Terabytes of memory, and avoid the cast games. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains a second batch of Netfilter updates for your net-next tree. This includes a rework of the core hook infrastructure that improves Netfilter performance by ~15% according to synthetic benchmarks. Then, a large batch with ipset updates, including a new hash:ipmac set type, via Jozsef Kadlecsik. This also includes a couple of assorted updates. Regarding the core hook infrastructure rework to improve performance, using this simple drop-all packets ruleset from ingress: nft add table netdev x nft add chain netdev x y { type filter hook ingress device eth0 priority 0\; } nft add rule netdev x y drop And generating traffic through Jesper Brouer's samples/pktgen/pktgen_bench_xmit_mode_netif_receive.sh script using -i option. perf report shows nf_tables calls in its top 10: 17.30% kpktgend_0 [nf_tables] [k] nft_do_chain 15.75% kpktgend_0 [kernel.vmlinux] [k] __netif_receive_skb_core 10.39% kpktgend_0 [nf_tables_netdev] [k] nft_do_chain_netdev I'm measuring here an improvement of ~15% in performance with this patchset, so we got +2.5Mpps more. I have used my old laptop Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz 4-cores. This rework contains more specifically, in strict order, these patches: 1) Remove compile-time debugging from core. 2) Remove obsolete comments that predate the rcu era. These days it is well known that a Netfilter hook always runs under rcu_read_lock(). 3) Remove threshold handling, this is only used by br_netfilter too. We already have specific code to handle this from br_netfilter, so remove this code from the core path. 4) Deprecate NF_STOP, as this is only used by br_netfilter. 5) Place nf_state_hook pointer into xt_action_param structure, so this structure fits into one single cacheline according to pahole. This also implicit affects nftables since it also relies on the xt_action_param structure. 6) Move state->hook_entries into nf_queue entry. The hook_entries pointer is only required by nf_queue(), so we can store this in the queue entry instead. 7) use switch() statement to handle verdict cases. 8) Remove hook_entries field from nf_hook_state structure, this is only required by nf_queue, so store it in nf_queue_entry structure. 9) Merge nf_iterate() into nf_hook_slow() that results in a much more simple and readable function. 10) Handle NF_REPEAT away from the core, so far the only client is nf_conntrack_in() and we can restart the packet processing using a simple goto to jump back there when the TCP requires it. This update required a second pass to fix fallout, fix from Arnd Bergmann. 11) Set random seed from nft_hash when no seed is specified from userspace. 12) Simplify nf_tables expression registration, in a much smarter way to save lots of boiler plate code, by Liping Zhang. 13) Simplify layer 4 protocol conntrack tracker registration, from Davide Caratti. 14) Missing CONFIG_NF_SOCKET_IPV4 dependency for udp4_lib_lookup, due to recent generalization of the socket infrastructure, from Arnd Bergmann. 15) Then, the ipset batch from Jozsef, he describes it as it follows: * Cleanup: Remove extra whitespaces in ip_set.h * Cleanup: Mark some of the helpers arguments as const in ip_set.h * Cleanup: Group counter helper functions together in ip_set.h * struct ip_set_skbinfo is introduced instead of open coded fields in skbinfo get/init helper funcions. * Use kmalloc() in comment extension helper instead of kzalloc() because it is unnecessary to zero out the area just before explicit initialization. * Cleanup: Split extensions into separate files. * Cleanup: Separate memsize calculation code into dedicated function. * Cleanup: group ip_set_put_extensions() and ip_set_get_extensions() together. * Add element count to hash headers by Eric B Munson. * Add element count to all set types header for uniform output across all set types. * Count non-static extension memory into memsize calculation for userspace. * Cleanup: Remove redundant mtype_expire() arguments, because they can be get from other parameters. * Cleanup: Simplify mtype_expire() for hash types by removing one level of intendation. * Make NLEN compile time constant for hash types. * Make sure element data size is a multiple of u32 for the hash set types. * Optimize hash creation routine, exit as early as possible. * Make struct htype per ipset family so nets array becomes fixed size and thus simplifies the struct htype allocation. * Collapse same condition body into a single one. * Fix reported memory size for hash:* types, base hash bucket structure was not taken into account. * hash:ipmac type support added to ipset by Tomasz Chilinski. * Use setup_timer() and mod_timer() instead of init_timer() by Muhammad Falak R Wani, individually for the set type families. 16) Remove useless connlabel field in struct netns_ct, patch from Florian Westphal. 17) xt_find_table_lock() doesn't return ERR_PTR() anymore, so simplify {ip,ip6,arp}tables code that uses this. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09net/sched: act_tunnel_key: Add UDP dst port optionHadar Hen Zion
The current tunnel set action supports only IP addresses and key options. Add UDP dst port option. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09net/dst: Add dst port to dst_metadata utility functionsHadar Hen Zion
Add dst port parameter to __ip_tun_set_dst and __ipv6_tun_set_dst utility functions. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09net/sched: cls_flower: Add UDP port to tunnel parametersHadar Hen Zion
The current IP tunneling classification supports only IP addresses and key. Enhance UDP based IP tunneling classification parameters by adding UDP src and dst port. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09net/sched: cls_flower: Allow setting encapsulation fields as used keyHadar Hen Zion
When encapsulation field is set, mark it as used key for the flow dissector. This will be used by offloading drivers. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-09net/sched: act_tunnel_key: add helper inlines to access tcf_tunnel_keyHadar Hen Zion
Needed for drivers to pick the relevant action when offloading tunnel key act. Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-07qdisc: catch misconfig of attaching qdisc to tx_queue_len zero deviceJesper Dangaard Brouer
It is a clear misconfiguration to attach a qdisc to a device with tx_queue_len zero, because some qdisc's (namely, pfifo, bfifo, gred, htb, plug and sfb) inherit/copy this value as their queue length. Why should the kernel catch such a misconfiguration? Because prior to introducing the IFF_NO_QUEUE device flag, userspace found a loophole in the qdisc config system that allowed them to achieve the equivalent of IFF_NO_QUEUE, which is to remove the qdisc code path entirely from a device. The loophole on older kernels is setting tx_queue_len=0, *prior* to device qdisc init (the config time is significant, simply setting tx_queue_len=0 doesn't trigger the loophole). This loophole is currently used by Docker[1] to get better performance and scalability out of the veth device. The Docker developers were warned[1] that they needed to adjust the tx_queue_len if ever attaching a qdisc. The OpenShift project didn't remember this warning and attached a qdisc, this were caught and fixed in[2]. [1] https://github.com/docker/libcontainer/pull/193 [2] https://github.com/openshift/origin/pull/11126 Instead of fixing every userspace program that used this loophole, and forgot to reset the tx_queue_len, prior to attaching a qdisc. Let's catch the misconfiguration on the kernel side. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03net/sched: cls_flower: Support matching on SCTP portsSimon Horman
Support matching on SCTP ports in the same way that matching on TCP and UDP ports is already supported. Example usage: tc qdisc add dev eth0 ingress tc filter add dev eth0 protocol ip parent ffff: \ flower indev eth0 ip_proto sctp dst_port 80 \ action drop Signed-off-by: Simon Horman <simon.horman@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03netfilter: x_tables: move hook state into xt_action_param structurePablo Neira Ayuso
Place pointer to hook state in xt_action_param structure instead of copying the fields that we need. After this change xt_action_param fits into one cacheline. This patch also adds a set of new wrapper functions to fetch relevant hook state structure fields. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-11-02net/sched: cls_flower: merge filter delete/destroy common codeRoi Dayan
Move common code from fl_delete and fl_detroy to __fl_delete. Signed-off-by: Roi Dayan <roid@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02net/sched: cls_flower: add missing unbind call when destroying flowsRoi Dayan
tcf_unbind was called in fl_delete but was missing in fl_destroy when force deleting flows. Fixes: 77b9900ef53a ('tc: introduce Flower classifier') Signed-off-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Mostly simple overlapping changes. For example, David Ahern's adjacency list revamp in 'net-next' conflicted with an adjacency list traversal bug fix in 'net'. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29net_sched actions: use nla_parse_nested()Johannes Berg
Use nla_parse_nested instead of open-coding the call to nla_parse() with the attribute data/len. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29netlink: Add nla_memdup() to wrap kmemdup() use on nlattrThomas Graf
Wrap several common instances of: kmemdup(nla_data(attr), nla_len(attr), GFP_KERNEL); Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Johannes Berg <johannes@sipsolutions.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27net sched filters: fix notification of filter delete with proper handleJamal Hadi Salim
Daniel says: While trying out [1][2], I noticed that tc monitor doesn't show the correct handle on delete: $ tc monitor qdisc clsact ffff: dev eno1 parent ffff:fff1 filter dev eno1 ingress protocol all pref 49152 bpf handle 0x2a [...] deleted filter dev eno1 ingress protocol all pref 49152 bpf handle 0xf3be0c80 some context to explain the above: The user identity of any tc filter is represented by a 32-bit identifier encoded in tcm->tcm_handle. Example 0x2a in the bpf filter above. A user wishing to delete, get or even modify a specific filter uses this handle to reference it. Every classifier is free to provide its own semantics for the 32 bit handle. Example: classifiers like u32 use schemes like 800:1:801 to describe the semantics of their filters represented as hash table, bucket and node ids etc. Classifiers also have internal per-filter representation which is different from this externally visible identity. Most classifiers set this internal representation to be a pointer address (which allows fast retrieval of said filters in their implementations). This internal representation is referenced with the "fh" variable in the kernel control code. When a user successfuly deletes a specific filter, by specifying the correct tcm->tcm_handle, an event is generated to user space which indicates which specific filter was deleted. Before this patch, the "fh" value was sent to user space as the identity. As an example what is shown in the sample bpf filter delete event above is 0xf3be0c80. This is infact a 32-bit truncation of 0xffff8807f3be0c80 which happens to be a 64-bit memory address of the internal filter representation (address of the corresponding filter's struct cls_bpf_prog); After this patch the appropriate user identifiable handle as encoded in the originating request tcm->tcm_handle is generated in the event. One of the cardinal rules of netlink rules is to be able to take an event (such as a delete in this case) and reflect it back to the kernel and successfully delete the filter. This patch achieves that. Note, this issue has existed since the original TC action infrastructure code patch back in 2004 as found in: https://git.kernel.org/cgit/linux/kernel/git/history/history.git/commit/ [1] http://patchwork.ozlabs.org/patch/682828/ [2] http://patchwork.ozlabs.org/patch/682829/ Fixes: 4e54c4816bfe ("[NET]: Add tc extensions infrastructure.") Reported-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27skbedit: allow the user to specify bitmask for markAntonio Quartulli
The user may want to use only some bits of the skb mark in his skbedit rules because the remaining part might be used by something else. Introduce the "mask" parameter to the skbedit actor in order to implement such functionality. When the mask is specified, only those bits selected by the latter are altered really changed by the actor, while the rest is left untouched. Signed-off-by: Antonio Quartulli <antonio@open-mesh.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26sch_htb: do not report fake rate estimatorsEric Dumazet
When I prepared commit d250a5f90e53 ("pkt_sched: gen_estimator: Dont report fake rate estimators"), htb still had an implicit rate estimator for all its classes. Then later, I made this rate estimator optional in commit 64153ce0a7b6 ("net_sched: htb: do not setup default rate estimators"), but I forgot to update htb use of gnet_stats_copy_rate_est() After this patch, "tc -s qdisc ..." no longer report fake rate estimators for HTB classes. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-23net/sched: em_meta: Fix 'meta vlan' to correctly recognize zero VID framesShmulik Ladkani
META_COLLECTOR int_vlan_tag() assumes that if the accel tag (vlan_tci) is zero, then no vlan accel tag is present. This is incorrect for zero VID vlan accel packets, making the following match fail: tc filter add ... basic match 'meta(vlan mask 0xfff eq 0)' ... Apparently 'int_vlan_tag' was implemented prior VLAN_TAG_PRESENT was introduced in 05423b2 "vlan: allow null VLAN ID to be used" (and at time introduced, the 'vlan_tx_tag_get' call in em_meta was not adapted). Fix, testing skb_vlan_tag_present instead of testing skb_vlan_tag_get's value. Fixes: 05423b2413 ("vlan: allow null VLAN ID to be used") Fixes: 1a31f2042e ("netsched: Allow meta match on vlan tag on receive") Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20net: use core MTU range checking in core net infraJarod Wilson
geneve: - Merge __geneve_change_mtu back into geneve_change_mtu, set max_mtu - This one isn't quite as straight-forward as others, could use some closer inspection and testing macvlan: - set min/max_mtu tun: - set min/max_mtu, remove tun_net_change_mtu vxlan: - Merge __vxlan_change_mtu back into vxlan_change_mtu - Set max_mtu to IP_MAX_MTU and retain dynamic MTU range checks in change_mtu function - This one is also not as straight-forward and could use closer inspection and testing from vxlan folks bridge: - set max_mtu of IP_MAX_MTU and retain dynamic MTU range checks in change_mtu function openvswitch: - set min/max_mtu, remove internal_dev_change_mtu - note: max_mtu wasn't checked previously, it's been set to 65535, which is the largest possible size supported sch_teql: - set min/max_mtu (note: max_mtu previously unchecked, used max of 65535) macsec: - min_mtu = 0, max_mtu = 65535 macvlan: - min_mtu = 0, max_mtu = 65535 ntb_netdev: - min_mtu = 0, max_mtu = 65535 veth: - min_mtu = 68, max_mtu = 65535 8021q: - min_mtu = 0, max_mtu = 65535 CC: netdev@vger.kernel.org CC: Nicolas Dichtel <nicolas.dichtel@6wind.com> CC: Hannes Frederic Sowa <hannes@stressinduktion.org> CC: Tom Herbert <tom@herbertland.com> CC: Daniel Borkmann <daniel@iogearbox.net> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Paolo Abeni <pabeni@redhat.com> CC: Jiri Benc <jbenc@redhat.com> CC: WANG Cong <xiyou.wangcong@gmail.com> CC: Roopa Prabhu <roopa@cumulusnetworks.com> CC: Pravin B Shelar <pshelar@ovn.org> CC: Sabrina Dubroca <sd@queasysnail.net> CC: Patrick McHardy <kaber@trash.net> CC: Stephen Hemminger <stephen@networkplumber.org> CC: Pravin Shelar <pshelar@nicira.com> CC: Maxim Krasnyansky <maxk@qti.qualcomm.com> Signed-off-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20net/sched: act_mirred: Use passed lastuse argumentPaul Blakey
stats_update callback is called by NIC drivers doing hardware offloading of the mirred action. Lastuse is passed as argument to specify when the stats was actually last updated and is not always the current time. Fixes: 9798e6fe4f9b ('net: act_mirred: allow statistic updates from offloaded actions') Signed-off-by: Paul Blakey <paulb@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-14net/sched: act_mirred: Implement ingress actionsShmulik Ladkani
Up until now, 'action mirred' supported only egress actions (either TCA_EGRESS_REDIR or TCA_EGRESS_MIRROR). This patch implements the corresponding ingress actions TCA_INGRESS_REDIR and TCA_INGRESS_MIRROR. This allows attaching filters whose target is to hand matching skbs into the rx processing of a specified device. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Tested-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-14net/sched: act_mirred: Refactor detection whether dev needs xmit at mac headerShmulik Ladkani
Move detection logic that tests whether device expects skb data to point at mac_header upon xmit into a function. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-14net/sched: act_mirred: Rename tcfm_ok_push to tcfm_mac_header_xmit and make ↵Shmulik Ladkani
it a bool 'tcfm_ok_push' specifies whether a mac_len sized push is needed upon egress to the target device (if action is performed at ingress). Rename it to 'tcfm_mac_header_xmit' as this is actually an attribute of the target device (and use a bool instead of int). This allows to decouple the attribute from the action to be taken. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13net_sched: reorder pernet ops and act ops registrationsWANG Cong
Krister reported a kernel NULL pointer dereference after tcf_action_init_1() invokes a_o->init(), it is a race condition where one thread calling tcf_register_action() to initialize the netns data after putting act ops in the global list and the other thread searching the list and then calling a_o->init(net, ...). Fix this by moving the pernet ops registration before making the action ops visible. This is fine because: a) we don't rely on act_base in pernet ops->init(), b) in the worst case we have a fully initialized netns but ops is still not ready so new actions still can't be created. Reported-by: Krister Johansen <kjlx@templeofstupid.com> Tested-by: Krister Johansen <kjlx@templeofstupid.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13net_sched: do not broadcast RTM_GETTFILTER resultEric Dumazet
There are two ways to get tc filters from kernel to user space. 1) Full dump (tc_dump_tfilter()) 2) RTM_GETTFILTER to get one precise filter, reducing overhead. The second operation is unfortunately broadcasting its result, polluting "tc monitor" users. This patch makes sure only the requester gets the result, using netlink_unicast() instead of rtnetlink_send() Jamal cooked an iproute2 patch to implement "tc filter get" operation, but other user space libraries already use RTM_GETTFILTER when a single filter is queried, instead of dumping all filters. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-03net/sched: act_vlan: Push skb->data to mac_header prior calling skb_vlan_*() ↵Shmulik Ladkani
functions Generic skb_vlan_push/skb_vlan_pop functions don't properly handle the case where the input skb data pointer does not point at the mac header: - They're doing push/pop, but fail to properly unwind data back to its original location. For example, in the skb_vlan_push case, any subsequent 'skb_push(skb, skb->mac_len)' calls make the skb->data point 4 bytes BEFORE start of frame, leading to bogus frames that may be transmitted. - They update rcsum per the added/removed 4 bytes tag. Alas if data is originally after the vlan/eth headers, then these bytes were already pulled out of the csum. OTOH calling skb_vlan_push/skb_vlan_pop with skb->data at mac_header present no issues. act_vlan is the only caller to skb_vlan_*() that has skb->data pointing at network header (upon ingress). Other calles (ovs, bpf) already adjust skb->data at mac_header. This patch fixes act_vlan to point to the mac_header prior calling skb_vlan_*() functions, as other callers do. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Pravin Shelar <pshelar@ovn.org> Cc: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Three sets of overlapping changes. Nothing serious. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-28net/sched: cls_flower: Use a proper mask value for enc key id parameterHadar Hen Zion
The current code use the encapsulation key id value as the mask of that parameter which is wrong. Fix that by using a full mask. Fixes: bc3103f1ed40 ('net/sched: cls_flower: Classify packet in ip tunnels') Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Acked-by: Amir Vadai <amir@vadai.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-27act_ife: Fix false encodingYotam Gigi
On ife encode side, the action stores the different tlvs inside the ife header, where each tlv length field should refer to the length of the whole tlv (without additional padding) and not just the data length. On ife decode side, the action iterates over the tlvs in the ife header and parses them one by one, where in each iteration the current pointer is advanced according to the tlv size. Before, the encoding encoded only the data length inside the tlv, which led to false parsing of ife the header. In addition, due to the fact that the loop counter was unsigned, it could lead to infinite parsing loop. This fix changes the loop counter to be signed and fixes the encoding to take into account the tlv type and size. Fixes: 28a10c426e81 ("net sched: fix encoding to use real length") Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-27act_ife: Fix external mac header on encodeYotam Gigi
On ife encode side, external mac header is copied from the original packet and may be overridden if the user requests. Before, the mac header copy was done from memory region that might not be accessible anymore, as skb_cow_head might free it and copy the packet. This led to random values in the external mac header once the values were not set by user. This fix takes the internal mac header from the packet, after the call to skb_cow_head. Fixes: ef6980b6becb ("net sched: introduce IFE action") Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23net_sched: sch_fq: account for schedule/timers driftsEric Dumazet
It looks like the following patch can make FQ very precise, even in VM or stressed hosts. It matters at high pacing rates. We take into account the difference between the time that was programmed when last packet was sent, and current time (a drift of tens of usecs is often observed) Add an EWMA of the unthrottle latency to help diagnostics. This latency is the difference between current time and oldest packet in delayed RB-tree. This accounts for the high resolution timer latency, but can be different under stress, as fq_check_throttled() can be opportunistically be called from a dequeue() called after an enqueue() for a different flow. Tested: // Start a 10Gbit flow $ netperf --google-pacing-rate 1250000000 -H lpaa24 -l 10000 -- -K bbr & Before patch : $ sar -n DEV 10 5 | grep eth0 | grep Average Average: eth0 17106.04 756876.84 1102.75 1119049.02 0.00 0.00 0.52 After patch : $ sar -n DEV 10 5 | grep eth0 | grep Average Average: eth0 17867.00 800245.90 1151.77 1183172.12 0.00 0.00 0.52 A new iproute2 tc can output the 'unthrottle latency' : $ tc -s qd sh dev eth0 | grep latency 0 gc, 0 highprio, 32490767 throttled, 2382 ns latency Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23sch_sfb: keep backlog updated with qlenWANG Cong
Fixes: 2ccccf5fb43f ("net_sched: update hierarchical backlog too") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23sch_qfq: keep backlog updated with qlenWANG Cong
Reported-by: Stas Nichiporovich <stasn77@gmail.com> Fixes: 2ccccf5fb43f ("net_sched: update hierarchical backlog too") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23net_sched: check NULL on error path in route4_change()WANG Cong
On error path in route4_change(), 'f' could be NULL, so we should check NULL before calling tcf_exts_destroy(). Fixes: b9a24bb76bf6 ("net_sched: properly handle failure case of tcf_exts_init()") Reported-by: kbuild test robot <fengguang.wu@intel.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22net/sched: act_vlan: Introduce TCA_VLAN_ACT_MODIFY vlan actionShmulik Ladkani
TCA_VLAN_ACT_MODIFY allows one to change an existing tag. It accepts same attributes as TCA_VLAN_ACT_PUSH (protocol, id, priority). If packet is vlan tagged, then the tag gets overwritten according to user specified attributes. For example, this allows user to replace a tag's vid while preserving its priority bits (as opposed to "action vlan pop pipe action vlan push"). Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21net: act_mirred: allow statistic updates from offloaded actionsJakub Kicinski
Implement .stats_update() callback. The implementation is generic and can be reused by other simple actions if needed. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21net: cls_bpf: allow offloaded filters to update statsJakub Kicinski
Call into offloaded filters to update stats. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21net: cls_bpf: add support for marking filters as hardware-onlyJakub Kicinski
Add cls_bpf support for the TCA_CLS_FLAGS_SKIP_SW flag. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21net: cls_bpf: limit hardware offload by software-only flagJakub Kicinski
Add cls_bpf support for the TCA_CLS_FLAGS_SKIP_HW flag. Unlike U32 and flower cls_bpf already has some netlink flags defined. Create a new attribute to be able to use the same flag values as the above. Unlike U32 and flower reject unknown flags. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21net: cls_bpf: add hardware offloadJakub Kicinski
This patch adds hardware offload capability to cls_bpf classifier, similar to what have been done with U32 and flower. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21net_sched: sch_fq: add low_rate_threshold parameterEric Dumazet
This commit adds to the fq module a low_rate_threshold parameter to insert a delay after all packets if the socket requests a pacing rate below the threshold. This helps achieve more precise control of the sending rate with low-rate paths, especially policers. The basic issue is that if a congestion control module detects a policer at a certain rate, it may want fq to be able to shape to that policed rate. That way the sender can avoid policer drops by having the packets arrive at the policer at or just under the policed rate. The default threshold of 550Kbps was chosen analytically so that for policers or links at 500Kbps or 512Kbps fq would very likely invoke this mechanism, even if the pacing rate was briefly slightly above the available bandwidth. This value was then empirically validated with two years of production testing on YouTube video servers. Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-20net sched actions: fix GETing actionsJamal Hadi Salim
With the batch changes that translated transient actions into a temporary list lost in the translation was the fact that tcf_action_destroy() will eventually delete the action from the permanent location if the refcount is zero. Example of what broke: ...add a gact action to drop sudo $TC actions add action drop index 10 ...now retrieve it, looks good sudo $TC actions get action gact index 10 ...retrieve it again and find it is gone! sudo $TC actions get action gact index 10 Fixes: 22dc13c837c3 ("net_sched: convert tcf_exts from list to pointer array"), Fixes: 824a7e8863b3 ("net_sched: remove an unnecessary list_del()") Fixes: f07fed82ad79 ("net_sched: remove the leftover cleanup_a()") Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19net sched: stylistic cleanupsJamal Hadi Salim
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19net sched actions police: peg drop stats for conforming trafficRoman Mashak
setting conforming action to drop is a valid policy. When it is set we need to at least see the stats indicating it for debugging. Signed-off-by: Roman Mashak <mrv@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19net sched ife action: Introduce skb tcindex metadata encap decapJamal Hadi Salim
Sample use case of how this is encoded: user space via tuntap (or a connected VM/Machine/container) encodes the tcindex TLV. Sample use case of decoding: IFE action decodes it and the skb->tc_index is then used to classify. So something like this for encoded ICMP packets: .. first decode then reclassify... skb->tcindex will be set sudo $TC filter add dev $ETH parent ffff: prio 2 protocol 0xbeef \ u32 match u32 0 0 flowid 1:1 \ action ife decode reclassify ...next match the decode icmp packet... sudo $TC filter add dev $ETH parent ffff: prio 4 protocol ip \ u32 match ip protocol 1 0xff flowid 1:1 \ action continue ... last classify it using the tcindex classifier and do someaction.. sudo $TC filter add dev $ETH parent ffff: prio 5 protocol ip \ handle 0x11 tcindex classid 1:1 \ action blah.. Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>