summaryrefslogtreecommitdiff
path: root/net/sched
AgeCommit message (Collapse)Author
2017-02-10net/act_pedit: Introduce 'add' operationAmir Vadai
This command could be useful to inc/dec fields. For example, to forward any TCP packet and decrease its TTL: $ tc filter add dev enp0s9 protocol ip parent ffff: \ flower ip_proto tcp \ action pedit munge ip ttl add 0xff pipe \ action mirred egress redirect dev veth0 In the example above, adding 0xff to this u8 field is actually decreasing it by one, since the operation is masked. Signed-off-by: Amir Vadai <amir@vadai.me> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10net/act_pedit: Support using offset relative to the conventional network headersAmir Vadai
Extend pedit to enable the user setting offset relative to network headers. This change would enable to work with more complex header schemes (vs the simple IPv4 case) where setting a fixed offset relative to the network header is not enough. After this patch, the action has information about the exact header type and field inside this header. This information could be used later on for hardware offloading of pedit. Backward compatibility was being kept: 1. Old kernel <-> new userspace 2. New kernel <-> old userspace 3. add rule using new userspace <-> dump using old userspace 4. add rule using old userspace <-> dump using new userspace When using the extended api, new netlink attributes are being used. This way, operation will fail in (1) and (3) - and no malformed rule be added or dumped. Of course, new user space that doesn't need the new functionality can use the old netlink attributes and operation will succeed. Since action can support both api's, (2) should work, and it is easy to write the new user space to have (4) work. The action is having a strict check that only header types and commands it can handle are accepted. This way future additions will be much easier. Usage example: $ tc filter add dev enp0s9 protocol ip parent ffff: \ flower \ ip_proto tcp \ dst_port 80 \ action pedit munge tcp dport set 8080 pipe \ action mirred egress redirect dev veth0 Will forward tcp port whose original dest port is 80, while modifying the destination port to 8080. Signed-off-by: Amir Vadai <amir@vadai.me> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10sched: check negative err value to safe one level of indentJiri Pirko
As it is more common, check err for !0. That allows to safe one level of indentation and makes the code easier to read. Also, make 'next' variable global in function as it is used twice. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10sched: add missing curly braces in else branch in tc_ctl_tfilterJiri Pirko
Curly braces need to be there, for stylistic reasons. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10sched: move err set right before goto errout in tc_ctl_tfilterJiri Pirko
This makes the reader to know right away what is the error value. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10sched: push TC filter protocol creation into a separate functionJiri Pirko
Make the long function tc_ctl_tfilter a little bit shorter and easier to read. Also make the creation of filter proto symmetric to destruction. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10sched: move tcf_proto_destroy and tcf_destroy_chain helpers into cls_apiJiri Pirko
Creation is done in this file, move destruction to be at the same place. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10sched: rename tcf_destroy to tcf_destroy_protoJiri Pirko
This function destroys TC filter protocol, not TC filter. So name it accordingly. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-07net/sched: act_mirred: remove duplicated include from act_mirred.cWei Yongjun
Remove duplicated include. Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree, they are: 1) Stash ctinfo 3-bit field into pointer to nf_conntrack object from sk_buff so we only access one single cacheline in the conntrack hotpath. Patchset from Florian Westphal. 2) Don't leak pointer to internal structures when exporting x_tables ruleset back to userspace, from Willem DeBruijn. This includes new helper functions to copy data to userspace such as xt_data_to_user() as well as conversions of our ip_tables, ip6_tables and arp_tables clients to use it. Not surprinsingly, ebtables requires an ad-hoc update. There is also a new field in x_tables extensions to indicate the amount of bytes that we copy to userspace. 3) Add nf_log_all_netns sysctl: This new knob allows you to enable logging via nf_log infrastructure for all existing netnamespaces. Given the effort to provide pernet syslog has been discontinued, let's provide a way to restore logging using netfilter kernel logging facilities in trusted environments. Patch from Michal Kubecek. 4) Validate SCTP checksum from conntrack helper, from Davide Caratti. 5) Merge UDPlite conntrack and NAT helpers into UDP, this was mostly a copy&paste from the original helper, from Florian Westphal. 6) Reset netfilter state when duplicating packets, also from Florian. 7) Remove unnecessary check for broadcast in IPv6 in pkttype match and nft_meta, from Liping Zhang. 8) Add missing code to deal with loopback packets from nft_meta when used by the netdev family, also from Liping. 9) Several cleanups on nf_tables, one to remove unnecessary check from the netlink control plane path to add table, set and stateful objects and code consolidation when unregister chain hooks, from Gao Feng. 10) Fix harmless reference counter underflow in IPVS that, however, results in problems with the introduction of the new refcount_t type, from David Windsor. 11) Enable LIBCRC32C from nf_ct_sctp instead of nf_nat_sctp, from Davide Caratti. 12) Missing documentation on nf_tables uapi header, from Liping Zhang. 13) Use rb_entry() helper in xt_connlimit, from Geliang Tang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03sched: cls_flower: expose priority to offloading netdeviceJiri Pirko
The driver that offloads flower rules needs to know with which priority user inserted the rules. So add this information into offload struct. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03net/sched: act_ife: Change to use ife moduleYotam Gigi
Use the encode/decode functionality from the ife module instead of using implementation inside the act_ife. Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Roman Mashak <mrv@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-03net/sched: act_ife: Unexport ife_tlv_meta_encodeYotam Gigi
As the function ife_tlv_meta_encode is not used by any other module, unexport it and make it static for the act_ife module. Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Roman Mashak <mrv@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
All merge conflicts were simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-02skbuff: add and use skb_nfct helperFlorian Westphal
Followup patch renames skb->nfct and changes its type so add a helper to avoid intrusive rename change later. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-01net/sched: act_psample: Remove unnecessary ASSERT_RTNLYotam Gigi
The ASSERT_RTNL is not necessary in the init function, as it does not touch any rtnl protected structures, as opposed to the mirred action which does have to hold a net device. Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-01net/sched: act_sample: Fix error path in initYotam Gigi
Fix error path of in sample init, by releasing the tc hash in case of failure in psample_group creation. Fixes: 5c5670fae430 ("net/sched: Introduce sample tc action") Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-01net/sched: matchall: Fix configuration raceYotam Gigi
In the current version, the matchall internal state is split into two structs: cls_matchall_head and cls_matchall_filter. This makes little sense, as matchall instance supports only one filter, and there is no situation where one exists and the other does not. In addition, that led to some races when filter was deleted while packet was processed. Unify that two structs into one, thus simplifying the process of matchall creation and deletion. As a result, the new, delete and get callbacks have a dummy implementation where all the work is done in destroy and change callbacks, as was done in cls_cgroup. Fixes: bf3994d2ed31 ("net/sched: introduce Match-all classifier") Reported-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-30net/sched: cls_flower: Correct matching on ICMPv6 codeSimon Horman
When matching on the ICMPv6 code ICMPV6_CODE rather than ICMPV4_CODE attributes should be used. This corrects what appears to be a typo. Sample usage: tc qdisc add dev eth0 ingress tc filter add dev eth0 protocol ipv6 parent ffff: flower \ indev eth0 ip_proto icmpv6 type 128 code 0 action drop Without this change the code parameter above is effectively ignored. Fixes: 7b684884fbfa ("net/sched: cls_flower: Support matching on ICMP type and code") Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25net sched actions: Add support for user cookiesJamal Hadi Salim
Introduce optional 128-bit action cookie. Like all other cookie schemes in the networking world (eg in protocols like http or existing kernel fib protocol field, etc) the idea is to save user state that when retrieved serves as a correlator. The kernel _should not_ intepret it. The user can store whatever they wish in the 128 bits. Sample exercise(showing variable length use of cookie) .. create an accept action with cookie a1b2c3d4 sudo $TC actions add action ok index 1 cookie a1b2c3d4 .. dump all gact actions.. sudo $TC -s actions ls action gact action order 0: gact action pass random type none pass val 0 index 1 ref 1 bind 0 installed 5 sec used 5 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 cookie a1b2c3d4 .. bind the accept action to a filter.. sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \ u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 1 ... send some traffic.. $ ping 127.0.0.1 -c 3 PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data. 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.020 ms 64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.027 ms 64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.038 ms Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24net/sched: Introduce sample tc actionYotam Gigi
This action allows the user to sample traffic matched by tc classifier. The sampling consists of choosing packets randomly and sampling them using the psample module. The user can configure the psample group number, the sampling rate and the packet's truncation (to save kernel-user traffic). Example: To sample ingress traffic from interface eth1, one may use the commands: tc qdisc add dev eth1 handle ffff: ingress tc filter add dev eth1 parent ffff: \ matchall action sample rate 12 group 4 Where the first command adds an ingress qdisc and the second starts sampling randomly with an average of one sampled packet per 12 packets on dev eth1 to psample group 4. Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20fq_codel: Avoid regenerating skb flow hash unless necessaryAndrew Collins
The fq_codel qdisc currently always regenerates the skb flow hash. This wastes some cycles and prevents flow seperation in cases where the traffic has been encrypted and can no longer be understood by the flow dissector. Change it to use the prexisting flow hash if one exists, and only regenerate if necessary. Signed-off-by: Andrew Collins <acollins@cradlepoint.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-19net/sched: cls_flower: reduce fl_change stack sizeArnd Bergmann
The new ARP support has pushed the stack size over the edge on ARM, as there are two large objects on the stack in this function (mask and tb) and both have now grown a bit more: net/sched/cls_flower.c: In function 'fl_change': net/sched/cls_flower.c:928:1: error: the frame size of 1072 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] We can solve this by dynamically allocating one or both of them. I first tried to do it just for the mask, but that only saved 152 bytes on ARM, while this version just does it for the 'tb' array, bringing the stack size back down to 664 bytes. Fixes: 99d31326cbe6 ("net/sched: cls_flower: Support matching on ARP") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2017-01-16net sched actions: fix refcnt when GETing of action after bindJamal Hadi Salim
Demonstrating the issue: .. add a drop action $sudo $TC actions add action drop index 10 .. retrieve it $ sudo $TC -s actions get action gact index 10 action order 1: gact action drop random type none pass val 0 index 10 ref 2 bind 0 installed 29 sec used 29 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 ... bug 1 above: reference is two. Reference is actually 1 but we forget to subtract 1. ... do a GET again and we see the same issue try a few times and nothing changes ~$ sudo $TC -s actions get action gact index 10 action order 1: gact action drop random type none pass val 0 index 10 ref 2 bind 0 installed 31 sec used 31 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 ... lets try to bind the action to a filter.. $ sudo $TC qdisc add dev lo ingress $ sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \ u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 10 ... and now a few GETs: $ sudo $TC -s actions get action gact index 10 action order 1: gact action drop random type none pass val 0 index 10 ref 3 bind 1 installed 204 sec used 204 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 $ sudo $TC -s actions get action gact index 10 action order 1: gact action drop random type none pass val 0 index 10 ref 4 bind 1 installed 206 sec used 206 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 $ sudo $TC -s actions get action gact index 10 action order 1: gact action drop random type none pass val 0 index 10 ref 5 bind 1 installed 235 sec used 235 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 .... as can be observed the reference count keeps going up. After the fix $ sudo $TC actions add action drop index 10 $ sudo $TC -s actions get action gact index 10 action order 1: gact action drop random type none pass val 0 index 10 ref 1 bind 0 installed 4 sec used 4 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 $ sudo $TC -s actions get action gact index 10 action order 1: gact action drop random type none pass val 0 index 10 ref 1 bind 0 installed 6 sec used 6 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 $ sudo $TC qdisc add dev lo ingress $ sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \ u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 10 $ sudo $TC -s actions get action gact index 10 action order 1: gact action drop random type none pass val 0 index 10 ref 2 bind 1 installed 32 sec used 32 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 $ sudo $TC -s actions get action gact index 10 action order 1: gact action drop random type none pass val 0 index 10 ref 2 bind 1 installed 33 sec used 33 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 Fixes: aecc5cefc389 ("net sched actions: fix GETing actions") Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16net/sched: cls_flower: Disallow duplicate internal elementsPaul Blakey
Flower currently allows having the same filter twice with the same priority. Actions (and statistics update) will always execute on the first inserted rule leaving the second rule unused. This patch disallows that. Signed-off-by: Paul Blakey <paulb@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16bpf: rework prog_digest into prog_tagDaniel Borkmann
Commit 7bd509e311f4 ("bpf: add prog_digest and expose it via fdinfo/netlink") was recently discussed, partially due to admittedly suboptimal name of "prog_digest" in combination with sha1 hash usage, thus inevitably and rightfully concerns about its security in terms of collision resistance were raised with regards to use-cases. The intended use cases are for debugging resp. introspection only for providing a stable "tag" over the instruction sequence that both kernel and user space can calculate independently. It's not usable at all for making a security relevant decision. So collisions where two different instruction sequences generate the same tag can happen, but ideally at a rather low rate. The "tag" will be dumped in hex and is short enough to introspect in tracepoints or kallsyms output along with other data such as stack trace, etc. Thus, this patch performs a rename into prog_tag and truncates the tag to a short output (64 bits) to make it obvious it's not collision-free. Should in future a hash or facility be needed with a security relevant focus, then we can think about requirements, constraints, etc that would fit to that situation. For now, rework the exposed parts for the current use cases as long as nothing has been released yet. Tested on x86_64 and s390x. Fixes: 7bd509e311f4 ("bpf: add prog_digest and expose it via fdinfo/netlink") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-11net/sched: cls_flower: Support matching on ARPSimon Horman
Support matching on ARP operation, and hardware and protocol addresses for Ethernet hardware and IPv4 protocol addresses. Example usage: tc qdisc add dev eth0 ingress tc filter add dev eth0 protocol arp parent ffff: flower indev eth0 \ arp_op request arp_sip 10.0.0.1 action drop tc filter add dev eth0 protocol rarp parent ffff: flower indev eth0 \ arp_op reply arp_tha 52:54:3f:00:00:00/24 action drop Signed-off-by: Simon Horman <simon.horman@netronome.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-09net/sched: act_csum: compute crc32c on SCTP packetsDavide Caratti
modify act_csum to compute crc32c on IPv4/IPv6 packets having SCTP in their payload, and extend UAPI definitions accordingly. Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-09net/sched: Kconfig: select LIBCRC32C if NET_ACT_CSUM is selectedDavide Caratti
LIBCRC32C is needed to compute crc32c on SCTP packets. Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-09cls_u32: don't bother explicitly initializing ->divisor to zeroAlexandru Moise
This struct member is already initialized to zero upon root_ht's allocation via kzalloc(). Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-08net-tc: convert tc_from to tc_from_ingress and tc_redirectedWillem de Bruijn
The tc_from field fulfills two roles. It encodes whether a packet was redirected by an act_mirred device and, if so, whether act_mirred was called on ingress or egress. Split it into separate fields. The information is needed by the special IFB loop, where packets are taken out of the normal path by act_mirred, forwarded to IFB, then reinjected at their original location (ingress or egress) by IFB. The IFB device cannot use skb->tc_at_ingress, because that may have been overwritten as the packet travels from act_mirred to ifb_xmit, when it passes through tc_classify on the IFB egress path. Cache this value in skb->tc_from_ingress. That field is valid only if a packet arriving at ifb_xmit came from act_mirred. Other packets can be crafted to reach ifb_xmit. These must be dropped. Set tc_redirected on redirection and drop all packets that do not have this bit set. Both fields are set only on cloned skbs in tc actions, so original packet sources do not have to clear the bit when reusing packets (notably, pktgen and octeon). Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-08net-tc: convert tc_at to tc_at_ingressWillem de Bruijn
Field tc_at is used only within tc actions to distinguish ingress from egress processing. A single bit is sufficient for this purpose. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-08net-tc: convert tc_verd to integer bitfieldsWillem de Bruijn
Extract the remaining two fields from tc_verd and remove the __u16 completely. TC_AT and TC_FROM are converted to equivalent two-bit integer fields tc_at and tc_from. Where possible, use existing helper skb_at_tc_ingress when reading tc_at. Introduce helper skb_reset_tc to clear fields. Not documenting tc_from and tc_at, because they will be replaced with single bit fields in follow-on patches. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-08net-tc: extract skip classify bit from tc_verdWillem de Bruijn
Packets sent by the IFB device skip subsequent tc classification. A single bit governs this state. Move it out of tc_verd in anticipation of removing that __u16 completely. The new bitfield tc_skip_classify temporarily uses one bit of a hole, until tc_verd is removed completely in a follow-up patch. Remove the bit hole comment. It could be 2, 3, 4 or 5 bits long. With that many options, little value in documenting it. Introduce a helper function to deduplicate the logic in the two sites that check this bit. The field tc_skip_classify is set only in IFB on skbs cloned in act_mirred, so original packet sources do not have to clear the bit when reusing packets (notably, pktgen and octeon). Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-08net-tc: make MAX_RECLASSIFY_LOOP localWillem de Bruijn
This field is no longer kept in tc_verd. Remove it from the global definition of that struct. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-08net: make ndo_get_stats64 a void functionstephen hemminger
The network device operation for reading statistics is only called in one place, and it ignores the return value. Having a structure return value is potentially confusing because some future driver could incorrectly assume that the return value was used. Fix all drivers with ndo_get_stats64 to have a void function. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2017-01-03net/sched: cls_matchall: Fix error pathYotam Gigi
Fix several error paths in matchall: - Release reference to actions in case the hardware fails offloading (relevant to skip_sw only) - Fix error path in case tcf_exts initialization/validation fail Fixes: bf3994d2ed31 ("net/sched: introduce Match-all classifier") Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29net: dev_weight: TX/RX orthogonalityMatthias Tafelmeier
Oftenly, introducing side effects on packet processing on the other half of the stack by adjusting one of TX/RX via sysctl is not desirable. There are cases of demand for asymmetric, orthogonal configurability. This holds true especially for nodes where RPS for RFS usage on top is configured and therefore use the 'old dev_weight'. This is quite a common base configuration setup nowadays, even with NICs of superior processing support (e.g. aRFS). A good example use case are nodes acting as noSQL data bases with a large number of tiny requests and rather fewer but large packets as responses. It's affordable to have large budget and rx dev_weights for the requests. But as a side effect having this large a number on TX processed in one run can overwhelm drivers. This patch therefore introduces an independent configurability via sysctl to userland. Signed-off-by: Matthias Tafelmeier <matthias.tafelmeier@gmx.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-28net/sched: cls_flower: Fix missing addr_type in classifyPaul Blakey
Since we now use a non zero mask on addr_type, we are matching on its value (IPV4/IPV6). So before this fix, matching on enc_src_ip/enc_dst_ip failed in SW/classify path since its value was zero. This patch sets the proper value of addr_type for encapsulated packets. Fixes: 970bfcd09791 ('net/sched: cls_flower: Use mask for addr_type') Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Hadar Hen Zion <hadarh@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Various ipvlan fixes from Eric Dumazet and Mahesh Bandewar. The most important is to not assume the packet is RX just because the destination address matches that of the device. Such an assumption causes problems when an interface is put into loopback mode. 2) If we retry when creating a new tc entry (because we dropped the RTNL mutex in order to load a module, for example) we end up with -EAGAIN and then loop trying to replay the request. But we didn't reset some state when looping back to the top like this, and if another thread meanwhile inserted the same tc entry we were trying to, we re-link it creating an enless loop in the tc chain. Fix from Daniel Borkmann. 3) There are two different WRITE bits in the MDIO address register for the stmmac chip, depending upon the chip variant. Due to a bug we could set them both, fix from Hock Leong Kweh. 4) Fix mlx4 bug in XDP_TX handling, from Tariq Toukan. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: net: stmmac: fix incorrect bit set in gmac4 mdio addr register r8169: add support for RTL8168 series add-on card. net: xdp: remove unused bfp_warn_invalid_xdp_buffer() openvswitch: upcall: Fix vlan handling. ipv4: Namespaceify tcp_tw_reuse knob net: korina: Fix NAPI versus resources freeing net, sched: fix soft lockup in tc_classify net/mlx4_en: Fix user prio field in XDP forward tipc: don't send FIN message from connectionless socket ipvlan: fix multicast processing ipvlan: fix various issues in ipvlan_process_multicast()
2016-12-26net, sched: fix soft lockup in tc_classifyDaniel Borkmann
Shahar reported a soft lockup in tc_classify(), where we run into an endless loop when walking the classifier chain due to tp->next == tp which is a state we should never run into. The issue only seems to trigger under load in the tc control path. What happens is that in tc_ctl_tfilter(), thread A allocates a new tp, initializes it, sets tp_created to 1, and calls into tp->ops->change() with it. In that classifier callback we had to unlock/lock the rtnl mutex and returned with -EAGAIN. One reason why we need to drop there is, for example, that we need to request an action module to be loaded. This happens via tcf_exts_validate() -> tcf_action_init/_1() meaning after we loaded and found the requested action, we need to redo the whole request so we don't race against others. While we had to unlock rtnl in that time, thread B's request was processed next on that CPU. Thread B added a new tp instance successfully to the classifier chain. When thread A returned grabbing the rtnl mutex again, propagating -EAGAIN and destroying its tp instance which never got linked, we goto replay and redo A's request. This time when walking the classifier chain in tc_ctl_tfilter() for checking for existing tp instances we had a priority match and found the tp instance that was created and linked by thread B. Now calling again into tp->ops->change() with that tp was successful and returned without error. tp_created was never cleared in the second round, thus kernel thinks that we need to link it into the classifier chain (once again). tp and *back point to the same object due to the match we had earlier on. Thus for thread B's already public tp, we reset tp->next to tp itself and link it into the chain, which eventually causes the mentioned endless loop in tc_classify() once a packet hits the data path. Fix is to clear tp_created at the beginning of each request, also when we replay it. On the paths that can cause -EAGAIN we already destroy the original tp instance we had and on replay we really need to start from scratch. It seems that this issue was first introduced in commit 12186be7d2e1 ("net_cls: fix unconfigured struct tcf_proto keeps chaining and avoid kernel panic when we use cls_cgroup"). Fixes: 12186be7d2e1 ("net_cls: fix unconfigured struct tcf_proto keeps chaining and avoid kernel panic when we use cls_cgroup") Reported-by: Shahar Klein <shahark@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Tested-by: Shahar Klein <shahark@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-25ktime: Cleanup ktime_set() usageThomas Gleixner
ktime_set(S,N) was required for the timespec storage type and is still useful for situations where a Seconds and Nanoseconds part of a time value needs to be converted. For anything where the Seconds argument is 0, this is pointless and can be replaced with a simple assignment. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org>
2016-12-25ktime: Get rid of the unionThomas Gleixner
ktime is a union because the initial implementation stored the time in scalar nanoseconds on 64 bit machine and in a endianess optimized timespec variant for 32bit machines. The Y2038 cleanup removed the timespec variant and switched everything to scalar nanoseconds. The union remained, but become completely pointless. Get rid of the union and just keep ktime_t as simple typedef of type s64. The conversion was done with coccinelle and some manual mopping up. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org>
2016-12-23net/sched: cls_flower: Mandate mask when matching on flagsOr Gerlitz
When matching on flags, we should require the user to provide the mask and avoid using an all-ones mask. Not doing so causes matching on flags provided w.o mask to hit on the value being unset for all flags, which may not what the user wanted to happen. Fixes: faa3ffce7829 ('net/sched: cls_flower: Add support for matching on flags') Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reported-by: Paul Blakey <paulb@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-23net/sched: act_tunnel_key: Fix setting UDP dst port in metadata under IPv6Or Gerlitz
The UDP dst port was provided to the helper function which sets the IPv6 IP tunnel meta-data under a wrong param order, fix that. Fixes: 75bfbca01e48 ('net/sched: act_tunnel_key: Add UDP dst port option') Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-20net_sched: sch_netem: use rb_entry()Geliang Tang
To make the code clearer, use rb_entry() instead of container_of() to deal with rbtree. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-20net_sched: sch_fq: use rb_entry()Geliang Tang
To make the code clearer, use rb_entry() instead of container_of() to deal with rbtree. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-17net/sched: cls_flower: Use masked key when calling HW offloadsPaul Blakey
Zero bits on the mask signify a "don't care" on the corresponding bits in key. Some HWs require those bits on the key to be zero. Since these bits are masked anyway, it's okay to provide the masked key to all drivers. Fixes: 5b33f48842fa ('net/flower: Introduce hardware offload support') Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>