summaryrefslogtreecommitdiff
path: root/net/netfilter
AgeCommit message (Collapse)Author
2016-04-25netfilter/ipvs: use nla_put_u64_64bit()Nicolas Dichtel
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25Merge tag 'ipvs-for-v4.7' of ↵Pablo Neira Ayuso
https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next Simon Horman says: ==================== IPVS Updates for v4.7 please consider these enhancements to the IPVS. They allow SIP connections originating from real-servers to be load balanced by the SIP psersitence engine as is already implemented in the other direction. And for better one packet scheduling (OPS) performance. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-25netfilter: nf_ct_helper: disable automatic helper assignmentPablo Neira Ayuso
Four years ago we introduced a new sysctl knob to disable automatic helper assignment in 72110dfaa907 ("netfilter: nf_ct_helper: disable automatic helper assignment"). This knob kept this behaviour enabled by default to remain conservative. This measure was introduced to provide a secure way to configure iptables and connection tracking helpers through explicit rules. Give the time we have waited for this, let's turn off this by default now, worse case users still have a chance to recover the former behaviour by explicitly enabling this back through sysctl. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-25netfilter: nft_rbtree: allow adjacent intervals with dynamic updatesPablo Neira Ayuso
This patch fixes dynamic element updates for adjacent intervals in the rb-tree representation. Since elements are sorted in the rb-tree, in case of adjacent nodes with the same key, the assumption is that an interval end node must be placed before an interval opening. In tree lookup operations, the idea is to search for the closer element that is smaller than the one we're searching for. Given that we'll have two possible matchings, we have to take the opening interval in case of adjacent nodes. Range merges are not trivial with the current representation, specifically we have to check if node extensions are equal and make sure we keep the existing internal states around. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-25netfilter: nft_rbtree: introduce nft_rbtree_interval_end() helperPablo Neira Ayuso
Add this new nft_rbtree_interval_end() helper function to check in the end interval is set. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-25netfilter: nf_tables: parse element flags from nft_del_setelem()Pablo Neira Ayuso
Parse flags and pass them to the set via ->deactivate() to check if we remove the right element from the intervals. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-25netfilter: nf_tables: introduce nft_setelem_parse_flags() helperPablo Neira Ayuso
This function parses the set element flags, thus, we can reuse the same handling when deleting elements. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-25netfilter: conntrack: use get_random_once for conntrack hash seedFlorian Westphal
As earlier commit removed accessed to the hash from other files we can also make it static. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-25netfilter: conntrack: use get_random_once for nat and expectationsFlorian Westphal
Use a private seed and init it using get_random_once. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-25netfilter: conntrack: move generation seqcnt out of netns_ctFlorian Westphal
We only allow rehash in init namespace, so we only use init_ns.generation. And even if we would allow it, it makes no sense as the conntrack locks are global; any ongoing rehash prevents insert/ delete. So make this private to nf_conntrack_core instead. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree, mostly from Florian Westphal to sort out the lack of sufficient validation in x_tables and connlabel preparation patches to add nf_tables support. They are: 1) Ensure we don't go over the ruleset blob boundaries in mark_source_chains(). 2) Validate that target jumps land on an existing xt_entry. This extra sanitization comes with a performance penalty when loading the ruleset. 3) Introduce xt_check_entry_offsets() and use it from {arp,ip,ip6}tables. 4) Get rid of the smallish check_entry() functions in {arp,ip,ip6}tables. 5) Make sure the minimal possible target size in x_tables. 6) Similar to #3, add xt_compat_check_entry_offsets() for compat code. 7) Check that standard target size is valid. 8) More sanitization to ensure that the target_offset field is correct. 9) Add xt_check_entry_match() to validate that matches are well-formed. 10-12) Three patch to reduce the number of parameters in translate_compat_table() for {arp,ip,ip6}tables by using a container structure. 13) No need to return value from xt_compat_match_from_user(), so make it void. 14) Consolidate translate_table() so it can be used by compat code too. 15) Remove obsolete check for compat code, so we keep consistent with what was already removed in the native layout code (back in 2007). 16) Get rid of target jump validation from mark_source_chains(), obsoleted by #2. 17) Introduce xt_copy_counters_from_user() to consolidate counter copying, and use it from {arp,ip,ip6}tables. 18,22) Get rid of unnecessary explicit inlining in ctnetlink for dump functions. 19) Move nf_connlabel_match() to xt_connlabel. 20) Skip event notification if connlabel did not change. 21) Update of nf_connlabels_get() to make the upcoming nft connlabel support easier. 23) Remove spinlock to read protocol state field in conntrack. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23libnl: nla_put_be64(): align on a 64-bit areaNicolas Dichtel
nla_data() is now aligned on a 64-bit area. A temporary version (nla_put_be64_32bit()) is added for nla_put_net64(). This function is removed in the next patch. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts were two cases of simple overlapping changes, nothing serious. In the UDP case, we need to add a hlist_add_tail_rcu() to linux/rculist.h, because we've moved UDP socket handling away from using nulls lists. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-20ipvs: don't alter conntrack in OPS modeMarco Angaroni
When using OPS mode in conjunction with SIP persistent-engine, packets originating from the same ip-address/port could be balanced to different real servers, and (to properly handle SIP responses) OPS connections are created in the in-out direction too, where ip_vs_update_conntrack() is called to modify the reply tuple. As a result, there can be collision of conntrack tuples, causing random packet drops, as explained below: conntrack1: orig=CIP->VIP, reply=RIP1->CIP conntrack2: orig=RIP2->CIP, reply=CIP->VIP Tuple CIP->VIP is both in orig of conntrack1 and reply of conntrack2. The collision triggers packet drop inside nf_conntrack processing. In addition, the current implementation deletes the conntrack object at every expire of an OPS connection (once every forwarded packet), to have it recreated from scratch at next packet traversing IPVS. Since in OPS mode, by definition, we don't expect any associated response, the choices implemented in this patch are: a) don't call nf_conntrack_alter_reply() for OPS connections inside ip_vs_update_conntrack(). b) don't delete the conntrack object at OPS connection expire. The result is that created conntrack objects for each tuple CIP->VIP, RIP-N->CIP, etc. are left in UNREPLIED state and not modified by IPVS OPS connection management. This eliminates packet drops and leaves a single conntrack object for each tuple packets are sent from. Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2016-04-20ipvs: optimize release of connections in OPS modeMarco Angaroni
One-packet-scheduling is the most expensive mode in IPVS from performance point of view: for each packet to be processed a new connection data structure is created and, after packet is sent, deleted by starting a new timer set to expire immediately. SIP persistent-engine needs OPS mode to have Call-ID based load balancing, so OPS mode performance has negative impact in SIP protocol load balancing. This patch aims to improve performance of OPS mode by means of the following changes in the release mechanism of OPS connections: a) call expire callback ip_vs_conn_expire() directly instead of starting a timer programmed to fire immediately. b) avoid call_rcu() overhead inside expire callback, since OPS connection are not inserted in the hash-table and last just the time to process the packet, hence there is no concurrent access to such data structures. Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2016-04-20ipvs: handle connections started by real-serversMarco Angaroni
When using LVS-NAT and SIP persistence-egine over UDP, the following limitations are present with current implementation: 1) To actually have load-balancing based on Call-ID header, you need to use one-packet-scheduling mode. But with one-packet-scheduling the connection is deleted just after packet is forwarded, so SIP responses coming from real-servers do not match any connection and SNAT is not applied. 2) If you do not use "-o" option, IPVS behaves as normal UDP load balancer, so different SIP calls (each one identified by a different Call-ID) coming from the same ip-address/port go to the same real-server. So basically you don’t have load-balancing based on Call-ID as intended. 3) Call-ID is not learned when a new SIP call is started by a real-server (inside-to-outside direction), but only in the outside-to-inside direction. This would be a general problem for all SIP servers acting as Back2BackUserAgent. This patch aims to solve problems 1) and 3) while keeping OPS mode mandatory for SIP-UDP, so that 2) is not a problem anymore. The basic mechanism implemented is to make packets, that do not match any existent connection but come from real-servers, create new connections instead of let them pass without any effect. When such packets pass through ip_vs_out(), if their source ip address and source port match a configured real-server, a new connection is automatically created in the same way as it would have happened if the packet had come from outside-to-inside direction. A new connection template is created too if the virtual-service is persistent and there is no matching connection template found. The new connection automatically created, if the service had "-o" option, is an OPS connection that lasts only the time to forward the packet, just like it happens on the ingress side. The main part of this mechanism is implemented inside a persistent-engine specific callback (at the moment only SIP persistent engine exists) and is triggered only for UDP packets, since connection oriented protocols, by using different set of ports (typically ephemeral ports) to open new outgoing connections, should not need this feature. The following requisites are needed for automatic connection creation; if any is missing the packet simply goes the same way as before. a) virtual-service is not fwmark based (this is because fwmark services do not store address and port of the virtual-service, required to build the connection data). b) virtual-service and real-servers must not have been configured with omitted port (this is again to have all data to create the connection). Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2016-04-19netfilter: conntrack: don't acquire lock during seq_printfFlorian Westphal
read access doesn't need any lock here. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-18netfilter: ctnetlink: restore inlining for netlink message size calculationPablo Neira Ayuso
Calm down gcc warnings: net/netfilter/nf_conntrack_netlink.c:529:15: warning: 'ctnetlink_proto_size' defined but not used [-Wunused-function] static size_t ctnetlink_proto_size(const struct nf_conn *ct) ^ net/netfilter/nf_conntrack_netlink.c:546:15: warning: 'ctnetlink_acct_size' defined but not used [-Wunused-function] static size_t ctnetlink_acct_size(const struct nf_conn *ct) ^ net/netfilter/nf_conntrack_netlink.c:556:12: warning: 'ctnetlink_secctx_size' defined but not used [-Wunused-function] static int ctnetlink_secctx_size(const struct nf_conn *ct) ^ net/netfilter/nf_conntrack_netlink.c:572:15: warning: 'ctnetlink_timestamp_size' defined but not used [-Wunused-function] static size_t ctnetlink_timestamp_size(const struct nf_conn *ct) ^ So gcc compiles them out when CONFIG_NF_CONNTRACK_EVENTS and CONFIG_NETFILTER_NETLINK_GLUE_CT are not set. Fixes: 4054ff45454a9a4 ("netfilter: ctnetlink: remove unnecessary inlining") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Arnd Bergmann <arnd@arndb.de>
2016-04-18netfilter: connlabels: change nf_connlabels_get bit arg to 'highest used'Florian Westphal
nf_connlabel_set() takes the bit number that we would like to set. nf_connlabels_get() however took the number of bits that we want to support. So e.g. nf_connlabels_get(32) support bits 0 to 31, but not 32. This changes nf_connlabels_get() to take the highest bit that we want to set. Callers then don't have to cope with a potential integer wrap when using nf_connlabels_get(bit + 1) anymore. Current callers are fine, this change is only to make folloup nft ct label set support simpler. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-18netfilter: labels: don't emit ct event if labels were not changedFlorian Westphal
make the replace function only send a ctnetlink event if the contents of the new set is different. Otherwise 'ct label set ct label | bar' will cause netlink event storm since we "replace" labels for each packet. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-18netfilter: connlabels: move helpers to xt_connlabelFlorian Westphal
Currently labels can only be set either by iptables connlabel match or via ctnetlink. Before adding nftables set support, clean up the clabel core and move helpers that nft will not need after all to the xtables module. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-16ip_tunnel_core: iptunnel_handle_offloads returns int and doesn't free skbAlexander Duyck
This patch updates the IP tunnel core function iptunnel_handle_offloads so that we return an int and do not free the skb inside the function. This actually allows us to clean up several paths in several tunnels so that we can free the skb at one point in the path without having to have a secondary path if we are supporting tunnel offloads. In addition it should resolve some double-free issues I have found in the tunnels paths as I believe it is possible for us to end up triggering such an event in the case of fou or gue. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14netfilter: ctnetlink: remove unnecessary inliningPablo Neira Ayuso
Many of these functions are called from control plane path. Move ctnetlink_nlmsg_size() under CONFIG_NF_CONNTRACK_EVENTS to avoid a compilation warning when CONFIG_NF_CONNTRACK_EVENTS=n. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14netfilter: x_tables: introduce and use xt_copy_counters_from_userFlorian Westphal
The three variants use same copy&pasted code, condense this into a helper and use that. Make sure info.name is 0-terminated. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14netfilter: x_tables: do compat validation via translate_tableFlorian Westphal
This looks like refactoring, but its also a bug fix. Problem is that the compat path (32bit iptables, 64bit kernel) lacks a few sanity tests that are done in the normal path. For example, we do not check for underflows and the base chain policies. While its possible to also add such checks to the compat path, its more copy&pastry, for instance we cannot reuse check_underflow() helper as e->target_offset differs in the compat case. Other problem is that it makes auditing for validation errors harder; two places need to be checked and kept in sync. At a high level 32 bit compat works like this: 1- initial pass over blob: validate match/entry offsets, bounds checking lookup all matches and targets do bookkeeping wrt. size delta of 32/64bit structures assign match/target.u.kernel pointer (points at kernel implementation, needed to access ->compatsize etc.) 2- allocate memory according to the total bookkeeping size to contain the translated ruleset 3- second pass over original blob: for each entry, copy the 32bit representation to the newly allocated memory. This also does any special match translations (e.g. adjust 32bit to 64bit longs, etc). 4- check if ruleset is free of loops (chase all jumps) 5-first pass over translated blob: call the checkentry function of all matches and targets. The alternative implemented by this patch is to drop steps 3&4 from the compat process, the translation is changed into an intermediate step rather than a full 1:1 translate_table replacement. In the 2nd pass (step #3), change the 64bit ruleset back to a kernel representation, i.e. put() the kernel pointer and restore ->u.user.name . This gets us a 64bit ruleset that is in the format generated by a 64bit iptables userspace -- we can then use translate_table() to get the 'native' sanity checks. This has two drawbacks: 1. we re-validate all the match and target entry structure sizes even though compat translation is supposed to never generate bogus offsets. 2. we put and then re-lookup each match and target. THe upside is that we get all sanity tests and ruleset validations provided by the normal path and can remove some duplicated compat code. iptables-restore time of autogenerated ruleset with 300k chains of form -A CHAIN0001 -m limit --limit 1/s -j CHAIN0002 -A CHAIN0002 -m limit --limit 1/s -j CHAIN0003 shows no noticeable differences in restore times: old: 0m30.796s new: 0m31.521s 64bit: 0m25.674s Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14netfilter: x_tables: xt_compat_match_from_user doesn't need a retvalFlorian Westphal
Always returned 0. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14netfilter: x_tables: validate all offsets and sizes in a ruleFlorian Westphal
Validate that all matches (if any) add up to the beginning of the target and that each match covers at least the base structure size. The compat path should be able to safely re-use the function as the structures only differ in alignment; added a BUILD_BUG_ON just in case we have an arch that adds padding as well. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14netfilter: x_tables: check for bogus target offsetFlorian Westphal
We're currently asserting that targetoff + targetsize <= nextoff. Extend it to also check that targetoff is >= sizeof(xt_entry). Since this is generic code, add an argument pointing to the start of the match/target, we can then derive the base structure size from the delta. We also need the e->elems pointer in a followup change to validate matches. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14netfilter: x_tables: check standard target size tooFlorian Westphal
We have targets and standard targets -- the latter carries a verdict. The ip/ip6tables validation functions will access t->verdict for the standard targets to fetch the jump offset or verdict for chainloop detection, but this happens before the targets get checked/validated. Thus we also need to check for verdict presence here, else t->verdict can point right after a blob. Spotted with UBSAN while testing malformed blobs. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14netfilter: x_tables: add compat version of xt_check_entry_offsetsFlorian Westphal
32bit rulesets have different layout and alignment requirements, so once more integrity checks get added to xt_check_entry_offsets it will reject well-formed 32bit rulesets. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14netfilter: x_tables: assert minimum target sizeFlorian Westphal
The target size includes the size of the xt_entry_target struct. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-14netfilter: x_tables: add and use xt_check_entry_offsetsFlorian Westphal
Currently arp/ip and ip6tables each implement a short helper to check that the target offset is large enough to hold one xt_entry_target struct and that t->u.target_size fits within the current rule. Unfortunately these checks are not sufficient. To avoid adding new tests to all of ip/ip6/arptables move the current checks into a helper, then extend this helper in followup patches. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains the first batch of Netfilter updates for your net-next tree. 1) Define pr_fmt() in nf_conntrack, from Weongyo Jeong. 2) Define and register netfilter's afinfo for the bridge family, this comes in preparation for native nfqueue's bridge for nft, from Stephane Bryant. 3) Add new attributes to store layer 2 and VLAN headers to nfqueue, also from Stephane Bryant. 4) Parse new NFQA_VLAN and NFQA_L2HDR nfqueue netlink attributes coming from userspace, from Stephane Bryant. 5) Use net->ipv6.devconf_all->hop_limit instead of hardcoded hop_limit in IPv6 SYNPROXY, from Liping Zhang. 6) Remove unnecessary check for dst == NULL in nf_reject_ipv6, from Haishuang Yan. 7) Deinline ctnetlink event report functions, from Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-12netfilter: conntrack: move expectation event helper to ecache.cFlorian Westphal
Not performance critical, it is only invoked when an expectation is added/destroyed. While at it, kill unused nf_ct_expect_event() wrapper. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-12netfilter: conntrack: de-inline nf_conntrack_eventmask_reportFlorian Westphal
Way too large; move it to nf_conntrack_ecache.c. Reduces total object size by 1216 byte on my machine. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-08Merge tag 'mac80211-next-for-davem-2016-04-06' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== For the 4.7 cycle, we have a number of changes: * Bob's mesh mode rhashtable conversion, this includes the rhashtable API change for allocation flags * BSSID scan, connect() command reassoc support (Jouni) * fast (optimised data only) and support for RSS in mac80211 (myself) * various smaller changes ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07netfilter: nf_conntrack_tcp: Fix stack out of bounds when parsing TCP optionsJozsef Kadlecsik
Baozeng Ding reported a KASAN stack out of bounds issue - it uncovered that the TCP option parsing routines in netfilter TCP connection tracking could read one byte out of the buffer of the TCP options. Therefore in the patch we check that the available data length is large enough to parse both TCP option code and size. Reported-by: Baozeng Ding <sploving1@gmail.com> Tested-by: Baozeng Ding <sploving1@gmail.com> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-04-05rhashtable: accept GFP flags in rhashtable_walk_initBob Copeland
In certain cases, the 802.11 mesh pathtable code wants to iterate over all of the entries in the forwarding table from the receive path, which is inside an RCU read-side critical section. Enable walks inside atomic sections by allowing GFP_ATOMIC allocations for the walker state. Change all existing callsites to pass in GFP_KERNEL. Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Bob Copeland <me@bobcopeland.com> [also adjust gfs2/glock.c and rhashtable tests] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-04-04tcp/dccp: do not touch listener sk_refcnt under synfloodEric Dumazet
When a SYNFLOOD targets a non SO_REUSEPORT listener, multiple cpus contend on sk->sk_refcnt and sk->sk_wmem_alloc changes. By letting listeners use SOCK_RCU_FREE infrastructure, we can relax TCP_LISTEN lookup rules and avoid touching sk_refcnt Note that we still use SLAB_DESTROY_BY_RCU rules for other sockets, only listeners are impacted by this change. Peak performance under SYNFLOOD is increased by ~33% : On my test machine, I could process 3.2 Mpps instead of 2.4 Mpps Most consuming functions are now skb_set_owner_w() and sock_wfree() contending on sk->sk_wmem_alloc when cooking SYNACK and freeing them. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-29netfilter: bridge: nf queue verdict to use NFQA_VLAN and NFQA_L2HDRStephane Bryant
This makes nf queues use NFQA_VLAN and NFQA_L2HDR in verdict to modify the original skb Signed-off-by: Stephane Bryant <stephane.ml.bryant@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-29netfilter: bridge: pass L2 header and VLAN as netlink attributes in queues ↵Stephane Bryant
to userspace - This creates 2 netlink attribute NFQA_VLAN and NFQA_L2HDR. - These are filled up for the PF_BRIDGE family on the way to userspace. - NFQA_VLAN is a nested attribute, with the NFQA_VLAN_PROTO and the NFQA_VLAN_TCI carrying the corresponding vlan_proto and vlan_tci fields from the skb using big endian ordering (and using the CFI bit as the VLAN_TAG_PRESENT flag in vlan_tci as in the skb) Signed-off-by: Stephane Bryant <stephane.ml.bryant@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28netfilter: nfnetlink_queue: honor NFQA_CFG_F_FAIL_OPEN when netlink unicast ↵Pablo Neira Ayuso
fails When netlink unicast fails to deliver the message to userspace, we should also check if the NFQA_CFG_F_FAIL_OPEN flag is set so we reinject the packet back to the stack. I think the user expects no packet drops when this flag is set due to queueing to userspace errors, no matter if related to the internal queue or when sending the netlink message to userspace. The userspace application will still get the ENOBUFS error via recvmsg() so the user still knows that, with the current configuration that is in place, the userspace application is not consuming the messages at the pace that the kernel needs. Reported-by: "Yigal Reiss (yreiss)" <yreiss@cisco.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Tested-by: "Yigal Reiss (yreiss)" <yreiss@cisco.com>
2016-03-28netfilter: ipset: fix race condition in ipset save, swap and deleteVishwanath Pai
This fix adds a new reference counter (ref_netlink) for the struct ip_set. The other reference counter (ref) can be swapped out by ip_set_swap and we need a separate counter to keep track of references for netlink events like dump. Using the same ref counter for dump causes a race condition which can be demonstrated by the following script: ipset create hash_ip1 hash:ip family inet hashsize 1024 maxelem 500000 \ counters ipset create hash_ip2 hash:ip family inet hashsize 300000 maxelem 500000 \ counters ipset create hash_ip3 hash:ip family inet hashsize 1024 maxelem 500000 \ counters ipset save & ipset swap hash_ip3 hash_ip2 ipset destroy hash_ip3 /* will crash the machine */ Swap will exchange the values of ref so destroy will see ref = 0 instead of ref = 1. With this fix in place swap will not succeed because ipset save still has ref_netlink on the set (ip_set_swap doesn't swap ref_netlink). Both delete and swap will error out if ref_netlink != 0 on the set. Note: The changes to *_head functions is because previously we would increment ref whenever we called these functions, we don't do that anymore. Reviewed-by: Joshua Hunt <johunt@akamai.com> Signed-off-by: Vishwanath Pai <vpai@akamai.com> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28netfilter: nf_conntrack: Uses pr_fmt() for logging.Weongyo Jeong
Uses pr_fmt() macro for debugging messages of nf_conntrack module. Signed-off-by: Weongyo Jeong <weongyo.linux@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-15netfilter: nf_conntrack: consolidate lock/unlock into unlock_waitNicholas Mc Guire
The spin_lock()/spin_unlock() is synchronizing on the nf_conntrack_locks_all_lock which is equivalent to spin_unlock_wait() but the later should be more efficient. Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-12netfilter: x_tables: check for size overflowFlorian Westphal
Ben Hawkes says: integer overflow in xt_alloc_table_info, which on 32-bit systems can lead to small structure allocation and a copy_from_user based heap corruption. Reported-by: Ben Hawkes <hawkes@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-11netfilter: nft_compat: check match/targetinfo attr sizeFlorian Westphal
We copy according to ->target|matchsize, so check that the netlink attribute (which can include padding and might be larger) contains enough data. Reported-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-11Merge tag 'ipvs-fixes-for-v4.5' of ↵Pablo Neira Ayuso
https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs Simon Horman says: ==================== please consider these IPVS fixes for v4.5 or if it is too late please consider them for v4.6. * Arnd Bergman has corrected an error whereby the SIP persistence engine may incorrectly access protocol fields * Julian Anastasov has corrected a problem reported by Jiri Bohac with the connection rescheduling mechanism added in 3.10 when new SYNs in connection to dead real server can be redirected to another real server. * Marco Angaroni resolved a problem in the SIP persistence engine whereby the Call-ID could not be found if it was at the beginning of a SIP message. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-10Merge branch 'master' of git://blackhole.kfki.hu/nfPablo Neira Ayuso
Jozsef Kadlecsik says: ==================== Please apply the next patch against the nf tree: - Deniz Eren reported that parallel flush/dump of list:set type of sets can lead to kernel crash. The bug was due to non-RCU compatible flushing, listing in the set type, fixed by me. - Julia Lawall pointed out that IPSET_ATTR_ETHER netlink attribute length was not checked explicitly. The patch adds the missing checkings. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-08netfilter: ipset: Check IPSET_ATTR_ETHER netlink attribute lengthJozsef Kadlecsik
Julia Lawall pointed out that IPSET_ATTR_ETHER netlink attribute length was not checked explicitly, just for the maximum possible size. Malicious netlink clients could send shorter attribute and thus resulting a kernel read after the buffer. The patch adds the explicit length checkings. Reported-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>