summaryrefslogtreecommitdiff
path: root/net/netfilter
AgeCommit message (Collapse)Author
2015-07-02netfilter: nfnetlink: keep going batch handling on missing modulesPablo Neira Ayuso
After a fresh boot with no modules in place at all and a large rulesets, the existing nfnetlink_rcv_batch() funcion can take long time to commit the ruleset due to the many abort path. This is specifically a problem for the existing client of this code, ie. nf_tables, since it results in several synchronize_rcu() call in a row. This patch changes the policy to keep full batch processing on missing modules errors so we abort only once. Reported-by: Eric Leblond <eric@regit.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-07-02netfilter: nf_queue: Don't recompute the hook_list headEric W. Biederman
If someone sends packets from one of the netdevice ingress hooks to the a userspace queue, and then userspace later accepts the packet, the netfilter code can enter an infinite loop as the list head will never be found. Pass in the saved list_head to avoid this. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-06-23netfilter: nf_qeueue: Drop queue entries on nf_unregister_hookEric W. Biederman
Add code to nf_unregister_hook to flush the nf_queue when a hook is unregistered. This guarantees that the pointer that the nf_queue code retains into the nf_hook list will remain valid while a packet is queued. I tested what would happen if we do not flush queued packets and was trivially able to obtain the oops below. All that was required was to stop the nf_queue listening process, to delete all of the nf_tables, and to awaken the nf_queue listening process. > BUG: unable to handle kernel paging request at 0000000100000001 > IP: [<0000000100000001>] 0x100000001 > PGD b9c35067 PUD 0 > Oops: 0010 [#1] SMP > Modules linked in: > CPU: 0 PID: 519 Comm: lt-nfqnl_test Not tainted > task: ffff8800b9c8c050 ti: ffff8800ba9d8000 task.ti: ffff8800ba9d8000 > RIP: 0010:[<0000000100000001>] [<0000000100000001>] 0x100000001 > RSP: 0018:ffff8800ba9dba40 EFLAGS: 00010a16 > RAX: ffff8800bab48a00 RBX: ffff8800ba9dba90 RCX: ffff8800ba9dba90 > RDX: ffff8800b9c10128 RSI: ffff8800ba940900 RDI: ffff8800bab48a00 > RBP: ffff8800b9c10128 R08: ffffffff82976660 R09: ffff8800ba9dbb28 > R10: dead000000100100 R11: dead000000200200 R12: ffff8800ba940900 > R13: ffffffff8313fd50 R14: ffff8800b9c95200 R15: 0000000000000000 > FS: 00007fb91fc34700(0000) GS:ffff8800bfa00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000100000001 CR3: 00000000babfb000 CR4: 00000000000007f0 > Stack: > ffffffff8206ab0f ffffffff82982240 ffff8800bab48a00 ffff8800b9c100a8 > ffff8800b9c10100 0000000000000001 ffff8800ba940900 ffff8800b9c10128 > ffffffff8206bd65 ffff8800bfb0d5e0 ffff8800bab48a00 0000000000014dc0 > Call Trace: > [<ffffffff8206ab0f>] ? nf_iterate+0x4f/0xa0 > [<ffffffff8206bd65>] ? nf_reinject+0x125/0x190 > [<ffffffff8206dee5>] ? nfqnl_recv_verdict+0x255/0x360 > [<ffffffff81386290>] ? nla_parse+0x80/0xf0 > [<ffffffff8206c42c>] ? nfnetlink_rcv_msg+0x13c/0x240 > [<ffffffff811b2fec>] ? __memcg_kmem_get_cache+0x4c/0x150 > [<ffffffff8206c2f0>] ? nfnl_lock+0x20/0x20 > [<ffffffff82068159>] ? netlink_rcv_skb+0xa9/0xc0 > [<ffffffff820677bf>] ? netlink_unicast+0x12f/0x1c0 > [<ffffffff82067ade>] ? netlink_sendmsg+0x28e/0x650 > [<ffffffff81fdd814>] ? sock_sendmsg+0x44/0x50 > [<ffffffff81fde07b>] ? ___sys_sendmsg+0x2ab/0x2c0 > [<ffffffff810e8f73>] ? __wake_up+0x43/0x70 > [<ffffffff8141a134>] ? tty_write+0x1c4/0x2a0 > [<ffffffff81fde9f4>] ? __sys_sendmsg+0x44/0x80 > [<ffffffff823ff8d7>] ? system_call_fastpath+0x12/0x6a > Code: Bad RIP value. > RIP [<0000000100000001>] 0x100000001 > RSP <ffff8800ba9dba40> > CR2: 0000000100000001 > ---[ end trace 08eb65d42362793f ]--- Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-23netfilter: nftables: Do not run chains in the wrong network namespaceEric W. Biederman
Currenlty nf_tables chains added in one network namespace are being run in all network namespace. The issues are myriad with the simplest being an unprivileged user can cause any network packets to be dropped. Address this by simply not running nf_tables chains in the wrong network namespace. Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-18netfilter: use forward declaration instead of including linux/proc_fs.hPablo Neira Ayuso
We don't need to pull the full definitions in that file, a simple forward declaration is enough. Moreover, include linux/procfs.h from nf_synproxy_core, otherwise this hits a compilation error due to missing declarations, ie. net/netfilter/nf_synproxy_core.c: In function ‘synproxy_proc_init’: net/netfilter/nf_synproxy_core.c:326:2: error: implicit declaration of function ‘proc_create’ [-Werror=implicit-function-declaration] if (!proc_create("synproxy", S_IRUGO, net->proc_net_stat, ^ Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2015-06-18netfilter: Kill unused copies of RCV_SKB_FAILEric W. Biederman
This appears to have been a dead macro in both nfnetlink_log.c and nfnetlink_queue_core.c since these pieces of code were added in 2005. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-06-18netfilter: xt_socket: add XT_SOCKET_RESTORESKMARK flagHarout Hedeshian
xt_socket is useful for matching sockets with IP_TRANSPARENT and taking some action on the matching packets. However, it lacks the ability to match only a small subset of transparent sockets. Suppose there are 2 applications, each with its own set of transparent sockets. The first application wants all matching packets dropped, while the second application wants them forwarded somewhere else. Add the ability to retore the skb->mark from the sk_mark. The mark is only restored if a matching socket is found and the transparent / nowildcard conditions are satisfied. Now the 2 hypothetical applications can differentiate their sockets based on a mark value set with SO_MARK. iptables -t mangle -I PREROUTING -m socket --transparent \ --restore-skmark -j action iptables -t mangle -A action -m mark --mark 10 -j action2 iptables -t mangle -A action -m mark --mark 11 -j action3 Signed-off-by: Harout Hedeshian <harouth@codeaurora.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-06-18netfilter: nfnetlink_queue: add security context informationRoman Kubiak
This patch adds an additional attribute when sending packet information via netlink in netfilter_queue module. It will send additional security context data, so that userspace applications can verify this context against their own security databases. Signed-off-by: Roman Kubiak <r.kubiak@samsung.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-06-15netfilter: nf_tables_netdev: unregister hooks on net_device removalPablo Neira Ayuso
In case the net_device is gone, we have to unregister the hooks and put back the reference on the net_device object. Once it comes back, register them again. This also covers the device rename case. This patch also adds a new flag to indicate that the basechain is disabled, so their hooks are not registered. This flag is used by the netdev family to handle the case where the net_device object is gone. Currently this flag is not exposed to userspace. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-06-15netfilter: nf_tables: add nft_register_basechain() and ↵Pablo Neira Ayuso
nft_unregister_basechain() This wrapper functions take care of hook registration for basechains. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-06-15netfilter: nf_tables: attach net_device to basechainPablo Neira Ayuso
The device is part of the hook configuration, so instead of a global configuration per table, set it to each of the basechain that we create. This patch reworks ebddf1a8d78a ("netfilter: nf_tables: allow to bind table to net_device"). Note that this adds a dev_name field in the nft_base_chain structure which is required the netdev notification subscription that follows up in a patch to handle gone net_devices. Suggested-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-06-15netfilter: x_tables: remove XT_TABLE_INFO_SZ and a dereference.Eric Dumazet
After Florian patches, there is no need for XT_TABLE_INFO_SZ anymore : Only one copy of table is kept, instead of one copy per cpu. We also can avoid a dereference if we put table data right after xt_table_info. It reduces register pressure and helps compiler. Then, we attempt a kmalloc() if total size is under order-3 allocation, to reduce TLB pressure, as in many cases, rules fit in 32 KB. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-06-15Merge branch 'master' of git://blackhole.kfki.hu/nf-nextPablo Neira Ayuso
Jozsef Kadlecsik says: ==================== ipset patches for nf-next Please consider to apply the next bunch of patches for ipset. First comes the small changes, then the bugfixes and at the end the RCU related patches. * Use MSEC_PER_SEC consistently instead of the number. * Use SET_WITH_*() helpers to test set extensions from Sergey Popovich. * Check extensions attributes before getting extensions from Sergey Popovich. * Permit CIDR equal to the host address CIDR in IPv6 from Sergey Popovich. * Make sure we always return line number on batch in the case of error from Sergey Popovich. * Check CIDR value only when attribute is given from Sergey Popovich. * Fix cidr handling for hash:*net* types, reported by Jonathan Johnson. * Fix parallel resizing and listing of the same set so that the original set is kept for the whole dumping. * Make sure listing doesn't grab a set which is just being destroyed. * Remove rbtree from ip_set_hash_netiface.c in order to introduce RCU. * Replace rwlock_t with spinlock_t in "struct ip_set", change the locking in the core and simplifications in the timeout routines. * Introduce RCU locking in bitmap:* types with a slight modification in the logic on how an element is added. * Introduce RCU locking in hash:* types. This is the most complex part of the changes. * Introduce RCU locking in list type where standard rculist is used. * Fix coding styles reported by checkpatch.pl. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-06-15netfilter: Kconfig: get rid of parens around depends onPablo Neira Ayuso
According to the reporter, they are not needed. Reported-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-06-14netfilter: ipset: Fix coding styles reported by checkpatch.plJozsef Kadlecsik
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2015-06-14netfilter: ipset: Introduce RCU locking in list typeJozsef Kadlecsik
Standard rculist is used. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2015-06-14netfilter: ipset: Introduce RCU locking in hash:* typesJozsef Kadlecsik
Three types of data need to be protected in the case of the hash types: a. The hash buckets: standard rcu pointer operations are used. b. The element blobs in the hash buckets are stored in an array and a bitmap is used for book-keeping to tell which elements in the array are used or free. c. Networks per cidr values and the cidr values themselves are stored in fix sized arrays and need no protection. The values are modified in such an order that in the worst case an element testing is repeated once with the same cidr value. The ipset hash approach uses arrays instead of lists and therefore is incompatible with rhashtable. Performance is tested by Jesper Dangaard Brouer: Simple drop in FORWARD ~~~~~~~~~~~~~~~~~~~~~~ Dropping via simple iptables net-mask match:: iptables -t raw -N simple || iptables -t raw -F simple iptables -t raw -I simple -s 198.18.0.0/15 -j DROP iptables -t raw -D PREROUTING -j simple iptables -t raw -I PREROUTING -j simple Drop performance in "raw": 11.3Mpps Generator: sending 12.2Mpps (tx:12264083 pps) Drop via original ipset in RAW table ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Create a set with lots of elements:: sudo ./ipset destroy test echo "create test hash:ip hashsize 65536" > test.set for x in `seq 0 255`; do for y in `seq 0 255`; do echo "add test 198.18.$x.$y" >> test.set done done sudo ./ipset restore < test.set Dropping via ipset:: iptables -t raw -F iptables -t raw -N net198 || iptables -t raw -F net198 iptables -t raw -I net198 -m set --match-set test src -j DROP iptables -t raw -I PREROUTING -j net198 Drop performance in "raw" with ipset: 8Mpps Perf report numbers ipset drop in "raw":: + 24.65% ksoftirqd/1 [ip_set] [k] ip_set_test - 21.42% ksoftirqd/1 [kernel.kallsyms] [k] _raw_read_lock_bh - _raw_read_lock_bh + 99.88% ip_set_test - 19.42% ksoftirqd/1 [kernel.kallsyms] [k] _raw_read_unlock_bh - _raw_read_unlock_bh + 99.72% ip_set_test + 4.31% ksoftirqd/1 [ip_set_hash_ip] [k] hash_ip4_kadt + 2.27% ksoftirqd/1 [ixgbe] [k] ixgbe_fetch_rx_buffer + 2.18% ksoftirqd/1 [ip_tables] [k] ipt_do_table + 1.81% ksoftirqd/1 [ip_set_hash_ip] [k] hash_ip4_test + 1.61% ksoftirqd/1 [kernel.kallsyms] [k] __netif_receive_skb_core + 1.44% ksoftirqd/1 [kernel.kallsyms] [k] build_skb + 1.42% ksoftirqd/1 [kernel.kallsyms] [k] ip_rcv + 1.36% ksoftirqd/1 [kernel.kallsyms] [k] __local_bh_enable_ip + 1.16% ksoftirqd/1 [kernel.kallsyms] [k] dev_gro_receive + 1.09% ksoftirqd/1 [kernel.kallsyms] [k] __rcu_read_unlock + 0.96% ksoftirqd/1 [ixgbe] [k] ixgbe_clean_rx_irq + 0.95% ksoftirqd/1 [kernel.kallsyms] [k] __netdev_alloc_frag + 0.88% ksoftirqd/1 [kernel.kallsyms] [k] kmem_cache_alloc + 0.87% ksoftirqd/1 [xt_set] [k] set_match_v3 + 0.85% ksoftirqd/1 [kernel.kallsyms] [k] inet_gro_receive + 0.83% ksoftirqd/1 [kernel.kallsyms] [k] nf_iterate + 0.76% ksoftirqd/1 [kernel.kallsyms] [k] put_compound_page + 0.75% ksoftirqd/1 [kernel.kallsyms] [k] __rcu_read_lock Drop via ipset in RAW table with RCU-locking ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ With RCU locking, the RW-lock is gone. Drop performance in "raw" with ipset with RCU-locking: 11.3Mpps Performance-tested-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2015-06-14netfilter: ipset: Introduce RCU locking in bitmap:* typesJozsef Kadlecsik
There's nothing much required because the bitmap types use atomic bit operations. However the logic of adding elements slightly changed: first the MAC address updated (which is not atomic), then the element activated (added). The extensions may call kfree_rcu() therefore we call rcu_barrier() at module removal. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2015-06-14netfilter: ipset: Prepare the ipset core to use RCU at set levelJozsef Kadlecsik
Replace rwlock_t with spinlock_t in "struct ip_set" and change the locking accordingly. Convert the comment extension into an rcu-avare object. Also, simplify the timeout routines. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2015-06-14netfilter:ipset Remove rbtree from hash:net,ifaceJozsef Kadlecsik
Remove rbtree in order to introduce RCU instead of rwlock in ipset Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2015-06-14netfilter: ipset: Make sure listing doesn't grab a set which is just being ↵Jozsef Kadlecsik
destroyed. There was a small window when all sets are destroyed and a concurrent listing of all sets could grab a set which is just being destroyed. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2015-06-14netfilter: ipset: Fix parallel resizing and listing of the same setJozsef Kadlecsik
When elements added to a hash:* type of set and resizing triggered, parallel listing could start to list the original set (before resizing) and "continue" with listing the new set. Fix it by references and using the original hash table for listing. Therefore the destroying of the original hash table may happen from the resizing or listing functions. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2015-06-14netfilter: ipset: Fix cidr handling for hash:*net* typesJozsef Kadlecsik
Commit "Simplify cidr handling for hash:*net* types" broke the cidr handling for the hash:*net* types when the sets were used by the SET target: entries with invalid cidr values were added to the sets. Reported by Jonathan Johnson. Testsuite entry is added to verify the fix. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2015-06-14netfilter: ipset: Check CIDR value only when attribute is givenSergey Popovich
There is no reason to check CIDR value regardless attribute specifying CIDR is given. Initialize cidr array in element structure on element structure declaration to let more freedom to the compiler to optimize initialization right before element structure is used. Remove local variables cidr and cidr2 for netnet and netportnet hashes as we do not use packed cidr value for such set types and can store value directly in e.cidr[]. Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2015-06-14netfilter: ipset: Make sure we always return line number on batchSergey Popovich
Even if we return with generic IPSET_ERR_PROTOCOL it is good idea to return line number if we called in batch mode. Moreover we are not always exiting with IPSET_ERR_PROTOCOL. For example hash:ip,port,net may return IPSET_ERR_HASH_RANGE_UNSUPPORTED or IPSET_ERR_INVALID_CIDR. Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2015-06-14netfilter: ipset: Permit CIDR equal to the host address CIDR in IPv6Sergey Popovich
Permit userspace to supply CIDR length equal to the host address CIDR length in netlink message. Prohibit any other CIDR length for IPv6 variant of the set. Also return -IPSET_ERR_HASH_RANGE_UNSUPPORTED instead of generic -IPSET_ERR_PROTOCOL in IPv6 variant of hash:ip,port,net when IPSET_ATTR_IP_TO attribute is given. Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2015-06-14netfilter: ipset: Check extensions attributes before getting extensions.Sergey Popovich
Make all extensions attributes checks within ip_set_get_extensions() and reduce number of duplicated code. Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2015-06-14netfilter: ipset: Use SET_WITH_*() helpers to test set extensionsSergey Popovich
Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2015-06-12netfilter: xtables: avoid percpu ruleset duplicationFlorian Westphal
We store the rule blob per (possible) cpu. Unfortunately this means we can waste lot of memory on big smp machines. ipt_entry structure ('rule head') is 112 byte, so e.g. with maxcpu=64 one single rule eats close to 8k RAM. Since previous patch made counters percpu it appears there is nothing left in the rule blob that needs to be percpu. On my test system (144 possible cpus, 400k dummy rules) this change saves close to 9 Gigabyte of RAM. Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-06-12netfilter: conntrack: warn the user if there is a better helper to useMarcelo Ricardo Leitner
After db29a9508a92 ("netfilter: conntrack: disable generic tracking for known protocols"), if the specific helper is built but not loaded (a standard for most distributions) systems with a restrictive firewall but weak configuration regarding netfilter modules to load, will silently stop working. This patch then puts a warning message so the sysadmin knows where to start looking into. It's a pr_warn_once regardless of protocol itself but it should be enough to give a hint on where to look. Cc: Florian Westphal <fw@strlen.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-05-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next, they are: 1) default CONFIG_NETFILTER_INGRESS to y for easier compile-testing of all options. 2) Allow to bind a table to net_device. This introduces the internal NFT_AF_NEEDS_DEV flag to perform a mandatory check for this binding. This is required by the next patch. 3) Add the 'netdev' table family, this new table allows you to create ingress filter basechains. This provides access to the existing nf_tables features from ingress. 4) Kill unused argument from compat_find_calc_{match,target} in ip_tables and ip6_tables, from Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-26netfilter: nf_tables: add netdev table to filter from ingressPablo Neira Ayuso
This allows us to create netdev tables that contain ingress chains. Use skb_header_pointer() as we may see shared sk_buffs at this stage. This change provides access to the existing nf_tables features from the ingress hook. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-05-26netfilter: nf_tables: allow to bind table to net_devicePablo Neira Ayuso
This patch adds the internal NFT_AF_NEEDS_DEV flag to indicate that you must attach this table to a net_device. This change is required by the follow up patch that introduces the new netdev table. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-05-26netfilter: default CONFIG_NETFILTER_INGRESS to yPablo Neira Ayuso
Useful to compile-test all options. Suggested-by: Alexei Stavoroitov <ast@plumgrid.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-05-25ipv6: Set FLOWI_FLAG_KNOWN_NH at flowi6_flagsMartin KaFai Lau
The neighbor look-up used to depend on the rt6i_gateway (if there is a gateway) or the rt6i_dst (if it is a RTF_CACHE clone) as the nexthop address. Note that rt6i_dst is set to fl6->daddr for the RTF_CACHE clone where fl6->daddr is the one used to do the route look-up. Now, we only create RTF_CACHE clone after encountering exception. When doing the neighbor look-up with a route that is neither a gateway nor a RTF_CACHE clone, the daddr in skb will be used as the nexthop. In some cases, the daddr in skb is not the one used to do the route look-up. One example is in ip_vs_dr_xmit_v6() where the real nexthop server address is different from the one in the skb. This patch is going to follow the IPv4 approach and ask the ip6_pol_route() callers to set the FLOWI_FLAG_KNOWN_NH properly. In the next patch, ip6_pol_route() will honor the FLOWI_FLAG_KNOWN_NH and create a RTF_CACHE clone. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Julian Anastasov <ja@ssi.bg> Tested-by: Julian Anastasov <ja@ssi.bg> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-25ipv6: Add rt6_get_cookie() functionMartin KaFai Lau
Instead of doing the rt6->rt6i_node check whenever we need to get the route's cookie. Refactor it into rt6_get_cookie(). It is a prep work to handle FLOWI_FLAG_KNOWN_NH and also percpu rt6_info later. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-25ipv6: Remove external dependency on rt6i_gateway and RTF_ANYCASTMartin KaFai Lau
When creating a RTF_CACHE route, RTF_ANYCAST is set based on rt6i_dst. Also, rt6i_gateway is always set to the nexthop while the nexthop could be a gateway or the rt6i_dst.addr. After removing the rt6i_dst and rt6i_src dependency in the last patch, we also need to stop the caller from depending on rt6i_gateway and RTF_ANYCAST. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-25ipv6: Remove external dependency on rt6i_dst and rt6i_srcMartin KaFai Lau
This patch removes the assumptions that the returned rt is always a RTF_CACHE entry with the rt6i_dst and rt6i_src containing the destination and source address. The dst and src can be recovered from the calling site. We may consider to rename (rt6i_dst, rt6i_src) to (rt6i_key_dst, rt6i_key_src) later. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/cadence/macb.c drivers/net/phy/phy.c include/linux/skbuff.h net/ipv4/tcp.c net/switchdev/switchdev.c Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD} renaming overlapping with net-next changes of various sorts. phy.c was a case of two changes, one adding a local variable to a function whilst the second was removing one. tcp.c overlapped a deadlock fix with the addition of new tcp_info statistic values. macb.c involved the addition of two zyncq device entries. skbuff.h involved adding back ipv4_daddr to nf_bridge_info whilst net-next changes put two other existing members of that struct into a union. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20netfilter: nfnetlink_{log,queue}: Register pernet in first placeFrancesco Ruggeri
nfnetlink_{log,queue}_init() register the netlink callback nf*_rcv_nl_event before registering the pernet_subsys, but the callback relies on data structures allocated by pernet init functions. When nfnetlink_{log,queue} is loaded, if a netlink message is received after the netlink callback is registered but before the pernet_subsys is registered, the kernel will panic in the sequence nfulnl_rcv_nl_event nfnl_log_pernet net_generic BUG_ON(id == 0) where id is nfnl_log_net_id. The panic can be easily reproduced in 4.0.3 by: while true ;do modprobe nfnetlink_log ; rmmod nfnetlink_log ; done & while true ;do ip netns add dummy ; ip netns del dummy ; done & This patch moves register_pernet_subsys to earlier in nfnetlink_log_init. Notice that the BUG_ON hit in 4.0.3 was recently removed in 2591ffd308 ["netns: remove BUG_ONs from net_generic()"]. Signed-off-by: Francesco Ruggeri <fruggeri@arista.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-05-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next. Briefly speaking, cleanups and minor fixes for ipset from Jozsef Kadlecsik and Serget Popovich, more incremental updates to make br_netfilter a better place from Florian Westphal, ARP support to the x_tables mark match / target from and context Zhang Chunyu and the addition of context to know that the x_tables runs through nft_compat. More specifically, they are: 1) Fix sparse warning in ipset/ip_set_hash_ipmark.c when fetching the IPSET_ATTR_MARK netlink attribute, from Jozsef Kadlecsik. 2) Rename STREQ macro to STRNCMP in ipset, also from Jozsef. 3) Use skb->network_header to calculate the transport offset in ip_set_get_ip{4,6}_port(). From Alexander Drozdov. 4) Reduce memory consumption per element due to size miscalculation, this patch and follow up patches from Sergey Popovich. 5) Expand nomatch field from 1 bit to 8 bits to allow to simplify mtype_data_reset_flags(), also from Sergey. 6) Small clean for ipset macro trickery. 7) Fix error reporting when both ip_set_get_hostipaddr4() and ip_set_get_extensions() from per-set uadt functions. 8) Simplify IPSET_ATTR_PORT netlink attribute validation. 9) Introduce HOST_MASK instead of hardcoded 32 in ipset. 10) Return true/false instead of 0/1 in functions that return boolean in the ipset code. 11) Validate maximum length of the IPSET_ATTR_COMMENT netlink attribute. 12) Allow to dereference from ext_*() ipset macros. 13) Get rid of incorrect definitions of HKEY_DATALEN. 14) Include linux/netfilter/ipset/ip_set.h in the x_tables set match. 15) Reduce nf_bridge_info size in br_netfilter, from Florian Westphal. 16) Release nf_bridge_info after POSTROUTING since this is only needed from the physdev match, also from Florian. 17) Reduce size of ipset code by deinlining ip_set_put_extensions(), from Denys Vlasenko. 18) Oneliner to add ARP support to the x_tables mark match/target, from Zhang Chunyu. 19) Add context to know if the x_tables extension runs from nft_compat, to address minor problems with three existing extensions. 20) Correct return value in several seqfile *_show() functions in the netfilter tree, from Joe Perches. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17netfilter: Use correct return for seq_show functionsJoe Perches
Using seq_has_overflowed doesn't produce the right return value. Either 0 or -1 is, but 0 is much more common and works well when seq allocation retries. I believe this doesn't matter as the initial allocation is always sufficient, this is just a correctness patch. Miscellanea: o Don't use strlen, use *ptr to determine if a string should be emitted like all the other tests here o Delete unnecessary return statements Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-05-15netfilter: nf_tables: fix bogus warning in nft_data_uninit()Mirek Kratochvil
The values 0x00000000-0xfffffeff are reserved for userspace datatype. When, deleting set elements with maps, a bogus warning is triggered. WARNING: CPU: 0 PID: 11133 at net/netfilter/nf_tables_api.c:4481 nft_data_uninit+0x35/0x40 [nf_tables]() This fixes the check accordingly to enum definition in include/linux/netfilter/nf_tables.h Fixes: https://bugzilla.netfilter.org/show_bug.cgi?id=1013 Signed-off-by: Mirek Kratochvil <exa.exa@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-05-15conntrack: RFC5961 challenge ACK confuse conntrack LAST-ACK transitionJesper Dangaard Brouer
In compliance with RFC5961, the network stack send challenge ACK in response to spurious SYN packets, since commit 0c228e833c88 ("tcp: Restore RFC5961-compliant behavior for SYN packets"). This pose a problem for netfilter conntrack in state LAST_ACK, because this challenge ACK is (falsely) seen as ACKing last FIN, causing a false state transition (into TIME_WAIT). The challenge ACK is hard to distinguish from real last ACK. Thus, solution introduce a flag that tracks the potential for seeing a challenge ACK, in case a SYN packet is let through and current state is LAST_ACK. When conntrack transition LAST_ACK to TIME_WAIT happens, this flag is used for determining if we are expecting a challenge ACK. Scapy based reproducer script avail here: https://github.com/netoptimizer/network-testing/blob/master/scapy/tcp_hacks_3WHS_LAST_ACK.py Fixes: 0c228e833c88 ("tcp: Restore RFC5961-compliant behavior for SYN packets") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-05-15netfilter: avoid build error if TPROXY/SOCKET=y && NF_DEFRAG_IPV6=mFlorian Westphal
With TPROXY=y but DEFRAG_IPV6=m we get build failure: net/built-in.o: In function `tproxy_tg_init': net/netfilter/xt_TPROXY.c:588: undefined reference to `nf_defrag_ipv6_enable' If DEFRAG_IPV6 is modular, TPROXY must be too. (or both must be builtin). This enforces =m for both. Reported-and-tested-by: Liu Hua <liusdu@126.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-05-15netfilter: x_tables: add context to know if extension runs from nft_compatPablo Neira Ayuso
Currently, we have four xtables extensions that cannot be used from the xt over nft compat layer. The problem is that they need real access to the full blown xt_entry to validate that the rule comes with the right dependencies. This check was introduced to overcome the lack of sufficient userspace dependency validation in iptables. To resolve this problem, this patch introduces a new field to the xt_tgchk_param structure that tell us if the extension is run from nft_compat context. The three affected extensions are: 1) CLUSTERIP, this target has been superseded by xt_cluster. So just bail out by returning -EINVAL. 2) TCPMSS. Relax the checking when used from nft_compat. If used with the wrong configuration, it will corrupt !syn packets by adding TCP MSS option. 3) ebt_stp. Relax the check to make sure it uses the reserved destination MAC address for STP. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Tested-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
2015-05-14netfilter: xt_MARK: Add ARP supportZhang Chunyu
Add arpt_MARK to xt_mark. The corresponding userspace update is available at: http://git.netfilter.org/arptables/commit/?id=4bb2f8340783fd3a3f70aa6f8807428a280f8474 Signed-off-by: Zhang Chunyu <zhangcy@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-05-14netfilter: ipset: deinline ip_set_put_extensions()Denys Vlasenko
On x86 allyesconfig build: The function compiles to 489 bytes of machine code. It has 25 callsites. text data bss dec hex filename 82441375 22255384 20627456 125324215 7784bb7 vmlinux.before 82434909 22255384 20627456 125317749 7783275 vmlinux Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> CC: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> CC: Eric W. Biederman <ebiederm@xmission.com> CC: David S. Miller <davem@davemloft.net> CC: Jan Engelhardt <jengelh@medozas.de> CC: Jiri Pirko <jpirko@redhat.com> CC: linux-kernel@vger.kernel.org CC: netdev@vger.kernel.org CC: netfilter-devel@vger.kernel.org Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-05-14netfilter: add netfilter ingress hook after handle_ing() under unique static keyPablo Neira
This patch adds the Netfilter ingress hook just after the existing tc ingress hook, that seems to be the consensus solution for this. Note that the Netfilter hook resides under the global static key that enables ingress filtering. Nonetheless, Netfilter still also has its own static key for minimal impact on the existing handle_ing(). * Without this patch: Result: OK: 6216490(c6216338+d152) usec, 100000000 (60byte,0frags) 16086246pps 7721Mb/sec (7721398080bps) errors: 100000000 42.46% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core 25.92% kpktgend_0 [kernel.kallsyms] [k] kfree_skb 7.81% kpktgend_0 [pktgen] [k] pktgen_thread_worker 5.62% kpktgend_0 [kernel.kallsyms] [k] ip_rcv 2.70% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal 2.34% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk 1.44% kpktgend_0 [kernel.kallsyms] [k] __build_skb * With this patch: Result: OK: 6214833(c6214731+d101) usec, 100000000 (60byte,0frags) 16090536pps 7723Mb/sec (7723457280bps) errors: 100000000 41.23% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core 26.57% kpktgend_0 [kernel.kallsyms] [k] kfree_skb 7.72% kpktgend_0 [pktgen] [k] pktgen_thread_worker 5.55% kpktgend_0 [kernel.kallsyms] [k] ip_rcv 2.78% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal 2.06% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk 1.43% kpktgend_0 [kernel.kallsyms] [k] __build_skb * Without this patch + tc ingress: tc filter add dev eth4 parent ffff: protocol ip prio 1 \ u32 match ip dst 4.3.2.1/32 Result: OK: 9269001(c9268821+d179) usec, 100000000 (60byte,0frags) 10788648pps 5178Mb/sec (5178551040bps) errors: 100000000 40.99% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core 17.50% kpktgend_0 [kernel.kallsyms] [k] kfree_skb 11.77% kpktgend_0 [cls_u32] [k] u32_classify 5.62% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat 5.18% kpktgend_0 [pktgen] [k] pktgen_thread_worker 3.23% kpktgend_0 [kernel.kallsyms] [k] tc_classify 2.97% kpktgend_0 [kernel.kallsyms] [k] ip_rcv 1.83% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal 1.50% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk 0.99% kpktgend_0 [kernel.kallsyms] [k] __build_skb * With this patch + tc ingress: tc filter add dev eth4 parent ffff: protocol ip prio 1 \ u32 match ip dst 4.3.2.1/32 Result: OK: 9308218(c9308091+d126) usec, 100000000 (60byte,0frags) 10743194pps 5156Mb/sec (5156733120bps) errors: 100000000 42.01% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core 17.78% kpktgend_0 [kernel.kallsyms] [k] kfree_skb 11.70% kpktgend_0 [cls_u32] [k] u32_classify 5.46% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat 5.16% kpktgend_0 [pktgen] [k] pktgen_thread_worker 2.98% kpktgend_0 [kernel.kallsyms] [k] ip_rcv 2.84% kpktgend_0 [kernel.kallsyms] [k] tc_classify 1.96% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal 1.57% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk Note that the results are very similar before and after. I can see gcc gets the code under the ingress static key out of the hot path. Then, on that cold branch, it generates the code to accomodate the netfilter ingress static key. My explanation for this is that this reduces the pressure on the instruction cache for non-users as the new code is out of the hot path, and it comes with minimal impact for tc ingress users. Using gcc version 4.8.4 on: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 [...] L1d cache: 16K L1i cache: 64K L2 cache: 2048K L3 cache: 8192K Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14netfilter: add hook list to nf_hook_statePablo Neira
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>