summaryrefslogtreecommitdiff
path: root/net/netfilter/ipvs
AgeCommit message (Collapse)Author
2014-08-28ipvs: fix ipv6 hook registration for local repliesJulian Anastasov
commit fc604767613b6d2036cdc35b660bc39451040a47 ("ipvs: changes for local real server") from 2.6.37 introduced DNAT support to local real server but the IPv6 LOCAL_OUT handler ip_vs_local_reply6() is registered incorrectly as IPv4 hook causing any outgoing IPv4 traffic to be dropped depending on the IP header values. Chris tracked down the problem to CONFIG_IP_VS_IPV6=y Bug report: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1349768 Reported-by: Chris J Arges <chris.j.arges@canonical.com> Tested-by: Chris J Arges <chris.j.arges@canonical.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2014-08-27ipvs: properly declare tunnel encapsulationJulian Anastasov
The tunneling method should properly use tunnel encapsulation. Fixes problem with CHECKSUM_PARTIAL packets when TCP/UDP csum offload is supported. Thanks to Alex Gartrell for reporting the problem, providing solution and for all suggestions. Reported-by: Alex Gartrell <agartrell@fb.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Alex Gartrell <agartrell@fb.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2014-08-08netfilter: don't use mutex_lock_interruptible()Pablo Neira Ayuso
Eric Dumazet reports that getsockopt() or setsockopt() sometimes returns -EINTR instead of -ENOPROTOOPT, causing headaches to application developers. This patch replaces all the mutex_lock_interruptible() by mutex_lock() in the netfilter tree, as there is no reason we should sleep for a long time there. Reported-by: Eric Dumazet <edumazet@google.com> Suggested-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Julian Anastasov <ja@ssi.bg>
2014-08-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/Makefile net/ipv6/sysctl_net_ipv6.c Two ipv6_table_template[] additions overlap, so the index of the ipv6_table[x] assignments needed to be adjusted. In the drivers/net/Makefile case, we've gotten rid of the garbage whereby we had to list every single USB networking driver in the top-level Makefile, there is just one "USB_NETWORKING" that guards everything. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains updates for your net-next tree, they are: 1) Use kvfree() helper function from x_tables, from Eric Dumazet. 2) Remove extra timer from the conntrack ecache extension, use a workqueue instead to redeliver lost events to userspace instead, from Florian Westphal. 3) Removal of the ulog targets for ebtables and iptables. The nflog infrastructure superseded this almost 9 years ago, time to get rid of this code. 4) Replace the list of loggers by an array now that we can only have two possible non-overlapping logger flavours, ie. kernel ring buffer and netlink logging. 5) Move Eric Dumazet's log buffer code to nf_log to reuse it from all of the supported per-family loggers. 6) Consolidate nf_log_packet() as an unified interface for packet logging. After this patch, if the struct nf_loginfo is available, it explicitly selects the logger that is used. 7) Move ip and ip6 logging code from xt_LOG to the corresponding per-family loggers. Thus, x_tables and nf_tables share the same code for packet logging. 8) Add generic ARP packet logger, which is used by nf_tables. The format aims to be consistent with the output of xt_LOG. 9) Add generic bridge packet logger. Again, this is used by nf_tables and it routes the packets to the real family loggers. As a result, we get consistent logging format for the bridge family. The ebt_log logging code has been intentionally left in place not to break backward compatibility since the logging output differs from xt_LOG. 10) Update nft_log to explicitly request the required family logger when needed. 11) Finish nft_log so it supports arp, ip, ip6, bridge and inet families. Allowing selection between netlink and kernel buffer ring logging. 12) Several fixes coming after the netfilter core logging changes spotted by robots. 13) Use IS_ENABLED() macros whenever possible in the netfilter tree, from Duan Jiong. 14) Removal of a couple of unnecessary branch before kfree, from Fabian Frederick. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-17ipvs: Maintain all DSCP and ECN bits for ipv6 tun forwardingAlex Gartrell
Previously, only the four high bits of the tclass were maintained in the ipv6 case. This matches the behavior of ipv4, though whether or not we should reflect ECN bits may be up for debate. Signed-off-by: Alex Gartrell <agartrell@fb.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2014-07-16ipvs: Remove dead debug codeYannick Brosseau
This code section cannot compile as it refer to non existing variable It also pre-date git history. Signed-off-by: Yannick Brosseau <scientist@fb.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2014-07-16ipvs: remove null test before kfreeFabian Frederick
Fix checkpatch warning: WARNING: kfree(NULL) is safe this check is probably not required Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Simon Horman <horms@verge.net.au>
2014-07-16ipvs: avoid netns exit crash on ip_vs_conn_drop_conntrackJulian Anastasov
commit 8f4e0a18682d91 ("IPVS netns exit causes crash in conntrack") added second ip_vs_conn_drop_conntrack call instead of just adding the needed check. As result, the first call still can cause crash on netns exit. Remove it. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2014-06-16Merge branch 'ipvs'Pablo Neira Ayuso
Simon Horman says: ==================== Fix for panic due use of tot_stats estimator outside of CONFIG_SYSCTL It has been present since v3.6.39. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-06-13ipvs: stop tot_stats estimator only under CONFIG_SYSCTLJulian Anastasov
The tot_stats estimator is started only when CONFIG_SYSCTL is defined. But it is stopped without checking CONFIG_SYSCTL. Fix the crash by moving ip_vs_stop_estimator into ip_vs_control_net_cleanup_sysctl. The change is needed after commit 14e405461e664b ("IPVS: Add __ip_vs_control_{init,cleanup}_sysctl()") from 2.6.39. Reported-by: Jet Chen <jet.chen@intel.com> Tested-by: Jet Chen <jet.chen@intel.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2014-06-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: include/net/inetpeer.h net/ipv6/output_core.c Changes in net were fixing bugs in code removed in net-next. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-02inetpeer: get rid of ip_id_countEric Dumazet
Ideally, we would need to generate IP ID using a per destination IP generator. linux kernels used inet_peer cache for this purpose, but this had a huge cost on servers disabling MTU discovery. 1) each inet_peer struct consumes 192 bytes 2) inetpeer cache uses a binary tree of inet_peer structs, with a nominal size of ~66000 elements under load. 3) lookups in this tree are hitting a lot of cache lines, as tree depth is about 20. 4) If server deals with many tcp flows, we have a high probability of not finding the inet_peer, allocating a fresh one, inserting it in the tree with same initial ip_id_count, (cf secure_ip_id()) 5) We garbage collect inet_peer aggressively. IP ID generation do not have to be 'perfect' Goal is trying to avoid duplicates in a short period of time, so that reassembly units have a chance to complete reassembly of fragments belonging to one message before receiving other fragments with a recycled ID. We simply use an array of generators, and a Jenkin hash using the dst IP as a key. ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it belongs (it is only used from this file) secure_ip_id() and secure_ipv6_id() no longer are needed. Rename ip_select_ident_more() to ip_select_ident_segs() to avoid unnecessary decrement/increment of the number of segments. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next This small patchset contains three accumulated Netfilter/IPVS updates, they are: 1) Refactorize common NAT code by encapsulating it into a helper function, similarly to what we do in other conntrack extensions, from Florian Westphal. 2) A minor format string mismatch fix for IPVS, from Masanari Iida. 3) Add quota support to the netfilter accounting infrastructure, now you can add quotas to accounting objects via the nfnetlink interface and use them from iptables. You can also listen to quota notifications from userspace. This enhancement from Mathieu Poirier. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-26ipvs: Fix panic due to non-linear skbPeter Christensen
Receiving a ICMP response to an IPIP packet in a non-linear skb could cause a kernel panic in __skb_pull. The problem was introduced in commit f2edb9f7706dcb2c0d9a362b2ba849efe3a97f5e ("ipvs: implement passive PMTUD for IPIP packets"). Signed-off-by: Peter Christensen <pch@ordbogen.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2014-05-12net: rename local_df to ignore_dfWANG Cong
As suggested by several people, rename local_df to ignore_df, since it means "ignore df bit if it is set". Cc: Maciej Żenczykowski <maze@google.com> Cc: Florian Westphal <fw@strlen.de> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-23netfilter: Fix format string mismatch in ip_vs_proto_name()Masanari Iida
Fix string mismatch in ip_vs_proto_name() Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-03-17Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for net-next, most relevantly they are: * cleanup to remove double semicolon from stephen hemminger. * calm down sparse warning in xt_ipcomp, from Fan Du. * nf_ct_labels support for nf_tables, from Florian Westphal. * new macros to simplify rcu dereferences in the scope of nfnetlink and nf_tables, from Patrick McHardy. * Accept queue and drop (including reason for drop) to verdict parsing in nf_tables, also from Patrick. * Remove unused random seed initialization in nfnetlink_log, from Florian Westphal. * Allow to attach user-specific information to nf_tables rules, useful to attach user comments to rule, from me. * Return errors in ipset according to the manpage documentation, from Jozsef Kadlecsik. * Fix coccinelle warnings related to incorrect bool type usage for ipset, from Fengguang Wu. * Add hash:ip,mark set type to ipset, from Vytas Dauksa. * Fix message for each spotted by ipset for each netns that is created, from Ilia Mirkin. * Add forceadd option to ipset, which evicts a random entry from the set if it becomes full, from Josh Hunt. * Minor IPVS cleanups and fixes from Andi Kleen and Tingwei Liu. * Improve conntrack scalability by removing a central spinlock, original work from Eric Dumazet. Jesper Dangaard Brouer took them over to address remaining issues. Several patches to prepare this change come in first place. * Rework nft_hash to resolve bugs (leaking chain, missing rcu synchronization on element removal, etc. from Patrick McHardy. * Restore context in the rule deletion path, as we now release rule objects synchronously, from Patrick McHardy. This gets back event notification for anonymous sets. * Fix NAT family validation in nft_nat, also from Patrick. * Improve scalability of xt_connlimit by using an array of spinlocks and by introducing a rb-tree of hashtables for faster lookup of accounted objects per network. This patch was preceded by several patches and refactorizations to accomodate this change including the use of kmem_cache, from Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-14net: Replace u64_stats_fetch_begin_bh to u64_stats_fetch_begin_irqEric W. Biederman
Replace the bh safe variant with the hard irq safe variant. We need a hard irq safe variant to deal with netpoll transmitting packets from hard irq context, and we need it in most if not all of the places using the bh safe variant. Except on 32bit uni-processor the code is exactly the same so don't bother with a bh variant, just have a hard irq safe variant that everyone can use. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-07ipvs: Reduce checkpatch noise in ip_vs_lblc.cTingwei Liu
Add whitespace after operator and put open brace { on the previous line Cc: Tingwei Liu <liutingwei@hisense.com> Cc: lvs-devel@vger.kernel.org Signed-off-by: Tingwei Liu <tingw.liu@gmail.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2014-03-07sections, ipvs: Remove useless __read_mostly for ipvs genl_opsAndi Kleen
const __read_mostly does not make any sense, because const data is already read-only. Remove the __read_mostly for the ipvs genl_ops. This avoids a LTO section conflict compile problem. Cc: Wensong Zhang <wensong@linux-vs.org> Cc: Simon Horman <horms@verge.net.au> Cc: Patrick McHardy <kaber@trash.net> Cc: lvs-devel@vger.kernel.org Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2014-02-04ipvs: fix AF assignment in ip_vs_conn_new()Michal Kubecek
If a fwmark is passed to ip_vs_conn_new(), it is passed in vaddr, not daddr. Therefore we should set AF to AF_UNSPEC in vaddr assignment (like we do in ip_vs_ct_in_get()), otherwise we may copy only first 4 bytes of an IPv6 address into cp->daddr. Signed-off-by: Bogdano Arendartchuk <barendartchuk@suse.com> Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2014-01-14net: replace macros net_random and net_srandom with direct calls to prandomAruna-Hewapathirane
This patch removes the net_random and net_srandom macros and replaces them with direct calls to the prandom ones. As new commits only seem to use prandom_u32 there is no use to keep them around. This change makes it easier to grep for users of prandom_u32. Signed-off-by: Aruna-Hewapathirane <aruna.hewapathirane@gmail.com> Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c net/ipv6/ip6_tunnel.c net/ipv6/ip6_vti.c ipv6 tunnel statistic bug fixes conflicting with consolidation into generic sw per-cpu net stats. qlogic conflict between queue counting bug fix and the addition of multiple MAC address support. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-27ipvs: correct usage/allocation of seqadj ext in ipvsJesper Dangaard Brouer
The IPVS FTP helper ip_vs_ftp could trigger an OOPS in nf_ct_seqadj_set, after commit 41d73ec053d2 (netfilter: nf_conntrack: make sequence number adjustments usuable without NAT). This is because, the seqadj ext is now allocated dynamically, and the IPVS code didn't handle this situation. Fix this in the IPVS nfct code by invoking the alloc function nfct_seqadj_ext_add(). Fixes: 41d73ec053d2 (netfilter: nf_conntrack: make sequence number adjustments usuable without NAT) Suggested-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-12-27ipvs: Remove unused variable ret from sync_thread_master()Geert Uytterhoeven
net/netfilter/ipvs/ip_vs_sync.c: In function 'sync_thread_master': net/netfilter/ipvs/ip_vs_sync.c:1640:8: warning: unused variable 'ret' [-Wunused-variable] Commit 35a2af94c7ce7130ca292c68b1d27fcfdb648f6b ("sched/wait: Make the __wait_event*() interface more friendly") changed how the interruption state is returned. However, sync_thread_master() ignores this state, now causing a compile warning. According to Julian Anastasov <ja@ssi.bg>, this behavior is OK: "Yes, your patch looks ok to me. In the past we used ssleep() but IPVS users were confused why IPVS threads increase the load average. So, we switched to _interruptible calls and later the socket polling was added." Document this, as requested by Peter Zijlstra, to avoid precious developers disappearing in this pitfall in the future. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-12-06netfilter: Fix FSF address in file headersJeff Kirsher
Several files refer to an old address for the Free Software Foundation in the file header comment. Resolve by replacing the address with the URL <http://www.gnu.org/licenses/> so that we do not have to keep updating the header comments anytime the address changes. CC: netfilter@vger.kernel.org CC: Pablo Neira Ayuso <pablo@netfilter.org> CC: Patrick McHardy <kaber@trash.net> CC: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "Mostly these are fixes for fallout due to merge window changes, as well as cures for problems that have been with us for a much longer period of time" 1) Johannes Berg noticed two major deficiencies in our genetlink registration. Some genetlink protocols we passing in constant counts for their ops array rather than something like ARRAY_SIZE(ops) or similar. Also, some genetlink protocols were using fixed IDs for their multicast groups. We have to retain these fixed IDs to keep existing userland tools working, but reserve them so that other multicast groups used by other protocols can not possibly conflict. In dealing with these two problems, we actually now use less state management for genetlink operations and multicast groups. 2) When configuring interface hardware timestamping, fix several drivers that simply do not validate that the hwtstamp_config value is one the driver actually supports. From Ben Hutchings. 3) Invalid memory references in mwifiex driver, from Amitkumar Karwar. 4) In dev_forward_skb(), set the skb->protocol in the right order relative to skb_scrub_packet(). From Alexei Starovoitov. 5) Bridge erroneously fails to use the proper wrapper functions to make calls to netdev_ops->ndo_vlan_rx_{add,kill}_vid. Fix from Toshiaki Makita. 6) When detaching a bridge port, make sure to flush all VLAN IDs to prevent them from leaking, also from Toshiaki Makita. 7) Put in a compromise for TCP Small Queues so that deep queued devices that delay TX reclaim non-trivially don't have such a performance decrease. One particularly problematic area is 802.11 AMPDU in wireless. From Eric Dumazet. 8) Fix crashes in tcp_fastopen_cache_get(), we can see NULL socket dsts here. Fix from Eric Dumzaet, reported by Dave Jones. 9) Fix use after free in ipv6 SIT driver, from Willem de Bruijn. 10) When computing mergeable buffer sizes, virtio-net fails to take the virtio-net header into account. From Michael Dalton. 11) Fix seqlock deadlock in ip4_datagram_connect() wrt. statistic bumping, this one has been with us for a while. From Eric Dumazet. 12) Fix NULL deref in the new TIPC fragmentation handling, from Erik Hugne. 13) 6lowpan bit used for traffic classification was wrong, from Jukka Rissanen. 14) macvlan has the same issue as normal vlans did wrt. propagating LRO disabling down to the real device, fix it the same way. From Michal Kubecek. 15) CPSW driver needs to soft reset all slaves during suspend, from Daniel Mack. 16) Fix small frame pacing in FQ packet scheduler, from Eric Dumazet. 17) The xen-netfront RX buffer refill timer isn't properly scheduled on partial RX allocation success, from Ma JieYue. 18) When ipv6 ping protocol support was added, the AF_INET6 protocol initialization cleanup path on failure was borked a little. Fix from Vlad Yasevich. 19) If a socket disconnects during a read/recvmsg/recvfrom/etc that blocks we can do the wrong thing with the msg_name we write back to userspace. From Hannes Frederic Sowa. There is another fix in the works from Hannes which will prevent future problems of this nature. 20) Fix route leak in VTI tunnel transmit, from Fan Du. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits) genetlink: make multicast groups const, prevent abuse genetlink: pass family to functions using groups genetlink: add and use genl_set_err() genetlink: remove family pointer from genl_multicast_group genetlink: remove genl_unregister_mc_group() hsr: don't call genl_unregister_mc_group() quota/genetlink: use proper genetlink multicast APIs drop_monitor/genetlink: use proper genetlink multicast APIs genetlink: only pass array to genl_register_family_with_ops() tcp: don't update snd_nxt, when a socket is switched from repair mode atm: idt77252: fix dev refcnt leak xfrm: Release dst if this dst is improper for vti tunnel netlink: fix documentation typo in netlink_set_err() be2net: Delete secondary unicast MAC addresses during be_close be2net: Fix unconditional enabling of Rx interface options net, virtio_net: replace the magic value ping: prevent NULL pointer dereference on write to msg_name bnx2x: Prevent "timeout waiting for state X" bnx2x: prevent CFC attention bnx2x: Prevent panic during DMAE timeout ...
2013-11-19genetlink: only pass array to genl_register_family_with_ops()Johannes Berg
As suggested by David Miller, make genl_register_family_with_ops() a macro and pass only the array, evaluating ARRAY_SIZE() in the macro, this is a little safer. The openvswitch has some indirection, assing ops/n_ops directly in that code. This might ultimately just assign the pointers in the family initializations, saving the struct genl_family_and_ops and code (once mcast groups are handled differently.) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-14genetlink: make all genl_ops users constJohannes Berg
Now that genl_ops are no longer modified in place when registering, they can be made const. This patch was done mostly with spatch: @@ identifier ops; @@ +const struct genl_ops ops[] = { ... }; (except the struct thing in net/openvswitch/datapath.c) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-14Merge branch 'core-locking-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core locking changes from Ingo Molnar: "The biggest changes: - add lockdep support for seqcount/seqlocks structures, this unearthed both bugs and required extra annotation. - move the various kernel locking primitives to the new kernel/locking/ directory" * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits) block: Use u64_stats_init() to initialize seqcounts locking/lockdep: Mark __lockdep_count_forward_deps() as static lockdep/proc: Fix lock-time avg computation locking/doc: Update references to kernel/mutex.c ipv6: Fix possible ipv6 seqlock deadlock cpuset: Fix potential deadlock w/ set_mems_allowed seqcount: Add lockdep functionality to seqcount/seqlock structures net: Explicitly initialize u64_stats_sync structures for lockdep locking: Move the percpu-rwsem code to kernel/locking/ locking: Move the lglocks code to kernel/locking/ locking: Move the rwsem code to kernel/locking/ locking: Move the rtmutex code to kernel/locking/ locking: Move the semaphore core to kernel/locking/ locking: Move the spinlock code to kernel/locking/ locking: Move the lockdep code to kernel/locking/ locking: Move the mutex code to kernel/locking/ hung_task debugging: Add tracepoint to report the hang x86/locking/kconfig: Update paravirt spinlock Kconfig description lockstat: Report avg wait and hold times lockdep, x86/alternatives: Drop ancient lockdep fixup message ...
2013-11-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: 1) The addition of nftables. No longer will we need protocol aware firewall filtering modules, it can all live in userspace. At the core of nftables is a, for lack of a better term, virtual machine that executes byte codes to inspect packet or metadata (arriving interface index, etc.) and make verdict decisions. Besides support for loading packet contents and comparing them, the interpreter supports lookups in various datastructures as fundamental operations. For example sets are supports, and therefore one could create a set of whitelist IP address entries which have ACCEPT verdicts attached to them, and use the appropriate byte codes to do such lookups. Since the interpreted code is composed in userspace, userspace can do things like optimize things before giving it to the kernel. Another major improvement is the capability of atomically updating portions of the ruleset. In the existing netfilter implementation, one has to update the entire rule set in order to make a change and this is very expensive. Userspace tools exist to create nftables rules using existing netfilter rule sets, but both kernel implementations will need to co-exist for quite some time as we transition from the old to the new stuff. Kudos to Patrick McHardy, Pablo Neira Ayuso, and others who have worked so hard on this. 2) Daniel Borkmann and Hannes Frederic Sowa made several improvements to our pseudo-random number generator, mostly used for things like UDP port randomization and netfitler, amongst other things. In particular the taus88 generater is updated to taus113, and test cases are added. 3) Support 64-bit rates in HTB and TBF schedulers, from Eric Dumazet and Yang Yingliang. 4) Add support for new 577xx tigon3 chips to tg3 driver, from Nithin Sujir. 5) Fix two fatal flaws in TCP dynamic right sizing, from Eric Dumazet, Neal Cardwell, and Yuchung Cheng. 6) Allow IP_TOS and IP_TTL to be specified in sendmsg() ancillary control message data, much like other socket option attributes. From Francesco Fusco. 7) Allow applications to specify a cap on the rate computed automatically by the kernel for pacing flows, via a new SO_MAX_PACING_RATE socket option. From Eric Dumazet. 8) Make the initial autotuned send buffer sizing in TCP more closely reflect actual needs, from Eric Dumazet. 9) Currently early socket demux only happens for TCP sockets, but we can do it for connected UDP sockets too. Implementation from Shawn Bohrer. 10) Refactor inet socket demux with the goal of improving hash demux performance for listening sockets. With the main goals being able to use RCU lookups on even request sockets, and eliminating the listening lock contention. From Eric Dumazet. 11) The bonding layer has many demuxes in it's fast path, and an RCU conversion was started back in 3.11, several changes here extend the RCU usage to even more locations. From Ding Tianhong and Wang Yufen, based upon suggestions by Nikolay Aleksandrov and Veaceslav Falico. 12) Allow stackability of segmentation offloads to, in particular, allow segmentation offloading over tunnels. From Eric Dumazet. 13) Significantly improve the handling of secret keys we input into the various hash functions in the inet hashtables, TCP fast open, as well as syncookies. From Hannes Frederic Sowa. The key fundamental operation is "net_get_random_once()" which uses static keys. Hannes even extended this to ipv4/ipv6 fragmentation handling and our generic flow dissector. 14) The generic driver layer takes care now to set the driver data to NULL on device removal, so it's no longer necessary for drivers to explicitly set it to NULL any more. Many drivers have been cleaned up in this way, from Jingoo Han. 15) Add a BPF based packet scheduler classifier, from Daniel Borkmann. 16) Improve CRC32 interfaces and generic SKB checksum iterators so that SCTP's checksumming can more cleanly be handled. Also from Daniel Borkmann. 17) Add a new PMTU discovery mode, IP_PMTUDISC_INTERFACE, which forces using the interface MTU value. This helps avoid PMTU attacks, particularly on DNS servers. From Hannes Frederic Sowa. 18) Use generic XPS for transmit queue steering rather than internal (re-)implementation in virtio-net. From Jason Wang. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits) random32: add test cases for taus113 implementation random32: upgrade taus88 generator to taus113 from errata paper random32: move rnd_state to linux/random.h random32: add prandom_reseed_late() and call when nonblocking pool becomes initialized random32: add periodic reseeding random32: fix off-by-one in seeding requirement PHY: Add RTL8201CP phy_driver to realtek xtsonic: add missing platform_set_drvdata() in xtsonic_probe() macmace: add missing platform_set_drvdata() in mace_probe() ethernet/arc/arc_emac: add missing platform_set_drvdata() in arc_emac_probe() ipv6: protect for_each_sk_fl_rcu in mem_check with rcu_read_lock_bh vlan: Implement vlan_dev_get_egress_qos_mask as an inline. ixgbe: add warning when max_vfs is out of range. igb: Update link modes display in ethtool netfilter: push reasm skb through instead of original frag skbs ip6_output: fragment outgoing reassembled skb properly MAINTAINERS: mv643xx_eth: take over maintainership from Lennart net_sched: tbf: support of 64bit rates ixgbe: deleting dfwd stations out of order can cause null ptr deref ixgbe: fix build err, num_rx_queues is only available with CONFIG_RPS ...
2013-11-11netfilter: push reasm skb through instead of original frag skbsJiri Pirko
Pushing original fragments through causes several problems. For example for matching, frags may not be matched correctly. Take following example: <example> On HOSTA do: ip6tables -I INPUT -p icmpv6 -j DROP ip6tables -I INPUT -p icmpv6 -m icmp6 --icmpv6-type 128 -j ACCEPT and on HOSTB you do: ping6 HOSTA -s2000 (MTU is 1500) Incoming echo requests will be filtered out on HOSTA. This issue does not occur with smaller packets than MTU (where fragmentation does not happen) </example> As was discussed previously, the only correct solution seems to be to use reassembled skb instead of separete frags. Doing this has positive side effects in reducing sk_buff by one pointer (nfct_reasm) and also the reams dances in ipvs and conntrack can be removed. Future plan is to remove net/ipv6/netfilter/nf_conntrack_reasm.c entirely and use code in net/ipv6/reassembly.c instead. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-06net: Explicitly initialize u64_stats_sync structures for lockdepJohn Stultz
In order to enable lockdep on seqcount/seqlock structures, we must explicitly initialize any locks. The u64_stats_sync structure, uses a seqcount, and thus we need to introduce a u64_stats_init() function and use it to initialize the structure. This unfortunately adds a lot of fairly trivial initialization code to a number of drivers. But the benefit of ensuring correctness makes this worth while. Because these changes are required for lockdep to be enabled, and the changes are quite trivial, I've not yet split this patch out into 30-some separate patches, as I figured it would be better to get the various maintainers thoughts on how to best merge this change along with the seqcount lockdep enablement. Feedback would be appreciated! Signed-off-by: John Stultz <john.stultz@linaro.org> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: James Morris <jmorris@namei.org> Cc: Jesse Gross <jesse@nicira.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Mirko Lindner <mlindner@marvell.com> Cc: Patrick McHardy <kaber@trash.net> Cc: Roger Luethi <rl@hellgate.ch> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Simon Horman <horms@verge.net.au> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Cc: Wensong Zhang <wensong@linux-vs.org> Cc: netdev@vger.kernel.org Link: http://lkml.kernel.org/r/1381186321-4906-2-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-04Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== This is another batch containing Netfilter/IPVS updates for your net-next tree, they are: * Six patches to make the ipt_CLUSTERIP target support netnamespace, from Gao feng. * Two cleanups for the nf_conntrack_acct infrastructure, introducing a new structure to encapsulate conntrack counters, from Holger Eitzenberger. * Fix missing verdict in SCTP support for IPVS, from Daniel Borkmann. * Skip checksum recalculation in SCTP support for IPVS, also from Daniel Borkmann. * Fix behavioural change in xt_socket after IP early demux, from Florian Westphal. * Fix bogus large memory allocation in the bitmap port set type in ipset, from Jozsef Kadlecsik. * Fix possible compilation issues in the hash netnet set type in ipset, also from Jozsef Kadlecsik. * Define constants to identify netlink callback data in ipset dumps, again from Jozsef Kadlecsik. * Use sock_gen_put() in xt_socket to replace xt_socket_put_sk, from Eric Dumazet. * Improvements for the SH scheduler in IPVS, from Alexander Frolkin. * Remove extra delay due to unneeded rcu barrier in IPVS net namespace cleanup path, from Julian Anastasov. * Save some cycles in ip6t_REJECT by skipping checksum validation in packets leaving from our stack, from Stanislav Fomichev. * Fix IPVS_CMD_ATTR_MAX definition in IPVS, larger that required, from Julian Anastasov. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-30net: ipvs: sctp: do not recalc sctp csum when ports didn't changeDaniel Borkmann
Unlike UDP or TCP, we do not take the pseudo-header into account in SCTP checksums. So in case port mapping is the very same, we do not need to recalculate the whole SCTP checksum in software, which is very expensive. Also, similarly as in TCP, take into account when a private helper mangled the packet. In that case, we also need to recalculate the checksum even if ports might be same. Thanks for feedback regarding skb->ip_summed checks from Julian Anastasov; here's a discussion on these checks for snat and dnat: * For snat_handler(), we can see CHECKSUM_PARTIAL from virtual devices, and from LOCAL_OUT, otherwise it should be CHECKSUM_UNNECESSARY. In general, in snat it is more complex. skb contains the original route and ip_vs_route_me_harder() can change the route after snat_handler. So, for locally generated replies from local server we can not preserve the CHECKSUM_PARTIAL mode. It is an chicken or egg dilemma: snat_handler needs the device after rerouting (to check for NETIF_F_SCTP_CSUM), while ip_route_me_harder() wants the snat_handler() to put the new saddr for proper rerouting. * For dnat_handler(), we should not see CHECKSUM_COMPLETE for SCTP, in fact the small set of drivers that support SCTP offloading return CHECKSUM_UNNECESSARY on correctly received SCTP csum. We can see CHECKSUM_PARTIAL from local stack or received from virtual drivers. The idea is that SCTP decides to avoid csum calculation if hardware supports offloading. IPVS can change the device after rerouting to real server but we can preserve the CHECKSUM_PARTIAL mode if the new device supports offloading too. This works because skb dst is changed before dnat_handler and we see the new device. So, checks in the 'if' part will decide whether it is ok to keep CHECKSUM_PARTIAL for the output. If the packet was with CHECKSUM_NONE, hence we deal with unknown checksum. As we recalculate the sum for IP header in all cases, it should be safe to use CHECKSUM_UNNECESSARY. We can forward wrong checksum in this case (without cp->app). In case of CHECKSUM_UNNECESSARY, the csum was valid on receive. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-10-28net: ipvs: sctp: add missing verdict assignments in sctp_conn_scheduleDaniel Borkmann
If skb_header_pointer() fails, we need to assign a verdict, that is NF_DROP in this case, otherwise, we would leave the verdict from conn_schedule() uninitialized when returning. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-10-15ipvs: improved SH fallback strategyAlexander Frolkin
Improve the SH fallback realserver selection strategy. With sh and sh-fallback, if a realserver is down, this attempts to distribute the traffic that would have gone to that server evenly among the remaining servers. Signed-off-by: Alexander Frolkin <avf@eldamar.org.uk> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-10-15ipvs: avoid rcu_barrier during netns cleanupJulian Anastasov
commit 578bc3ef1e473a ("ipvs: reorganize dest trash") added rcu_barrier() on cleanup to wait dest users and schedulers like LBLC and LBLCR to put their last dest reference. Using rcu_barrier with many namespaces is problematic. Trying to fix it by freeing dest with kfree_rcu is not a solution, RCU callbacks can run in parallel and execution order is random. Fix it by creating new function ip_vs_dest_put_and_free() which is heavier than ip_vs_dest_put(). We will use it just for schedulers like LBLC, LBLCR that can delay their dest release. By default, dests reference is above 0 if they are present in service and it is 0 when deleted but still in trash list. Change the dest trash code to use ip_vs_dest_put_and_free(), so that refcnt -1 can be used for freeing. As result, such checks remain in slow path and the rcu_barrier() from netns cleanup can be removed. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-10-14netfilter: pass hook ops to hookfnPatrick McHardy
Pass the hook ops to the hookfn to allow for generic hook functions. This change is required by nf_tables. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-10-09Merge tag 'v3.12-rc4' into sched/coreIngo Molnar
Merge Linux v3.12-rc4 to fix a conflict and also to refresh the tree before applying more scheduler patches. Conflicts: arch/avr32/include/asm/Kbuild Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-04sched/wait: Make the __wait_event*() interface more friendlyPeter Zijlstra
Change all __wait_event*() implementations to match the corresponding wait_event*() signature for convenience. In particular this does away with the weird 'ret' logic. Since there are __wait_event*() users this requires we update them too. Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20131002092529.042563462@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-01Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== The following patchset contains Netfilter/IPVS fixes for your net tree, they are: * Fix BUG_ON splat due to malformed TCP packets seen by synproxy, from Patrick McHardy. * Fix possible weight overflow in lblc and lblcr schedulers due to 32-bits arithmetics, from Simon Kirby. * Fix possible memory access race in the lblc and lblcr schedulers, introduced when it was converted to use RCU, two patches from Julian Anastasov. * Fix hard dependency on CPU 0 when reading per-cpu stats in the rate estimator, from Julian Anastasov. * Fix race that may lead to object use after release, when invoking ipvsadm -C && ipvsadm -R, introduced when adding RCU, from Julian Anastasov. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-19ip: generate unique IP identificator if local fragmentation is allowedAnsis Atteka
If local fragmentation is allowed, then ip_select_ident() and ip_select_ident_more() need to generate unique IDs to ensure correct defragmentation on the peer. For example, if IPsec (tunnel mode) has to encrypt large skbs that have local_df bit set, then all IP fragments that belonged to different ESP datagrams would have used the same identificator. If one of these IP fragments would get lost or reordered, then peer could possibly stitch together wrong IP fragments that did not belong to the same datagram. This would lead to a packet loss or data corruption. Signed-off-by: Ansis Atteka <aatteka@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-18ipvs: stats should not depend on CPU 0Julian Anastasov
When reading percpu stats we need to properly reset the sum when CPU 0 is not present in the possible mask. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-09-18ipvs: do not use dest after ip_vs_dest_put in LBLCRJulian Anastasov
commit c5549571f975ab ("ipvs: convert lblcr scheduler to rcu") allows RCU readers to use dest after calling ip_vs_dest_put(). In the corner case it can race with ip_vs_dest_trash_expire() which can release the dest while it is being returned to the RCU readers as scheduling result. To fix the problem do not allow e->dest to be replaced and defer the ip_vs_dest_put() call by using RCU callback. Now e->dest does not need to be RCU pointer. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-09-18ipvs: do not use dest after ip_vs_dest_put in LBLCJulian Anastasov
commit c2a4ffb70eef39 ("ipvs: convert lblc scheduler to rcu") allows RCU readers to use dest after calling ip_vs_dest_put(). In the corner case it can race with ip_vs_dest_trash_expire() which can release the dest while it is being returned to the RCU readers as scheduling result. To fix the problem do not allow en->dest to be replaced and defer the ip_vs_dest_put() call by using RCU callback. Now en->dest does not need to be RCU pointer. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-09-18ipvs: make the service replacement more robustJulian Anastasov
commit 578bc3ef1e473a ("ipvs: reorganize dest trash") added IP_VS_DEST_STATE_REMOVING flag and RCU callback named ip_vs_dest_wait_readers() to keep dests and services after removal for at least a RCU grace period. But we have the following corner cases: - we can not reuse the same dest if its service is removed while IP_VS_DEST_STATE_REMOVING is still set because another dest removal in the first grace period can not extend this period. It can happen when ipvsadm -C && ipvsadm -R is used. - dest->svc can be replaced but ip_vs_in_stats() and ip_vs_out_stats() have no explicit read memory barriers when accessing dest->svc. It can happen that dest->svc was just freed (replaced) while we use it to update the stats. We solve the problems as follows: - IP_VS_DEST_STATE_REMOVING is removed and we ensure a fixed idle period for the dest (IP_VS_DEST_TRASH_PERIOD). idle_start will remember when for first time after deletion we noticed dest->refcnt=0. Later, the connections can grab a reference while in RCU grace period but if refcnt becomes 0 we can safely free the dest and its svc. - dest->svc becomes RCU pointer. As result, we add explicit RCU locking in ip_vs_in_stats() and ip_vs_out_stats(). - __ip_vs_unbind_svc is renamed to __ip_vs_svc_put(), it now can free the service immediately or after a RCU grace period. dest->svc is not set to NULL anymore. As result, unlinked dests and their services are freed always after IP_VS_DEST_TRASH_PERIOD period, unused services are freed after a RCU grace period. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2013-09-18ipvs: fix overflow on dest weight multiplySimon Kirby
Schedulers such as lblc and lblcr require the weight to be as high as the maximum number of active connections. In commit b552f7e3a9524abcbcdf ("ipvs: unify the formula to estimate the overhead of processing connections"), the consideration of inactconns and activeconns was cleaned up to always count activeconns as 256 times more important than inactconns. In cases where 3000 or more connections are expected, a weight of 3000 * 256 * 3000 connections overflows the 32-bit signed result used to determine if rescheduling is required. On amd64, this merely changes the multiply and comparison instructions to 64-bit. On x86, a 64-bit result is already present from imull, so only a few more comparison instructions are emitted. Signed-off-by: Simon Kirby <sim@hostway.ca> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>