summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2012-05-08IB/core: Add raw packet QP typeOr Gerlitz
IB_QPT_RAW_PACKET allows applications to build a complete packet, including L2 headers, when sending; on the receive side, the HW will not strip any headers. This QP type is designed for userspace direct access to Ethernet; for example by applications that do TCP/IP themselves. Only processes with the NET_RAW capability are allowed to create raw packet QPs (the name "raw packet QP" is supposed to suggest an analogy to AF_PACKET / SOL_RAW sockets). Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-05-08RDMA/ocrdma: Fix build with IPV6=nRoland Dreier
When IPV6 is not enabled: ERROR: "register_inet6addr_notifier" [drivers/infiniband/hw/ocrdma/ocrdma.ko] undefined! ERROR: "unregister_inet6addr_notifier" [drivers/infiniband/hw/ocrdma/ocrdma.ko] undefined! Fix this by wrapping the inet6 calls in #ifdef IPV6. Also make the ocrdma module depend on (IPV6 || IPV6=n) to forbid the case of modular ipv6 but built-in ocrdma (which can't work, because ocrdma calls ipv6 functions). Reported-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-05-08RDMA/ocrdma: Tiny locking cleanupDan Carpenter
We only need to disable the IRQs one time. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> [ Rename "wq_flags" to more conventional "flags." - Roland ] Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-05-08RDMA/ocrdma: Fix check for NULL instead of IS_ERRDan Carpenter
The ocrdma_alloc_lkey() function never returns NULL pointers -- it returns ERR_PTRs. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-05-08RDMA/ocrdma: Don't sleep in atomic notifier handlerSasha Levin
Events sent to ocrdma_inet6addr_event() are sent from an atomic context, therefore we can't try to lock a mutex within the notifier callback. We could just switch the mutex to a spinlock since all it does it protect a list, but I've gone ahead and switched the list to use RCU instead. I couldn't fully test it since I don't have IB hardware, so if it doesn't fully work for some reason let me know and I'll switch it back to using a spinlock. Signed-off-by: Sasha Levin <levinsasha928@gmail.com> [ Fixed locking in ocrdma_add(). - Roland ] Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-05-08RDMA/ocrdma: Remove write-only variablesRoland Dreier
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-05-08RDMA/ocrdma: Set event's device member in ocrdma_dispatch_ibevent()Roland Dreier
We need to set ib_evt.device, or else ib_dispatch_event() will crash when we call it for unaffiliated events (and consumers may get confused in their QP/CQ/SRQ event handler for affiliated events). Also fix sparse warning: drivers/infiniband/hw/ocrdma/ocrdma_hw.c:678:36: warning: Using plain integer as NULL pointer There's no need to clear ib_evt, since every member is initialized. Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-05-08RDMA/ocrdma: Make needlessly global functions/structs staticRoland Dreier
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-05-08RDMA/ocrdma: Fix warnings about uninitialized variablesRoland Dreier
First, fix drivers/infiniband/hw/ocrdma/ocrdma_verbs.c: In function 'ocrdma_alloc_pd': drivers/infiniband/hw/ocrdma/ocrdma_verbs.c:371:17: warning: 'dpp_page_addr' may be used uninitialized in this function [-Wuninitialized] drivers/infiniband/hw/ocrdma/ocrdma_verbs.c:337:6: note: 'dpp_page_addr' was declared here which seems that it may border on a bug (the call to ocrdma_del_mmap() might conceivably do bad things if pd->dpp_enabled is not set and dpp_page_addr ends up with just the wrong value). Also take care of: drivers/infiniband/hw/ocrdma/ocrdma_hw.c: In function 'ocrdma_init_hw': drivers/infiniband/hw/ocrdma/ocrdma_hw.c:2587:5: warning: 'status' may be used uninitialized in this function [-Wuninitialized] drivers/infiniband/hw/ocrdma/ocrdma_hw.c:2549:17: note: 'status' was declared here which is only real if num_eq == 0, which should be impossible. Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-05-08RDMA/ocrdma: Add driver for Emulex OneConnect IBoE RDMA adapterParav Pandit
Signed-off-by: Parav Pandit <parav.pandit@emulex.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-05-08be2net: Add functionality to support RoCE driverParav Pandit
- Increase MSI-X vectors by 5 for RoCE traffic. - Add macro to check roce support on a device. - Add device-specific doorbell and MSI-X vector fields shared with NIC functionality. - Provide RoCE driver registration and deregistration functions. - Add support functions which will be invoked on adapter add/remove and port up/down events. - Traverse through the list of adapters to invoke callback functions. Signed-off-by: Parav Pandit <parav.pandit@emulex.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-05-08be2net: Add function to issue mailbox cmd on MQParav Pandit
- Add generic function to issue mailbox cmd on MQ as export function. - RoCE driver will use this before it setups its own MQ. Signed-off-by: Parav Pandit <parav.pandit@emulex.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-05-08RDMA/cma: Fix lockdep false positive recursive lockingSean Hefty
The following lockdep problem was reported by Or Gerlitz <ogerlitz@mellanox.com>: [ INFO: possible recursive locking detected ] 3.3.0-32035-g1b2649e-dirty #4 Not tainted --------------------------------------------- kworker/5:1/418 is trying to acquire lock: (&id_priv->handler_mutex){+.+.+.}, at: [<ffffffffa0138a41>] rdma_destroy_i d+0x33/0x1f0 [rdma_cm] but task is already holding lock: (&id_priv->handler_mutex){+.+.+.}, at: [<ffffffffa0135130>] cma_disable_ca llback+0x24/0x45 [rdma_cm] other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&id_priv->handler_mutex); lock(&id_priv->handler_mutex); *** DEADLOCK *** May be due to missing lock nesting notation 3 locks held by kworker/5:1/418: #0: (ib_cm){.+.+.+}, at: [<ffffffff81042ac1>] process_one_work+0x210/0x4a 6 #1: ((&(&work->work)->work)){+.+.+.}, at: [<ffffffff81042ac1>] process_on e_work+0x210/0x4a6 #2: (&id_priv->handler_mutex){+.+.+.}, at: [<ffffffffa0135130>] cma_disab le_callback+0x24/0x45 [rdma_cm] stack backtrace: Pid: 418, comm: kworker/5:1 Not tainted 3.3.0-32035-g1b2649e-dirty #4 Call Trace: [<ffffffff8102b0fb>] ? console_unlock+0x1f4/0x204 [<ffffffff81068771>] __lock_acquire+0x16b5/0x174e [<ffffffff8106461f>] ? save_trace+0x3f/0xb3 [<ffffffff810688fa>] lock_acquire+0xf0/0x116 [<ffffffffa0138a41>] ? rdma_destroy_id+0x33/0x1f0 [rdma_cm] [<ffffffff81364351>] mutex_lock_nested+0x64/0x2ce [<ffffffffa0138a41>] ? rdma_destroy_id+0x33/0x1f0 [rdma_cm] [<ffffffff81065a78>] ? trace_hardirqs_on_caller+0x11e/0x155 [<ffffffff81065abc>] ? trace_hardirqs_on+0xd/0xf [<ffffffffa0138a41>] rdma_destroy_id+0x33/0x1f0 [rdma_cm] [<ffffffffa0139c02>] cma_req_handler+0x418/0x644 [rdma_cm] [<ffffffffa012ee88>] cm_process_work+0x32/0x119 [ib_cm] [<ffffffffa0130299>] cm_req_handler+0x928/0x982 [ib_cm] [<ffffffffa01302f3>] ? cm_req_handler+0x982/0x982 [ib_cm] [<ffffffffa0130326>] cm_work_handler+0x33/0xfe5 [ib_cm] [<ffffffff81065a78>] ? trace_hardirqs_on_caller+0x11e/0x155 [<ffffffffa01302f3>] ? cm_req_handler+0x982/0x982 [ib_cm] [<ffffffff81042b6e>] process_one_work+0x2bd/0x4a6 [<ffffffff81042ac1>] ? process_one_work+0x210/0x4a6 [<ffffffff813669f3>] ? _raw_spin_unlock_irq+0x2b/0x40 [<ffffffff8104316e>] worker_thread+0x1d6/0x350 [<ffffffff81042f98>] ? rescuer_thread+0x241/0x241 [<ffffffff81046a32>] kthread+0x84/0x8c [<ffffffff8136e854>] kernel_thread_helper+0x4/0x10 [<ffffffff81366d59>] ? retint_restore_args+0xe/0xe [<ffffffff810469ae>] ? __init_kthread_worker+0x56/0x56 [<ffffffff8136e850>] ? gs_change+0xb/0xb The actual locking is fine, since we're dealing with different locks, but from the same lock class. cma_disable_callback() acquires the listening id mutex, whereas rdma_destroy_id() acquires the mutex for the new connection id. To fix this, delay the call to rdma_destroy_id() until we've released the listening id mutex. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-05-08IB/uverbs: Lock SRQ / CQ / PD objects in a consistent orderRoland Dreier
Since XRC support was added, the uverbs code has locked SRQ, CQ and PD objects needed during QP and SRQ creation in different orders depending on the the code path. This leads to the (at least theoretical) possibility of deadlock, and triggers the lockdep splat below. Fix this by making sure we always lock the SRQ first, then CQs and finally the PD. ====================================================== [ INFO: possible circular locking dependency detected ] 3.4.0-rc5+ #34 Not tainted ------------------------------------------------------- ibv_srq_pingpon/2484 is trying to acquire lock: (SRQ-uobj){+++++.}, at: [<ffffffffa00af51b>] idr_read_uobj+0x2f/0x4d [ib_uverbs] but task is already holding lock: (CQ-uobj){+++++.}, at: [<ffffffffa00af51b>] idr_read_uobj+0x2f/0x4d [ib_uverbs] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (CQ-uobj){+++++.}: [<ffffffff81070fd0>] lock_acquire+0xbf/0xfe [<ffffffff81384f28>] down_read+0x34/0x43 [<ffffffffa00af51b>] idr_read_uobj+0x2f/0x4d [ib_uverbs] [<ffffffffa00af542>] idr_read_obj+0x9/0x19 [ib_uverbs] [<ffffffffa00b16c3>] ib_uverbs_create_qp+0x180/0x684 [ib_uverbs] [<ffffffffa00ae3dd>] ib_uverbs_write+0xb7/0xc2 [ib_uverbs] [<ffffffff810fe47f>] vfs_write+0xa7/0xee [<ffffffff810fe65f>] sys_write+0x45/0x69 [<ffffffff8138cdf9>] system_call_fastpath+0x16/0x1b -> #1 (PD-uobj){++++++}: [<ffffffff81070fd0>] lock_acquire+0xbf/0xfe [<ffffffff81384f28>] down_read+0x34/0x43 [<ffffffffa00af51b>] idr_read_uobj+0x2f/0x4d [ib_uverbs] [<ffffffffa00af542>] idr_read_obj+0x9/0x19 [ib_uverbs] [<ffffffffa00af8ad>] __uverbs_create_xsrq+0x96/0x386 [ib_uverbs] [<ffffffffa00b31b9>] ib_uverbs_detach_mcast+0x1cd/0x1e6 [ib_uverbs] [<ffffffffa00ae3dd>] ib_uverbs_write+0xb7/0xc2 [ib_uverbs] [<ffffffff810fe47f>] vfs_write+0xa7/0xee [<ffffffff810fe65f>] sys_write+0x45/0x69 [<ffffffff8138cdf9>] system_call_fastpath+0x16/0x1b -> #0 (SRQ-uobj){+++++.}: [<ffffffff81070898>] __lock_acquire+0xa29/0xd06 [<ffffffff81070fd0>] lock_acquire+0xbf/0xfe [<ffffffff81384f28>] down_read+0x34/0x43 [<ffffffffa00af51b>] idr_read_uobj+0x2f/0x4d [ib_uverbs] [<ffffffffa00af542>] idr_read_obj+0x9/0x19 [ib_uverbs] [<ffffffffa00b1728>] ib_uverbs_create_qp+0x1e5/0x684 [ib_uverbs] [<ffffffffa00ae3dd>] ib_uverbs_write+0xb7/0xc2 [ib_uverbs] [<ffffffff810fe47f>] vfs_write+0xa7/0xee [<ffffffff810fe65f>] sys_write+0x45/0x69 [<ffffffff8138cdf9>] system_call_fastpath+0x16/0x1b other info that might help us debug this: Chain exists of: SRQ-uobj --> PD-uobj --> CQ-uobj Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(CQ-uobj); lock(PD-uobj); lock(CQ-uobj); lock(SRQ-uobj); *** DEADLOCK *** 3 locks held by ibv_srq_pingpon/2484: #0: (QP-uobj){+.+...}, at: [<ffffffffa00b162c>] ib_uverbs_create_qp+0xe9/0x684 [ib_uverbs] #1: (PD-uobj){++++++}, at: [<ffffffffa00af51b>] idr_read_uobj+0x2f/0x4d [ib_uverbs] #2: (CQ-uobj){+++++.}, at: [<ffffffffa00af51b>] idr_read_uobj+0x2f/0x4d [ib_uverbs] stack backtrace: Pid: 2484, comm: ibv_srq_pingpon Not tainted 3.4.0-rc5+ #34 Call Trace: [<ffffffff8137eff0>] print_circular_bug+0x1f8/0x209 [<ffffffff81070898>] __lock_acquire+0xa29/0xd06 [<ffffffffa00af37c>] ? __idr_get_uobj+0x20/0x5e [ib_uverbs] [<ffffffffa00af51b>] ? idr_read_uobj+0x2f/0x4d [ib_uverbs] [<ffffffff81070fd0>] lock_acquire+0xbf/0xfe [<ffffffffa00af51b>] ? idr_read_uobj+0x2f/0x4d [ib_uverbs] [<ffffffff81070eee>] ? lock_release+0x166/0x189 [<ffffffff81384f28>] down_read+0x34/0x43 [<ffffffffa00af51b>] ? idr_read_uobj+0x2f/0x4d [ib_uverbs] [<ffffffffa00af51b>] idr_read_uobj+0x2f/0x4d [ib_uverbs] [<ffffffffa00af542>] idr_read_obj+0x9/0x19 [ib_uverbs] [<ffffffffa00b1728>] ib_uverbs_create_qp+0x1e5/0x684 [ib_uverbs] [<ffffffff81070fec>] ? lock_acquire+0xdb/0xfe [<ffffffff81070c09>] ? lock_release_non_nested+0x94/0x213 [<ffffffff810d470f>] ? might_fault+0x40/0x90 [<ffffffff810d470f>] ? might_fault+0x40/0x90 [<ffffffffa00ae3dd>] ib_uverbs_write+0xb7/0xc2 [ib_uverbs] [<ffffffff810fe47f>] vfs_write+0xa7/0xee [<ffffffff810ff736>] ? fget_light+0x3b/0x99 [<ffffffff810fe65f>] sys_write+0x45/0x69 [<ffffffff8138cdf9>] system_call_fastpath+0x16/0x1b Reported-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-05-08IB/uverbs: Make lockdep output more readableRoland Dreier
Add names for our lockdep classes, so instead of having to decipher lockdep output with mysterious names: Chain exists of: key#14 --> key#11 --> key#13 lockdep will give us something nicer: Chain exists of: SRQ-uobj --> PD-uobj --> CQ-uobj Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-05-08IB/ipath: Replace open-coded ARRAY_SIZE with macroMike Marciniszyn
Change sizeof(array)/sizeof(array[0]) to ARRAY_SIZE. Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-05-08IB/ipath: Replace open-coded ARRAY_SIZE with macroJim Cromie
Signed-off-by: Jim Cromie <jim.cromie@gmail.com> Acked-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-05-08RDMA/cxgb4: Use dst parameter in import_ep()Steve Wise
Function import_ep() is incorrectly using ep->dst instead of the dst ptr passed in. This causes a crash when accepting new rdma connections becase ep->dst is not initialized yet. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Cc: <stable@vger.kernel.org> Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-05-08IB/core: Use qp->usecnt to track multicast attach/detachOr Gerlitz
Just as we don't allow PDs, CQs, etc. to be destroyed if there are QPs that are attached to them, don't let a QP be destroyed if there are multicast group(s) attached to it. Use the existing usecnt field of struct ib_qp which was added by commit 0e0ec7e ("RDMA/core: Export ib_open_qp() to share XRC TGT QPs") to track this. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-05-08Merge tag 'stable/for-linus-3.4-rc6-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen Pull xen fixes from Konrad Rzeszutek Wilk: - fix to Kconfig to make it fit within 80 line characters, - two bootup fixes (AMD 8-core and with PCI BIOS), - cleanup code in a Xen PV fb driver, - and a crash fix when trying to see non-existent PTE's * tag 'stable/for-linus-3.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen/Kconfig: fix Kconfig layout xen/pci: don't use PCI BIOS service for configuration space accesses xen/pte: Fix crashes when trying to see non-existent PGD/PMD/PUD/PTEs xen/apic: Return the APIC ID (and version) for CPU 0. drivers/video/xen-fbfront.c: add missing cleanup code
2012-05-08Merge branch 'for-3.4-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull two percpu fixes from Tejun Heo: "One adds missing KERN_CONT on split printk()s and the other makes the percpu allocator avoid using PMD_SIZE as atom_size on x86_32. Using PMD_SIZE led to vmalloc area exhaustion on certain configurations (x86_32 android) and the only cost of using PAGE_SIZE instead is static percpu area not being aligned to large page mapping." * 'for-3.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu, x86: don't use PMD_SIZE as embedded atom_size on 32bit percpu: use KERN_CONT in pcpu_dump_alloc_info()
2012-05-08Merge branch 'fixes' of git://git.linaro.org/people/rmk/linux-armLinus Torvalds
Pull ARM fixes from Russell King: "This is mainly audit fixes, found by folks who happened to enable this feature and then found it broke their user applications." * 'fixes' of git://git.linaro.org/people/rmk/linux-arm: ARM: 7414/1: SMP: prevent use of the console when using idmap_pgd ARM: 7412/1: audit: use only AUDIT_ARCH_ARM regardless of endianness ARM: 7411/1: audit: fix treatment of saved ip register during syscall tracing ARM: 7410/1: Add extra clobber registers for assembly in kernel_execve
2012-05-08netfilter: nf_conntrack: fix explicit helper attachment and NATPablo Neira Ayuso
Explicit helper attachment via the CT target is broken with NAT if non-standard ports are used. This problem was hidden behind the automatic helper assignment routine. Thus, it becomes more noticeable now that we can disable the automatic helper assignment with Eric Leblond's: 9e8ac5a netfilter: nf_ct_helper: allow to disable automatic helper assignment Basically, nf_conntrack_alter_reply asks for looking up the helper up if NAT is enabled. Unfortunately, we don't have the conntrack template at that point anymore. Since we don't want to rely on the automatic helper assignment, we can skip the second look-up and stick to the helper that was attached by iptables. With the CT target, the user is in full control of helper attachment, thus, the policy is to trust what the user explicitly configures via iptables (no automatic magic anymore). Interestingly, this bug was hidden by the automatic helper look-up code. But it can be easily trigger if you attach the helper in a non-standard port, eg. iptables -I PREROUTING -t raw -p tcp --dport 8888 \ -j CT --helper ftp And you disabled the automatic helper assignment. I added the IPS_HELPER_BIT that allows us to differenciate between a helper that has been explicitly attached and those that have been automatically assigned. I didn't come up with a better solution (having backward compatibility in mind). Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-05-08regulator: wm8994: Use main I2C device as struct deviceMark Brown
This makes logging a bit clearer as it gives the actual bus location and makes things like board hookup a bit smoother. Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
2012-05-08netfilter: nf_ct_expect: partially implement ctnetlink_change_expectKelvie Wong
This refreshes the "timeout" attribute in existing expectations if one is given. The use case for this would be for userspace helpers to extend the lifetime of the expectation when requested, as this is not possible right now without deleting/recreating the expectation. I use this specifically for forwarding DCERPC traffic through: DCERPC has a port mapper daemon that chooses a (seemingly) random port for future traffic to go to. We expect this traffic (with a reasonable timeout), but sometimes the port mapper will tell the client to continue using the same port. This allows us to extend the expectation accordingly. Signed-off-by: Kelvie Wong <kelvie@ieee.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-05-08net: export sysctl_[r|w]mem_max symbols needed by ip_vs_syncHans Schillstrom
To build ip_vs as a module sysctl_rmem_max and sysctl_wmem_max needs to be exported. The dependency was added by "ipvs: wakeup master thread" patch. Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com> Signed-off-by: Simon Horman <horms@verge.net.au> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-05-08ipvs: ip_vs_proto: local functions should not be exposed globallyH Hartley Sweeten
Functions not referenced outside of a source file should be marked static to prevent it from being exposed globally. This quiets the sparse warnings: warning: symbol '__ipvs_proto_data_get' was not declared. Should it be static? Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: ip_vs_ftp: local functions should not be exposed globallyH Hartley Sweeten
Functions not referenced outside of a source file should be marked static to prevent it from being exposed globally. This quiets the sparse warnings: warning: symbol 'ip_vs_ftp_init' was not declared. Should it be static? Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: optimize the use of flags in ip_vs_bind_destPablo Neira Ayuso
cp->flags is marked volatile but ip_vs_bind_dest can safely modify the flags, so save some CPU cycles by using temp variable. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: add support for sync threadsPablo Neira Ayuso
Allow master and backup servers to use many threads for sync traffic. Add sysctl var "sync_ports" to define the number of threads. Every thread will use single UDP port, thread 0 will use the default port 8848 while last thread will use port 8848+sync_ports-1. The sync traffic for connections is scheduled to many master threads based on the cp address but one connection is always assigned to same thread to avoid reordering of the sync messages. Remove ip_vs_sync_switch_mode because this check for sync mode change is still risky. Instead, check for mode change under sync_buff_lock. Make sure the backup socks do not block on reading. Special thanks to Aleksey Chudov for helping in all tests. Signed-off-by: Julian Anastasov <ja@ssi.bg> Tested-by: Aleksey Chudov <aleksey.chudov@gmail.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: reduce sync rate with time thresholdsJulian Anastasov
Add two new sysctl vars to control the sync rate with the main idea to reduce the rate for connection templates because currently it depends on the packet rate for controlled connections. This mechanism should be useful also for normal connections with high traffic. sync_refresh_period: in seconds, difference in reported connection timer that triggers new sync message. It can be used to avoid sync messages for the specified period (or half of the connection timeout if it is lower) if connection state is not changed from last sync. sync_retries: integer, 0..3, defines sync retries with period of sync_refresh_period/8. Useful to protect against loss of sync messages. Allow sysctl_sync_threshold to be used with sysctl_sync_period=0, so that only single sync message is sent if sync_refresh_period is also 0. Add new field "sync_endtime" in connection structure to hold the reported time when connection expires. The 2 lowest bits will represent the retry count. As the sysctl_sync_period now can be 0 use ACCESS_ONCE to avoid division by zero. Special thanks to Aleksey Chudov for being patient with me, for his extensive reports and helping in all tests. Signed-off-by: Julian Anastasov <ja@ssi.bg> Tested-by: Aleksey Chudov <aleksey.chudov@gmail.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: wakeup master threadPablo Neira Ayuso
High rate of sync messages in master can lead to overflowing the socket buffer and dropping the messages. Fixed sleep of 1 second without wakeup events is not suitable for loaded masters, Use delayed_work to schedule sending for queued messages and limit the delay to IPVS_SYNC_SEND_DELAY (20ms). This will reduce the rate of wakeups but to avoid sending long bursts we wakeup the master thread after IPVS_SYNC_WAKEUP_RATE (8) messages. Add hard limit for the queued messages before sending by using "sync_qlen_max" sysctl var. It defaults to 1/32 of the memory pages but actually represents number of messages. It will protect us from allocating large parts of memory when the sending rate is lower than the queuing rate. As suggested by Pablo, add new sysctl var "sync_sock_size" to configure the SNDBUF (master) or RCVBUF (slave) socket limit. Default value is 0 (preserve system defaults). Change the master thread to detect and block on SNDBUF overflow, so that we do not drop messages when the socket limit is low but the sync_qlen_max limit is not reached. On ENOBUFS or other errors just drop the messages. Change master thread to enter TASK_INTERRUPTIBLE state early, so that we do not miss wakeups due to messages or kthread_should_stop event. Thanks to Pablo Neira Ayuso for his valuable feedback! Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: always update some of the flags bits in backupJulian Anastasov
As the goal is to mirror the inactconns/activeconns counters in the backup server, make sure the cp->flags are updated even if cp is still not bound to dest. If cp->flags are not updated ip_vs_bind_dest will rely only on the initial flags when updating the counters. To avoid mistakes and complicated checks for protocol state rely only on the IP_VS_CONN_F_INACTIVE bit when updating the counters. Signed-off-by: Julian Anastasov <ja@ssi.bg> Tested-by: Aleksey Chudov <aleksey.chudov@gmail.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: fix ip_vs_try_bind_dest to rebind app and transmitterJulian Anastasov
Initially, when the synced connection is created we use the forwarding method provided by master but once we bind to destination it can be changed. As result, we must update the application and the transmitter. As ip_vs_try_bind_dest is called always for connections that require dest binding, there is no need to validate the cp and dest pointers. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: remove check for IP_VS_CONN_F_SYNC from ip_vs_bind_destJulian Anastasov
As the IP_VS_CONN_F_INACTIVE bit is properly set in cp->flags for all kind of connections we do not need to add special checks for synced connections when updating the activeconns/inactconns counters for first time. Now logic will look just like in ip_vs_unbind_dest. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: ignore IP_VS_CONN_F_NOOUTPUT in backup serverJulian Anastasov
As IP_VS_CONN_F_NOOUTPUT is derived from the forwarding method we should get it from conn_flags just like we do it for IP_VS_CONN_F_FWD_MASK bits when binding to real server. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: use GFP_KERNEL allocation where possibleSasha Levin
Use GFP_KERNEL instead of GFP_ATOMIC when registering an ipvs protocol. This is safe since it will always run from a process context. Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-05-08ipvs: SH scheduler does not need GFP_ATOMIC allocationJulian Anastasov
Schedulers are initialized and bound to services only on commands. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: LBLCR scheduler does not need GFP_ATOMIC allocation on initJulian Anastasov
Schedulers are initialized and bound to services only on commands. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: WRR scheduler does not need GFP_ATOMIC allocationJulian Anastasov
Schedulers are initialized and bound to services only on commands. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: DH scheduler does not need GFP_ATOMIC allocationJulian Anastasov
Schedulers are initialized and bound to services only on commands. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: LBLC scheduler does not need GFP_ATOMIC allocation on initJulian Anastasov
Schedulers are initialized and bound to services only on commands. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08ipvs: timeout tables do not need GFP_ATOMIC allocationJulian Anastasov
They are called only on initialization. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Hans Schillstrom <hans@schillstrom.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2012-05-08netfilter: bridge: optionally set indev to vlanPablo Neira Ayuso
if net.bridge.bridge-nf-filter-vlan-tagged sysctl is enabled, bridge netfilter removes the vlan header temporarily and then feeds the packet to ip(6)tables. When the new "bridge-nf-pass-vlan-input-device" sysctl is on (default off), then bridge netfilter will also set the in-interface to the vlan interface; if such an interface exists. This is needed to make iptables REDIRECT target work with "vlan-on-top-of-bridge" setups and to allow use of "iptables -i" to match the vlan device name. Also update Documentation with current brnf default settings. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Bart De Schuymer <bdschuym@pandora.be> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-05-08netfilter: nf_conntrack: use this_cpu_inc()Eric Dumazet
this_cpu_inc() is IRQ safe and faster than local_bh_disable()/__this_cpu_inc()/local_bh_enable(), at least on x86. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Patrick McHardy <kaber@trash.net> Cc: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-05-08netfilter: nf_ct_helper: allow to disable automatic helper assignmentEric Leblond
This patch allows you to disable automatic conntrack helper lookup based on TCP/UDP ports, eg. echo 0 > /proc/sys/net/netfilter/nf_conntrack_helper [ Note: flows that already got a helper will keep using it even if automatic helper assignment has been disabled ] Once this behaviour has been disabled, you have to explicitly use the iptables CT target to attach helper to flows. There are good reasons to stop supporting automatic helper assignment, for further information, please read: http://www.netfilter.org/news.html#2012-04-03 This patch also adds one message to inform that automatic helper assignment is deprecated and it will be removed soon (this is spotted only once, with the first flow that gets a helper attached to make it as less annoying as possible). Signed-off-by: Eric Leblond <eric@regit.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-05-08sfc: Fix division by zero when using one RX channel and no SR-IOVBen Hutchings
If RSS is disabled on the PF (efx->n_rx_channels == 1) we try to set up the indirection table so that VFs can use it, setting efx->rss_spread = efx_vf_size(efx). But if SR-IOV was disabled at compile time, this evaluates to 0 and we end up dividing by zero when initialising the table. I considered changing the fallback definition of efx_vf_size() to return 1, but its value is really meaningless if we are not going to enable VFs. Therefore add a condition of efx_sriov_wanted(efx) in efx_probe_interrupts(). Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
2012-05-08regmap: Implement dev_get_regmap()Mark Brown
Use devres to implement dev_get_regmap(). This should mean that in almost all cases devices wishing to take advantage of framework features based on regmap shouldn't need to explicitly pass the regmap into the framework. This simplifies device setup a bit. Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
2012-05-08netfilter: nf_ct_ecache: refactor notifier registrationTony Zelenoff
* ret variable initialization removed as useless * similar code strings concatenated and functions code flow became more plain Signed-off-by: Tony Zelenoff <antonz@parallels.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-05-08etherdev.h: Convert int is_<foo>_ether_addr to boolJoe Perches
Make the return value explicitly true or false. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>