summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2015-05-09Merge branch 'netns-scalability'David S. Miller
Nicolas Dichtel says: ==================== netns: ease netlink use with a lot of netns This idea was informally discussed in Ottawa / netdev0.1. The goal is to ease the use/scalability of netns, from a userland point of view. Today, users need to open one netlink socket per family and per netns. Thus, when the number of netns inscreases (for example 5K or more), the number of sockets needed to manage them grows a lot. The goal of this series is to be able to monitor netlink events, for a specified family, for a set of netns, with only one netlink socket. For this purpose, a netlink socket option is added: NETLINK_LISTEN_ALL_NSID. When this option is set on a netlink socket, this socket will receive netlink notifications from all netns that have a nsid assigned into the netns where the socket has been opened. The nsid is sent to userland via an anscillary data. Here is an example with a patched iproute2. vxlan10 is created in the current netns (netns0, nsid 0) and then moved to another netns (netns1, nsid 1): $ ip netns exec netns0 ip monitor all-nsid label [nsid 0][NSID]nsid 1 (iproute2 netns name: netns1) [nsid 0][NEIGH]??? lladdr 00:00:00:00:00:00 REACHABLE,PERMANENT [nsid 0][LINK]5: vxlan10@NONE: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN group default link/ether 92:33:17:e6:e7:1d brd ff:ff:ff:ff:ff:ff [nsid 0][LINK]Deleted 5: vxlan10@NONE: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN group default link/ether 92:33:17:e6:e7:1d brd ff:ff:ff:ff:ff:ff [nsid 1][NSID]nsid 0 (iproute2 netns name: netns0) [nsid 1][LINK]5: vxlan10@NONE: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN group default link/ether 92:33:17:e6:e7:1d brd ff:ff:ff:ff:ff:ff link-netnsid 0 [nsid 1][ADDR]5: vxlan10 inet 192.168.0.249/24 brd 192.168.0.255 scope global vxlan10 valid_lft forever preferred_lft forever [nsid 1][ROUTE]local 192.168.0.249 dev vxlan10 table local proto kernel scope host src 192.168.0.249 [nsid 1][ROUTE]ff00::/8 dev vxlan10 table local metric 256 pref medium [nsid 1][ROUTE]2001:123::/64 dev vxlan10 proto kernel metric 256 pref medium [nsid 1][LINK]5: vxlan10@NONE: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default link/ether 92:33:17:e6:e7:1d brd ff:ff:ff:ff:ff:ff link-netnsid 0 [nsid 1][ROUTE]broadcast 192.168.0.255 dev vxlan10 table local proto kernel scope link src 192.168.0.249 [nsid 1][ROUTE]192.168.0.0/24 dev vxlan10 proto kernel scope link src 192.168.0.249 [nsid 1][ROUTE]broadcast 192.168.0.0 dev vxlan10 table local proto kernel scope link src 192.168.0.249 [nsid 1][ROUTE]fe80::/64 dev vxlan10 proto kernel metric 256 pref medium ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09netlink: allow to listen "all" netnsNicolas Dichtel
More accurately, listen all netns that have a nsid assigned into the netns where the netlink socket is opened. For this purpose, a netlink socket option is added: NETLINK_LISTEN_ALL_NSID. When this option is set on a netlink socket, this socket will receive netlink notifications from all netns that have a nsid assigned into the netns where the socket has been opened. The nsid is sent to userland via an anscillary data. With this patch, a daemon needs only one socket to listen many netns. This is useful when the number of netns is high. Because 0 is a valid value for a nsid, the field nsid_is_set indicates if the field nsid is valid or not. skb->cb is initialized to 0 on skb allocation, thus we are sure that we will never send a nsid 0 by error to the userland. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09netlink: rename private flags and statesNicolas Dichtel
These flags and states have the same prefix (NETLINK_) that netlink socket options. To avoid confusion and to be able to name a flag like a socket option, let's use an other prefix: NETLINK_[S|F]_. Note: a comment has been fixed, it was talking about NETLINK_RECV_NO_ENOBUFS socket option instead of NETLINK_NO_ENOBUFS. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09netns: use a spin_lock to protect nsid managementNicolas Dichtel
Before this patch, nsid were protected by the rtnl lock. The goal of this patch is to be able to find a nsid without needing to hold the rtnl lock. The next patch will introduce a netlink socket option to listen to all netns that have a nsid assigned into the netns where the socket is opened. Thus, it's important to call rtnl_net_notifyid() outside the spinlock, to avoid a recursive lock (nsid are notified via rtnl). This was the main reason of the previous patch. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09netns: notify new nsid outside __peernet2id()Nicolas Dichtel
There is no functional change with this patch. It will ease the refactoring of the locking system that protects nsids and the support of the netlink socket option NETLINK_LISTEN_ALL_NSID. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09netns: rename peernet2id() to peernet2id_alloc()Nicolas Dichtel
In a following commit, a new function will be introduced to only lookup for a nsid (no allocation if the nsid doesn't exist). To avoid confusion, the existing function is renamed. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09netns: always provide the id to rtnl_net_fill()Nicolas Dichtel
The goal of this commit is to prepare the rework of the locking of nsnid protection. After this patch, rtnl_net_notifyid() will not call anymore __peernet2id(), ie no idr_* operation into this function. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09netns: returns always an id in __peernet2id()Nicolas Dichtel
All callers of this function expect a nsid, not an error. Thus, returns NETNSA_NSID_NOT_ASSIGNED in case of error so that callers don't have to convert the error to NETNSA_NSID_NOT_ASSIGNED. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09Merge tag 'linux-can-next-for-4.2-20150506' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== pull-request: can-next 2015-05-06 this is a pull request of a seven patches for net-next/master. Andreas Gröger contributes two patches for the janz-ican3 driver. In the first patch, the documentation for already existing sysfs entries is added, the second patch adds support for another module/firmware variant. A patch by Shawn Landden makes the padding in the struct can_frame explicit. The next 4 patches target the flexcan driver, the first one is by David Jander adding some documentation, the reaming three by me add more documentation and two small code cleanups. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09Merge tag 'fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc Pull ARM SoC fixes from Arnd Bergmann: "A few patches have come up since the merge window. The largest one is a rewrite of the PXA lubbock/mainstone IRQ handling. This was already broken in 2011 by a change to the GPIO code and only noticed now. The other changes contained here are: MAINTAINERS file updates: - Ray Jui and Scott Branden are now co-maintainers for some of the mach-bcm chips, while Christian Daudt and Marc Carino have stepped down. - Andrew Victor is no longer maintaining at91. Instead, Alexandre Belloni now becomes an official maintainer, after having done a bulk of the work for a while. - Baruch Siach, who added the mach-digicolor platform in 4.1 is now listed as maintainer - The git URL for mach-socfpga has changed Bug fixes: - Three bug fixes for new rockchip rk3288 code - A regression fix to make SD card support work on certain ux500 boards - multiple smaller dts fixes for imx, omap, mvebu, and shmobile - a regression fiix for omap3 power consumption - a fix for regression in the ARM CCI bus driver Configuration changes: - more imx platforms are now enabled in multi_v7_defconfig" * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (39 commits) MAINTAINERS: add Conexant Digicolor machines entry MAINTAINERS: socfpga: update the git repo for SoCFPGA ARM: multi_v7_defconfig: Select more FSL SoCs MAINTAINERS: replace an AT91 maintainer drivers: CCI: fix used_mask init in validate_group() bus: omap_l3_noc: Fix master id address decoding for OMAP5 bus: omap_l3_noc: Fix offset for DRA7 CLK1_HOST_CLK1_2 instance ARM: dts: dra7: Fix efuse register size for ABB ARM: dts: am57xx-beagle-x15: Switch GPIO fan number ARM: dts: am57xx-beagle-x15: Switch UART mux pins ARM: dts: am437x-sk: reduce col-scan-delay-us ARM: dts: am437x-sk: fix for new newhaven display module revision ARM: dts: am57xx-beagle-x15: Fix RTC aliases ARM: dts: am57xx-beagle-x15: Fix IRQ type for mcp7941x ARM: dts: omap3: Add #iommu-cells to isp and iva iommu ARM: omap2plus_defconfig: Enable EXTCON_USB_GPIO ARM: dts: OMAP3-N900: Add microphone bias voltages ARM: OMAP2+: Fix omap off idle power consumption creeping up MAINTAINERS: Update brcmstb entry MAINTAINERS: Remove Christian Daudt for mach-bcm ...
2015-05-09Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user-namespace fix from Eric Biederman: "Eric Windish recently reported a really bug that allows mounting fresh copies of proc and sysfs when it really should not be allowed. The code attempted to verify that proc and sysfs were fully visible but there is a test missing to ensure that the root of the filesystem is visible. Doh! The following patch fixes that. This fixes a containment issue that the docker folks are seeing" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: mnt: Fix fs_fully_visible to verify the root directory is visible
2015-05-09Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "Two patches from the irq departement: - a simple fix to make dummy_irq_chip usable for wakeup scenarios - removal of the gic arch_extn hackery. Now that all users are converted we really want to get rid of the interface so people wont come up with new use cases" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip: gic: Drop support for gic_arch_extn genirq: Set IRQCHIP_SKIP_SET_WAKE flag for dummy_irq_chip
2015-05-09Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Thomas Gleixner: "A simple fix to actually shut down a detached device instead of keeping it active" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clockevents: Shutdown detached clockevent device
2015-05-09net: macb: Add change_mtu callback with jumbo supportHarini Katakam
Add macb_change_mtu callback; if jumbo frame support is present allow mtu size changes upto (jumbo max length allowed - headers). Signed-off-by: Harini Katakam <harinik@xilinx.com> Reviewed-by: Punnaiah Choudary Kalluri <punnaia@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09net: macb: Add support for jumbo framesHarini Katakam
Enable jumbo frame support for Zynq Ultrascale+ MPSoC. Update the NWCFG register and descriptor length masks accordingly. Jumbo max length register should be set according to support in SoC; it is set to 10240 for Zynq Ultrascale+ MPSoC. Signed-off-by: Harini Katakam <harinik@xilinx.com> Reviewed-by: Punnaiah Choudary Kalluri <punnaia@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09net: macb: Add compatible string for Zynq Ultrascale+ MPSoCHarini Katakam
Add compatible string and config structure for Zynq Ultrascale+ MPSoC Signed-off-by: Harini Katakam <harinik@xilinx.com> Reviewed-by: Punnaiah Choudary Kalluri <punnaia@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09devicetree: Add compatible string for Zynq Ultrascale+ MPSoCHarini Katakam
Add "cdns,zynqmp-gem" to be used for Zynq Ultrascale+ MPSoC. Signed-off-by: Harini Katakam <harinik@xilinx.com> Reviewed-by: Punnaiah Choudary Kalluri <punnaia@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09tcp: set SOCK_NOSPACE under memory pressureJason Baron
Under tcp memory pressure, calling epoll_wait() in edge triggered mode after -EAGAIN, can result in an indefinite hang in epoll_wait(), even when there is sufficient memory available to continue making progress. The problem is that when __sk_mem_schedule() returns 0 under memory pressure, we do not set the SOCK_NOSPACE flag in the tcp write paths (tcp_sendmsg() or do_tcp_sendpages()). Then, since SOCK_NOSPACE is used to trigger wakeups when incoming acks create sufficient new space in the write queue, all outstanding packets are acked, but we never wake up with the the EPOLLOUT that we are expecting from epoll_wait(). This issue is currently limited to epoll() when used in edge trigger mode, since 'tcp_poll()', does in fact currently set SOCK_NOSPACE. This is sufficient for poll()/select() and epoll() in level trigger mode. However, in edge trigger mode, epoll() is relying on the write path to set SOCK_NOSPACE. EPOLL(7) says that in edge-trigger mode we can only call epoll_wait() after read/write return -EAGAIN. Thus, in the case of the socket write, we are relying on the fact that tcp_sendmsg()/network write paths are going to issue a wakeup for us at some point in the future when we get -EAGAIN. Normally, epoll() edge trigger works fine when we've exceeded the sk->sndbuf because in that case we do set SOCK_NOSPACE. However, when we return -EAGAIN from the write path b/c we are over the tcp memory limits and not b/c we are over the sndbuf, we are never going to get another wakeup. I can reproduce this issue, using SO_SNDBUF, since __sk_mem_schedule() will return 0, or failure more readily with SO_SNDBUF: 1) create socket and set SO_SNDBUF to N 2) add socket as edge trigger 3) write to socket and block in epoll on -EAGAIN 4) cause tcp mem pressure via: echo "<small val>" > net.ipv4.tcp_mem The fix here is simply to set SOCK_NOSPACE in sk_stream_wait_memory() when the socket is non-blocking. Note that SOCK_NOSPACE, in addition to waking up outstanding waiters is also used to expand the size of the sk->sndbuf. However, we will not expand it by setting it in this case because tcp_should_expand_sndbuf(), ensures that no expansion occurs when we are under tcp memory pressure. Note that we could still hang if sk->sk_wmem_queue is 0, when we get the -EAGAIN. In this case the SOCK_NOSPACE bit will not help, since we are waiting for and event that will never happen. I believe that this case is harder to hit (and did not hit in my testing), in that over the tcp 'soft' memory limits, we continue to guarantee a minimum write buffer size. Perhaps, we could return -ENOSPC in this case, or maybe we simply issue a wakeup in this case, such that we keep retrying the write. Note that this case is not specific to epoll() ET, but rather would affect blocking sockets as well. So I view this patch as bringing epoll() edge-trigger into sync with the current poll()/select()/epoll() level trigger and blocking sockets behavior. Signed-off-by: Jason Baron <jbaron@akamai.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09gianfar: Enable changing mac addr when if upClaudiu Manoil
Use device flag IFF_LIVE_ADDR_CHANGE to signal that the device supports changing the hardware address when the device is running. This allows eth_mac_addr() to change the mac address also when the network device's interface is open. This capability is required by certain applications, like bonding mode 6 (Adaptive Load Balancing). Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09gianfar: Move TxFIFO underrun handling to reset pathClaudiu Manoil
Handle TxFIFO underrun exceptions outside the fast path. A controller reset is more reliable in this exceptional case, as opposed to re-enabling on-the-fly the Tx DMA. As the controller reset is handled outside the fast path by the reset_gfar() workqueue handler, the locking scheme on the Tx path is significantly simplified. Because the Tx processing (xmit queues and tx napi) is disabled during controller reset, tstat access from xmit does not require locking. So the scope of the txlock on the processing path is now reduced to num_txbdfree, which is shared only between process context (xmit) and softirq (clean_tx_ring). As a result, the txlock must not guard against interrupt context, and the spin_lock_irqsave() from xmit can be replaced by spin_lock_bh(). Likewise, the locking has been downgraded for clean_tx_ring(). Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09Merge branch 'bpf_seccomp'David S. Miller
Daniel Borkmann says: ==================== BPF updates This set gets rid of BPF special handling in seccomp filter preparation and provides generic infrastructure from BPF side, which eventually also allows for classic BPF JITs to add support for seccomp filters. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09seccomp, filter: add and use bpf_prog_create_from_user from seccompDaniel Borkmann
Seccomp has always been a special candidate when it comes to preparation of its filters in seccomp_prepare_filter(). Due to the extra checks and filter rewrite it partially duplicates code and has BPF internals exposed. This patch adds a generic API inside the BPF code code that seccomp can use and thus keep it's filter preparation code minimal and better maintainable. The other side-effect is that now classic JITs can add seccomp support as well by only providing a BPF_LDX | BPF_W | BPF_ABS translation. Tested with seccomp and BPF test suites. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Nicolas Schichan <nschichan@freebox.fr> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Kees Cook <keescook@chromium.org> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09net: filter: add __GFP_NOWARN flag for larger kmem allocsDaniel Borkmann
When seccomp BPF was added, it was discussed to add __GFP_NOWARN flag for their configuration path as f.e. up to 32K allocations are more prone to fail under stress. As we're going to reuse BPF API, add __GFP_NOWARN flags where larger kmalloc() and friends allocations could fail. It doesn't make much sense to pass around __GFP_NOWARN everywhere as an extra argument only for seccomp while we just as well could run into similar issues for socket filters, where it's not desired to have a user application throw a WARN() due to allocation failure. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Nicolas Schichan <nschichan@freebox.fr> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Kees Cook <keescook@chromium.org> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09seccomp: simplify seccomp_prepare_filter and reuse bpf_prepare_filterNicolas Schichan
Remove the calls to bpf_check_classic(), bpf_convert_filter() and bpf_migrate_runtime() and let bpf_prepare_filter() take care of that instead. seccomp_check_filter() is passed to bpf_prepare_filter() so that it gets called from there, after bpf_check_classic(). We can now remove exposure of two internal classic BPF functions previously used by seccomp. The export of bpf_check_classic() symbol, previously known as sk_chk_filter(), was there since pre git times, and no in-tree module was using it, therefore remove it. Joint work with Daniel Borkmann. Signed-off-by: Nicolas Schichan <nschichan@freebox.fr> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Kees Cook <keescook@chromium.org> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09net: filter: add a callback to allow classic post-verifier transformationsNicolas Schichan
This is in preparation for use by the seccomp code, the rationale is not to duplicate additional code within the seccomp layer, but instead, have it abstracted and hidden within the classic BPF API. As an interim step, this now also makes bpf_prepare_filter() visible (not as exported symbol though), so that seccomp can reuse that code path instead of reimplementing it. Joint work with Daniel Borkmann. Signed-off-by: Nicolas Schichan <nschichan@freebox.fr> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Kees Cook <keescook@chromium.org> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09Merge tag 'mac80211-next-for-davem-2015-05-06' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== Lots of updates for net-next for this cycle. As usual, we have a lot of small fixes and cleanups, the bigger items are: * proper mac80211 rate control locking, to fix some random crashes (this required changing other locking as well) * mac80211 "fast-xmit", a mechanism to reduce, in most cases, the amount of code we execute while going from ndo_start_xmit() to the driver * this also clears the way for properly supporting S/G and checksum and segmentation offloads ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09usbnet: avoid integer overflow in start_xmitJason A. Donenfeld
transfer_buffer_length is of type u32. It's therefore wrong to assign it to a signed integer. This patch avoids the overflow. It's worth noting that entry->length here is a long; perhaps it would be beneficial at somepoint to change this to be unsigned as well, if nothing else relies on its signedness for error conditions or the like. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09netxen_nic: use spin_[un]lock_bh around tx_clean_lock (2)Tony Camuso
This patch should have been part of the previous patch having the same summary. See http://marc.info/?l=linux-kernel&m=143039470103795&w=2 Unfortunately, I didn't check to see where else this lock was used before submitting that patch. This should take care of it for netxen_nic, as I did a thorough search this time. To recap from the original patch; although testing this driver with DEBUG_LOCKDEP and DEBUG_SPINLOCK enabled did not produce any traces, it would be more prudent in the case of tx_clean_lock to use _bh versions of spin_[un]lock, since this lock is manipulated in both the process and softirq contexts. This patch was tested for functionality and regressions with netperf and DEBUG_LOCKDEP and DEBUG_SPINLOCK enabled. Signed-off-by: Tony Camuso <tcamuso@redhat.com> Acked-By: Neil Horman <nhorman@tuxdriver.com> Acked-By: Manish Chopra <manish.chopra@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09Merge branch 'tcp-more-reliable-window-probes'David S. Miller
Eric Dumazet says: ==================== tcp: more reliable window probes This series address a problem caused by small rto_min timers in DC, leading to either timer storms or early flow terminations. We also add two new SNMP counters for proper monitoring : TCPWinProbe and TCPKeepAlive v2: added TCPKeepAlive counter, as suggested by Yuchung & Neal ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09tcp: add TCPWinProbe and TCPKeepAlive SNMP countersEric Dumazet
Diagnosing problems related to Window Probes has been hard because we lack a counter. TCPWinProbe counts the number of ACK packets a sender has to send at regular intervals to make sure a reverse ACK packet opening back a window had not been lost. TCPKeepAlive counts the number of ACK packets sent to keep TCP flows alive (SO_KEEPALIVE) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09tcp: adjust window probe timers to safer valuesEric Dumazet
With the advent of small rto timers in datacenter TCP, (ip route ... rto_min x), the following can happen : 1) Qdisc is full, transmit fails. TCP sets a timer based on icsk_rto to retry the transmit, without exponential backoff. With low icsk_rto, and lot of sockets, all cpus are servicing timer interrupts like crazy. Intent of the code was to retry with a timer between 200 (TCP_RTO_MIN) and 500ms (TCP_RESOURCE_PROBE_INTERVAL) 2) Receivers can send zero windows if they don't drain their receive queue. TCP sends zero window probes, based on icsk_rto current value, with exponential backoff. With /proc/sys/net/ipv4/tcp_retries2 being 15 (or even smaller in some cases), sender can abort in less than one or two minutes ! If receiver stops the sender, it obviously doesn't care of very tight rto. Probability of dropping the ACK reopening the window is not worth the risk. Lets change the base timer to be at least 200ms (TCP_RTO_MIN) for these events (but not normal RTO based retransmits) A followup patch adds a new SNMP counter, as it would have helped a lot diagnosing this issue. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09tipc: send explicit not supported error in nl compatRichard Alpe
The legacy netlink API treated EPERM (permission denied) as "operation not supported". Reported-by: Tomi Ollila <tomi.ollila@iki.fi> Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09tipc: add broadcast link window set/get to nl apiRichard Alpe
Add the ability to get or set the broadcast link window through the new netlink API. The functionality was unintentionally missing from the new netlink API. Adding this means that we also fix the breakage in the old API when coming through the compat layer. Fixes: 37e2d4843f9e (tipc: convert legacy nl link prop set to nl compat) Reported-by: Tomi Ollila <tomi.ollila@iki.fi> Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09tipc: fix default link prop regression in nl compatRichard Alpe
Default link properties can be set for media or bearer. This functionality was missed when introducing the NL compatibility layer. This patch implements this functionality in the compat netlink layer. It works the same way as it did in the old API. We search for media and bearers matching the "link name". If we find a matching media or bearer the link tolerance, priority or window is used as default for new links on that media or bearer. Fixes: 37e2d4843f9e (tipc: convert legacy nl link prop set to nl compat) Reported-by: Tomi Ollila <tomi.ollila@iki.fi> Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09cxgb4: Initialize RSS mode for all PortsHariprasad Shenai
Implements t4_init_rss_mode() to initialize the rss_mode for all the ports. If Tunnel All Lookup isn't specified in the global RSS Configuration, then we need to specify a default Ingress Queue for any ingress packets which aren't hashed. We'll use our first ingress queue. Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09Merge branch 'be2net'David S. Miller
Sathya Perla says: ==================== be2net: patch-set The following patch-set has two new feature additions, and a few minor fixes and cleanups. Pls consider applying to the net-next tree. Thanks. v2 changes: a) dropped the "don't enable pause by default" patch b) described how the "spoof check" works in patch 1's commit log c) I had to update our email addresses from "@emulex" to "@avagotech". I'll send a separate patch updating the maintainers Patch 1 adds support for the "spoofchk" knob for VFs. When it is enabled, "spoof checking" is done for both MAC-address and VLAN. For each VF, the HW ensures that the source MAC address (or vlan) of every outgoing packet of the VF exists in the MAC-list (or vlan-list) configured for RX filtering for that VF. If not, the packet is dropped and an error is reported to the driver in the TX completion. Patch 2 improves interrupt moderation on Skyhawk-R chip by using the EQ-DB mechanism to set a "re-arm to interrupt" delay. Currently interrupt moderation is adjusted by calculating and configuring an EQ-delay every second. This is done via a FW-cmd. This patch uses the EQ_DB facility to calculate and set the interrupt delay every 1ms. This helps moderating interrupts better when the traffic is bursty. Patch 3 adds L3/L4 error accounting to BE3 VFs, by passing L3/4 error packets to the network stack. Patch 4 adds an extra FW-cmd error value check in the driver to identify an "out of vlan filters" scenario. Patch 5 stops enabling pause by default as this setting fails in some HW-configs where priority pause is enabled in FW. If the user tries to do the same, an appropriate error is returned via ethtool. Patch 5 posts the full RXQ in be_open() to prevent packet drops due to bursty traffic when the interface is enabled. Patch 6 refactors the be_check_ufi_compatibility() routine, that checks to see if a UFI file meant for a lower rev of a chip is being flashed on a higher rev, to make it simpler. Patch 7 replaces the usage of !be_physfn() macro with be_virtfn() that is already avialble in the driver. Patch 8 updates the year in the copyright text to 2015. Path 9 bumps up the driver version to 10.6.02. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09be2net: update the driver version to 10.6.0.2Sathya Perla
Signed-off-by: Sathya Perla <sathya.perla@avagotech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09be2net: update copyright year to 2015Vasundhara Volam
Signed-off-by: Vasundhara Volam <vasundhara.volam@avagotech.com> Signed-off-by: Sathya Perla <sathya.perla@avagotech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09be2net: use be_virtfn() instead of !be_physfn()Kalesh AP
Use be_virtfn() to determine a VF instead of !be_physfn() for better readability. Signed-off-by: Sathya Perla <sathya.perla@avagotech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09be2net: simplify UFI compatibility checkingVasundhara Volam
The code in be_check_ufi_compatibility() checks to see if a UFI file meant for a lower rev of a chip is being flashed on a higher rev, which is disallowed. This patch re-writes the code needed for this check in a much simpler manner. Signed-off-by: Vasundhara Volam <vasundhara.volam@avagotech.com> Signed-off-by: Sathya Perla <sathya.perla@avagotech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09be2net: post full RXQ on interface enableSuresh Reddy
When an RXQ is created in be_open(), the driver currently posts only 64 buffers. This sometimes results in packet drops when there is a traffic burst as soon as the interface is enabled. This patch fixes this problem by posting the full RXQ on interface enable. Signed-off-by: Suresh Reddy <Suresh.Reddy@avagotech.com> Signed-off-by: Sathya Perla <sathya.perla@avagotech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09be2net: check for INSUFFICIENT_VLANS errorKalesh AP
When the FW runs out of vlan filters it can either return an INSUFFICIENT_RESOURCES error or an INSUFFICIENT_VLANS error. The driver currently checks only for the former error value. This patch adds a check for the latter value too. Signed-off-by: Kalesh AP <kalesh.purayil@emulex.com> Signed-off-by: Sathya Perla <sathya.perla@avagotech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09be2net: receive pkts with L3, L4 errors on VFsSomnath Kotur
Currently pkts with L3 or L4 errors received on PFs are not dropped by the adapter, but instead sent to the stack. This helps the network stack to better reflect error statistics. This was not being done on BE3 VFs. This patch fixes this for BE3 VFs. Signed-off-by: Somnath Kotur <somnath.kotur@avagotech.com> Signed-off-by: Sathya Perla <sathya.perla@avagotech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09be2net: set interrupt moderation for Skyhawk-R using EQ-DBPadmanabh Ratnakar
Currently adaptive interrupt moderation is set by calculating and configuring an EQ-delay every second. This is done via a FW-cmd. But, on Skyhawk-R a "re-arm to interrupt" delay can be set while ringing the EQ-DB. This patch uses this facility to calculate and set the interrupt delay every 1ms. This helps moderating interrupts better when the traffic is bursty. Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@avagotech.com> Signed-off-by: Sathya Perla <sathya.perla@avagotech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09be2net: add support for spoofchk settingKalesh AP
This patch adds support for spoofchk configuration for VFs. When it is enabled, "spoof checking" is done for both MAC-address and VLAN. For each VF, the HW ensures that the source MAC address (or vlan) of every outgoing packet exists in the MAC-list (or vlan-list) configured for RX filtering for that VF. If not, the packet is dropped and an error is reported to the driver in the TX completion; this is reflected in the "tx_spoof_check_err" ethtool counter. This feature is supported in Skyhawk FW version 10.6.31.0 and above. Signed-off-by: Kalesh AP <kalesh.purayil@emulex.com> Signed-off-by: Sathya Perla <sathya.perla@avagotech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09net: xgene_enet: Set hardware dependencyJean Delvare
The xgene_enet driver is only useful on X-Gene SoC. Signed-off-by: Jean Delvare <jdelvare@suse.de> Cc: Iyappan Subramanian <isubramanian@apm.com> Cc: Keyur Chudgar <kchudgar@apm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09net: amd-xgbe: Add hardware dependencyJean Delvare
The amd-xgbe driver currently only works with the Seattle SoC, which is ARM64 architecture, so there is no point in building this driver on other architectures except for build testing purpose. The dependency list can be updated later if the driver ever supports other architectures. Signed-off-by: Jean Delvare <jdelvare@suse.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09Merge branch 'sfc-next'David S. Miller
Shradha Shah says: ==================== sfc: Enabling EF10 Vf's, set up vswitching and bind the SFC driver to the VF's This set of patches makes way for the implementation of EF10 SR-IOV driver starting with some cleanup code. NIC specific SR-IOV functions are moved to their own header and netdev_ops are made generic instead of being NIC specific Next in line comes the patch to enable VF's using sriov_configure. VEB vswitching hierarchy is set up next followed by patches to prepare sfc driver to bind to enabled VF's This is followed by patch to support use of shared RSS contexts which makes VF's use shared RSS contexts in all cases. Patch series ends with a patch to bind the sfc driver to the enabled VF's which creates network interfaces corresponding to the VF's. Coming up soon are the patches to set_vf_mac, set_vf_config, set_vf_vlan, vf_spoofcheck, etc. These patches have been tested with and without CONFIG_SFC_SRIOV. In the case of CONFIG_SFC_SRIOV=y enabling of VF's using sriov_configure is also tested. The enabled VF's bind to the installed sfc driver succesfully to create network interfaces. In the case of CONFIG_SFC_SRIOV=n enabling of VF's using sriov_configure returns the correct error message: "Function not implemented". ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09sfc: Bind the sfc driver to any available VF'sShradha Shah
Add the device ID of the VF to the PCI device ID table. Added a boolean flag is_vf in efx_nic_type to differentiate between a VF and PF at probe time. This flag is useful in later patches while setting MAC address specially in the PCI-passthrough case. Signed-off-by: Shradha Shah <sshah@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09sfc: Add use of shared RSS contexts.Jon Cooper
Allow PFs to allocate shared RSS contexts if we exhaust our exclusive RSS contexts. Make VFs use shared RSS contexts in all cases. Spruce up error handling so that the shadow copy of the RSS table is updated after successful update, rather than in all cases, so that we report the actual contents of the RSS table after a failure to set it, rather than what we'd like it to be. Populate context_size parameter when vacuously allocating RSS context of size 1. Signed-off-by: Shradha Shah <sshah@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>