summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2016-01-04soreuseport: fast reuseport UDP socket selectionCraig Gallek
Include a struct sock_reuseport instance when a UDP socket binds to a specific address for the first time with the reuseport flag set. When selecting a socket for an incoming UDP packet, use the information available in sock_reuseport if present. This required adding an additional field to the UDP source address equality function to differentiate between exact and wildcard matches. The original use case allowed wildcard matches when checking for existing port uses during bind. The new use case of adding a socket to a reuseport group requires exact address matching. Performance test (using a machine with 2 CPU sockets and a total of 48 cores): Create reuseport groups of varying size. Use one socket from this group per user thread (pinning each thread to a different core) calling recvmmsg in a tight loop. Record number of messages received per second while saturating a 10G link. 10 sockets: 18% increase (~2.8M -> 3.3M pkts/s) 20 sockets: 14% increase (~2.9M -> 3.3M pkts/s) 40 sockets: 13% increase (~3.0M -> 3.4M pkts/s) This work is based off a similar implementation written by Ying Cai <ycai@google.com> for implementing policy-based reuseport selection. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04soreuseport: define reuseport groupsCraig Gallek
struct sock_reuseport is an optional shared structure referenced by each socket belonging to a reuseport group. When a socket is bound to an address/port not yet in use and the reuseport flag has been set, the structure will be allocated and attached to the newly bound socket. When subsequent calls to bind are made for the same address/port, the shared structure will be updated to include the new socket and the newly bound socket will reference the group structure. Usually, when an incoming packet was destined for a reuseport group, all sockets in the same group needed to be considered before a dispatching decision was made. With this structure, an appropriate socket can be found after looking up just one socket in the group. This shared structure will also allow for more complicated decisions to be made when selecting a socket (eg a BPF filter). This work is based off a similar implementation written by Ying Cai <ycai@google.com> for implementing policy-based reuseport selection. Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04Merge branch 'mlxsw-fixes'David S. Miller
Jiri Pirko says: ==================== mlxsw: couple of fixes Couple of fixes from Ido. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04mlxsw: spectrum: Change bridge port attributes only when bridgedIdo Schimmel
Bridge port attributes are offloaded to hardware when invoked with SELF flag set, but it really makes no sense to reflect them when port is not bridged. Allow a user to change these attribute only when port is bridged and initialize them correctly when joining or leaving a bridge. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04mlxsw: spectrum: Set bridge status in appropriate functionsIdo Schimmel
Set the bridge status of physical ports in the appropriate functions, to be consistent with LAG join/leave and vPorts joining/leaving bridge. Also, remove the error messages in these two functions, as we already emit errors in both the single functions they call. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04mlxsw: spectrum: Return NOTIFY_BAD on bridge failureIdo Schimmel
It is possible for us to fail when joining or leaving a bridge, so let the user know about that by returning NOTIFY_BAD, as already done for LAG join/leave and 802.1D bridges. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04mlxsw: spectrum: Initialize PVID only onceIdo Schimmel
We set PVID to 1 in mlxsw_sp_port_vlan_init(), so we can remove this statement. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04r8152: add reset_resume functionhayeswang
When the reset_resume() is called, the flag of SELECTIVE_SUSPEND should be cleared and reinitialize the device, whether the SELECTIVE_SUSPEND is set or not. If reset_resume() is called, it means the power supply is cut or the device is reset. That is, the device wouldn't be in runtime suspend state and the reinitialization is necessary. Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04chelsio: constify cphy_ops structuresJulia Lawall
The cphy_ops structures are never modified, so declare them as const. Done with the help of Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04fsl/fman: allow modular buildArnd Bergmann
ARM allmodconfig fails because of the addition of the FMAN driver: drivers/built-in.o: In function `dtsec_restart_autoneg': binder.c:(.text+0x173328): undefined reference to `mdiobus_read' binder.c:(.text+0x173348): undefined reference to `mdiobus_write' drivers/built-in.o: In function `dtsec_config': binder.c:(.text+0x173d24): undefined reference to `of_phy_find_device' drivers/built-in.o: In function `init_phy': binder.c:(.text+0x1763b0): undefined reference to `of_phy_connect' drivers/built-in.o: In function `stop': binder.c:(.text+0x176014): undefined reference to `phy_stop' drivers/built-in.o: In function `start': binder.c:(.text+0x176078): undefined reference to `phy_start' The reason is that the driver uses PHYLIB, but that is a loadable module here, and fman itself is built-in. This patch makes it possible to configure fman as a module as well so we don't change the status of PHYLIB in an allmodconfig kernel, and it adds a 'select PHYLIB' statement to ensure that phylib is always built-in when fman is. The driver uses "builtin_platform_driver(fman_driver);", which means it cannot be unloaded, but it's still possible to have it as a loadable module that gets loaded once and never removed. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: 5adae51a64b8 ("fsl/fman: Add FMan MURAM support") Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04net: make ip6tunnel_xmit definition conditionalArnd Bergmann
Moving the caller of iptunnel_xmit_stats causes a build error in randconfig builds that disable CONFIG_INET: In file included from ../net/xfrm/xfrm_input.c:17:0: ../include/net/ip6_tunnel.h: In function 'ip6tunnel_xmit': ../include/net/ip6_tunnel.h:93:2: error: implicit declaration of function 'iptunnel_xmit_stats' [-Werror=implicit-function-declaration] iptunnel_xmit_stats(dev, pkt_len); The reason is that the iptunnel_xmit_stats definition is hidden inside #ifdef CONFIG_INET but the caller is not. We can change one or the other to fix it, and this patch adds a second #ifdef around ip6tunnel_xmit() to avoid seeing the invalid call. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: 039f50629b7f ("ip_tunnel: Move stats update to iptunnel_xmit()") Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04Merge tag 'nfc-next-4.5-1' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next Samuel Ortiz says: ==================== NFC 4.5 pull request This is the first NFC pull request for 4.5 and it brings: - A new driver for the STMicroelectronics ST95HF NFC chipset. The ST95HF is an NFC digital transceiver with an embedded analog front-end and as such relies on the Linux NFC digital implementation. This is the 3rd user of the NFC digital stack. - ACPI support for the ST st-nci and st21nfca drivers. - A small improvement for the nfcsim driver, as we can now tune the Rx delay through sysfs. - A bunch of minor cleanups and small fixes from Christophe Ricard, for a few drivers and the NFC core code. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04connector: bump skb->users before callback invocationFlorian Westphal
Dmitry reports memleak with syskaller program. Problem is that connector bumps skb usecount but might not invoke callback. So move skb_get to where we invoke the callback. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-05i2c: ibm_iic: rename i2c_timings struct due to clash with generic versionStephen Rothwell
Fixes: e1dba01ca620 ("i2c: add generic routine to parse DT for timing information") Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2016-01-05cpufreq-dt: fix handling regulator_get_voltage() resultAndrzej Hajda
The function can return negative values so it should be assigned to signed type. The problem has been detected using proposed semantic patch scripts/coccinelle/tests/unsigned_lesser_than_zero.cocci. Link: http://permalink.gmane.org/gmane.linux.kernel/2038576 Signed-off-by: Andrzej Hajda <a.hajda@samsung.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-01-05cpufreq: governor: Fix negative idle_time when configured with ↵Chen Yu
CONFIG_HZ_PERIODIC It is reported that, with CONFIG_HZ_PERIODIC=y cpu stays at the lowest frequency even if the usage goes to 100%, neither ondemand nor conservative governor works, however performance and userspace work as expected. If set with CONFIG_NO_HZ_FULL=y, everything goes well. This problem is caused by improper calculation of the idle_time when the load is extremely high(near 100%). Firstly, cpufreq_governor uses get_cpu_idle_time to get the total idle time for specific cpu, then: 1.If the system is configured with CONFIG_NO_HZ_FULL, the idle time is returned by ktime_get, which is always increasing, it's OK. 2.However, if the system is configured with CONFIG_HZ_PERIODIC, get_cpu_idle_time might not guarantee to be always increasing, because it will leverage get_cpu_idle_time_jiffy to calculate the idle_time, consider the following scenario: At T1: idle_tick_1 = total_tick_1 - user_tick_1 sample period(80ms)... At T2: ( T2 = T1 + 80ms): idle_tick_2 = total_tick_2 - user_tick_2 Currently the algorithm is using (idle_tick_2 - idle_tick_1) to get the delta idle_time during the past sample period, however it CAN NOT guarantee that idle_tick_2 >= idle_tick_1, especially when cpu load is high. (Yes, total_tick_2 >= total_tick_1, and user_tick_2 >= user_tick_1, but how about idle_tick_2 and idle_tick_1? No guarantee.) So governor might get a negative value of idle_time during the past sample period, which might mislead the system that the idle time is very big(converted to unsigned int), and the busy time is nearly zero, which causes the governor to always choose the lowest cpufreq, then cause this problem. In theory there are two solutions: 1.The logic should not rely on the idle tick during every sample period, but be based on the busy tick directly, as this is how 'top' is implemented. 2.Or the logic must make sure that the idle_time is strictly increasing during each sample period, then there would be no negative idle_time anymore. This solution requires minimum modification to current code and this patch uses method 2. Link: https://bugzilla.kernel.org/show_bug.cgi?id=69821 Reported-by: Jan Fikar <j.fikar@gmail.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-01-04cciss: print max outstanding commands as a hex valueColin Ian King
The max outstanding commands is being printed with a 0x prefix to suggest it is a hex value, when in fact the integer decimal %d format specifier is being used and this is a bit confusing. Use %x instead to match the proceeding 0x prefix. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2016-01-04x86: ftrace: Fix the comments for ftrace_modify_code_direct()Li Bin
There is no need to worry about module and __init text disappearing case, because that ftrace has a module notifier that is called when a module is being unloaded and before the text goes away and this code grabs the ftrace_lock mutex and removes the module functions from the ftrace list, such that it will no longer do any modifications to that module's text, the update to make functions be traced or not is done under the ftrace_lock mutex as well. And by now, __init section codes should not been modified by ftrace, because it is black listed in recordmcount.c and ignored by ftrace. Link: http://lkml.kernel.org/r/1449367378-29430-6-git-send-email-huawei.libin@huawei.com Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Li Bin <huawei.libin@huawei.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-01-04udp: properly support MSG_PEEK with truncated buffersEric Dumazet
Backport of this upstream commit into stable kernels : 89c22d8c3b27 ("net: Fix skb csum races when peeking") exposed a bug in udp stack vs MSG_PEEK support, when user provides a buffer smaller than skb payload. In this case, skb_copy_and_csum_datagram_iovec(skb, sizeof(struct udphdr), msg->msg_iov); returns -EFAULT. This bug does not happen in upstream kernels since Al Viro did a great job to replace this into : skb_copy_and_csum_datagram_msg(skb, sizeof(struct udphdr), msg); This variant is safe vs short buffers. For the time being, instead reverting Herbert Xu patch and add back skb->ip_summed invalid changes, simply store the result of udp_lib_checksum_complete() so that we avoid computing the checksum a second time, and avoid the problematic skb_copy_and_csum_datagram_iovec() call. This patch can be applied on recent kernels as it avoids a double checksumming, then backported to stable kernels as a bug fix. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04cxgb4: correctly handling failed allocationInsu Yun
Since t4_alloc_mem can be failed in memory pressure, if not properly handled, NULL dereference could be happened. Signed-off-by: Insu Yun <wuninsu@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04qlcnic: correctly handle qlcnic_alloc_mbx_argsInsu Yun
Since qlcnic_alloc_mbx_args can be failed, return value should be checked. Signed-off-by: Insu Yun <wuninsu@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-05drm/nouveau/gr/nv40: fix oops in interrupt handlerBen Skeggs
fdo#93557 Signed-off-by: Ben Skeggs <bskeggs@redhat.com> Cc: stable@vger.kernel.org
2016-01-04Merge branch 'r8169-hw-programming-typo-fixes'David S. Miller
Chunhao Lin says: ==================== Fix some typos in setting hardware parameter The typos are in setting RTL8168DP, RTL8168EP and RTL8168H hardware parameters. This series of patch fix these typos. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04r8169:Correct the way of setting RTL8168DP ephyChun-Hao Lin
The original way is wrong, it always writes ephy reg 0x03. Signed-off-by: Chunhao Lin <hau@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04r8169:Fix typo in setting RTL8168H PHY PFM mode.Chun-Hao Lin
The PHY PFM register is in PHY page 0x0a44 register 0x11, not 0x14. Signed-off-by: Chunhao Lin <hau@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04r8169:Fix typo in setting RTL8168EP and RTL8168H D3cold PFM modeChun-Hao Lin
The register for setting D3code PFM mode is MISC_1, not DLLPR. Signed-off-by: Chunhao Lin <hau@realtek.com> Reviewed-by: Francois Romieu <romieu@fr.zoreil.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04l2tp: rely on ppp layer for skb scrubbingGuillaume Nault
Since 79c441ae505c ("ppp: implement x-netns support"), the PPP layer calls skb_scrub_packet() whenever the skb is received on the PPP device. Manually resetting packet meta-data in the L2TP layer is thus redundant. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04PM / sleep: Add support for read-only sysfs attributesRafael J. Wysocki
Some sysfs attributes in /sys/power/ should really be read-only, so add support for that, convert those attributes to read-only and drop the stub .show() routines from them. Original-by: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-01-04ACPI: Fix white space in a structure definitionLukas Wunner
Add a missing space in the definition of struct acpi_device_bus_id. Signed-off-by: Lukas Wunner <lukas@wunner.de> [ rjw: Subject and changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-01-04ACPI / SBS: fix inconsistent indenting inside if statementColin Ian King
The indenting in acpi_battery_set_alarm is inconsistent and has been so since 2007; commit 94f6c0860139da9219255b8ff45ad42117dda859 ("ACPI: SBS: Add support for power_supply class (and sysfs)"). Minor fix for this, no code functionality change. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-01-04PNP: respect PNP_DRIVER_RES_DO_NOT_CHANGE when detachingHeiner Kallweit
I have a device (Nuvoton 6779D Super-IO IR RC with nuvoton-cir driver) which works after initial boot but not any longer if I unload and re-load the driver module. Digging into the issue I found that unloading the driver calls pnp_disable_dev although the driver has flag PNP_DRIVER_RES_DO_NOT_CHANGE set. IMHO this is not right. Let's have a look at the call chain when probing a device: pnp_device_probe 1. attaches the device 2. if it's not active and PNP_DRIVER_RES_DO_NOT_CHANGE is not set it gets activated 3. probes driver I think pnp_device_remove should do it in reverse order and also respect PNP_DRIVER_RES_DO_NOT_CHANGE. Therefore: 1. call drivers remove callback 2. if device is active and PNP_DRIVER_RES_DO_NOT_CHANGE is not set disable it 3. detach device The change works for me and sounds logical to me. However I don't know the pnp driver in detail so I might be wrong. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-01-04Merge branch 'sh_eth-remove-BE-desc-support'David S. Miller
Sergei Shtylyov says: ==================== sh_eth: remove unused BE descriptor support Here's a set of 2 patches against DaveM's 'net-next.git' repo plus the recently merged to 'net.git' repo fix for the 16-bit descriptor endianness. We get rid of ~30 LoCs and ~300 bytes of code. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04sh_eth: get rid of {cpu|edmac}_to_{edmac|cpu}()Sergei Shtylyov
Now that {cpu|edmac}_to_{edmac|cpu}() functions boiled down to the mere {cpu|le32}_to_{le32|cpu}() calls, there's no need for these functions anymore, so just get rid of them. Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Acked-by: Simon Horman <horms+renesas@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04sh_eth: remove EDMAC_BIG_ENDIANSergei Shtylyov
Commit 71557a37adb5 ("[netdrvr] sh_eth: Add SH7619 support") added support for the big-endian EDMAC descriptors. However, it was never used and never worked right until the recent driver fixes. I think we now can just remove this support, it was only burdening the driver from the start. It should be easy to do without disturbing the SH platform code, at least for now... Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Acked-by: Simon Horman <horms+renesas@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04ACPI / PNP: constify device IDsMathias Krause
Instead of re-creating the array on the stack each time is_cmos_rtc_device() gets called, make the array 'static const'. Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-01-05Merge branch 'xfs-dax-fixes-for-4.5' into for-nextDave Chinner
2016-01-05Merge branch 'xfs-misc-fixes-for-4.5' into for-nextDave Chinner
2016-01-04ACPI / PCI: Simplify acpi_penalize_isa_irq()Rafael J. Wysocki
acpi_penalize_isa_irq() can be written in fewer lines of code, so do that. No functional change. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Works-for: Andy Shevchenko <andy.shevchenko@gmail.com>
2016-01-04tilepro: use to_delayed_workGeliang Tang
Use to_delayed_work() instead of open-coding it. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04ACPICA: Drop Linux-specific waking vector functionsRafael J. Wysocki
Commit f06147f9fbf1 (ACPICA: Hardware: Enable firmware waking vector for both 32-bit and 64-bit FACS) added three functions that aren't present in upstream ACPICA, acpi_hw_set_firmware_waking_vectors(), acpi_set_firmware_waking_vectors() and acpi_set_firmware_waking_vector64(), to allow Linux to use the previously existing API for setting the platform firmware waking vector. However, that wasn't necessary, since the ACPI sleep support code in Linux can be modified to use the upstream ACPICA's API easily and the additional functions may be dropped which reduces the code size and puts the kernel's ACPICA code more in line with the upstream. Make the changes as per the above. While at it, make the relevant function desctiption comments reflect the upstream ACPICA's ones. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Lv Zheng <lv.zheng@intel.com>
2016-01-04Merge branch 'bnxt_en-combined-rx-tx-channels'David S. Miller
Michael Chan says: ==================== bnxt_en: Support combined and rx/tx channels. The bnxt hardware uses a completion ring for rx and tx events. The driver has to process the completion ring entries sequentially for the events. The current code only supports an rx/tx ring pair for each completion ring. This patch series add support for using a dedicated completion ring for rx only or tx only as an option configuarble using ethtool -L. The benefits for using dedicated completion rings are: 1. A burst of rx packets can cause delay in processing tx events if the completion ring is shared. If tx queue is stopped by BQL, this can cause delay in re-starting the tx queue. 2. A completion ring is sized according to the rx and tx ring size rounded up to the nearest power of 2. When the completion ring is shared, it is sized by adding the rx and tx ring sizes and then rounded to the next power of 2, often with a lot of wasted space. 3. Using dedicated completion ring, we can adjust the tx and rx coalescing parameters independently for rx and tx. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04bnxt_en: Modify ethtool -l|-L to support combined or rx/tx rings.Michael Chan
The driver can support either all combined or all rx/tx rings. The default is combined, but the user can now select rx/tx rings. Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04bnxt_en: Modify init sequence to support shared or non shared rings.Michael Chan
Modify ring memory allocation and MSIX setup to support shared or non shared rings and do the proper mapping. Default is still to use shared rings. Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04bnxt_en: Modify bnxt_get_max_rings() to support shared or non shared rings.Michael Chan
Add logic to calculate how many shared or non shared rings can be supported. Default is to use shared rings. Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04bnxt_en: Re-structure ring indexing and mapping.Michael Chan
In order to support dedicated or shared completion rings, the ring indexing and mapping are re-structured as below: 1. bp->grp_info[] array index is 1:1 with bp->bnapi[] array index and completion ring index. 2. rx rings 0 to n will be mapped to completion rings 0 to n. 3. If tx and rx rings share completion rings, then tx rings 0 to m will be mapped to completion rings 0 to m. 4. If tx and rx rings use dedicated completion rings, then tx rings 0 to m will be mapped to completion rings n + 1 to n + m. 5. Each tx or rx ring will use the corresponding completion ring index for doorbell mapping and MSIX mapping. Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04bnxt_en: Check for NULL rx or tx ring.Michael Chan
Each bnxt_napi structure may no longer be having both an rx ring and a tx ring. Check for a valid ring before using it. Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04bnxt_en: Separate bnxt_{rx|tx}_ring_info structs from bnxt_napi struct.Michael Chan
Currently, an rx and a tx ring are always paired with a completion ring. We want to restructure it so that it is possible to have a dedicated completion ring for tx or rx only. The bnxt hardware uses a completion ring for rx and tx events. The driver has to process the completion ring entries sequentially for the rx and tx events. Using a dedicated completion ring for rx only or tx only has these benefits: 1. A burst of rx packets can cause delay in processing tx events if the completion ring is shared. If tx queue is stopped by BQL, this can cause delay in re-starting the tx queue. 2. A completion ring is sized according to the rx and tx ring size rounded up to the nearest power of 2. When the completion ring is shared, it is sized by adding the rx and tx ring sizes and then rounded to the next power of 2, often with a lot of wasted space. 3. Using dedicated completion ring, we can adjust the tx and rx coalescing parameters independently for rx and tx. The first step is to separate the rx and tx ring structures from the bnxt_napi struct. In this patch, an rx ring and a tx ring will point to the same bnxt_napi struct to share the same completion ring. No change in ring assignment and mapping yet. Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04bnxt_en: Refactor bnxt_dbg_dump_states().Michael Chan
By adding 3 separate functions to dump the different ring states. Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-05xfs: debug mode log record crc error injectionBrian Foster
XFS now uses CRC verification over a limited section of the log to detect torn writes prior to a crash. This is difficult to test directly due to the timing and hardware requirements to cause a short write. Add a mechanism to inject CRC errors into log records to facilitate testing torn write detection during log recovery. This mechanism is dangerous and can result in filesystem corruption. Thus, it is only available in DEBUG mode for testing/development purposes. Set a non-zero value to the following sysfs entry to enable error injection: /sys/fs/xfs/<dev>/log/log_badcrc_factor Once enabled, XFS intentionally writes an invalid CRC to a log record at some random point in the future based on the provided frequency. The filesystem immediately shuts down once the record has been written to the physical log to prevent metadata writeback (e.g., AIL insertion) once the log write completes. This helps reasonably simulate a torn write to the log as the affected record must be safe to discard. The next mount after the intentional shutdown requires log recovery and should detect and recover from the torn write. Note again that this _will_ result in data loss or worse. For testing and development purposes only! Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-05xfs: detect and trim torn writes during log recoveryBrian Foster
Certain types of storage, such as persistent memory, do not provide sector atomicity for writes. This means that if a crash occurs while XFS is writing log records, only part of those records might make it to the storage. This is problematic because log recovery uses the cycle value packed at the top of each log block to locate the head/tail of the log. This can lead to CRC verification failures during log recovery and an unmountable fs for a filesystem that is otherwise consistent. Update log recovery to incorporate log record CRC verification as part of the head/tail discovery process. Once the head is located via the traditional algorithm, run a CRC-only pass over the records up to the head of the log. If CRC verification fails, assume that the records are torn as a matter of policy and trim the head block back to the start of the first bad record. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>