summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2015-06-11powerpc: Don't use -mno-strict-align on clangAnton Blanchard
We added -mno-strict-align in commit f036b3681962 (powerpc: Work around little endian gcc bug) to fix gcc bug http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57134 Clang doesn't understand it. We need to use a conditional because we can't use the simpler call cc-option here. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11powerpc: Only use -mtraceback=no, -mno-string and -msoft-float if toolchain ↵Anton Blanchard
supports it These options are not recognised on LLVM, so use call cc-option to check for support. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11powerpc: Only use -mabi=altivec if toolchain supports itAnton Blanchard
The -mabi=altivec option is not recognised on LLVM, so use call cc-option to check for support. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11powerpc: Fix duplicate const clang warning in user access codeAnton Blanchard
We see a large number of duplicate const errors in the user access code when building with llvm/clang: include/linux/pagemap.h:576:8: warning: duplicate 'const' declaration specifier [-Wduplicate-decl-specifier] ret = __get_user(c, uaddr); The problem is we are doing const __typeof__(*(ptr)), which will hit the warning if ptr is marked const. Removing const does not seem to have any effect on GCC code generation. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11Merge branch 'broadcom-MDIO-turn-around'David S. Miller
Florian Fainelli says: ==================== net: broadcom MDIO support for broken turn-around These two patches update the GENET and UniMAC MDIO controllers to deal with PHYs that are known to have a broken turn-around bug (e.g: BCM53125 and others) This utilizes the infrastructure that code recently added to do that in 'net-next'. Note that the changes look nearly identical and I will try to address the MDIO code duplication between GENET and UniMAC in a future patch series. Changes in v2: - remove brcmphy.h include in mdio-bcm-unimac.c - use the same comment as with GENET's MDIO read function ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-11net: phy: mdio-bcm-unimac: handle broken turn-around for specific PHYsFlorian Fainelli
Some Ethernet PHYs/switches such as Broadcom's BCM53125 have a hardware bug which makes them not release the MDIO line during turn-around time. This gets flagged by the UniMAC MDIO controller as a read failure, and we fail the read transaction. Check the MDIO bus phy_ignore_ta_mask bitmask for the PHY we are reading from and if it is listed in this bitmask, ignore the read failure and proceed with returning the data we read out of the controller. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-11net: bcmgenet: handle broken turn-around for specific PHYsFlorian Fainelli
Some Ethernet PHYs/switches such as Broadcom's BCM53125 have a hardware bug which makes them not release the MDIO line during turn-around time. This gets flagged by the GENET MDIO controller as a read failure, and we fail the read transaction. Check the MDIO bus phy_ignore_ta_mask bitmask for the PHY we are reading from and if it is listed in this bitmask, ignore the read failure and proceed with returning the data we read out of the controller. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-11net: phy: davicom: add IDs for DM9161B and C variantsGustavo Zacarias
Add PHY IDs for Davicom DM9161B and DM9161C variants. Tested with a DM9161C on a custom Atmel-based SAM9X25 board in RMII mode. The DM9161B uses the same model id with just the LSB bit of the version id changing (which is masked out). For all intents and purposes they're the same as the DM9161A with an added GPSI mode and better fabrication process. Signed-off-by: Gustavo Zacarias <gustavo@zacarias.com.ar> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-11Renesas Ethernet AVB PTP clock driverSergei Shtylyov
Ethernet AVB device includes the gPTP timer, so we can implement a PTP clock driver. We're doing that in a separate file, with the main Ethernet driver calling the PTP driver's [de]initialization and interrupt handler functions. Unfortunately, the clock seems tightly coupled with the AVB-DMAC, so when that one leaves the operation mode, we have to unregister the PTP clock... :-( Based on the original patches by Masaru Nagai. Signed-off-by: Masaru Nagai <masaru.nagai.vx@renesas.com> Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-11Renesas Ethernet AVB driver properSergei Shtylyov
Ethernet AVB includes an Gigabit Ethernet controller (E-MAC) that is basically compatible with SuperH Gigabit Ethernet E-MAC. Ethernet AVB has a dedicated direct memory access controller (AVB-DMAC) that is a new design compared to the SuperH E-DMAC. The AVB-DMAC is compliant with 3 standards formulated for IEEE 802.1BA: IEEE 802.1AS timing and synchronization protocol, IEEE 802.1Qav real- time transfer, and the IEEE 802.1Qat stream reservation protocol. The driver only supports device tree probing, so the binding document is included in this patch. Based on the original patches by Mitsuhiro Kimura. Signed-off-by: Mitsuhiro Kimura <mitsuhiro.kimura.kc@renesas.com> Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-11tcp: add CDG congestion controlKenneth Klette Jonassen
CAIA Delay-Gradient (CDG) is a TCP congestion control that modifies the TCP sender in order to [1]: o Use the delay gradient as a congestion signal. o Back off with an average probability that is independent of the RTT. o Coexist with flows that use loss-based congestion control, i.e., flows that are unresponsive to the delay signal. o Tolerate packet loss unrelated to congestion. (Disabled by default.) Its FreeBSD implementation was presented for the ICCRG in July 2012; slides are available at http://www.ietf.org/proceedings/84/iccrg.html Running the experiment scenarios in [1] suggests that our implementation achieves more goodput compared with FreeBSD 10.0 senders, although it also causes more queueing delay for a given backoff factor. The loss tolerance heuristic is disabled by default due to safety concerns for its use in the Internet [2, p. 45-46]. We use a variant of the Hybrid Slow start algorithm in tcp_cubic to reduce the probability of slow start overshoot. [1] D.A. Hayes and G. Armitage. "Revisiting TCP congestion control using delay gradients." In Networking 2011, pages 328-341. Springer, 2011. [2] K.K. Jonassen. "Implementing CAIA Delay-Gradient in Linux." MSc thesis. Department of Informatics, University of Oslo, 2015. Cc: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: David Hayes <davihay@ifi.uio.no> Cc: Andreas Petlund <apetlund@simula.no> Cc: Dave Taht <dave.taht@bufferbloat.net> Cc: Nicolas Kuhn <nicolas.kuhn@telecom-bretagne.eu> Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-11tcp: export tcp_enter_cwr()Kenneth Klette Jonassen
Upcoming tcp_cdg uses tcp_enter_cwr() to initiate PRR. Export this function so that CDG can be compiled as a module. Cc: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: David Hayes <davihay@ifi.uio.no> Cc: Andreas Petlund <apetlund@simula.no> Cc: Dave Taht <dave.taht@bufferbloat.net> Cc: Nicolas Kuhn <nicolas.kuhn@telecom-bretagne.eu> Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-11iommu: Introduce iommu_request_dm_for_dev()Joerg Roedel
This function can be called by an IOMMU driver to request that a device's default domain is direct mapped. Signed-off-by: Joerg Roedel <jroedel@suse.de>
2015-06-10switchdev: fix handling for drivers not supporting IPv4 fib add/del opsScott Feldman
If CONFIG_NET_SWITCHDEV is enabled, but port driver does not implement support for IPv4 FIB add/del ops, don't fail route add/del offload operations. Route adds will not be marked as OFFLOAD. Routes will be installed in the kernel FIB, as usual. This was report/fixed by Florian when testing DSA driver with net-next on devices with L2 offload support but no L3 offload support. What he reported was an initial route installed from DHCP client would fail (route not installed to kernel FIB). This was triggering the setting of ipv4.fib_offload_disabled, which would disable route offloading after the first failure. So subsequent attempts to install the route would succeed. There is follow-on work/discussion to address the handling of route install failures, but for now, let's differentiate between no support and failed support. Reported-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-10enic: fix memory leak in rq_cleanGovindarajulu Varadarajan
When incoming packet qualifies for rx_copybreak, we copy the data to newly allocated skb. We do not free/unmap the original buffer. At this point driver assumes this buffer is unallocated. When enic_rq_alloc_buf() is called for buffer allocation, it checks if buf->os_buf is NULL. If its not NULL that means buffer can be re-used. When vnic_rq_clean() is called for freeing all rq buffers, and if the rx_copybreak reused buffer falls outside the used desc, we do not free the buffer. The following trace is observer when dma-debug is enabled. Fix is to walk through complete ring and clean if buffer is present. [ 40.555386] ------------[ cut here ]------------ [ 40.555396] WARNING: CPU: 0 PID: 491 at lib/dma-debug.c:971 dma_debug_device_change+0x188/0x1f0() [ 40.555400] pci 0000:06:00.0: DMA-API: device driver has pending DMA allocations while released from device [count=4] One of leaked entries details: [device address=0x00000000ff4cc040] [size=9018 bytes] [mapped with DMA_FROM_DEVICE] [mapped as single] [ 40.555402] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 dns_resolver coretemp intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw joydev mousedev gf128mul hid_generic glue_helper mgag200 usbhid ttm hid drm_kms_helper drm ablk_helper syscopyarea sysfillrect sysimgblt i2c_algo_bit i2c_core iTCO_wdt cryptd mac_hid evdev pcspkr sb_edac edac_core tpm_tis iTCO_vendor_support ipmi_si wmi tpm ipmi_msghandler shpchp lpc_ich processor acpi_power_meter hwmon button ac sch_fq_codel nfs lockd grace sunrpc fscache sd_mod ehci_pci ehci_hcd megaraid_sas usbcore scsi_mod usb_common enic(-) crc32c_generic crc32c_intel btrfs xor raid6_pq ext4 crc16 mbcache jbd2 [ 40.555467] CPU: 0 PID: 491 Comm: rmmod Not tainted 4.1.0-rc7-ARCH-01305-gf59b71f #118 [ 40.555469] Hardware name: Cisco Systems Inc UCSB-B200-M4/UCSB-B200-M4, BIOS B200M4.2.2.2.23.061220140128 06/12/2014 [ 40.555471] 0000000000000000 00000000e2f8a5b7 ffff880275f8bc48 ffffffff8158d6f0 [ 40.555474] 0000000000000000 ffff880275f8bca0 ffff880275f8bc88 ffffffff8107b04a [ 40.555477] ffff8802734e0000 0000000000000004 ffff8804763fb3c0 ffff88027600b650 [ 40.555480] Call Trace: [ 40.555488] [<ffffffff8158d6f0>] dump_stack+0x4f/0x7b [ 40.555492] [<ffffffff8107b04a>] warn_slowpath_common+0x8a/0xc0 [ 40.555494] [<ffffffff8107b0d5>] warn_slowpath_fmt+0x55/0x70 [ 40.555498] [<ffffffff812fa408>] dma_debug_device_change+0x188/0x1f0 [ 40.555503] [<ffffffff8109aaef>] notifier_call_chain+0x4f/0x80 [ 40.555506] [<ffffffff8109aecb>] __blocking_notifier_call_chain+0x4b/0x70 [ 40.555510] [<ffffffff8109af06>] blocking_notifier_call_chain+0x16/0x20 [ 40.555514] [<ffffffff813f8066>] __device_release_driver+0xf6/0x120 [ 40.555518] [<ffffffff813f8b08>] driver_detach+0xc8/0xd0 [ 40.555523] [<ffffffff813f7c59>] bus_remove_driver+0x59/0xe0 [ 40.555527] [<ffffffff813f93a0>] driver_unregister+0x30/0x70 [ 40.555534] [<ffffffff8131532d>] pci_unregister_driver+0x2d/0xa0 [ 40.555542] [<ffffffffa0200ec2>] enic_cleanup_module+0x10/0x14e [enic] [ 40.555547] [<ffffffff8110158f>] SyS_delete_module+0x1cf/0x280 [ 40.555551] [<ffffffff811e284e>] ? ____fput+0xe/0x10 [ 40.555554] [<ffffffff810980ec>] ? task_work_run+0xbc/0xf0 [ 40.555558] [<ffffffff815930ee>] system_call_fastpath+0x12/0x71 [ 40.555561] ---[ end trace 4988cadc77c2b236 ]--- [ 40.555562] Mapped at: [ 40.555563] [<ffffffff812fa865>] debug_dma_map_page+0x95/0x150 [ 40.555566] [<ffffffffa01f4a88>] enic_rq_alloc_buf+0x1b8/0x360 [enic] [ 40.555570] [<ffffffffa01f7658>] enic_open+0xf8/0x820 [enic] [ 40.555574] [<ffffffff8148d50e>] __dev_open+0xce/0x150 [ 40.555579] [<ffffffff8148d851>] __dev_change_flags+0xa1/0x170 Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-10enic: check return value for stat dumpGovindarajulu Varadarajan
We do not check the return value of enic_dev_stats_dump(). If allocation fails, we will hit NULL pointer reference. Return only if memory allocation fails. For other failures, we return the previously recorded values. Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-10enic: unlock napi busy poll before unmasking intrGovindarajulu Varadarajan
There is a small window between vnic_intr_unmask() and enic_poll_unlock_napi(). In this window if an irq occurs and napi is scheduled on different cpu, it tries to acquire enic_poll_lock_napi() and hits the following WARN_ON message. Fix is to unlock napi_poll before unmasking the interrupt. [ 781.121746] ------------[ cut here ]------------ [ 781.121789] WARNING: CPU: 1 PID: 0 at drivers/net/ethernet/cisco/enic/vnic_rq.h:228 enic_poll_msix_rq+0x36a/0x3c0 [enic]() [ 781.121834] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 dns_resolver coretemp intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel mgag200 ttm drm_kms_helper joydev aes_x86_64 lrw drm gf128mul mousedev glue_helper sb_edac ablk_helper iTCO_wdt iTCO_vendor_support evdev ipmi_si syscopyarea sysfillrect sysimgblt i2c_algo_bit i2c_core edac_core lpc_ich mac_hid cryptd pcspkr ipmi_msghandler shpchp tpm_tis acpi_power_meter tpm wmi processor hwmon button ac sch_fq_codel nfs lockd grace sunrpc fscache hid_generic usbhid hid ehci_pci ehci_hcd sd_mod megaraid_sas usbcore scsi_mod usb_common enic crc32c_generic crc32c_intel btrfs xor raid6_pq ext4 crc16 mbcache jbd2 [ 781.122176] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.1.0-rc6-ARCH-00040-gc46a024-dirty #106 [ 781.122210] Hardware name: Cisco Systems Inc UCSB-B200-M4/UCSB-B200-M4, BIOS B200M4.2.2.2.23.061220140128 06/12/2014 [ 781.122252] 0000000000000000 bddbbc9d655ec96e ffff880277e43da8 ffffffff81583fe8 [ 781.122286] 0000000000000000 0000000000000000 ffff880277e43de8 ffffffff8107acfa [ 781.122319] ffff880272c01000 ffff880273f18000 ffff880273f1a100 0000000000000000 [ 781.122352] Call Trace: [ 781.122364] <IRQ> [<ffffffff81583fe8>] dump_stack+0x4f/0x7b [ 781.122399] [<ffffffff8107acfa>] warn_slowpath_common+0x8a/0xc0 [ 781.122425] [<ffffffff8107ae2a>] warn_slowpath_null+0x1a/0x20 [ 781.122455] [<ffffffffa01fa9ca>] enic_poll_msix_rq+0x36a/0x3c0 [enic] [ 781.122487] [<ffffffff8148525a>] net_rx_action+0x22a/0x370 [ 781.122512] [<ffffffff8107ed3d>] __do_softirq+0xed/0x2d0 [ 781.122537] [<ffffffff8107f06e>] irq_exit+0x7e/0xa0 [ 781.122560] [<ffffffff8158c424>] do_IRQ+0x64/0x100 [ 781.122582] [<ffffffff8158a42e>] common_interrupt+0x6e/0x6e [ 781.122605] <EOI> [<ffffffff810bd331>] ? cpu_startup_entry+0x121/0x480 [ 781.122638] [<ffffffff810bd2fc>] ? cpu_startup_entry+0xec/0x480 [ 781.122667] [<ffffffff810f2ed3>] ? clockevents_register_device+0x113/0x1f0 [ 781.122698] [<ffffffff81050ab6>] start_secondary+0x196/0x1e0 [ 781.122723] ---[ end trace cec2e9dd3af7b9db ]--- Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-10Merge branch 'brcm-pseudo-phy-addr'David S. Miller
Florian Fainelli says: ==================== net: phy: broadcom: define pseudo-PHY address This patch series converts existing in-tree users of the Broadcom pseudo-PHY address (30) used to configure MDIO-connected switches to share a constant in a shared header files. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-10net: dsa: bcm_sf2: Utilize BRCM_PSEUDO_PHY_ADDRFlorian Fainelli
Utilize the newly introduced BRCM_PSEUDO_PHY_ADDR constant from brcmphy.h instead of open-coding the Broadcom Ethernet switches pseudo-PHY address (30). Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-10bgmac: Utilize BRCM_PSEUDO_PHY_ADDRFlorian Fainelli
What BGMAC defines as BGMAC_PHY_NOREGS is in fact the Broadcom Ethernet switches' pseudo-PHY address (30), utilize the newly introduced constant from brcmphy.h Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-10b44: Utilize BRCM_PSEUDO_PHY_ADDRFlorian Fainelli
What B44 has been locally using as B44_PHY_ADDR_NO_LOCAL_PHY is in fact the Broadcom Ethernet switches pseudo-PHY address (30). Update the header to use the newly introduced constant and update comments so they are within 80 columns and consistent. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-10net: phy: broadcom: define Broadcom pseudo-PHY address in brcmphy.hFlorian Fainelli
Define the pseudo-PHY address (30) which is used by all Broadcom Ethernet switches in a shared header file. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-10net: phy: broadcom: include phy.h for brcmphy.hFlorian Fainelli
We utilize inline functions from the PHY library, make sure that we do include phy.h in brcmphy.h in order for the code including brcmphy.h not to have to resolve this inclusion dependency. Fixes: 705314797b8b ("net: phy: broadcom: move shadow 0x1C register accessors to brcmphy.h") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-11x86/crash: Allocate enough low memory when crashkernel=highJoerg Roedel
When the crash kernel is loaded above 4GiB in memory, the first kernel allocates only 72MiB of low-memory for the DMA requirements of the second kernel. On systems with many devices this is not enough and causes device driver initialization errors and failed crash dumps. Testing by SUSE and Redhat has shown that 256MiB is a good default value for now and the discussion has lead to this value as well. So set this default value to 256MiB to make sure there is enough memory available for DMA. Signed-off-by: Joerg Roedel <jroedel@suse.de> [ Reflow comment. ] Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Baoquan He <bhe@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Young <dyoung@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jörg Rödel <joro@8bytes.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: kexec@lists.infradead.org Link: http://lkml.kernel.org/r/1433500202-25531-4-git-send-email-joro@8bytes.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-11x86/swiotlb: Try coherent allocations with __GFP_NOWARNJoerg Roedel
When we boot a kdump kernel in high memory, there is by default only 72MB of low memory available. The swiotlb code takes 64MB of it (by default) so that there are only 8MB left to allocate from. On systems with many devices this causes page allocator warnings from dma_generic_alloc_coherent(): systemd-udevd: page allocation failure: order:0, mode:0x280d4 CPU: 0 PID: 197 Comm: systemd-udevd Tainted: G W 3.12.28-4-default #1 Hardware name: HP ProLiant DL980 G7, BIOS P66 07/30/2012 ffff8800781335e0 ffffffff8150b1db 00000000000280d4 ffffffff8113af90 0000000000000000 0000000000000000 ffff88007efdbb00 0000000100000000 0000000000000000 0000000000000000 0000000000000000 0000000000000001 Call Trace: dump_trace+0x7d/0x2d0 show_stack_log_lvl+0x94/0x170 show_stack+0x21/0x50 dump_stack+0x41/0x51 warn_alloc_failed+0xf0/0x160 __alloc_pages_slowpath+0x72f/0x796 __alloc_pages_nodemask+0x1ea/0x210 dma_generic_alloc_coherent+0x96/0x140 x86_swiotlb_alloc_coherent+0x1c/0x50 ttm_dma_pool_alloc_new_pages+0xab/0x320 [ttm] ttm_dma_populate+0x3ce/0x640 [ttm] ttm_tt_bind+0x36/0x60 [ttm] ttm_bo_handle_move_mem+0x55f/0x5c0 [ttm] ttm_bo_move_buffer+0x105/0x130 [ttm] ttm_bo_validate+0xc1/0x130 [ttm] ttm_bo_init+0x24b/0x400 [ttm] radeon_bo_create+0x16c/0x200 [radeon] radeon_ring_init+0x11e/0x2b0 [radeon] r100_cp_init+0x123/0x5b0 [radeon] r100_startup+0x194/0x230 [radeon] r100_init+0x223/0x410 [radeon] radeon_device_init+0x6af/0x830 [radeon] radeon_driver_load_kms+0x89/0x180 [radeon] drm_get_pci_dev+0x121/0x2f0 [drm] local_pci_probe+0x39/0x60 pci_device_probe+0xa9/0x120 driver_probe_device+0x9d/0x3d0 __driver_attach+0x8b/0x90 bus_for_each_dev+0x5b/0x90 bus_add_driver+0x1f8/0x2c0 driver_register+0x5b/0xe0 do_one_initcall+0xf2/0x1a0 load_module+0x1207/0x1c70 SYSC_finit_module+0x75/0xa0 system_call_fastpath+0x16/0x1b 0x7fac533d2788 After these warnings the code enters a fall-back path and allocated directly from the swiotlb aperture in the end. So remove these warnings as this is not a fatal error. Signed-off-by: Joerg Roedel <jroedel@suse.de> [ Simplify, reflow comment. ] Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Baoquan He <bhe@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Young <dyoung@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jörg Rödel <joro@8bytes.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: kexec@lists.infradead.org Link: http://lkml.kernel.org/r/1433500202-25531-3-git-send-email-joro@8bytes.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-11swiotlb: Warn on allocation failure in swiotlb_alloc_coherent()Joerg Roedel
Print a warning when all allocation tries have been failed and the function is about to return NULL. This prepares for calling the function with __GFP_NOWARN to suppress allocation failure warnings before all fall-backs have failed - which we'll do to improve kdump behavior. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Young <dyoung@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jörg Rödel <joro@8bytes.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: kexec@lists.infradead.org Link: http://lkml.kernel.org/r/1433500202-25531-2-git-send-email-joro@8bytes.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-10net: tcp: dctcp_update_alpha() fixes.Eric Dumazet
dctcp_alpha can be read by from dctcp_get_info() without synchro, so use WRITE_ONCE() to prevent compiler from using dctcp_alpha as a temporary variable. Also, playing with small dctcp_shift_g (like 1), can expose an overflow with 32bit values shifted 9 times before divide. Use an u64 field to avoid this problem, and perform the divide only if acked_bytes_ecn is not zero. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-10net, swap: Remove a warning and clarify why sk_mem_reclaim is required when ↵Mel Gorman
deactivating swap Jeff Layton reported the following; [ 74.232485] ------------[ cut here ]------------ [ 74.233354] WARNING: CPU: 2 PID: 754 at net/core/sock.c:364 sk_clear_memalloc+0x51/0x80() [ 74.234790] Modules linked in: cts rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xfs libcrc32c snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device nfsd snd_pcm snd_timer snd e1000 ppdev parport_pc joydev parport pvpanic soundcore floppy serio_raw i2c_piix4 pcspkr nfs_acl lockd virtio_balloon acpi_cpufreq auth_rpcgss grace sunrpc qxl drm_kms_helper ttm drm virtio_console virtio_blk virtio_pci ata_generic virtio_ring pata_acpi virtio [ 74.243599] CPU: 2 PID: 754 Comm: swapoff Not tainted 4.1.0-rc6+ #5 [ 74.244635] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 74.245546] 0000000000000000 0000000079e69e31 ffff8800d066bde8 ffffffff8179263d [ 74.246786] 0000000000000000 0000000000000000 ffff8800d066be28 ffffffff8109e6fa [ 74.248175] 0000000000000000 ffff880118d48000 ffff8800d58f5c08 ffff880036e380a8 [ 74.249483] Call Trace: [ 74.249872] [<ffffffff8179263d>] dump_stack+0x45/0x57 [ 74.250703] [<ffffffff8109e6fa>] warn_slowpath_common+0x8a/0xc0 [ 74.251655] [<ffffffff8109e82a>] warn_slowpath_null+0x1a/0x20 [ 74.252585] [<ffffffff81661241>] sk_clear_memalloc+0x51/0x80 [ 74.253519] [<ffffffffa0116c72>] xs_disable_swap+0x42/0x80 [sunrpc] [ 74.254537] [<ffffffffa01109de>] rpc_clnt_swap_deactivate+0x7e/0xc0 [sunrpc] [ 74.255610] [<ffffffffa03e4fd7>] nfs_swap_deactivate+0x27/0x30 [nfs] [ 74.256582] [<ffffffff811e99d4>] destroy_swap_extents+0x74/0x80 [ 74.257496] [<ffffffff811ecb52>] SyS_swapoff+0x222/0x5c0 [ 74.258318] [<ffffffff81023f27>] ? syscall_trace_leave+0xc7/0x140 [ 74.259253] [<ffffffff81798dae>] system_call_fastpath+0x12/0x71 [ 74.260158] ---[ end trace 2530722966429f10 ]--- The warning in question was unnecessary but with Jeff's series the rules are also clearer. This patch removes the warning and updates the comment to explain why sk_mem_reclaim() may still be called. [jlayton: remove if (sk->sk_forward_alloc) conditional. As Leon points out that it's not needed.] Cc: Leon Romanovsky <leon@leon.nu> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-10Merge tag 'mac80211-next-for-davem-2015-06-10' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== For this round we mostly have fixes: * mesh fixes from Alexis Green and Chun-Yeow Yeoh, * a documentation fix from Jakub Kicinski, * a missing channel release (from Michal Kazior), * a fix for a signal strength reporting bug (from Sara Sharon), * handle deauth while associating (myself), * don't report mangled TX SKB back to userspace for status (myself), * handle aggregation session timeouts properly in fast-xmit (myself) However, there are also a few cleanups and one big change that affects all drivers (and that required me to pull in your tree) to change the mac80211 HW flags to use an unsigned long bitmap so that we can extend them more easily - we're running out of flags even with a cleanup to remove the two unused ones. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-10net/unix: support SCM_SECURITY for stream socketsStephen Smalley
SCM_SECURITY was originally only implemented for datagram sockets, not for stream sockets. However, SCM_CREDENTIALS is supported on Unix stream sockets. For consistency, implement Unix stream support for SCM_SECURITY as well. Also clean up the existing code and get rid of the superfluous UNIXSID macro. Motivated by https://bugzilla.redhat.com/show_bug.cgi?id=1224211, where systemd was using SCM_CREDENTIALS and assumed wrongly that SCM_SECURITY was also supported on Unix stream sockets. Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-10atm: idt77105: Use setup_timerVaishali Thakkar
Use the timer API function setup_timer instead of structure field assignments to initialize a timer. A simplified version of the Coccinelle semantic patch that performs this transformation is as follows: @change@ expression e1, e2, a; @@ -init_timer(&e1); +setup_timer(&e1, a, 0UL); ... when != a = e2 -e1.function = a; Signed-off-by: Vaishali Thakkar <vthakkar1994@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-10Add support of Cavium Liquidio ethernet adaptersRaghu Vatsavayi
Following patch V8 adds support for Cavium Liquidio pci express based 10Gig ethernet adapters. 1) Consolidated all debug macros to either call dev_* or netdev_* macros directly, feedback from previous patch. 2) Changed soft commands to avoid crash when running in interrupt context. 3) Fixed link status not reflecting correct status when NetworkManager is running. Added MODULE_FIRMWARE declarations. Following were the previous patches. Patch V7: 1) Minor comments from v6 release regarding debug statements. 2) Fix for large multicast lists. 3) Fixed lockup issue if port initialization fails. 4) Enabled MSI by default. https://patchwork.ozlabs.org/patch/464441/ Patch V6: 1) Addressed the uint64 vs u64 issue, feedback from previous patch. 2) Consolidated some receive processing routines. 3) Removed link status polling method. https://patchwork.ozlabs.org/patch/459514/ Patch V5: Based on the feedback from earlier patches with regards to consolidation of common functions like device init, register programming for cn66xx and cn68xx devices. https://patchwork.ozlabs.org/patch/438979/ Patch V4: Following were the changes based on the feedback from earlier patch: 1) Added mmiowb while synchronizing queue updates and other hw interactions. 2) Statistics will now be incremented non-atomically per each ring. liquidio_get_stats will add stats of each ring while reporting the total statistics counts. 3) Modified liquidio_ioctl to return proper return codes. 4) Modified device naming to use standard Ethernet naming. 5) Global function names in the driver will have lio_/liquidio_/octeon_ prefix. 6) Ethtool related changes for: Removed redundant stats and jiffies. Use default ethtool handler of link status. Speed setting will make use of ethtool_cmd_speed_set. 7) Added checks for pci_map_* return codes. 8) Check for signals while waiting in interruptible mode https://patchwork.ozlabs.org/patch/435073/ Patch v3: Implemented feedback from previous patch like: Removed NAPI Config and DEBUG config options, added BQL and xmit_more support. https://patchwork.ozlabs.org/patch/422749/ Patch V2: Implemented feedback from previous patch. https://patchwork.ozlabs.org/patch/413539/ First Patch: https://patchwork.ozlabs.org/patch/412946/ Signed-off-by: Derek Chickles <derek.chickles@caviumnetworks.com> Signed-off-by: Satanand Burla <satananda.burla@caviumnetworks.com> Signed-off-by: Felix Manlunas <felix.manlunas@caviumnetworks.com> Signed-off-by: Robert Richter <Robert.Richter@caviumnetworks.com> Signed-off-by: Aleksey Makarov <Aleksey.Makarov@caviumnetworks.com> Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-11vfio: powerpc/spapr: Support Dynamic DMA windowsAlexey Kardashevskiy
This adds create/remove window ioctls to create and remove DMA windows. sPAPR defines a Dynamic DMA windows capability which allows para-virtualized guests to create additional DMA windows on a PCI bus. The existing linux kernels use this new window to map the entire guest memory and switch to the direct DMA operations saving time on map/unmap requests which would normally happen in a big amounts. This adds 2 ioctl handlers - VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE - to create and remove windows. Up to 2 windows are supported now by the hardware and by this driver. This changes VFIO_IOMMU_SPAPR_TCE_GET_INFO handler to return additional information such as a number of supported windows and maximum number levels of TCE tables. DDW is added as a capability, not as a SPAPR TCE IOMMU v2 unique feature as we still want to support v2 on platforms which cannot do DDW for the sake of TCE acceleration in KVM (coming soon). Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [aw: for the vfio related changes] Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11vfio: powerpc/spapr: Register memory and define IOMMU v2Alexey Kardashevskiy
The existing implementation accounts the whole DMA window in the locked_vm counter. This is going to be worse with multiple containers and huge DMA windows. Also, real-time accounting would requite additional tracking of accounted pages due to the page size difference - IOMMU uses 4K pages and system uses 4K or 64K pages. Another issue is that actual pages pinning/unpinning happens on every DMA map/unmap request. This does not affect the performance much now as we spend way too much time now on switching context between guest/userspace/host but this will start to matter when we add in-kernel DMA map/unmap acceleration. This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU. New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces 2 new ioctls to register/unregister DMA memory - VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY - which receive user space address and size of a memory region which needs to be pinned/unpinned and counted in locked_vm. New IOMMU splits physical pages pinning and TCE table update into 2 different operations. It requires: 1) guest pages to be registered first 2) consequent map/unmap requests to work only with pre-registered memory. For the default single window case this means that the entire guest (instead of 2GB) needs to be pinned before using VFIO. When a huge DMA window is added, no additional pinning will be required, otherwise it would be guest RAM + 2GB. The new memory registration ioctls are not supported by VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration will require memory to be preregistered in order to work. The accounting is done per the user process. This advertises v2 SPAPR TCE IOMMU and restricts what the userspace can do with v1 or v2 IOMMUs. In order to support memory pre-registration, we need a way to track the use of every registered memory region and only allow unregistration if a region is not in use anymore. So we need a way to tell from what region the just cleared TCE was from. This adds a userspace view of the TCE table into iommu_table struct. It contains userspace address, one per TCE entry. The table is only allocated when the ownership over an IOMMU group is taken which means it is only used from outside of the powernv code (such as VFIO). As v2 IOMMU supports IODA2 and pre-IODA2 IOMMUs (which do not support DDW API), this creates a default DMA window for IODA2 for consistency. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [aw: for the vfio related changes] Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11powerpc/mmu: Add userspace-to-physical addresses translation cacheAlexey Kardashevskiy
We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11vfio: powerpc/spapr: powerpc/powernv/ioda2: Use DMA windows API in ownership ↵Alexey Kardashevskiy
control Before the IOMMU user (VFIO) would take control over the IOMMU table belonging to a specific IOMMU group. This approach did not allow sharing tables between IOMMU groups attached to the same container. This introduces a new IOMMU ownership flavour when the user can not just control the existing IOMMU table but remove/create tables on demand. If an IOMMU implements take/release_ownership() callbacks, this lets the user have full control over the IOMMU group. When the ownership is taken, the platform code removes all the windows so the caller must create them. Before returning the ownership back to the platform code, VFIO unprograms and removes all the tables it created. This changes IODA2's onwership handler to remove the existing table rather than manipulating with the existing one. From now on, iommu_take_ownership() and iommu_release_ownership() are only called from the vfio_iommu_spapr_tce driver. Old-style ownership is still supported allowing VFIO to run on older P5IOC2 and IODA IO controllers. No change in userspace-visible behaviour is expected. Since it recreates TCE tables on each ownership change, related kernel traces will appear more often. This adds a pnv_pci_ioda2_setup_default_config() which is called when PE is being configured at boot time and when the ownership is passed from VFIO to the platform code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [aw: for the vfio related changes] Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future tableAlexey Kardashevskiy
This adds a way for the IOMMU user to know how much a new table will use so it can be accounted in the locked_vm limit before allocation happens. This stores the allocated table size in pnv_pci_ioda2_get_table_size() so the locked_vm counter can be updated correctly when a table is being disposed. This defines an iommu_table_group_ops callback to let VFIO know how much memory will be locked if a table is created. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11powerpc/powernv/ioda2: Use new helpers to do proper cleanup on PE releaseAlexey Kardashevskiy
The existing code programmed TVT#0 with some address and then immediately released that memory. This makes use of pnv_pci_ioda2_unset_window() and pnv_pci_ioda2_set_bypass() which do correct resource release and TVT update. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11vfio: powerpc/spapr: powerpc/powernv/ioda: Define and implement DMA windows APIAlexey Kardashevskiy
This extends iommu_table_group_ops by a set of callbacks to support dynamic DMA windows management. create_table() creates a TCE table with specific parameters. it receives iommu_table_group to know nodeid in order to allocate TCE table memory closer to the PHB. The exact format of allocated multi-level table might be also specific to the PHB model (not the case now though). This callback calculated the DMA window offset on a PCI bus from @num and stores it in a just created table. set_window() sets the window at specified TVT index + @num on PHB. unset_window() unsets the window from specified TVT. This adds a free() callback to iommu_table_ops to free the memory (potentially a tree of tables) allocated for the TCE table. create_table() and free() are supposed to be called once per VFIO container and set_window()/unset_window() are supposed to be called for every group in a container. This adds IOMMU capabilities to iommu_table_group such as default 32bit window parameters and others. This makes use of new values in vfio_iommu_spapr_tce. IODA1/P5IOC2 do not support DDW so they do not advertise pagemasks to the userspace. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11powerpc/powernv: Implement multilevel TCE tablesAlexey Kardashevskiy
TCE tables might get too big in case of 4K IOMMU pages and DDW enabled on huge guests (hundreds of GB of RAM) so the kernel might be unable to allocate contiguous chunk of physical memory to store the TCE table. To address this, POWER8 CPU (actually, IODA2) supports multi-level TCE tables, up to 5 levels which splits the table into a tree of smaller subtables. This adds multi-level TCE tables support to pnv_pci_ioda2_table_alloc_pages() and pnv_pci_ioda2_table_free_pages() helpers. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_set_windowAlexey Kardashevskiy
This is a part of moving DMA window programming to an iommu_ops callback. pnv_pci_ioda2_set_window() takes an iommu_table_group as a first parameter (not pnv_ioda_pe) as it is going to be used as a callback for VFIO DDW code. This should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11powerpc/powernv/ioda2: Introduce helpers to allocate TCE pagesAlexey Kardashevskiy
This is a part of moving TCE table allocation into an iommu_ops callback to support multiple IOMMU groups per one VFIO container. This moves the code which allocates the actual TCE tables to helpers: pnv_pci_ioda2_table_alloc_pages() and pnv_pci_ioda2_table_free_pages(). These do not allocate/free the iommu_table struct. This enforces window size to be a power of two. This should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11powerpc/powernv/ioda2: Rework iommu_table creationAlexey Kardashevskiy
This moves iommu_table creation to the beginning to make following changes easier to review. This starts using table parameters from the iommu_table struct. This should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11powerpc/iommu/powernv: Release replaced TCEAlexey Kardashevskiy
At the moment writing new TCE value to the IOMMU table fails with EBUSY if there is a valid entry already. However PAPR specification allows the guest to write new TCE value without clearing it first. Another problem this patch is addressing is the use of pool locks for external IOMMU users such as VFIO. The pool locks are to protect DMA page allocator rather than entries and since the host kernel does not control what pages are in use, there is no point in pool locks and exchange()+put_page(oldtce) is sufficient to avoid possible races. This adds an exchange() callback to iommu_table_ops which does the same thing as set() plus it returns replaced TCE and DMA direction so the caller can release the pages afterwards. The exchange() receives a physical address unlike set() which receives linear mapping address; and returns a physical address as the clear() does. This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement for a platform to have exchange() implemented in order to support VFIO. This replaces iommu_tce_build() and iommu_clear_tce() with a single iommu_tce_xchg(). This makes sure that TCE permission bits are not set in TCE passed to IOMMU API as those are to be calculated by platform code from DMA direction. This moves SetPageDirty() to the IOMMU code to make it work for both VFIO ioctl interface in in-kernel TCE acceleration (when it becomes available later). Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [aw: for the vfio related changes] Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11powerpc/powernv: Implement accessor to TCE entryAlexey Kardashevskiy
This replaces direct accesses to TCE table with a helper which returns an TCE entry address. This does not make difference now but will when multi-level TCE tables get introduces. No change in behavior is expected. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11powerpc/powernv/ioda2: Add TCE invalidation for all attached groupsAlexey Kardashevskiy
The iommu_table struct keeps a list of IOMMU groups it is used for. At the moment there is just a single group attached but further patches will add TCE table sharing. When sharing is enabled, TCE cache in each PE needs to be invalidated so does the patch. This does not change pnv_pci_ioda1_tce_invalidate() as there is no plan to enable TCE table sharing on PHBs older than IODA2. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11powerpc/powernv/ioda2: Move TCE kill register address to PEAlexey Kardashevskiy
At the moment the DMA setup code looks for the "ibm,opal-tce-kill" property which contains the TCE kill register address. Writing to this register invalidates TCE cache on IODA/IODA2 hub. This moves the register address from iommu_table to pnv_pnb as this register belongs to PHB and invalidates TCE cache for all tables of all attached PEs. This moves the property reading/remapping code to a helper which is called when DMA is being configured for PE and which does DMA setup for both IODA1 and IODA2. This adds a new pnv_pci_ioda2_tce_invalidate_entire() helper which invalidates cache for the entire table. It should be called after every call to opal_pci_map_pe_dma_window(). It was not required before because there was just a single TCE table and 64bit DMA was handled via bypass window (which has no table so no cache was used) but this is going to change with Dynamic DMA windows (DDW). Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11powerpc/iommu: Fix IOMMU ownership control functionsAlexey Kardashevskiy
This adds missing locks in iommu_take_ownership()/ iommu_release_ownership(). This marks all pages busy in iommu_table::it_map in order to catch errors if there is an attempt to use this table while ownership over it is taken. This only clears TCE content if there is no page marked busy in it_map. Clearing must be done outside of the table locks as iommu_clear_tce() called from iommu_clear_tces_and_put_pages() does this. In order to use bitmap_empty(), the existing code clears bit#0 which is set even in an empty table if it is bus-mapped at 0 as iommu_init_table() reserves page#0 to prevent buggy drivers from crashing when allocated page is bus-mapped at zero (which is correct). This restores the bit in the case of failure to bring the it_map to the state it was in when we called iommu_take_ownership(). Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11vfio: powerpc/spapr/iommu/powernv/ioda2: Rework IOMMU ownership controlAlexey Kardashevskiy
This adds tce_iommu_take_ownership() and tce_iommu_release_ownership which call in a loop iommu_take_ownership()/iommu_release_ownership() for every table on the group. As there is just one now, no change in behaviour is expected. At the moment the iommu_table struct has a set_bypass() which enables/ disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code which calls this callback when external IOMMU users such as VFIO are about to get over a PHB. The set_bypass() callback is not really an iommu_table function but IOMMU/PE function. This introduces a iommu_table_group_ops struct and adds take_ownership()/release_ownership() callbacks to it which are called when an external user takes/releases control over the IOMMU. This replaces set_bypass() with ownership callbacks as it is not necessarily just bypass enabling, it can be something else/more so let's give it more generic name. The callbacks is implemented for IODA2 only. Other platforms (P5IOC2, IODA1) will use the old iommu_take_ownership/iommu_release_ownership API. The following patches will replace iommu_take_ownership/ iommu_release_ownership calls in IODA2 with full IOMMU table release/ create. As we here and touching bypass control, this removes pnv_pci_ioda2_setup_bypass_pe() as it does not do much more compared to pnv_pci_ioda2_set_bypass. This moves tce_bypass_base initialization to pnv_pci_ioda2_setup_dma_pe. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [aw: for the vfio related changes] Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_groupAlexey Kardashevskiy
So far one TCE table could only be used by one IOMMU group. However IODA2 hardware allows programming the same TCE table address to multiple PE allowing sharing tables. This replaces a single pointer to a group in a iommu_table struct with a linked list of groups which provides the way of invalidating TCE cache for every PE when an actual TCE table is updated. This adds pnv_pci_link_table_and_group() and pnv_pci_unlink_table_and_group() helpers to manage the list. However without VFIO, it is still going to be a single IOMMU group per iommu_table. This changes iommu_add_device() to add a device to a first group from the group list of a table as it is only called from the platform init code or PCI bus notifier and at these moments there is only one group per table. This does not change TCE invalidation code to loop through all attached groups in order to simplify this patch and because it is not really needed in most cases. IODA2 is fixed in a later patch. This should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [aw: for the vfio related changes] Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>