linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2017-02-21	net: mvpp2: fix indentation of MVPP2_EXT_GLOBAL_CTRL_DEFAULT	Thomas Petazzoni
	Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21	net: mvpp2: remove unused register definitions	Thomas Petazzoni
	Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21	net: mvpp2: simplify mvpp2_bm_bufs_add()	Thomas Petazzoni
	The mvpp2_bm_bufs_add() currently creates a fake cookie by calling mvpp2_bm_cookie_pool_set(), just to be able to call mvpp2_pool_refill(). But all what mvpp2_pool_refill() does is extract the pool ID from the cookie, and call mvpp2_bm_pool_put() with this ID. Instead of doing this convoluted thing, just call mvpp2_bm_pool_put() directly, since we have the BM pool ID. Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21	net: mvpp2: drop useless fields in mvpp2_bm_pool and related code	Thomas Petazzoni
	This commit drops dead code from the mvpp2 driver. The 'in_use' and 'in_use_thresh' fields of 'struct mvpp2_bm_pool' are incremented/decremented/initialized in various places. But they are only used in one place: if (is_recycle && (atomic_read(&bm_pool->in_use) < bm_pool->in_use_thresh)) return 0; However 'is_recycle', passed as argument to mvpp2_rx_refill() is always false. So in fact, this code is never reached, and the 'is_recycle' argument is useless. So let's drop this code. Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21	net: mvpp2: remove unused 'tx_skb' field of 'struct mvpp2_tx_queue'	Thomas Petazzoni
	This commit remove a field of 'struct mvpp2_tx_queue' that is not used anywhere. Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21	net: mvpp2: release reference to txq_cpu[] entry after unmapping	Thomas Petazzoni
	The mvpp2_txq_bufs_free() function is called upon TX completion to DMA unmap TX buffers, and free the corresponding SKBs. It gets the references to the SKB to free and the DMA buffer to unmap from a per-CPU txq_pcpu data structure. However, the code currently increments the pointer to the next entry before doing the DMA unmap and freeing the SKB. It does not cause any visible problem because for a given SKB the TX completion is guaranteed to take place on the CPU where the TX was started. However, it is much more logical to increment the pointer to the next entry once the current entry has been completely unmapped/released. Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21	net: mvpp2: handle too large value in mvpp2_rx_time_coal_set()	Thomas Petazzoni
	When configuring the MVPP2_ISR_RX_THRESHOLD_REG with the RX coalescing time threshold, we do not check for the maximum allowed value supported by the driver, which means we might overflow and use a bogus value. This commit adds a check for this situation, and if a value higher than what is supported by the hardware is provided, then we use the maximum value supported by the hardware. In order to achieve this in a way that avoids overflow and rounding errors, we introduce two utility functions mvpp2_usec_to_cycles() and cycles_to_usec(). Many thanks to Russell King for suggesting this implementation. Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21	net: mvpp2: handle too large value handling in mvpp2_rx_pkts_coal_set()	Thomas Petazzoni
	Currently, mvpp2_rx_pkts_coal_set() does the following to avoid setting a too large value for the RX coalescing by packet number: val = (pkts & MVPP2_OCCUPIED_THRESH_MASK); This means that if you set a value that is slightly higher the the maximum number of packets, you in fact get a very low value. It makes a lot more sense to simply check if the value is too high, and if it's too high, limit it to the maximum possible value. Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21	net: mvpp2: remove useless arguments in mvpp2_rx_{pkts, time}_coal_set	Thomas Petazzoni
	As noticed by Russell King, the last argument of mvpp2_rx_{pkts,time}_coal_set() is useless, since the packet/time coalescing value is already stored in the 'struct mvpp2_rx_queue *' passed as argument to these functions. So passing the packet/time value as an additional argument, and setting them again in the mvpp2_rx_queue structure is useles. This commit therefore gets rid of this additional argument, assuming the caller has assigned the appropriate value to rxq->pkts_coal or rxq->time_coal before calling the respective functions. Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21	net: mvpp2: fix DMA address calculation in mvpp2_txq_inc_put()	Thomas Petazzoni
	When TX descriptors are filled in, the buffer DMA address is split between the tx_desc->buf_phys_addr field (high-order bits) and tx_desc->packet_offset field (5 low-order bits). However, when we re-calculate the DMA address from the TX descriptor in mvpp2_txq_inc_put(), we do not take tx_desc->packet_offset into account. This means that when the DMA address is not aligned on a 32 bytes boundary, we end up calling dma_unmap_single() with a DMA address that was not the one returned by dma_map_single(). This inconsistency is detected by the kernel when DMA_API_DEBUG is enabled. We fix this problem by properly calculating the DMA address in mvpp2_txq_inc_put(). Cc: <stable@vger.kernel.org> Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21	macsec: fix validation failed in asynchronous operation.	Lee Ryder
	MACSec test failed when asynchronous crypto operations is used. It encounters packet validation failed since macsec_skb_cb(skb)->valid is always 'false'. This patch adds missing "macsec_skb_cb(skb)->valid = true" in macsec_decrypt_done() when "err == 0". Signed-off-by: Ryder Lee <ryder.lee@mediatek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21	batman-adv: Fix transmission of final, 16th fragment	Linus Lüssing
	Trying to split and transmit a unicast packet in 16 parts will fail for the final fragment: After having sent the 15th one with a frag_packet.no index of 14, we will increase the the index to 15 - and return with an error code immediately, even though one more fragment is due for transmission and allowed. Fixing this issue by moving the check before incrementing the index. While at it, adding an unlikely(), because the check is actually more of an assertion. Fixes: ee75ed88879a ("batman-adv: Fragment and send skbs larger than mtu") Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-02-21	batman-adv: Fix double free during fragment merge error	Sven Eckelmann
	The function batadv_frag_skb_buffer was supposed not to consume the skbuff on errors. This was followed in the helper function batadv_frag_insert_packet when the skb would potentially be inserted in the fragment queue. But it could happen that the next helper function batadv_frag_merge_packets would try to merge the fragments and fail. This results in a kfree_skb of all the enqueued fragments (including the just inserted one). batadv_recv_frag_packet would detect the error in batadv_frag_skb_buffer and try to free the skb again. The behavior of batadv_frag_skb_buffer (and its helper batadv_frag_insert_packet) must therefore be changed to always consume the skbuff to have a common behavior and avoid the double kfree_skb. Fixes: 610bfc6bc99b ("batman-adv: Receive fragmented packets and merge") Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2017-02-21	net: sock: Use USEC_PER_SEC macro instead of literal 1000000	Gao Feng
	The USEC_PER_SEC is used once in sock_set_timeout as the max value of tv_usec. But there are other similar codes which use the literal 1000000 in this file. It is minor cleanup to keep consitent. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21	virtio-net: switch to use build_skb() for small buffer	Jason Wang
	This patch switch to use build_skb() for small buffer which can have better performance for both TCP and XDP (since we can work at page before skb creation). It also remove lots of XDP codes since both mergeable and small buffer use page frag during refill now. Before \| After XDP_DROP(xdp1) 64B : 11.1Mpps \| 14.4Mpps Tested with xdp1/xdp2/xdp_ip_tx_tunnel and netperf. Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21	ip: fix IP_CHECKSUM handling	Paolo Abeni
	The skbs processed by ip_cmsg_recv() are not guaranteed to be linear e.g. when sending UDP packets over loopback with MSGMORE. Using csum_partial() on [potentially] the whole skb len is dangerous; instead be on the safe side and use skb_checksum(). Thanks to syzkaller team to detect the issue and provide the reproducer. v1 -> v2: - move the variable declaration in a tighter scope Fixes: ad6f939ab193 ("ip: Add offset parameter to ip_cmsg_recv") Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21	vxlan: remove unused variable saddr in neigh_reduce	Roopa Prabhu
	silences the below warning: drivers/net/vxlan.c: In function ‘neigh_reduce’: drivers/net/vxlan.c:1599:25: warning: variable ‘saddr’ set but not used [-Wunused-but-set-variable] Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21	vxlan: add changelink support	Roopa Prabhu
	This patch adds changelink rtnl op support for vxlan netdevs. code changes involve: - refactor vxlan_newlink into vxlan_nl2conf to be used by vxlan_newlink and vxlan_changelink - vxlan_nl2conf and vxlan_dev_configure take a changelink argument to isolate changelink checks and updates. - Allow changing only a few attributes: - return -EOPNOTSUPP for attributes that cannot be changed for now. Incremental patches can make the non-supported one available in the future if needed. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21	ARM: ep93xx: ts72xx: allow rtc-m48t86 to manage its own resources	H Hartley Sweeten
	The rtc-m48t86 driver can now handle its own resources and do the read/write operations internally. Pass the necessary resources to the driver and remove the m48t86_ops platform data. Remove the, then unnecessary, static remapping for the registers. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
2017-02-21	rtnl: simplify error return path in rtnl_create_link()	Tobias Klauser
	There is only one possible error path which reaches the err label, so return ERR_PTR(-ENOMEM) directly if alloc_netdev_mqs() fails. This also allows to omit the err variable. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-21	rtc: m48t86: verify that the RTC is actually present	H Hartley Sweeten
	The RTC is an optional feature at purchase time on some Technologic Systems boards. Verify that it actually exists by checking if the last two bytes of the NVRAM can be changed. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
2017-02-21	arm64: dts: juno: update definition for programmable replicator	Mike Leach
	Juno platforms have a programmable replicator splitting the trace output to TPIU and ETR. Currently this is not being programmed as it is being treated as a none-programmable replicator - which is the default operational mode for these devices. The TPIU in the system is enabled by default, and this combination is causing back-pressure in the trace system resulting in overflows at the source. Replaces the existing definition with one that defines the programmable replicator, using the "qcom,coresight-replicator1x" driver that provides the correct functionality for CoreSight programmable replicators. Reviewed-and-Tested-by: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Mike Leach <mike.leach@linaro.org> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
2017-02-21	sunrpc: silence uninitialized variable warning	Dan Carpenter
	kstrtouint() can return a couple different error codes so the check for "ret == -EINVAL" is wrong and static analysis tools correctly complain that we can use "num" without initializing it. It's not super harmful because we check the bounds. But it's also easy enough to fix. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-21	Merge tag 'gfs2-4.11.fixes' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull GFS2 updates from Robert Peterson: "We've got eight GFS2 patches for this merge window: - Andy Price submitted a patch to make gfs2_write_full_page a static function. - Dan Carpenter submitted a patch to fix a ERR_PTR thinko. Three patches fix bugs related to deleting very large files, which cause GFS2 to run out of journal space: - The first one prevents GFS2 delete operation from requesting too much journal space. - The second one fixes a problem whereby GFS2 can hang because it wasn't taking journal space demand into its calculations. - The third one wakes up IO waiters when a flush is done to restart processes stuck waiting for journal space to become available. The final three patches are a performance improvement related to spin_lock contention between multiple writers: - The "tr_touched" variable was switched to a flag to be more atomic and eliminate the possibility of some races. - Function meta_lo_add was moved inline with its only caller to make the code more readable and efficient. - Contention on the gfs2_log_lock spinlock was greatly reduced by avoiding the lock altogether in cases where we don't really need it: buffers that already appear in the appropriate metadata list for the journal. Many thanks to Steve Whitehouse for the ideas and principles behind these patches" * tag 'gfs2-4.11.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Make gfs2_write_full_page static GFS2: Reduce contention on gfs2_log_lock GFS2: Inline function meta_lo_add GFS2: Switch tr_touched to flag in transaction GFS2: Wake up io waiters whenever a flush is done GFS2: Made logd daemon take into account log demand GFS2: Limit number of transaction blocks requested for truncates GFS2: Fix reference to ERR_PTR in gfs2_glock_iter_next
2017-02-21	Merge branch 'topic/zx' into for-linus	Vinod Koul

2017-02-21	Merge branch 'topic/stm32-dma' into for-linus	Vinod Koul

2017-02-21	Merge branch 'topic/ste' into for-linus	Vinod Koul

2017-02-21	Merge branch 'for_linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull UDF fixes and cleanups from Jan Kara: "Several small UDF fixes and cleanups and a small cleanup of fanotify code" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fanotify: simplify the code of fanotify_merge udf: simplify udf_ioctl() udf: fix ioctl errors udf: allow implicit blocksize specification during mount udf: check partition reference in udf_read_inode() udf: atomically read inode size udf: merge module informations in super.c udf: remove next_epos from udf_update_extent_cache() udf: Factor out trimming of crtime udf: remove empty condition udf: remove unneeded line break udf: merge bh free udf: use pointer for kernel_long_ad argument udf: use __packed instead of __attribute__ ((packed)) udf: Make stat on symlink report symlink length as st_size fs/udf: make #ifdef UDF_PREALLOCATE unconditional fs: udf: Replace CURRENT_TIME with current_time()
2017-02-21	Merge branch 'topic/intel' into for-linus	Vinod Koul

2017-02-21	nfsd: special case truncates some more	Christoph Hellwig
	Both the NFS protocols and the Linux VFS use a setattr operation with a bitmap of attributes to set to set various file attributes including the file size and the uid/gid. The Linux syscalls never mix size updates with unrelated updates like the uid/gid, and some file systems like XFS and GFS2 rely on the fact that truncates don't update random other attributes, and many other file systems handle the case but do not update the other attributes in the same transaction. NFSD on the other hand passes the attributes it gets on the wire more or less directly through to the VFS, leading to updates the file systems don't expect. XFS at least has an assert on the allowed attributes, which caught an unusual NFS client setting the size and group at the same time. To handle this issue properly this splits the notify_change call in nfsd_setattr into two separate ones. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org Tested-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-02-21	PCI: exynos: Support the PHY generic framework	Jaehoon Chung
	Switch the pci-exynos driver to generic PHY framework. At the same time backward compatibility is preserved: Warning will be printed for old DTB. Refer to the binding file: - Documentation/devicetree/bindings/pci/samsung,exynos5440-pcie.txt Signed-off-by: Jaehoon Chung <jh80.chung@samsung.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Pankaj Dubey <pankaj.dubey@samsung.com> Reviewed-by: Alim Akhtar <alim.akhtar@samsung.com> Acked-by: Krzysztof Kozlowski <krzk@kernel.org> Acked-by: Jingoo Han <jingoohan1@gmail.com>
2017-02-21	Documentation: binding: Modify the exynos5440 PCIe binding	Jaehoon Chung
	According to using PHY framework, updates the exynos5440-pcie binding. For maintaining backward compatibility, leaves the current dt-binding. (It should be deprecated.) Recommends to use the PHY Framework and "config" property to follow the designware-pcie binding. If you use the old way, can see "missing config reg space" message. Because the getting configuration space address from range is old way. NOTE: When use the "config" property, first name of 'reg-names' must be set to "elbi". Otherwise driver can't maintain the backward capability. Signed-off-by: Jaehoon Chung <jh80.chung@samsung.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Pankaj Dubey <pankaj.dubey@samsung.com> Reviewed-by: Alim Akhtar <alim.akhtar@samsung.com> Acked-by: Rob Herring <robh@kernel.org>
2017-02-21	phy: phy-exynos-pcie: Add support for Exynos PCIe PHY	Jaehoon Chung
	Add support for Generic PHY framework about Exynos SoCs. Current Exynos PCIe driver doesn't use the PHY framework, which makes it difficult to upstream the other Exynos variants because of different PHY registers. Move the codes relevant to PHY from Exnyos PCIe driver to PHY Exynos PCIe driver. [bhelgaas: depend on "OF && (ARCH_EXYNOS \|\| COMPILE_TEST)", update copyright year, both per Vivek] Signed-off-by: Jaehoon Chung <jh80.chung@samsung.com> Acked-by: Krzysztof Kozlowski <krzk@kernel.org> Reviewed-by: Jingoo Han <jingoohan1@gmail.com> Reviewed-by: Pankaj Dubey <pankaj.dubey@samsung.com> Reviewed-by: Vivek Gautam <vivek.gautam@codeaurora.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2017-02-21	Documentation: samsung-phy: Add exynos-pcie-phy binding	Jaehoon Chung
	Add the exynos-pcie-phy binding for Exynos PCIe PHY. This is for using generic PHY framework. Signed-off-by: Jaehoon Chung <jh80.chung@samsung.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Rob Herring <robh@kernel.org>
2017-02-21	gpio: reintroduce devm_get_gpiod_from_child()	Linus Walleij
	We need to keep this API around for the merge window to avoid nasty build problems in the merges. Cc: Lee Jones <lee.jones@linaro.org> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2017-02-21	Merge branch 'master' of git://blackhole.kfki.hu/nf	Pablo Neira Ayuso
	Jozsef Kadlecsik says: ==================== ipset patches for nf Please apply the next patches for ipset in your nf branch. Both patches should go into the stable kernel branches as well, because these are important bugfixes: * Sometimes valid entries in hash:* types of sets were evicted due to a typo in an index. The wrong evictions happen when entries are deleted from the set and the bucket is shrinked. Bug was reported by Eric Ewanco and the patch fixes netfilter bugzilla id #1119. * Fixing of a null pointer exception when someone wants to add an entry to an empty list type of set and specifies an add before/after option. The fix is from Vishwanath Pai. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-21	netfilter: nfnetlink: remove static declaration from err_list	Liping Zhang
	Otherwise, different subsys will race to access the err_list, with holding the different nfnl_lock(subsys_id). But this will not happen now, since ->call_batch is only implemented by nftables, so the err_list is protected by nfnl_lock(NFNL_SUBSYS_NFTABLES). Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-21	netfilter: nfnetlink_queue: fix NFQA_VLAN_MAX definition	Ken-ichirou MATSUZAWA
	Should be - 1 as in other _MAX definitions. Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-21	x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64	Waiman Long
	It was found when running fio sequential write test with a XFS ramdisk on a KVM guest running on a 2-socket x86-64 system, the %CPU times as reported by perf were as follows: 69.75% 0.59% fio [k] down_write 69.15% 0.01% fio [k] call_rwsem_down_write_failed 67.12% 1.12% fio [k] rwsem_down_write_failed 63.48% 52.77% fio [k] osq_lock 9.46% 7.88% fio [k] __raw_callee_save___kvm_vcpu_is_preempt 3.93% 3.93% fio [k] __kvm_vcpu_is_preempted Making vcpu_is_preempted() a callee-save function has a relatively high cost on x86-64 primarily due to at least one more cacheline of data access from the saving and restoring of registers (8 of them) to and from stack as well as one more level of function call. To reduce this performance overhead, an optimized assembly version of the the __raw_callee_save___kvm_vcpu_is_preempt() function is provided for x86-64. With this patch applied on a KVM guest on a 2-socket 16-core 32-thread system with 16 parallel jobs (8 on each socket), the aggregrate bandwidth of the fio test on an XFS ramdisk were as follows: I/O Type w/o patch with patch -------- --------- ---------- random read 8141.2 MB/s 8497.1 MB/s seq read 8229.4 MB/s 8304.2 MB/s random write 1675.5 MB/s 1701.5 MB/s seq write 1681.3 MB/s 1699.9 MB/s There are some increases in the aggregated bandwidth because of the patch. The perf data now became: 70.78% 0.58% fio [k] down_write 70.20% 0.01% fio [k] call_rwsem_down_write_failed 69.70% 1.17% fio [k] rwsem_down_write_failed 59.91% 55.42% fio [k] osq_lock 10.14% 10.14% fio [k] __kvm_vcpu_is_preempted The assembly code was verified by using a test kernel module to compare the output of C __kvm_vcpu_is_preempted() and that of assembly __raw_callee_save___kvm_vcpu_is_preempt() to verify that they matched. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-21	x86/paravirt: Change vcp_is_preempted() arg type to long	Waiman Long
	The cpu argument in the function prototype of vcpu_is_preempted() is changed from int to long. That makes it easier to provide a better optimized assembly version of that function. For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the downcast from long to int is not a problem as vCPU number won't exceed 32 bits. Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-21	KVM: VMX: use correct vmcs_read/write for guest segment selector/base	Chao Peng
	Guest segment selector is 16 bit field and guest segment base is natural width field. Fix two incorrect invocations accordingly. Without this patch, build fails when aggressive inlining is used with ICC. Cc: stable@vger.kernel.org Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-21	x86/kvm/vmx: Defer TR reload after VM exit	Andy Lutomirski
	Intel's VMX is daft and resets the hidden TSS limit register to 0x67 on VMX reload, and the 0x67 is not configurable. KVM currently reloads TR using the LTR instruction on every exit, but this is quite slow because LTR is serializing. The 0x67 limit is entirely harmless unless ioperm() is in use, so defer the reload until a task using ioperm() is actually running. Here's some poorly done benchmarking using kvm-unit-tests: Before: cpuid 1313 vmcall 1195 mov_from_cr8 11 mov_to_cr8 17 inl_from_pmtimer 6770 inl_from_qemu 6856 inl_from_kernel 2435 outl_to_kernel 1402 After: cpuid 1291 vmcall 1181 mov_from_cr8 11 mov_to_cr8 16 inl_from_pmtimer 6457 inl_from_qemu 6209 inl_from_kernel 2339 outl_to_kernel 1391 Signed-off-by: Andy Lutomirski <luto@kernel.org> [Force-reload TR in invalidate_tss_limit. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-21	x86/asm/64: Drop __cacheline_aligned from struct x86_hw_tss	Andy Lutomirski
	Historically, the entire TSS + io bitmap structure was cacheline aligned, but commit ca241c75037b ("x86: unify tss_struct") changed it (presumably inadvertently) so that the fixed-layout hardware part is cacheline-aligned and the io bitmap is after the padding. This wastes 24 bytes (the hardware part should be 104 bytes, but this pads it to 128 bytes) and, serves no purpose, and causes sizeof(struct x86_hw_tss) to have a confusing value. Drop the pointless alignment. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-21	x86/kvm/vmx: Simplify segment_base()	Andy Lutomirski
	Use actual pointer types for pointers (instead of unsigned long) and replace hardcoded constants with the appropriate self-documenting macros. The function is still a bit messy, but this seems a lot better than before to me. This is mostly borrowed from a patch by Thomas Garnier. Cc: Thomas Garnier <thgarnie@google.com> Cc: Jim Mattson <jmattson@google.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-21	x86/kvm/vmx: Get rid of segment_base() on 64-bit kernels	Andy Lutomirski
	It was a bit buggy (it didn't list all segment types that needed 64-bit fixups), but the bug was irrelevant because it wasn't called in any interesting context on 64-bit kernels and was only used for data segents on 32-bit kernels. To avoid confusion, make it explicitly 32-bit only. Cc: Thomas Garnier <thgarnie@google.com> Cc: Jim Mattson <jmattson@google.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-21	x86/kvm/vmx: Don't fetch the TSS base from the GDT	Andy Lutomirski
	The current CPU's TSS base is a foregone conclusion, so there's no need to parse it out of the segment tables. This should save a couple cycles (as STR is surely microcoded and poorly optimized) but, more importantly, it's a cleanup and it means that segment_base() will never be called on 64-bit kernels. Cc: Thomas Garnier <thgarnie@google.com> Cc: Jim Mattson <jmattson@google.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-21	x86/asm: Define the kernel TSS limit in a macro	Andy Lutomirski
	Rather than open-coding the kernel TSS limit in set_tss_desc(), make it a real macro near the TSS layout definition. This is purely a cleanup. Cc: Thomas Garnier <thgarnie@google.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-21	powerpc/pseries: Advertise Hot Plug Event support to firmware	Michael Roth
	With the inclusion of commit 333f7b76865b ("powerpc/pseries: Implement indexed-count hotplug memory add") and commit 753843471cbb ("powerpc/pseries: Implement indexed-count hotplug memory remove"), we now have complete handling of the RTAS hotplug event format as described by PAPR via ACR "PAPR Changes for Hotplug RTAS Events". This capability is indicated by byte 6, bit 2 (5 in IBM numbering) of architecture option vector 5, and allows for greater control over cpu/memory/pci hot plug/unplug operations. Existing pseries kernels will utilize this capability based on the existence of the /event-sources/hot-plug-events DT property, so we only need to advertise it via CAS and do not need a corresponding FW_FEATURE_* value to test for. Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-02-21	cxl: fix nested locking hang during EEH hotplug	Andrew Donnellan
	Commit 14a3ae34bfd0 ("cxl: Prevent read/write to AFU config space while AFU not configured") introduced a rwsem to fix an invalid memory access that occurred when someone attempts to access the config space of an AFU on a vPHB whilst the AFU is deconfigured, such as during EEH recovery. It turns out that it's possible to run into a nested locking issue when EEH recovery fails and a full device hotplug is required. cxl_pci_error_detected() deconfigures the AFU, taking a writer lock on configured_rwsem. When EEH recovery fails, the EEH code calls pci_hp_remove_devices() to remove the device, which in turn calls cxl_remove() -> cxl_pci_remove_afu() -> pci_deconfigure_afu(), which tries to grab the writer lock that's already held. Standard rwsem semantics don't express what we really want to do here and don't allow for nested locking. Fix this by replacing the rwsem with an atomic_t which we can control more finely. Allow the AFU to be locked multiple times so long as there are no readers. Fixes: 14a3ae34bfd0 ("cxl: Prevent read/write to AFU config space while AFU not configured") Cc: stable@vger.kernel.org # v4.9+ Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-02-21	Merge tag 'perf-core-for-mingo-4.11-20170220' of ↵	Ingo Molnar
	git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo: New features: - Make -a/--all-cpus be the default target in 'perf record' and 'perf stat', just like it is with 'perf trace' (Jiri Olsa) - Introduce -q/--quiet to the 'annotate', 'diff' and 'report', fix up its behaviour in 'record'. This makes the output more compact by elliminating headers, leaving just the histogram lines (Namhyung Kim) Fixes: - Handle offline/absent CPUs (Jan Stancek) Infrastructure changes: - Filter out -specs=/a/b/c from CC options when building the python support, allowing that feature to be built with clang (Arnaldo Carvalho de Melo) - Fix DEBUG=1 build with clang (Arnaldo Carvalho de Melo) Trivial changes: - Fix spelling of 'preempt' in a libtraceevent function name (Steven Rostedt) Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>