summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2017-08-04drm/i915/gvt: Initialize MMIO Block with HW stateTina Zhang
MMIO block with tracked mmio, is introduced for the sake of performance of searching tracked mmio. All the tracked mmio needs to get the initial value from the HW state during vGPU being created. This patch is to initialize the tracked registers in MMIO block with the HW state. v2: Add "Fixes:" line for this patch (Zhenyu) Fixes: 65f9f6febf12 ("drm/i915/gvt: Optimize MMIO register handling for some large MMIO blocks") Signed-off-by: Tina Zhang <tina.zhang@intel.com> Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
2017-08-04drm/rockchip: vop: report error when check resource errorMark yao
The user would be confused while facing a error commit without any error report. Signed-off-by: Mark Yao <mark.yao@rock-chips.com> Reviewed-by: Sandy huang <sandy.huang@rock-chips.com> Link: https://patchwork.freedesktop.org/patch/msgid/1501494596-7090-1-git-send-email-mark.yao@rock-chips.com
2017-08-04drm/rockchip: vop: round_up pitches to word alignMark yao
VOP pitch register is word align, need align to word. VOP_WIN0_VIR: bit[31:16] win0_vir_stride_uv Number of words of Win0 uv Virtual width bit[15:0] win0_vir_width Number of words of Win0 yrgb Virtual width ARGB888 : win0_vir_width RGB888 : (win0_vir_width*3/4) + (win0_vir_width%3) RGB565 : ceil(win0_vir_width/2) YUV : ceil(win0_vir_width/4) Signed-off-by: Mark Yao <mark.yao@rock-chips.com> Reviewed-by: Sandy huang <sandy.huang@rock-chips.com> Link: https://patchwork.freedesktop.org/patch/msgid/1501494591-7034-1-git-send-email-mark.yao@rock-chips.com
2017-08-04drm/rockchip: vop: fix NV12 video display errorMark yao
fixup the scale calculation formula on the case src_height == (dst_height/2). Signed-off-by: Mark Yao <mark.yao@rock-chips.com> Reviewed-by: Sandy huang <sandy.huang@rock-chips.com> Link: https://patchwork.freedesktop.org/patch/msgid/1501494586-6984-1-git-send-email-mark.yao@rock-chips.com
2017-08-04drm/rockchip: vop: fix iommu page fault when resumeMark yao
Iommu would get page fault with following path: vop_disable: 1, disable all windows and set vop config done 2, vop enter to standy, all windows not works, but their registers are not clean, when you read window's enable bit, may found the window is enable. vop_enable: 1, memcpy(vop->regsbak, vop->regs, len) save current vop registers to vop->regsbak, then you can found window is enable on regsbak. 2, VOP_WIN_SET(vop, win, gate, 1); force enable window gate, but gate and enable are on same hardware register, then window enable bit rewrite to vop hardware. 3, vop power on, and vop might try to scan destroyed buffer, then iommu get page fault. Move windows disable after vop regsbak restore, then vop regsbak mechanism would keep tracing the modify, everything would be safe. Signed-off-by: Mark Yao <mark.yao@rock-chips.com> Reviewed-by: Sandy huang <sandy.huang@rock-chips.com> Link: https://patchwork.freedesktop.org/patch/msgid/1501494582-6934-1-git-send-email-mark.yao@rock-chips.com
2017-08-04powerpc/64: Fix __check_irq_replay missing decrementer interruptNicholas Piggin
If the decrementer wraps again and de-asserts the decrementer exception while hard-disabled, __check_irq_replay() has a test to notice the wrap when interrupts are re-enabled. The decrementer check must be done when clearing the PACA_IRQ_HARD_DIS flag, not when the PACA_IRQ_DEC flag is tested. Previously this worked because the decrementer interrupt was always the first one checked after clearing the hard disable flag, but HMI check was moved ahead of that, which introduced this bug. This can cause a missed decrementer interrupt if we soft-disable interrupts then take an HMI which is recorded in irq_happened, then hard-disable interrupts for > 4s to wrap the decrementer. Fixes: e0e0d6b7390b ("powerpc/64: Replay hypervisor maintenance interrupt first") Cc: stable@vger.kernel.org # v4.9+ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-04powerpc/perf: POWER9 PMU stops after idle workaroundNicholas Piggin
POWER9 DD2 PMU can stop after a state-loss idle in some conditions. A solution is to set then clear MMCRA[60] after wake from state-loss idle. MMCRA[60] is a non-architected bit, see the user manual for details. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Acked-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-04Merge branch 'drm-fixes-4.13' of git://people.freedesktop.org/~agd5f/linux ↵Dave Airlie
into drm-fixes Just a few small fixes for 4.13. * 'drm-fixes-4.13' of git://people.freedesktop.org/~agd5f/linux: drm/amdgpu: Use list_del_init in amdgpu_mn_unregister drm/amdgpu: Fix undue fallthroughs in golden registers initialization drm/amdgpu: fix header on gfx9 clear state
2017-08-03Merge branch 'tcp-xmit-timer-rearming'David S. Miller
Neal Cardwell says: ==================== tcp: fix xmit timer rearming to avoid stalls This patch series is a bug fix for a TCP loss recovery performance bug reported independently in recent netdev threads: (i) July 26, 2017: netdev thread "TCP fast retransmit issues" (ii) July 26, 2017: netdev thread: "[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one outstanding TLP retransmission" Many thanks to Klavs Klavsen and Mao Wenan for the detailed reports, traces, and packetdrill test cases, which enabled us to root-cause this issue and verify the fix. - v1 -> v2: - In patch 2/3, changed an unclear comment in the pre-existing code in tcp_schedule_loss_probe() to be more clear (thanks to Eric Dumazet for suggesting we improve this). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03tcp: fix xmit timer to only be reset if data ACKed/SACKedNeal Cardwell
Fix a TCP loss recovery performance bug raised recently on the netdev list, in two threads: (i) July 26, 2017: netdev thread "TCP fast retransmit issues" (ii) July 26, 2017: netdev thread: "[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one outstanding TLP retransmission" The basic problem is that incoming TCP packets that did not indicate forward progress could cause the xmit timer (TLP or RTO) to be rearmed and pushed back in time. In certain corner cases this could result in the following problems noted in these threads: - Repeated ACKs coming in with bogus SACKs corrupted by middleboxes could cause TCP to repeatedly schedule TLPs forever. We kept sending TLPs after every ~200ms, which elicited bogus SACKs, which caused more TLPs, ad infinitum; we never fired an RTO to fill in the holes. - Incoming data segments could, in some cases, cause us to reschedule our RTO or TLP timer further out in time, for no good reason. This could cause repeated inbound data to result in stalls in outbound data, in the presence of packet loss. This commit fixes these bugs by changing the TLP and RTO ACK processing to: (a) Only reschedule the xmit timer once per ACK. (b) Only reschedule the xmit timer if tcp_clean_rtx_queue() deems the ACK indicates sufficient forward progress (a packet was cumulatively ACKed, or we got a SACK for a packet that was sent before the most recent retransmit of the write queue head). This brings us back into closer compliance with the RFCs, since, as the comment for tcp_rearm_rto() notes, we should only restart the RTO timer after forward progress on the connection. Previously we were restarting the xmit timer even in these cases where there was no forward progress. As a side benefit, this commit simplifies and speeds up the TCP timer arming logic. We had been calling inet_csk_reset_xmit_timer() three times on normal ACKs that cumulatively acknowledged some data: 1) Once near the top of tcp_ack() to switch from TLP timer to RTO: if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) tcp_rearm_rto(sk); 2) Once in tcp_clean_rtx_queue(), to update the RTO: if (flag & FLAG_ACKED) { tcp_rearm_rto(sk); 3) Once in tcp_ack() after tcp_fastretrans_alert() to switch from RTO to TLP: if (icsk->icsk_pending == ICSK_TIME_RETRANS) tcp_schedule_loss_probe(sk); This commit, by only rescheduling the xmit timer once per ACK, simplifies the code and reduces CPU overhead. This commit was tested in an A/B test with Google web server traffic. SNMP stats and request latency metrics were within noise levels, substantiating that for normal web traffic patterns this is a rare issue. This commit was also tested with packetdrill tests to verify that it fixes the timer behavior in the corner cases discussed in the netdev threads mentioned above. This patch is a bug fix patch intended to be queued for -stable relases. Fixes: 6ba8a3b19e76 ("tcp: Tail loss probe (TLP)") Reported-by: Klavs Klavsen <kl@vsen.dk> Reported-by: Mao Wenan <maowenan@huawei.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03tcp: enable xmit timer fix by having TLP use time when RTO should fireNeal Cardwell
Have tcp_schedule_loss_probe() base the TLP scheduling decision based on when the RTO *should* fire. This is to enable the upcoming xmit timer fix in this series, where tcp_schedule_loss_probe() cannot assume that the last timer installed was an RTO timer (because we are no longer doing the "rearm RTO, rearm RTO, rearm TLP" dance on every ACK). So tcp_schedule_loss_probe() must independently figure out when an RTO would want to fire. In the new TLP implementation following in this series, we cannot assume that icsk_timeout was set based on an RTO; after processing a cumulative ACK the icsk_timeout we see can be from a previous TLP or RTO. So we need to independently recalculate the RTO time (instead of reading it out of icsk_timeout). Removing this dependency on the nature of icsk_timeout makes things a little easier to reason about anyway. Note that the old and new code should be equivalent, since they are both saying: "if the RTO is in the future, but at an earlier time than the normal TLP time, then set the TLP timer to fire when the RTO would have fired". Fixes: 6ba8a3b19e76 ("tcp: Tail loss probe (TLP)") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03tcp: introduce tcp_rto_delta_us() helper for xmit timer fixNeal Cardwell
Pure refactor. This helper will be required in the xmit timer fix later in the patch series. (Because the TLP logic will want to make this calculation.) Fixes: 6ba8a3b19e76 ("tcp: Tail loss probe (TLP)") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03Merge tag 'vfio-v4.13-rc4' of git://github.com/awilliam/linux-vfioLinus Torvalds
Pull VFIO fixes from Alex Williamson: - SPAPR/EEH config build fix (Murilo Opsfelder Araujo) - Fix possible device lock deadlock (Alex Williamson) - Correctly size integrated endpoint PCIe capabilities (Alex Williamson) * tag 'vfio-v4.13-rc4' of git://github.com/awilliam/linux-vfio: vfio/pci: Fix handling of RC integrated endpoint PCIe capability size vfio/pci: Use pci_try_reset_function() on initial open include/linux/vfio.h: Guard powerpc-specific functions with CONFIG_VFIO_SPAPR_EEH
2017-08-03ipv6: set rt6i_protocol properly in the route when it is installedXin Long
After commit c2ed1880fd61 ("net: ipv6: check route protocol when deleting routes"), ipv6 route checks rt protocol when trying to remove a rt entry. It introduced a side effect causing 'ip -6 route flush cache' not to work well. When flushing caches with iproute, all route caches get dumped from kernel then removed one by one by sending DELROUTE requests to kernel for each cache. The thing is iproute sends the request with the cache whose proto is set with RTPROT_REDIRECT by rt6_fill_node() when kernel dumps it. But in kernel the rt_cache protocol is still 0, which causes the cache not to be matched and removed. So the real reason is rt6i_protocol in the route is not set when it is allocated. As David Ahern's suggestion, this patch is to set rt6i_protocol properly in the route when it is installed and remove the codes setting rtm_protocol according to rt6i_flags in rt6_fill_node. This is also an improvement to keep rt6i_protocol consistent with rtm_protocol. Fixes: c2ed1880fd61 ("net: ipv6: check route protocol when deleting routes") Reported-by: Jianlin Shi <jishi@redhat.com> Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "15 fixes" [ This does not merge the "fortify: use WARN instead of BUG for now" patch, which needs a bit of extra work to build cleanly with all configurations. Arnd is on it. - Linus ] * emailed patches from Andrew Morton <akpm@linux-foundation.org>: ocfs2: don't clear SGID when inheriting ACLs mm: allow page_cache_get_speculative in interrupt context userfaultfd: non-cooperative: flush event_wqh at release time ipc: add missing container_of()s for randstruct cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() userfaultfd_zeropage: return -ENOSPC in case mm has gone mm: take memory hotplug lock within numa_zonelist_order_handler() mm/page_io.c: fix oops during block io poll in swapin path zram: do not free pool->size_class kthread: fix documentation build warning kasan: avoid -Wmaybe-uninitialized warning userfaultfd: non-cooperative: notify about unmap of destination during mremap mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries pid: kill pidhash_size in pidhash_init() mm/hugetlb.c: __get_user_pages ignores certain follow_hugetlb_page errors
2017-08-03Merge tag 'acpi-4.13-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI fixes from Rafael Wysocki: "These fix two issues in the ACPI SoC drivers (Intel LPSS and AMD APD), a crash in the PCC mailbox initialization code and a WDAT watchdog initialization failure. Specifics: - Fix a device ID of Hisilicon Hip07/08 in the ACPI APD (AMD SoC) driver (Hanjun Guo). - Fix list corruption (introduced during the 4.11 cycle) in the ACPI LPSS (Intel SoC) driver (Hans de Goede). - Fix PCC mailbox handling code crash during initialization when PCCT is not present and PCC channel 0 is requested (Hoan Tran). - Fix a WDAT watchdog initialization issue causing platform device creation to fail due to partially overlapping address ranges in resources (Ryan Kennedy)" * tag 'acpi-4.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI: APD: Fix HID for Hisilicon Hip07/08 mailbox: pcc: Fix crash when request PCC channel 0 ACPI / watchdog: Fix init failure with overlapping register regions ACPI / LPSS: Only call pwm_add_table() for the first PWM controller
2017-08-03Merge tag 'pm-4.13-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix two cpufreq issues, one introduced recently and one related to recent changes, fix cpufreq documentation, fix up recently added code in the Thunderbolt driver and update runtime PM framework documentation. Specifics: - Fix the handling of the scaling_cur_freq cpufreq policy attribute on x86 systems with the MPERF/APERF registers present to make it behave more as expected after recent changes (Rafael Wysocki). - Drop a leftover callback from the intel_pstate driver which also prevents the cpuinfo_cur_freq cpufreq policy attribute from being incorrectly exposed when intel_pstate works in the active mode (Rafael Wysocki). - Add a missing piece describing the cpuinfo_cur_freq policy attribute to cpufreq documentation (Rafael Wysocki). - Fix up a recently added part of the Thunderbolt driver to avoid aborting system suspends if its mailbox commands time out (Rafael Wysocki). - Update device runtime PM framework documentation to reflect the current behavior of the code (Johan Hovold)" * tag 'pm-4.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: thunderbolt: icm: Ignore mailbox errors in icm_suspend() cpufreq: x86: Make scaling_cur_freq behave more as expected PM / runtime: Document new pm_runtime_set_suspended() constraint cpufreq: docs: Add missing cpuinfo_cur_freq description cpufreq: intel_pstate: Drop ->get from intel_pstate structure
2017-08-03Merge branches 'acpi-soc', 'acpi-wdat' and 'acpi-cppc'Rafael J. Wysocki
* acpi-soc: ACPI: APD: Fix HID for Hisilicon Hip07/08 ACPI / LPSS: Only call pwm_add_table() for the first PWM controller * acpi-wdat: ACPI / watchdog: Fix init failure with overlapping register regions * acpi-cppc: mailbox: pcc: Fix crash when request PCC channel 0
2017-08-03Merge branches 'pm-core' and 'pm-misc'Rafael J. Wysocki
* pm-core: PM / runtime: Document new pm_runtime_set_suspended() constraint * pm-misc: thunderbolt: icm: Ignore mailbox errors in icm_suspend()
2017-08-03Merge branches 'pm-cpufreq-x86', 'pm-cpufreq-docs' and 'intel_pstate'Rafael J. Wysocki
* pm-cpufreq-x86: cpufreq: x86: Make scaling_cur_freq behave more as expected * pm-cpufreq-docs: cpufreq: docs: Add missing cpuinfo_cur_freq description * intel_pstate: cpufreq: intel_pstate: Drop ->get from intel_pstate structure
2017-08-03Merge tag 'usb-serial-4.13-rc4' of ↵Greg Kroah-Hartman
git://git.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial into usb-linus Johan writes: USB-serial fixes for v4.13-rc4 Here are some new device ids for v4.13-rc4. All have been in linux-next with no reported issues. Signed-off-by: Johan Hovold <johan@kernel.org>
2017-08-03net: fix keepalive code vs TCP_FASTOPEN_CONNECTEric Dumazet
syzkaller was able to trigger a divide by 0 in TCP stack [1] Issue here is that keepalive timer needs to be updated to not attempt to send a probe if the connection setup was deferred using TCP_FASTOPEN_CONNECT socket option added in linux-4.11 [1] divide error: 0000 [#1] SMP CPU: 18 PID: 0 Comm: swapper/18 Not tainted task: ffff986f62f4b040 ti: ffff986f62fa2000 task.ti: ffff986f62fa2000 RIP: 0010:[<ffffffff8409cc0d>] [<ffffffff8409cc0d>] __tcp_select_window+0x8d/0x160 Call Trace: <IRQ> [<ffffffff8409d951>] tcp_transmit_skb+0x11/0x20 [<ffffffff8409da21>] tcp_xmit_probe_skb+0xc1/0xe0 [<ffffffff840a0ee8>] tcp_write_wakeup+0x68/0x160 [<ffffffff840a151b>] tcp_keepalive_timer+0x17b/0x230 [<ffffffff83b3f799>] call_timer_fn+0x39/0xf0 [<ffffffff83b40797>] run_timer_softirq+0x1d7/0x280 [<ffffffff83a04ddb>] __do_softirq+0xcb/0x257 [<ffffffff83ae03ac>] irq_exit+0x9c/0xb0 [<ffffffff83a04c1a>] smp_apic_timer_interrupt+0x6a/0x80 [<ffffffff83a03eaf>] apic_timer_interrupt+0x7f/0x90 <EOI> [<ffffffff83fed2ea>] ? cpuidle_enter_state+0x13a/0x3b0 [<ffffffff83fed2cd>] ? cpuidle_enter_state+0x11d/0x3b0 Tested: Following packetdrill no longer crashes the kernel `echo 0 >/proc/sys/net/ipv4/tcp_timestamps` // Cache warmup: send a Fast Open cookie request 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 +0 setsockopt(3, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0 +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation is now in progress) +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop> +.01 < S. 123:123(0) ack 1 win 14600 <mss 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop> +0 > . 1:1(0) ack 1 +0 close(3) = 0 +0 > F. 1:1(0) ack 1 +0 < F. 1:1(0) ack 2 win 92 +0 > . 2:2(0) ack 2 +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4 +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 +0 setsockopt(4, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0 +0 setsockopt(4, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0 +.01 connect(4, ..., ...) = 0 +0 setsockopt(4, SOL_TCP, TCP_KEEPIDLE, [5], 4) = 0 +10 close(4) = 0 `echo 1 >/proc/sys/net/ipv4/tcp_timestamps` Fixes: 19f6d3f3c842 ("net/tcp-fastopen: Add new API support") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Wei Wang <weiwan@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03Merge tag 'batadv-net-for-davem-20170802' of git://git.open-mesh.org/linux-mergeDavid S. Miller
Simon Wunderlich says: ==================== Here is a batman-adv bugfix: - fix TT sync flag inconsistency problems, which can lead to excess packets, by Linus Luessing ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03Merge tag 'fixes-for-v4.13-rc4' of ↵Greg Kroah-Hartman
git://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb into usb-linus Felipe writes: usb: fixes for v4.13-rc4 Another fix for isochronous transfers on dwc3. This time around, we're making sure that we use correct PIDs in all transfer sizes. MSM PHY driver got a fix for the use of devm_regulator_bulk_get() API which will avoid kernel crashes. Renesas DRD driver got 3 fixes: a fix on giveback, a fix for proper controller programming and the removal of set-but-never-used variable.
2017-08-03Merge tag 'kvm-arm-for-v4.13-rc4' of ↵Radim Krčmář
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm KVM/ARM Fixes for v4.13-rc4 - Yet another race with VM destruction plugged - A set of small vgic fixes
2017-08-03fuse: Dont call set_page_dirty_lock() for ITER_BVEC pages for async_dioAshish Samant
Commit 8fba54aebbdf ("fuse: direct-io: don't dirty ITER_BVEC pages") fixes the ITER_BVEC page deadlock for direct io in fuse by checking in fuse_direct_io(), whether the page is a bvec page or not, before locking it. However, this check is missed when the "async_dio" mount option is enabled. In this case, set_page_dirty_lock() is called from the req->end callback in request_end(), when the fuse thread is returning from userspace to respond to the read request. This will cause the same deadlock because the bvec condition is not checked in this path. Here is the stack of the deadlocked thread, while returning from userspace: [13706.656686] INFO: task glusterfs:3006 blocked for more than 120 seconds. [13706.657808] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [13706.658788] glusterfs D ffffffff816c80f0 0 3006 1 0x00000080 [13706.658797] ffff8800d6713a58 0000000000000086 ffff8800d9ad7000 ffff8800d9ad5400 [13706.658799] ffff88011ffd5cc0 ffff8800d6710008 ffff88011fd176c0 7fffffffffffffff [13706.658801] 0000000000000002 ffffffff816c80f0 ffff8800d6713a78 ffffffff816c790e [13706.658803] Call Trace: [13706.658809] [<ffffffff816c80f0>] ? bit_wait_io_timeout+0x80/0x80 [13706.658811] [<ffffffff816c790e>] schedule+0x3e/0x90 [13706.658813] [<ffffffff816ca7e5>] schedule_timeout+0x1b5/0x210 [13706.658816] [<ffffffff81073ffb>] ? gup_pud_range+0x1db/0x1f0 [13706.658817] [<ffffffff810668fe>] ? kvm_clock_read+0x1e/0x20 [13706.658819] [<ffffffff81066909>] ? kvm_clock_get_cycles+0x9/0x10 [13706.658822] [<ffffffff810f5792>] ? ktime_get+0x52/0xc0 [13706.658824] [<ffffffff816c6f04>] io_schedule_timeout+0xa4/0x110 [13706.658826] [<ffffffff816c8126>] bit_wait_io+0x36/0x50 [13706.658828] [<ffffffff816c7d06>] __wait_on_bit_lock+0x76/0xb0 [13706.658831] [<ffffffffa0545636>] ? lock_request+0x46/0x70 [fuse] [13706.658834] [<ffffffff8118800a>] __lock_page+0xaa/0xb0 [13706.658836] [<ffffffff810c8500>] ? wake_atomic_t_function+0x40/0x40 [13706.658838] [<ffffffff81194d08>] set_page_dirty_lock+0x58/0x60 [13706.658841] [<ffffffffa054d968>] fuse_release_user_pages+0x58/0x70 [fuse] [13706.658844] [<ffffffffa0551430>] ? fuse_aio_complete+0x190/0x190 [fuse] [13706.658847] [<ffffffffa0551459>] fuse_aio_complete_req+0x29/0x90 [fuse] [13706.658849] [<ffffffffa05471e9>] request_end+0xd9/0x190 [fuse] [13706.658852] [<ffffffffa0549126>] fuse_dev_do_write+0x336/0x490 [fuse] [13706.658854] [<ffffffffa054963e>] fuse_dev_write+0x6e/0xa0 [fuse] [13706.658857] [<ffffffff812a9ef3>] ? security_file_permission+0x23/0x90 [13706.658859] [<ffffffff81205300>] do_iter_readv_writev+0x60/0x90 [13706.658862] [<ffffffffa05495d0>] ? fuse_dev_splice_write+0x350/0x350 [fuse] [13706.658863] [<ffffffff812062a1>] do_readv_writev+0x171/0x1f0 [13706.658866] [<ffffffff810b3d00>] ? try_to_wake_up+0x210/0x210 [13706.658868] [<ffffffff81206361>] vfs_writev+0x41/0x50 [13706.658870] [<ffffffff81206496>] SyS_writev+0x56/0xf0 [13706.658872] [<ffffffff810257a1>] ? syscall_trace_leave+0xf1/0x160 [13706.658874] [<ffffffff816cbb2e>] system_call_fastpath+0x12/0x71 Fix this by making should_dirty a fuse_io_priv parameter that can be checked in fuse_aio_complete_req(). Reported-by: Tiger Yang <tiger.yang@oracle.com> Signed-off-by: Ashish Samant <ashish.samant@oracle.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-08-03KVM: arm/arm64: vgic: Use READ_ONCE fo cmpxchgChristoffer Dall
There is a small chance that the compiler could generate separate loads for the dist->propbaser which could be modified from another CPU. As we want to make sure we atomically update the entire value, and don't race with other updates, guarantee that the cmpxchg operation compares against the original value. Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Christoffer Dall <cdall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2017-08-03KVM: nVMX: Fix interrupt window request with "Acknowledge interrupt on exit"Wanpeng Li
------------[ cut here ]------------ WARNING: CPU: 5 PID: 2288 at arch/x86/kvm/vmx.c:11124 nested_vmx_vmexit+0xd64/0xd70 [kvm_intel] CPU: 5 PID: 2288 Comm: qemu-system-x86 Not tainted 4.13.0-rc2+ #7 RIP: 0010:nested_vmx_vmexit+0xd64/0xd70 [kvm_intel] Call Trace: vmx_check_nested_events+0x131/0x1f0 [kvm_intel] ? vmx_check_nested_events+0x131/0x1f0 [kvm_intel] kvm_arch_vcpu_ioctl_run+0x5dd/0x1be0 [kvm] ? vmx_vcpu_load+0x1be/0x220 [kvm_intel] ? kvm_arch_vcpu_load+0x62/0x230 [kvm] kvm_vcpu_ioctl+0x340/0x700 [kvm] ? kvm_vcpu_ioctl+0x340/0x700 [kvm] ? __fget+0xfc/0x210 do_vfs_ioctl+0xa4/0x6a0 ? __fget+0x11d/0x210 SyS_ioctl+0x79/0x90 do_syscall_64+0x8f/0x750 ? trace_hardirqs_on_thunk+0x1a/0x1c entry_SYSCALL64_slow_path+0x25/0x25 This can be reproduced by booting L1 guest w/ 'noapic' grub parameter, which means that tells the kernel to not make use of any IOAPICs that may be present in the system. Actually external_intr variable in nested_vmx_vmexit() is the req_int_win variable passed from vcpu_enter_guest() which means that the L0's userspace requests an irq window. I observed the scenario (!kvm_cpu_has_interrupt(vcpu) && L0's userspace reqeusts an irq window) is true, so there is no interrupt which L1 requires to inject to L2, we should not attempt to emualte "Acknowledge interrupt on exit" for the irq window requirement in this scenario. This patch fixes it by not attempt to emulate "Acknowledge interrupt on exit" if there is no L1 requirement to inject an interrupt to L2. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> [Added code comment to make it obvious that the behavior is not correct. We should do a userspace exit with open interrupt window instead of the nested VM exit. This patch still improves the behavior, so it was accepted as a (temporary) workaround.] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-03usb: renesas_usbhs: gadget: fix unused-but-set-variable warningYoshihiro Shimoda
The commit b8b9c974afee ("usb: renesas_usbhs: gadget: disable all eps when the driver stops") causes the unused-but-set-variable warning. But, if the usbhsg_ep_disable() will return non-zero value, udc/core.c doesn't clear the ep->enabled flag. So, this driver should not return non-zero value, if the pipe is zero because this means the pipe is already disabled. Otherwise, the ep->enabled flag is never cleared when the usbhsg_ep_disable() is called by the renesas_usbhs driver first. Fixes: b8b9c974afee ("usb: renesas_usbhs: gadget: disable all eps when the driver stops") Fixes: 11432050f070 ("usb: renesas_usbhs: gadget: fix NULL pointer dereference in ep_disable()") Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
2017-08-03usb: renesas_usbhs: Fix UGCTRL2 value for R-Car Gen3Yoshihiro Shimoda
The latest HW manual (Rev.0.55) shows us this UGCTRL2.VBUSSEL bit. If the bit sets to 1, the VBUS drive is controlled by phy related registers (called "UCOM Registers" on the manual). Since R-Car Gen3 environment will control VBUS by phy-rcar-gen3-usb2 driver, the UGCTRL2.VBUSSEL bit should be set to 1. So, this patch fixes the register's value. Otherwise, even if the ID pin indicates to peripheral, the R-Car will output USBn_PWEN to 1 when a host driver is running. Fixes: de18757e272d ("usb: renesas_usbhs: add R-Car Gen3 power control" Cc: <stable@vger.kernel.org> # v4.6+ Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
2017-08-03usb: phy: phy-msm-usb: Fix usage of devm_regulator_bulk_get()Rajendra Nayak
The regulator_bulk_data pointer passed to devm_regulator_bulk_get() is used to store the client handles for the regulators, which is later used by devm_regulator_bulk_release() to free the regulators. Passing a local array as is done here means the memory used to store the handles is freed causing the handles to be corrupted, resulting in a crash when devm_regulator_bulk_release() tries to free them. Fix this my moving the array inside of the msm_otg structure. Signed-off-by: Rajendra Nayak <rnayak@codeaurora.org> Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
2017-08-03usb: gadget: udc: renesas_usb3: Fix usb_gadget_giveback_request() callingYoshihiro Shimoda
According to the gadget.h, a "complete" function will always be called with interrupts disabled. However, sometimes usb3_request_done() function is called with interrupts enabled. So, this function should be held by spin_lock_irqsave() to disable interruption. Also, this driver has to call spin_unlock() to avoid spinlock recursion by this driver before calling usb_gadget_giveback_request(). Reported-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com> Tested-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com> Fixes: 746bfe63bba3 ("usb: gadget: renesas_usb3: add support for Renesas USB3.0 peripheral controller") Cc: <stable@vger.kernel.org> # v4.5+ Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
2017-08-03usb: dwc3: gadget: Correct ISOC DATA PIDs for short packetsManu Gautam
The PIDs for Isochronous data transfers are incorrect for high bandwidth IN endpoints when the request length is less than EP wMaxPacketSize. As per spec correct PIDs for ISOC data transfers are: 1) For request length <= maxpacket - DATA0, 2) For maxpacket < length <= (2 * maxpacket) - DATA1, DATA0 3) For (2 * maxpacket) < length <= (3 * maxpacket) - DATA2, DATA1, DATA0. But driver always sets PCM fields based on wMaxPacketSize due to which DATA2 happens even for small requests. Fix this by setting the PCM field of trb->size depending on request length rather than fixing it to the value depending on wMaxPacketSize. Ideally it shouldn't give any issues as dwc3 will send 0-length packet for next IN token if host sends (even after receiving a short packet). Windows seems to ignore this but with MacOS frame loss observed when using f_uvc. Signed-off-by: Manu Gautam <mgautam@codeaurora.org> Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
2017-08-03mmc: block: bypass the queue even if usage is present for hotplugShawn Lin
The commit 304419d8a7e9 ("mmc: core: Allocate per-request data using the block layer core") refactored mechanism of queue handling caused mmc_init_request() can be called just after mmc_cleanup_queue() caused null pointer dereference. Another commit bbdc74dc19e0 ("mmc: block: Prevent new req entering queue after its cleanup") tried to fix the problem. However it actually miss one corner case. We could still reproduce the issue mentioned with these steps: (1) insert a SD card and mount it (2) hotplug it, so it will leave md->usage still be counted (3) reboot the system which will sync data and umount the card [Unable to handle kernel NULL pointer dereference at virtual address 00000000 [user pgtable: 4k pages, 48-bit VAs, pgd = ffff80007bab3000 [[0000000000000000] *pgd=000000007a828003, *pud=0000000078dce003, *pmd=000000007aab6003, *pte=0000000000000000 [Internal error: Oops: 96000007 [#1] PREEMPT SMP [Modules linked in: [CPU: 3 PID: 3507 Comm: umount Tainted: G W 4.13.0-rc1-next-20170720-00012-g9d9bf45 #33 [Hardware name: Firefly-RK3399 Board (DT) [task: ffff80007a1de200 task.stack: ffff80007a01c000 [PC is at mmc_init_request+0x14/0xc4 [LR is at alloc_request_size+0x4c/0x74 [pc : [<ffff0000087d7150>] lr : [<ffff000008378fe0>] pstate: 600001c5 [sp : ffff80007a01f8f0 .... [[<ffff0000087d7150>] mmc_init_request+0x14/0xc4 [[<ffff000008378fe0>] alloc_request_size+0x4c/0x74 [[<ffff00000817ac28>] mempool_create_node+0xb8/0x17c [[<ffff00000837aadc>] blk_init_rl+0x9c/0x120 [[<ffff000008396580>] blkg_alloc+0x110/0x234 [[<ffff000008396ac8>] blkg_create+0x424/0x468 [[<ffff00000839877c>] blkg_lookup_create+0xd8/0x14c [[<ffff0000083796bc>] generic_make_request_checks+0x368/0x3b0 [[<ffff00000837b050>] generic_make_request+0x1c/0x240 So mmc_blk_put wouldn't calling blk_cleanup_queue which actually the QUEUE_FLAG_DYING and QUEUE_FLAG_BYPASS should stay. Block core expect blk_queue_bypass_{start, end} internally to bypass/drain the queue before actually dying the queue, so it didn't expose API to set the queue bypass. I think we should set QUEUE_FLAG_BYPASS whenever queue is removed, although the md->usage is still counted, as no dispatch queue could be found then. Fixes: 304419d8a7e9 ("mmc: core: Allocate per-request data using the block layer core") Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2017-08-03mmc: sdhci-of-at91: force card detect value for non removable devicesLudovic Desroches
When the device is non removable, the card detect signal is often used for another purpose i.e. muxed to another SoC peripheral or used as a GPIO. It could lead to wrong behaviors depending the default value of this signal if not muxed to the SDHCI controller. Fixes: bb5f8ea4d514 ("mmc: sdhci-of-at91: introduce driver for the Atmel SDMMC") Signed-off-by: Ludovic Desroches <ludovic.desroches@microchip.com> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2017-08-03pinctrl: cherryview: Add Setzer models to the Chromebook DMI quirkAndy Shevchenko
Add one more model to the Chromebook DMI quirk to make it working again. Link: https://bugzilla.kernel.org/show_bug.cgi?id=194945 Fixes: 2a8209fa6823 ("pinctrl: cherryview: Extend the Chromebook DMI quirk to Intel_Strago systems") Reported-by: mail@abhishek.geek.nz Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2017-08-03crypto: inside-secure - fix the sha state length in hmac_sha1_setkeyAntoine Ténart
A check is performed on the ipad/opad in the safexcel_hmac_sha1_setkey function, but the index used by the loop doing it is wrong. It is currently the size of the state array while it should be the size of a sha1 state. This patch fixes it. Fixes: 1b44c5a60c13 ("crypto: inside-secure - add SafeXcel EIP197 crypto engine driver") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-08-03crypto: inside-secure - fix invalidation check in hmac_sha1_setkeyAntoine Ténart
The safexcel_hmac_sha1_setkey function checks if an invalidation command should be issued, i.e. when the context ipad/opad change. This checks is done after filling the ipad/opad which and it can't be true. The patch fixes this by moving the check before the ipad/opad memcpy operations. Fixes: 1b44c5a60c13 ("crypto: inside-secure - add SafeXcel EIP197 crypto engine driver") Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-08-02Merge tag 'nfs-for-4.13-4' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client fixes from Anna Schumaker: "Two fixes from Trond this time, now that he's back from his vacation. The first is a stable fix for the EXCHANGE_ID issue on the mailing list, and the other fixes a double-free situation that he found at the same time. Stable fix: - Fix EXCHANGE_ID corrupt verifier issue Other fix: - Fix double frees in nfs4_test_session_trunk()" * tag 'nfs-for-4.13-4' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFSv4: Fix double frees in nfs4_test_session_trunk() NFSv4: Fix EXCHANGE_ID corrupt verifier issue
2017-08-02isdn/i4l: fix buffer overflowAnnie Cherkaev
This fixes a potential buffer overflow in isdn_net.c caused by an unbounded strcpy. [ ISDN seems to be effectively unmaintained, and the I4L driver in particular is long deprecated, but in case somebody uses this.. - Linus ] Signed-off-by: Jiten Thakkar <jitenmt@gmail.com> Signed-off-by: Annie Cherkaev <annie.cherk@gmail.com> Cc: Karsten Keil <isdn@linux-pingi.de> Cc: Kees Cook <keescook@chromium.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-02clk: keystone: sci-clk: Fix sci_clk_getTero Kristo
Currently a bug in the sci_clk_get implementation causes it to always return a clock belonging to the last device in the static list of clock data. This is due to a bug in the init code that causes the array used by sci_clk_get to only be populated with the clocks for the last device, as each device overwrites the entire array with its own clocks. Fix this by calculating the actual number of clocks for the SoC, and allocating the whole array in one go. Also, we don't need the handle to the init data array anymore after doing this, instead we can just compare the dev_id / clk_id against the registered clocks and use binary search for speed. Signed-off-by: Tero Kristo <t-kristo@ti.com> Reported-by: Dave Gerlach <d-gerlach@ti.com> Fixes: b745c0794e2f ("clk: keystone: Add sci-clk driver support") Cc: Nishanth Menon <nm@ti.com> Tested-by: Franklin Cooper <fcooper@ti.com> Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
2017-08-02ocfs2: don't clear SGID when inheriting ACLsJan Kara
When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit set, DIR1 is expected to have SGID bit set (and owning group equal to the owning group of 'DIR0'). However when 'DIR0' also has some default ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on 'DIR1' to get cleared if user is not member of the owning group. Fix the problem by moving posix_acl_update_mode() out of ocfs2_set_acl() into ocfs2_iop_set_acl(). That way the function will not be called when inheriting ACLs which is what we want as it prevents SGID bit clearing and the mode has been properly set by posix_acl_create() anyway. Also posix_acl_chmod() that is calling ocfs2_set_acl() takes care of updating mode itself. Fixes: 073931017b4 ("posix_acl: Clear SGID bit when setting file permissions") Link: http://lkml.kernel.org/r/20170801141252.19675-3-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-02mm: allow page_cache_get_speculative in interrupt contextKan Liang
Kernel panic when calling the IRQ-safe __get_user_pages_fast in NMI handler. The bug was introduced by commit 2947ba054a4d ("x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation"). The original x86 __get_user_page_fast used plain get_page() or page_ref_add(). However, the generic __get_user_page_fast uses page_cache_get_speculative(), which has VM_BUG_ON(in_interrupt()). There is no reason to prevent page_cache_get_speculative from using in interrupt context. According to the author, putting a BUG_ON there is just because the code is not verifying correctness of interrupt races. I did some tests in interrupt context. There is no issue found. Removing VM_BUG_ON(in_interrupt()) for page_cache_get_speculative(). Link: http://lkml.kernel.org/r/1501609146-59730-1-git-send-email-kan.liang@intel.com Fixes: 2947ba054a4d ("x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation") Signed-off-by: Kan Liang <kan.liang@intel.com> Cc: Jens Axboe <axboe@fb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ying Huang <ying.huang@intel.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-02userfaultfd: non-cooperative: flush event_wqh at release timeMike Rapoport
There may still be threads waiting on event_wqh at the time the userfault file descriptor is closed. Flush the events wait-queue to prevent waiting threads from hanging. Link: http://lkml.kernel.org/r/1501398127-30419-1-git-send-email-rppt@linux.vnet.ibm.com Fixes: 9cd75c3cd4c3d ("userfaultfd: non-cooperative: add ability to report non-PF events from uffd descriptor") Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-02ipc: add missing container_of()s for randstructKees Cook
When building with the randstruct gcc plugin, the layout of the IPC structs will be randomized, which requires any sub-structure accesses to use container_of(). The proc display handlers were missing the needed container_of()s since the iterator is passing in the top-level struct kern_ipc_perm. This would lead to crashes when running the "lsipc" program after the system had IPC registered (e.g. after starting up Gnome): general protection fault: 0000 [#1] PREEMPT SMP ... RIP: 0010:shm_add_rss_swap.isra.1+0x13/0xa0 ... Call Trace: sysvipc_shm_proc_show+0x5e/0x150 sysvipc_proc_show+0x1a/0x30 seq_read+0x2e9/0x3f0 ... Link: http://lkml.kernel.org/r/20170730205950.GA55841@beast Fixes: 3859a271a003 ("randstruct: Mark various structs for randomization") Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: Dominik Brodowski <linux@dominikbrodowski.net> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-02cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()Dima Zavin
In codepaths that use the begin/retry interface for reading mems_allowed_seq with irqs disabled, there exists a race condition that stalls the patch process after only modifying a subset of the static_branch call sites. This problem manifested itself as a deadlock in the slub allocator, inside get_any_partial. The loop reads mems_allowed_seq value (via read_mems_allowed_begin), performs the defrag operation, and then verifies the consistency of mem_allowed via the read_mems_allowed_retry and the cookie returned by xxx_begin. The issue here is that both begin and retry first check if cpusets are enabled via cpusets_enabled() static branch. This branch can be rewritted dynamically (via cpuset_inc) if a new cpuset is created. The x86 jump label code fully synchronizes across all CPUs for every entry it rewrites. If it rewrites only one of the callsites (specifically the one in read_mems_allowed_retry) and then waits for the smp_call_function(do_sync_core) to complete while a CPU is inside the begin/retry section with IRQs off and the mems_allowed value is changed, we can hang. This is because begin() will always return 0 (since it wasn't patched yet) while retry() will test the 0 against the actual value of the seq counter. The fix is to use two different static keys: one for begin (pre_enable_key) and one for retry (enable_key). In cpuset_inc(), we first bump the pre_enable key to ensure that cpuset_mems_allowed_begin() always return a valid seqcount if are enabling cpusets. Similarly, when disabling cpusets via cpuset_dec(), we first ensure that callers of cpuset_mems_allowed_retry() will start ignoring the seqcount value before we let cpuset_mems_allowed_begin() return 0. The relevant stack traces of the two stuck threads: CPU: 1 PID: 1415 Comm: mkdir Tainted: G L 4.9.36-00104-g540c51286237 #4 Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017 task: ffff8817f9c28000 task.stack: ffffc9000ffa4000 RIP: smp_call_function_many+0x1f9/0x260 Call Trace: smp_call_function+0x3b/0x70 on_each_cpu+0x2f/0x90 text_poke_bp+0x87/0xd0 arch_jump_label_transform+0x93/0x100 __jump_label_update+0x77/0x90 jump_label_update+0xaa/0xc0 static_key_slow_inc+0x9e/0xb0 cpuset_css_online+0x70/0x2e0 online_css+0x2c/0xa0 cgroup_apply_control_enable+0x27f/0x3d0 cgroup_mkdir+0x2b7/0x420 kernfs_iop_mkdir+0x5a/0x80 vfs_mkdir+0xf6/0x1a0 SyS_mkdir+0xb7/0xe0 entry_SYSCALL_64_fastpath+0x18/0xad ... CPU: 2 PID: 1 Comm: init Tainted: G L 4.9.36-00104-g540c51286237 #4 Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017 task: ffff8818087c0000 task.stack: ffffc90000030000 RIP: int3+0x39/0x70 Call Trace: <#DB> ? ___slab_alloc+0x28b/0x5a0 <EOE> ? copy_process.part.40+0xf7/0x1de0 __slab_alloc.isra.80+0x54/0x90 copy_process.part.40+0xf7/0x1de0 copy_process.part.40+0xf7/0x1de0 kmem_cache_alloc_node+0x8a/0x280 copy_process.part.40+0xf7/0x1de0 _do_fork+0xe7/0x6c0 _raw_spin_unlock_irq+0x2d/0x60 trace_hardirqs_on_caller+0x136/0x1d0 entry_SYSCALL_64_fastpath+0x5/0xad do_syscall_64+0x27/0x350 SyS_clone+0x19/0x20 do_syscall_64+0x60/0x350 entry_SYSCALL64_slow_path+0x25/0x25 Link: http://lkml.kernel.org/r/20170731040113.14197-1-dmitriyz@waymo.com Fixes: 46e700abc44c ("mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled") Signed-off-by: Dima Zavin <dmitriyz@waymo.com> Reported-by: Cliff Spradlin <cspradlin@waymo.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christopher Lameter <cl@linux.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-02userfaultfd_zeropage: return -ENOSPC in case mm has goneMike Rapoport
In the non-cooperative userfaultfd case, the process exit may race with outstanding mcopy_atomic called by the uffd monitor. Returning -ENOSPC instead of -EINVAL when mm is already gone will allow uffd monitor to distinguish this case from other error conditions. Unfortunately I overlooked userfaultfd_zeropage when updating userfaultd_copy(). Link: http://lkml.kernel.org/r/1501136819-21857-1-git-send-email-rppt@linux.vnet.ibm.com Fixes: 96333187ab162 ("userfaultfd_copy: return -ENOSPC in case mm has gone") Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-02mm: take memory hotplug lock within numa_zonelist_order_handler()Heiko Carstens
Andre Wild reported the following warning: WARNING: CPU: 2 PID: 1205 at kernel/cpu.c:240 lockdep_assert_cpus_held+0x4c/0x60 Modules linked in: CPU: 2 PID: 1205 Comm: bash Not tainted 4.13.0-rc2-00022-gfd2b2c57ec20 #10 Hardware name: IBM 2964 N96 702 (z/VM 6.4.0) task: 00000000701d8100 task.stack: 0000000073594000 Krnl PSW : 0704f00180000000 0000000000145e24 (lockdep_assert_cpus_held+0x4c/0x60) ... Call Trace: lockdep_assert_cpus_held+0x42/0x60) stop_machine_cpuslocked+0x62/0xf0 build_all_zonelists+0x92/0x150 numa_zonelist_order_handler+0x102/0x150 proc_sys_call_handler.isra.12+0xda/0x118 proc_sys_write+0x34/0x48 __vfs_write+0x3c/0x178 vfs_write+0xbc/0x1a0 SyS_write+0x66/0xc0 system_call+0xc4/0x2b0 locks held by bash/1205: #0: (sb_writers#4){.+.+.+}, at: vfs_write+0xa6/0x1a0 #1: (zl_order_mutex){+.+...}, at: numa_zonelist_order_handler+0x44/0x150 #2: (zonelists_mutex){+.+...}, at: numa_zonelist_order_handler+0xf4/0x150 Last Breaking-Event-Address: lockdep_assert_cpus_held+0x48/0x60 This can be easily triggered with e.g. echo n > /proc/sys/vm/numa_zonelist_order In commit 3f906ba23689a ("mm/memory-hotplug: switch locking to a percpu rwsem") memory hotplug locking was changed to fix a potential deadlock. This also switched the stop_machine() invocation within build_all_zonelists() to stop_machine_cpuslocked() which now expects that online cpus are locked when being called. This assumption is not true if build_all_zonelists() is being called from numa_zonelist_order_handler(). In order to fix this simply add a mem_hotplug_begin()/mem_hotplug_done() pair to numa_zonelist_order_handler(). Link: http://lkml.kernel.org/r/20170726111738.38768-1-heiko.carstens@de.ibm.com Fixes: 3f906ba23689a ("mm/memory-hotplug: switch locking to a percpu rwsem") Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Reported-by: Andre Wild <wild@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-02mm/page_io.c: fix oops during block io poll in swapin pathTetsuo Handa
When a thread is OOM-killed during swap_readpage() operation, an oops occurs because end_swap_bio_read() is calling wake_up_process() based on an assumption that the thread which called swap_readpage() is still alive. Out of memory: Kill process 525 (polkitd) score 0 or sacrifice child Killed process 525 (polkitd) total-vm:528128kB, anon-rss:0kB, file-rss:4kB, shmem-rss:0kB oom_reaper: reaped process 525 (polkitd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp ppdev pcspkr vmw_balloon sg shpchp vmw_vmci parport_pc parport i2c_piix4 ip_tables xfs libcrc32c sd_mod sr_mod cdrom ata_generic pata_acpi vmwgfx ahci libahci drm_kms_helper ata_piix syscopyarea sysfillrect sysimgblt fb_sys_fops mptspi scsi_transport_spi ttm e1000 mptscsih drm mptbase i2c_core libata serio_raw CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0-rc2-next-20170725 #129 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 task: ffffffffb7c16500 task.stack: ffffffffb7c00000 RIP: 0010:__lock_acquire+0x151/0x12f0 Call Trace: <IRQ> lock_acquire+0x59/0x80 _raw_spin_lock_irqsave+0x3b/0x4f try_to_wake_up+0x3b/0x410 wake_up_process+0x10/0x20 end_swap_bio_read+0x6f/0xf0 bio_endio+0x92/0xb0 blk_update_request+0x88/0x270 scsi_end_request+0x32/0x1c0 scsi_io_completion+0x209/0x680 scsi_finish_command+0xd4/0x120 scsi_softirq_done+0x120/0x140 __blk_mq_complete_request_remote+0xe/0x10 flush_smp_call_function_queue+0x51/0x120 generic_smp_call_function_single_interrupt+0xe/0x20 smp_trace_call_function_single_interrupt+0x22/0x30 smp_call_function_single_interrupt+0x9/0x10 call_function_single_interrupt+0xa7/0xb0 </IRQ> RIP: 0010:native_safe_halt+0x6/0x10 default_idle+0xe/0x20 arch_cpu_idle+0xa/0x10 default_idle_call+0x1e/0x30 do_idle+0x187/0x200 cpu_startup_entry+0x6e/0x70 rest_init+0xd0/0xe0 start_kernel+0x456/0x477 x86_64_start_reservations+0x24/0x26 x86_64_start_kernel+0xf7/0x11a secondary_startup_64+0xa5/0xa5 Code: c3 49 81 3f 20 9e 0b b8 41 bc 00 00 00 00 44 0f 45 e2 83 fe 01 0f 87 62 ff ff ff 89 f0 49 8b 44 c7 08 48 85 c0 0f 84 52 ff ff ff <f0> ff 80 98 01 00 00 8b 3d 5a 49 c4 01 45 8b b3 18 0c 00 00 85 RIP: __lock_acquire+0x151/0x12f0 RSP: ffffa01f39e03c50 ---[ end trace 6c441db499169b1e ]--- Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: 0x36000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) ---[ end Kernel panic - not syncing: Fatal exception in interrupt Fix it by holding a reference to the thread. [akpm@linux-foundation.org: add comment] Fixes: 23955622ff8d231b ("swap: add block io poll in swapin path") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Shaohua Li <shli@fb.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Jens Axboe <axboe@fb.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-02zram: do not free pool->size_classMinchan Kim
Mike reported kernel goes oops with ltp:zram03 testcase. zram: Added device: zram0 zram0: detected capacity change from 0 to 107374182400 BUG: unable to handle kernel paging request at 0000306d61727a77 IP: zs_map_object+0xb9/0x260 PGD 0 P4D 0 Oops: 0000 [#1] SMP Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: zram(E) xfs(E) libcrc32c(E) btrfs(E) xor(E) raid6_pq(E) loop(E) ebtable_filter(E) ebtables(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) ip_tables(E) x_tables(E) af_packet(E) br_netfilter(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) intel_powerclamp(E) coretemp(E) cdc_ether(E) kvm_intel(E) usbnet(E) mii(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) iTCO_wdt(E) ghash_clmulni_intel(E) bnx2(E) iTCO_vendor_support(E) pcbc(E) ioatdma(E) ipmi_ssif(E) aesni_intel(E) i5500_temp(E) i2c_i801(E) aes_x86_64(E) lpc_ich(E) shpchp(E) mfd_core(E) crypto_simd(E) i7core_edac(E) dca(E) glue_helper(E) cryptd(E) ipmi_si(E) button(E) acpi_cpufreq(E) ipmi_devintf(E) pcspkr(E) ipmi_msghandler(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) ext4(E) crc16(E) mbcache(E) jbd2(E) sd_mod(E) ata_generic(E) i2c_algo_bit(E) ata_piix(E) drm_kms_helper(E) ahci(E) syscopyarea(E) sysfillrect(E) libahci(E) sysimgblt(E) fb_sys_fops(E) uhci_hcd(E) ehci_pci(E) ttm(E) ehci_hcd(E) libata(E) drm(E) megaraid_sas(E) usbcore(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) efivarfs(E) autofs4(E) [last unloaded: zram] CPU: 6 PID: 12356 Comm: swapon Tainted: G E 4.13.0.g87b2c3f-default #194 Hardware name: IBM System x3550 M3 -[7944K3G]-/69Y5698 , BIOS -[D6E150AUS-1.10]- 12/15/2010 task: ffff880158d2c4c0 task.stack: ffffc90001680000 RIP: 0010:zs_map_object+0xb9/0x260 Call Trace: zram_bvec_rw.isra.26+0xe8/0x780 [zram] zram_rw_page+0x6e/0xa0 [zram] bdev_read_page+0x81/0xb0 do_mpage_readpage+0x51a/0x710 mpage_readpages+0x122/0x1a0 blkdev_readpages+0x1d/0x20 __do_page_cache_readahead+0x1b2/0x270 ondemand_readahead+0x180/0x2c0 page_cache_sync_readahead+0x31/0x50 generic_file_read_iter+0x7e7/0xaf0 blkdev_read_iter+0x37/0x40 __vfs_read+0xce/0x140 vfs_read+0x9e/0x150 SyS_read+0x46/0xa0 entry_SYSCALL_64_fastpath+0x1a/0xa5 Code: 81 e6 00 c0 3f 00 81 fe 00 00 16 00 0f 85 9f 01 00 00 0f b7 13 65 ff 05 5e 07 dc 7e 66 c1 ea 02 81 e2 ff 01 00 00 49 8b 54 d4 08 <8b> 4a 48 41 0f af ce 81 e1 ff 0f 00 00 41 89 c9 48 c7 c3 a0 70 RIP: zs_map_object+0xb9/0x260 RSP: ffffc90001683988 CR2: 0000306d61727a77 He bisected the problem is [1]. After commit cf8e0fedf078 ("mm/zsmalloc: simplify zs_max_alloc_size handling"), zram doesn't use double pointer for pool->size_class any more in zs_create_pool so counter function zs_destroy_pool don't need to free it, either. Otherwise, it does kfree wrong address and then, kernel goes Oops. Link: http://lkml.kernel.org/r/20170725062650.GA12134@bbox Fixes: cf8e0fedf078 ("mm/zsmalloc: simplify zs_max_alloc_size handling") Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Mike Galbraith <efault@gmx.de> Tested-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>