summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2017-08-11MD: not clear ->safemode for external metadata arrayShaohua Li
->safemode should be triggered by mdadm for external metadaa array, otherwise array's state confuses mdadm. Fixes: 33182d15c6bf(md: always clear ->safemode when md_check_recovery gets the mddev lock.) Cc: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-08-11iomap: fix integer truncation issues in the zeroing and dirtying helpersChristoph Hellwig
Fix the min_t calls in the zeroing and dirtying helpers to perform the comparisms on 64-bit types, which prevents them from incorrectly being truncated, and larger zeroing operations being stuck in a never ending loop. Special thanks to Markus Stockhausen for spotting the bug. Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-11xfs: fix inobt inode allocation search optimizationOmar Sandoval
When we try to allocate a free inode by searching the inobt, we try to find the inode nearest the parent inode by searching chunks both left and right of the chunk containing the parent. As an optimization, we cache the leftmost and rightmost records that we previously searched; if we do another allocation with the same parent inode, we'll pick up the search where it last left off. There's a bug in the case where we found a free inode to the left of the parent's chunk: we need to update the cached left and right records, but because we already reassigned the right record to point to the left, we end up assigning the left record to both the cached left and right records. This isn't a correctness problem strictly, but it can result in the next allocation rechecking chunks unnecessarily or allocating inodes further away from the parent than it needs to. Fix it by swapping the record pointer after we update the cached left and right records. Fixes: bd169565993b ("xfs: speed up free inode search") Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-11udp: harden copy_linear_skb()Eric Dumazet
syzkaller got crashes with CONFIG_HARDENED_USERCOPY=y configs. Issue here is that recvfrom() can be used with user buffer of Z bytes, and SO_PEEK_OFF of X bytes, from a skb with Y bytes, and following condition : Z < X < Y kernel BUG at mm/usercopy.c:72! invalid opcode: 0000 [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 0 PID: 2917 Comm: syzkaller842281 Not tainted 4.13.0-rc3+ #16 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 task: ffff8801d2fa40c0 task.stack: ffff8801d1fe8000 RIP: 0010:report_usercopy mm/usercopy.c:64 [inline] RIP: 0010:__check_object_size+0x3ad/0x500 mm/usercopy.c:264 RSP: 0018:ffff8801d1fef8a8 EFLAGS: 00010286 RAX: 0000000000000078 RBX: ffffffff847102c0 RCX: 0000000000000000 RDX: 0000000000000078 RSI: 1ffff1003a3fded5 RDI: ffffed003a3fdf09 RBP: ffff8801d1fef998 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801d1ea480e R13: fffffffffffffffa R14: ffffffff84710280 R15: dffffc0000000000 FS: 0000000001360880(0000) GS:ffff8801dc000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000202ecfe4 CR3: 00000001d1ff8000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: check_object_size include/linux/thread_info.h:108 [inline] check_copy_size include/linux/thread_info.h:139 [inline] copy_to_iter include/linux/uio.h:105 [inline] copy_linear_skb include/net/udp.h:371 [inline] udpv6_recvmsg+0x1040/0x1af0 net/ipv6/udp.c:395 inet_recvmsg+0x14c/0x5f0 net/ipv4/af_inet.c:793 sock_recvmsg_nosec net/socket.c:792 [inline] sock_recvmsg+0xc9/0x110 net/socket.c:799 SYSC_recvfrom+0x2d6/0x570 net/socket.c:1788 SyS_recvfrom+0x40/0x50 net/socket.c:1760 entry_SYSCALL_64_fastpath+0x1f/0xbe Fixes: b65ac44674dd ("udp: try to avoid 2 cache miss on dequeue") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11Merge branch 'bpf-Minor-fix-in-bpf_convert_ctx_access'David S. Miller
Daniel Borkmann says: ==================== bpf: Minor fix in bpf_convert_ctx_access First one was found while trying to compile the kernel with !CONFIG_NET_RX_BUSY_POLL. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11bpf: fix two missing target_size settings in bpf_convert_ctx_accessDaniel Borkmann
When CONFIG_NET_SCHED or CONFIG_NET_RX_BUSY_POLL is /not/ set and we try a narrow __sk_buff load of tc_index or napi_id, respectively, then verifier rightfully complains that it's misconfigured, because we need to set target_size in each of the two cases. The rewrite for the ctx access is just a dummy op, but needs to pass, so fix this up. Fixes: f96da09473b5 ("bpf: simplify narrower ctx access") Reported-by: Shubham Bansal <illusionist.neo@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11net: fix compilation when busy poll is not enabledDaniel Borkmann
MIN_NAPI_ID is used in various places outside of CONFIG_NET_RX_BUSY_POLL wrapping, so when it's not set we run into build errors such as: net/core/dev.c: In function 'dev_get_by_napi_id': net/core/dev.c:886:16: error: ‘MIN_NAPI_ID’ undeclared (first use in this function) if (napi_id < MIN_NAPI_ID) ^~~~~~~~~~~ Thus, have MIN_NAPI_ID always defined to fix these errors. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11mISDN: Fix null pointer dereference at mISDN_FsmNewAnton Vasilyev
If mISDN_FsmNew() fails to allocate memory for jumpmatrix then null pointer dereference will occur on any write to jumpmatrix. The patch adds check on successful allocation and corresponding error handling. Found by Linux Driver Verification project (linuxtesting.org). Signed-off-by: Anton Vasilyev <vasilyev@ispras.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11nfp: do not update MTU from BH in flower appSimon Horman
The Flower app may receive a request to update the MTU of a representor netdev upon receipt of a control message from the firmware. This requires the RTNL lock which needs to be taken outside of the packet processing path. As a handling of this correctly seems a little to invasive for a fix simply skip setting the MTU for now. Relevant backtrace: [ 1496.288489] BUG: scheduling while atomic: kworker/0:3/373/0x00000100 [ 1496.294911] dca syscopyarea sysfillrect sysimgblt fb_sys_fops ptp drm mxm_wmi ahci pps_core libahci i2c_algo_bit wmi [last unloaded: nfp] [ 1496.294918] CPU: 0 PID: 373 Comm: kworker/0:3 Tainted: G OE 4.13.0-rc3+ #3 [ 1496.294919] Hardware name: Supermicro X10DRi/X10DRi, BIOS 2.0 12/28/2015 [ 1496.294923] Workqueue: events work_for_cpu_fn [ 1496.294924] Call Trace: [ 1496.294927] <IRQ> [ 1496.294931] dump_stack+0x63/0x82 [ 1496.294935] __schedule_bug+0x54/0x70 [ 1496.294937] __schedule+0x62f/0x890 [ 1496.294941] ? intel_unmap_sg+0x90/0x90 [ 1496.294942] schedule+0x36/0x80 [ 1496.294943] schedule_preempt_disabled+0xe/0x10 [ 1496.294945] __mutex_lock.isra.2+0x445/0x4a0 [ 1496.294947] ? device_is_rmrr_locked+0x12/0x50 [ 1496.294950] ? kfree+0x162/0x170 [ 1496.294952] ? device_is_rmrr_locked+0x12/0x50 [ 1496.294953] ? iommu_should_identity_map+0x50/0xe0 [ 1496.294954] __mutex_lock_slowpath+0x13/0x20 [ 1496.294955] ? iommu_no_mapping+0x48/0xd0 [ 1496.294956] ? __mutex_lock_slowpath+0x13/0x20 [ 1496.294957] mutex_lock+0x2f/0x40 [ 1496.294960] rtnl_lock+0x15/0x20 [ 1496.294979] nfp_flower_cmsg_rx+0xc8/0x150 [nfp] [ 1496.294986] nfp_ctrl_poll+0x286/0x350 [nfp] [ 1496.294989] tasklet_action+0xf6/0x110 [ 1496.294992] __do_softirq+0xed/0x278 [ 1496.294993] irq_exit+0xb6/0xc0 [ 1496.294994] do_IRQ+0x4f/0xd0 [ 1496.294996] common_interrupt+0x89/0x89 Fixes: 948faa46c05b ("nfp: add support for control messages for flower app") Signed-off-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11net: stmmac: Use the right logging function in stmmac_mdio_registerRomain Perier
Currently, the function stmmac_mdio_register() is only used by stmmac_dvr_probe() from stmmac_main.c, in order to register the MDIO bus and probe information about the PHY. As this function is called before calling register_netdev(), all messages logged from stmmac_mdio_register are prefixed by "(unnamed net_device)". The goal of netdev_info or netdev_err is to dump useful infos about a net_device, when this data structure is partially initialized, there is no point for using these functions. This commit fixes the issue by replacing all netdev_*() by the corresponding dev_*() function for logging. The last netdev_info is replaced by phy_attached_info(), as a valid phydev can be used at this point. Signed-off-by: Romain Perier <romain.perier@collabora.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11net/sched/hfsc: allocate tcf block for hfsc root classKonstantin Khlebnikov
Without this filters cannot be attached. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Fixes: 6529eaba33f0 ("net: sched: introduce tcf block infractructure") Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11bonding: require speed/duplex only for 802.3ad, alb and tlbAndreas Born
The patch c4adfc822bf5 ("bonding: make speed, duplex setting consistent with link state") puts the link state to down if bond_update_speed_duplex() cannot retrieve speed and duplex settings. Assumably the patch was written with 802.3ad mode in mind which relies on link speed/duplex settings. For other modes like active-backup these settings are not required. Thus, only for these other modes, this patch reintroduces support for slaves that do not support reporting speed or duplex such as wireless devices. This fixes the regression reported in bug 196547 (https://bugzilla.kernel.org/show_bug.cgi?id=196547). Fixes: c4adfc822bf5 ("bonding: make speed, duplex setting consistent with link state") Signed-off-by: Andreas Born <futur.andy@googlemail.com> Acked-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11net: dsa: ksz: fix skb freeingVivien Didelot
The DSA layer frees the original skb when an xmit function returns NULL, meaning an error occurred. But if the tagging code copied the original skb, it is responsible of freeing the copy if an error occurs. The ksz tagging code currently has two issues: if skb_put_padto fails, the skb copy is not freed, and the original skb will be freed twice. To fix that, move skb_put_padto inside both branches of the skb_tailroom condition, before freeing the original skb, and free the copy on error. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Woojung Huh <woojung.huh@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11Merge tag 'nfs-for-4.13-5' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client fixes from Anna Schumaker: "A few more NFS client bugfixes from me for rc5. Dros has a stable fix for flexfiles to prevent leaking the nfs4_ff_ds_version arrays when freeing a layout, Trond fixed a potential recovery loop situation with the TEST_STATEID operation, and Christoph fixed up the pNFS blocklayout Kconfig options to prevent unsafe use with kernels that don't have large block device support. Summary: Stable fix: - fix leaking nfs4_ff_ds_version array Other fixes: - improve TEST_STATEID OLD_STATEID handling to prevent recovery loop - require 64-bit sector_t for pNFS blocklayout to prevent 32-bit compile errors" * tag 'nfs-for-4.13-5' of git://git.linux-nfs.org/projects/anna/linux-nfs: pnfs/blocklayout: require 64-bit sector_t NFSv4: Ignore NFS4ERR_OLD_STATEID in nfs41_check_open_stateid() nfs/flexfiles: fix leak of nfs4_ff_ds_version arrays
2017-08-11Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "A set of fixes that should go into this series. This contains: - Fix from Bart for blk-mq requeue queue running, preventing a continued loop of run/restart. - Fix for a bio/blk-integrity issue, in two parts. One from Christoph, fixing where verification happens, and one from Milan, for a NULL profile. - NVMe pull request, most of the changes being for nvme-fc, but also a few trivial core/pci fixes" * 'for-linus' of git://git.kernel.dk/linux-block: nvme: fix directive command numd calculation nvme: fix nvme reset command timeout handling nvme-pci: fix CMB sysfs file removal in reset path lpfc: support nvmet_fc defer_rcv callback nvmet_fc: add defer_req callback for deferment of cmd buffer return nvme: strip trailing 0-bytes in wwid_show block: Make blk_mq_delay_kick_requeue_list() rerun the queue at a quiet time bio-integrity: only verify integrity on the lowest stacked driver bio-integrity: Fix regression if profile verify_fn is NULL
2017-08-11Merge tag 'mmc-v4.13-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc Pull MMC fixes from Ulf Hansson: "MMC core: - fix lockdep splat when removing mmc_block module - fix the logic for setting eMMC HS400ES signal voltage MMC host: - omap_hsmmc: add CMD23 capability to fix -EIO errors" * tag 'mmc-v4.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: mmc: block: fix lockdep splat when removing mmc_block module mmc: mmc: correct the logic for setting HS400ES signal voltage mmc: host: omap_hsmmc: Add CMD23 capability to omap_hsmmc driver
2017-08-11Merge tag 'fbdev-v4.13-rc5' of git://github.com/bzolnier/linuxLinus Torvalds
Pull fbdev fixes from Bartlomiej Zolnierkiewicz: - allow user to disable write combined mapping in efifb driver (Dave Airlie) - fix use after free bugs on driver removal in imxfb driver (Dan Carpenter) - fix unused variable warning in omapfb driver (Arnd Bergmann) * tag 'fbdev-v4.13-rc5' of git://github.com/bzolnier/linux: efifb: allow user to disable write combined mapping. fbdev: omapfb: remove unused variable video: fbdev: imxfb: use after free in imxfb_remove()
2017-08-11Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: "Fix a few bugs in fuse" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: set mapping error in writepage_locked when it fails fuse: Dont call set_page_dirty_lock() for ITER_BVEC pages for async_dio fuse: initialize the flock flag in fuse_file on allocation
2017-08-11Merge tag 'iommu-fixes-v4.13-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull IOMMU fix from Joerg Roedel: "Fix a NULL-pointer dereference in arm_smmu_add_device" * tag 'iommu-fixes-v4.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: iommu/arm-smmu: fix null-pointer dereference in arm_smmu_add_device
2017-08-11pnfs/blocklayout: require 64-bit sector_tChristoph Hellwig
The blocklayout code does not compile cleanly for a 32-bit sector_t, and also has no reliable checks for devices sizes, which makes it unsafe to use with a kernel that doesn't support large block devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Arnd Bergmann <arnd@arndb.de> Fixes: 5c83746a0cf2 ("pnfs/blocklayout: in-kernel GETDEVICEINFO XDR parsing") Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-11Merge tag 'powerpc-4.13-6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: "All fixes for code that went in this cycle. - a revert of an optimisation to the syscall exit path, which could lead to an oops on either older machines or machines with > 1TB of memory - disable some deep idle states if the firmware configuration for them fails - re-enable HARD/SOFT lockup detectors in defconfigs after a Kconfig change - six fairly small patches fixing bugs in our new watchdog code Thanks to: Gautham R Shenoy, Nicholas Piggin" * tag 'powerpc-4.13-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/watchdog: add locking around init/exit functions powerpc/watchdog: Fix marking of stuck CPUs powerpc/watchdog: Fix final-check recovered case powerpc/watchdog: Moderate touch_nmi_watchdog overhead powerpc/watchdog: Improve watchdog lock primitive powerpc: NMI IPI improve lock primitive powerpc/configs: Re-enable HARD/SOFT lockup detectors powerpc/powernv/idle: Disable LOSE_FULL_CONTEXT states when stop-api fails Revert "powerpc/64: Avoid restore_math call if possible in syscall exit"
2017-08-11selftests: timers: freq-step: fix compile errorShuah Khan
Fix compile error due to ksft_exit_skip() update to take var_args. freq-step.c: In function ‘init_test’: freq-step.c:234:3: error: too few arguments to function ‘ksft_exit_skip’ ksft_exit_skip(); ^~~~~~~~~~~~~~ In file included from freq-step.c:26:0: ../kselftest.h:167:19: note: declared here static inline int ksft_exit_skip(const char *msg, ...) ^~~~~~~~~~~~~~ <builtin>: recipe for target 'freq-step' failed Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
2017-08-11iommu/arm-smmu: fix null-pointer dereference in arm_smmu_add_deviceArtem Savkov
Commit c54451a "iommu/arm-smmu: Fix the error path in arm_smmu_add_device" removed fwspec assignment in legacy_binding path as redundant which is wrong. It needs to be updated after fwspec initialisation in arm_smmu_register_legacy_master() as it is dereferenced later. Without this there is a NULL-pointer dereference panic during boot on some hosts. Signed-off-by: Artem Savkov <asavkov@redhat.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2017-08-11xen/events: Fix interrupt lost during irq_disable and irq_enableLiu Shuo
Here is a device has xen-pirq-MSI interrupt. Dom0 might lost interrupt during driver irq_disable/irq_enable. Here is the scenario, 1. irq_disable -> disable_dynirq -> mask_evtchn(irq channel) 2. dev interrupt raised by HW and Xen mark its evtchn as pending 3. irq_enable -> startup_pirq -> eoi_pirq -> clear_evtchn(channel of irq) -> clear pending status 4. consume_one_event process the irq event without pending bit assert which result in interrupt lost once 5. No HW interrupt raising anymore. Now use enable_dynirq for enable_pirq of xen_pirq_chip to remove eoi_pirq when irq_enable. Signed-off-by: Liu Shuo <shuo.a.liu@intel.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2017-08-11xen: avoid deadlock in xenbusJuergen Gross
When starting the xenwatch thread a theoretical deadlock situation is possible: xs_init() contains: task = kthread_run(xenwatch_thread, NULL, "xenwatch"); if (IS_ERR(task)) return PTR_ERR(task); xenwatch_pid = task->pid; And xenwatch_thread() does: mutex_lock(&xenwatch_mutex); ... event->handle->callback(); ... mutex_unlock(&xenwatch_mutex); The callback could call unregister_xenbus_watch() which does: ... if (current->pid != xenwatch_pid) mutex_lock(&xenwatch_mutex); ... In case a watch is firing before xenwatch_pid could be set and the callback of that watch unregisters a watch, then a self-deadlock would occur. Avoid this by setting xenwatch_pid in xenwatch_thread(). Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2017-08-11Merge branch 'nvme-4.13' of git://git.infradead.org/nvme into for-linusJens Axboe
Pull NVMe fixes from Christoph: "A few more small fixes - the fc/lpfc update is the biggest by far."
2017-08-11clocksource/drivers/arm_arch_timer: Avoid infinite recursion when ftrace is ↵Ding Tianhong
enabled On platforms with an arch timer erratum workaround, it's possible for arch_timer_reg_read_stable() to recurse into itself when certain tracing options are enabled, leading to stack overflows and related problems. For example, when PREEMPT_TRACER and FUNCTION_GRAPH_TRACER are selected, it's possible to trigger this with: $ mount -t debugfs nodev /sys/kernel/debug/ $ echo function_graph > /sys/kernel/debug/tracing/current_tracer The problem is that in such cases, preempt_disable() instrumentation attempts to acquire a timestamp via trace_clock(), resulting in a call back to arch_timer_reg_read_stable(), and hence recursion. This patch changes arch_timer_reg_read_stable() to use preempt_{disable,enable}_notrace(), which avoids this. This problem is similar to the fixed by upstream commit 96b3d28bf4 ("sched/clock: Prevent tracing recursion in sched_clock_cpu()"). Fixes: 6acc71ccac71 ("arm64: arch_timer: Allows a CPU-specific erratum to only affect a subset of CPUs") Signed-off-by: Ding Tianhong <dingtianhong@huawei.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2017-08-11xen: fix hvm guest with kaslr enabledJuergen Gross
A Xen HVM guest running with KASLR enabled will die rather soon today because the shared info page mapping is using va() too early. This was introduced by commit a5d5f328b0e2baa5ee7c119fd66324eb79eeeb66 ("xen: allocate page for shared info page from low memory"). In order to fix this use early_memremap() to get a temporary virtual address for shared info until va() can be used safely. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Juergen Gross <jgross@suse.com>
2017-08-11xen: split up xen_hvm_init_shared_info()Juergen Gross
Instead of calling xen_hvm_init_shared_info() on boot and resume split it up into a boot time function searching for the pfn to use and a mapping function doing the hypervisor mapping call. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Juergen Gross <jgross@suse.com>
2017-08-11x86: provide an init_mem_mapping hypervisor hookJuergen Gross
Provide a hook in hypervisor_x86 called after setting up initial memory mapping. This is needed e.g. by Xen HVM guests to map the hypervisor shared info page. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Juergen Gross <jgross@suse.com>
2017-08-11x86: Mark various structures and functions as 'static'Colin Ian King
Mark a couple of structures and functions as 'static', pointed out by Sparse: warning: symbol 'bts_pmu' was not declared. Should it be static? warning: symbol 'p4_event_aliases' was not declared. Should it be static? warning: symbol 'rapl_attr_groups' was not declared. Should it be static? symbol 'process_uv2_message' was not declared. Should it be static? Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Andrew Banman <abanman@hpe.com> # for the UV change Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: kernel-janitors@vger.kernel.org Link: http://lkml.kernel.org/r/20170810155709.7094-1-colin.king@canonical.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-11x86/cpufeature, kvm/svm: Rename (shorten) the new "virtualized ↵Borislav Petkov
VMSAVE/VMLOAD" CPUID flag "virtual_vmload_vmsave" is what is going to land in /proc/cpuinfo now as per v4.13-rc4, for a single feature bit which is clearly too long. So rename it to what it is called in the processor manual. "v_vmsave_vmload" is a bit shorter, after all. We could go more aggressively here but having it the same as in the processor manual is advantageous. Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Radim Krčmář <rkrcmar@redhat.com> Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Cc: Jörg Rödel <joro@8bytes.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kvm-ML <kvm@vger.kernel.org> Link: http://lkml.kernel.org/r/20170801185552.GA3743@nazgul.tnic Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-11clocksource/drivers/Kconfig: Fix CLKSRC_PISTACHIO dependenciesMatt Redfearn
In v4.13, CLKSRC_PISTACHIO can select TIMER_OF on architectures without GENERIC_CLOCKEVENTS, resulting in a struct clock_event_device missing some required features and build breakage compiling timer_of.c. One of the symbols selecting TIMER_OF is CLKSRC_PISTACHIO, so add the dependency on GENERIC_CLOCKEVENTS. Thanks to kbuild test robot for finding this error (https://lkml.org/lkml/2017/7/16/249) Signed-off-by: Matt Redfearn <matt.redfearn@imgtec.com> Suggested-by: Ian Abbott <abbotti@mev.co.uk> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2017-08-11clocksource/drivers/timer-of: Checking for IS_ERR() instead of NULLDan Carpenter
The current code checks the return value of the of_io_request_and_map() function as it was returning a NULL pointer in case of error. However, it returns an error code encoded in the pointer return value, not a NULL value. Fix this by checking the returned pointer against IS_ERR() and return the error with PTR_ERR(). Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2017-08-11fuse: set mapping error in writepage_locked when it failsJeff Layton
This ensures that we see errors on fsync when writeback fails. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-08-11ALSA: seq: Fix CONFIG_SND_SEQ_MIDI dependencyTakashi Iwai
The commit 0181307abc1d ("ALSA: seq: Reorganize kconfig and build") rewrote the dependency of each sequencer module in a standard way, but there was one change applied mistakenly: CONFIG_SND_SEQ_MIDI isn't enabled properly by CONFIG_SND_RAWMIDI. I seem to have changed the wrong one instead, CONFIG_SND_SEQ_MIDI_EMUL, which is eventually reverse-selected by CONFIG_SND_SEQ_MIDI itself. This ended up the lack of snd-seq-midi module as reported below. The fix is to put def_tristate properly to CONFIG_SND_SEQ_MIDI instead of *_MIDI_EMUL entry. Fixes: 0181307abc1d ("ALSA: seq: Reorganize kconfig and build") Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=196633 Signed-off-by: Takashi Iwai <tiwai@suse.de>
2017-08-10Merge tag 'drm-fixes-for-v4.13-rc5' of ↵Linus Torvalds
git://people.freedesktop.org/~airlied/linux Pull drm fixes from Dave Airlie: "Nothing too earth shattering here, it just seems like lots of little things all over the place. msm has probably the larger amount of changes, but they all seem fine, otherwise, some rockchip, i915, etnaviv and exynos fixes, along with one nouveau regression fix for some older GPUs" * tag 'drm-fixes-for-v4.13-rc5' of git://people.freedesktop.org/~airlied/linux: (35 commits) drm/nouveau/disp/nv04: avoid creation of output paths drm: make DRM_STM default n drm/exynos: forbid creating framebuffers from too small GEM buffers drm/etnaviv: Fix off-by-one error in reloc checking drm/i915: fix backlight invert for non-zero minimum brightness drm/i915/shrinker: Wrap need_resched() inside preempt-disable drm/i915/perf: fix flex eu registers programming drm/i915: Fix out-of-bounds array access in bdw_load_gamma_lut drm/i915/gvt: Change the max length of mmio_reg_rw from 4 to 8 drm/i915/gvt: Initialize MMIO Block with HW state drm/rockchip: vop: report error when check resource error drm/rockchip: vop: round_up pitches to word align drm/rockchip: vop: fix NV12 video display error drm/rockchip: vop: fix iommu page fault when resume drm/i915/gvt: clean workload queue if error happened drm/i915/gvt: change resetting to resetting_eng drm/msm: gpu: don't abuse dma_alloc for non-DMA allocations drm/msm: gpu: call qcom_mdt interfaces only for ARCH_QCOM drm/msm/adreno: Prevent unclocked access when retrieving timestamps drm/msm: Remove __user from __u64 data types ...
2017-08-11arc: Mask individual IRQ lines during core INTC initAlexey Brodkin
ARC cores on reset have all interrupt lines of built-in INTC enabled. Which means once we globally enable interrupts (very early on boot) faulty hardware blocks may trigger an interrupt that Linux kernel cannot handle yet as corresponding handler is not yet installed. In that case system falls in "interrupt storm" and basically never does anything useful except entering and exiting generic IRQ handling code. One real example of that kind of problematic hardware is DW GMAC which also has interrupts enabled on reset and if Ethernet PHY informs GMAC about link state, GMAC immediately reports that upstream to ARC core and here we are. Now with that change we mask all individual IRQ lines making entire system more fool-proof. [This patch was motivated by Adaptrum platform support] Signed-off-by: Alexey Brodkin <abrodkin@synopsys.com> Cc: Eugeniy Paltsev <paltsev@synopsys.com> Tested-by: Alexandru Gagniuc <alex.g@adaptrum.com> Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2017-08-11cpufreq: x86: Disable interrupts during MSRs readingDoug Smythies
According to Intel 64 and IA-32 Architectures SDM, Volume 3, Chapter 14.2, "Software needs to exercise care to avoid delays between the two RDMSRs (for example interrupts)". So, disable interrupts during reading MSRs IA32_APERF and IA32_MPERF. See also: commit 4ab60c3f32c7 (cpufreq: intel_pstate: Disable interrupts during MSRs reading). Signed-off-by: Doug Smythies <dsmythies@telus.net> Reviewed-by: Len Brown <len.brown@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-08-11cpufreq: intel_pstate: report correct CPU frequencies during traceDoug Smythies
The intel_pstate CPU frequency scaling driver has always calculated CPU frequency incorrectly. Recent changes have eliminted most of the issues, however the frequency reported in the trace buffer, if used, is incorrect. It remains desireable that cpu->pstate.scaling still be a nice round number for things such as when setting max and min frequencies. So the proposal is to just fix the reported frequency in the trace data. Fixes what remains of [1]. Link: https://bugzilla.kernel.org/show_bug.cgi?id=96521 # [1] Signed-off-by: Doug Smythies <dsmythies@telus.net> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-08-10Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "21 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (21 commits) userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage zram: rework copy of compressor name in comp_algorithm_store() rmap: do not call mmu_notifier_invalidate_page() under ptl mm: fix list corruptions on shmem shrinklist mm/balloon_compaction.c: don't zero ballooned pages MAINTAINERS: copy virtio on balloon_compaction.c mm: fix KSM data corruption mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem mm: make tlb_flush_pending global mm: refactor TLB gathering API Revert "mm: numa: defer TLB flush for THP migration as long as possible" mm: migrate: fix barriers around tlb_flush_pending mm: migrate: prevent racy access to tlb_flush_pending fault-inject: fix wrong should_fail() decision in task context test_kmod: fix small memory leak on filesystem tests test_kmod: fix the lock in register_test_dev_kmod() test_kmod: fix bug which allows negative values on two config options test_kmod: fix spelling mistake: "EMTPY" -> "EMPTY" userfaultfd: hugetlbfs: remove superfluous page unlock in VM_SHARED case mm: ratelimit PFNs busy info message ...
2017-08-10userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropageMike Rapoport
When the process exit races with outstanding mcopy_atomic, it would be better to return ESRCH error. When such race occurs the process and it's mm are going away and returning "no such process" to the uffd monitor seems better fit than ENOSPC. Link: http://lkml.kernel.org/r/1502111545-32305-1-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10zram: rework copy of compressor name in comp_algorithm_store()Matthias Kaehlcke
comp_algorithm_store() passes the size of the source buffer to strlcpy() instead of the destination buffer size. Make it explicit that the two buffers have the same size and use strcpy() instead of strlcpy(). The latter can be done safely since the function ensures that the string in the source buffer is terminated. Link: http://lkml.kernel.org/r/20170803163350.45245-1-mka@chromium.org Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10rmap: do not call mmu_notifier_invalidate_page() under ptlKirill A. Shutemov
MMU notifiers can sleep, but in page_mkclean_one() we call mmu_notifier_invalidate_page() under page table lock. Let's instead use mmu_notifier_invalidate_range() outside page_vma_mapped_walk() loop. [jglisse@redhat.com: try_to_unmap_one() do not call mmu_notifier under ptl] Link: http://lkml.kernel.org/r/20170809204333.27485-1-jglisse@redhat.com Link: http://lkml.kernel.org/r/20170804134928.l4klfcnqatni7vsc@black.fi.intel.com Fixes: c7ab0d2fdc84 ("mm: convert try_to_unmap_one() to use page_vma_mapped_walk()") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reported-by: axie <axie@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "Writer, Tim" <Tim.Writer@amd.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10mm: fix list corruptions on shmem shrinklistCong Wang
We saw many list corruption warnings on shmem shrinklist: WARNING: CPU: 18 PID: 177 at lib/list_debug.c:59 __list_del_entry+0x9e/0xc0 list_del corruption. prev->next should be ffff9ae5694b82d8, but was ffff9ae5699ba960 Modules linked in: intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel raid0 dcdbas shpchp wmi hed i2c_i801 ioatdma lpc_ich i2c_smbus acpi_cpufreq tcp_diag inet_diag sch_fq_codel ipmi_si ipmi_devintf ipmi_msghandler igb ptp crc32c_intel pps_core i2c_algo_bit i2c_core dca ipv6 crc_ccitt CPU: 18 PID: 177 Comm: kswapd1 Not tainted 4.9.34-t3.el7.twitter.x86_64 #1 Hardware name: Dell Inc. PowerEdge C6220/0W6W6G, BIOS 2.2.3 11/07/2013 Call Trace: dump_stack+0x4d/0x66 __warn+0xcb/0xf0 warn_slowpath_fmt+0x4f/0x60 __list_del_entry+0x9e/0xc0 shmem_unused_huge_shrink+0xfa/0x2e0 shmem_unused_huge_scan+0x20/0x30 super_cache_scan+0x193/0x1a0 shrink_slab.part.41+0x1e3/0x3f0 shrink_slab+0x29/0x30 shrink_node+0xf9/0x2f0 kswapd+0x2d8/0x6c0 kthread+0xd7/0xf0 ret_from_fork+0x22/0x30 WARNING: CPU: 23 PID: 639 at lib/list_debug.c:33 __list_add+0x89/0xb0 list_add corruption. prev->next should be next (ffff9ae5699ba960), but was ffff9ae5694b82d8. (prev=ffff9ae5694b82d8). Modules linked in: intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel raid0 dcdbas shpchp wmi hed i2c_i801 ioatdma lpc_ich i2c_smbus acpi_cpufreq tcp_diag inet_diag sch_fq_codel ipmi_si ipmi_devintf ipmi_msghandler igb ptp crc32c_intel pps_core i2c_algo_bit i2c_core dca ipv6 crc_ccitt CPU: 23 PID: 639 Comm: systemd-udevd Tainted: G W 4.9.34-t3.el7.twitter.x86_64 #1 Hardware name: Dell Inc. PowerEdge C6220/0W6W6G, BIOS 2.2.3 11/07/2013 Call Trace: dump_stack+0x4d/0x66 __warn+0xcb/0xf0 warn_slowpath_fmt+0x4f/0x60 __list_add+0x89/0xb0 shmem_setattr+0x204/0x230 notify_change+0x2ef/0x440 do_truncate+0x5d/0x90 path_openat+0x331/0x1190 do_filp_open+0x7e/0xe0 do_sys_open+0x123/0x200 SyS_open+0x1e/0x20 do_syscall_64+0x61/0x170 entry_SYSCALL64_slow_path+0x25/0x25 The problem is that shmem_unused_huge_shrink() moves entries from the global sbinfo->shrinklist to its local lists and then releases the spinlock. However, a parallel shmem_setattr() could access one of these entries directly and add it back to the global shrinklist if it is removed, with the spinlock held. The logic itself looks solid since an entry could be either in a local list or the global list, otherwise it is removed from one of them by list_del_init(). So probably the race condition is that, one CPU is in the middle of INIT_LIST_HEAD() but the other CPU calls list_empty() which returns true too early then the following list_add_tail() sees a corrupted entry. list_empty_careful() is designed to fix this situation. [akpm@linux-foundation.org: add comments] Link: http://lkml.kernel.org/r/20170803054630.18775-1-xiyou.wangcong@gmail.com Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure") Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10mm/balloon_compaction.c: don't zero ballooned pagesWei Wang
Revert commit bb01b64cfab7 ("mm/balloon_compaction.c: enqueue zero page to balloon device")' Zeroing ballon pages is rather time consuming, especially when a lot of pages are in flight. E.g. 7GB worth of ballooned memory takes 2.8s with __GFP_ZERO while it takes ~491ms without it. The original commit argued that zeroing will help ksmd to merge these pages on the host but this argument is assuming that the host actually marks balloon pages for ksm which is not universally true. So we pay performance penalty for something that even might not be used in the end which is wrong. The host can zero out pages on its own when there is a need. [mhocko@kernel.org: new changelog text] Link: http://lkml.kernel.org/r/1501761557-9758-1-git-send-email-wei.w.wang@intel.com Fixes: bb01b64cfab7 ("mm/balloon_compaction.c: enqueue zero page to balloon device") Signed-off-by: Wei Wang <wei.w.wang@intel.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: zhenwei.pi <zhenwei.pi@youruncloud.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10MAINTAINERS: copy virtio on balloon_compaction.cMichael S. Tsirkin
Changes to mm/balloon_compaction.c can easily break virtio, and virtio is the only user of that interface. Add a line to MAINTAINERS so whoever changes that file remembers to copy us. Link: http://lkml.kernel.org/r/1501764010-24456-1-git-send-email-mst@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Acked-by: Wei Wang <wei.w.wang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10mm: fix KSM data corruptionMinchan Kim
Nadav reported KSM can corrupt the user data by the TLB batching race[1]. That means data user written can be lost. Quote from Nadav Amit: "For this race we need 4 CPUs: CPU0: Caches a writable and dirty PTE entry, and uses the stale value for write later. CPU1: Runs madvise_free on the range that includes the PTE. It would clear the dirty-bit. It batches TLB flushes. CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty. We care about the fact that it clears the PTE write-bit, and of course, batches TLB flushes. CPU3: Runs KSM. Our purpose is to pass the following test in write_protect_page(): if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) || (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte))) Since it will avoid TLB flush. And we want to do it while the PTE is stale. Later, and before replacing the page, we would be able to change the page. Note that all the operations the CPU1-3 perform canhappen in parallel since they only acquire mmap_sem for read. We start with two identical pages. Everything below regards the same page/PTE. CPU0 CPU1 CPU2 CPU3 ---- ---- ---- ---- Write the same value on page [cache PTE as dirty in TLB] MADV_FREE pte_mkclean() 4 > clear_refs pte_wrprotect() write_protect_page() [ success, no flush ] pages_indentical() [ ok ] Write to page different value [Ok, using stale PTE] replace_page() Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late. CPU0 already wrote on the page, but KSM ignored this write, and it got lost" In above scenario, MADV_FREE is fixed by changing TLB batching API including [set|clear]_tlb_flush_pending. Remained thing is soft-dirty part. This patch changes soft-dirty uses TLB batching API instead of flush_tlb_mm and KSM checks pending TLB flush by using mm_tlb_flush_pending so that it will flush TLB to avoid data lost if there are other parallel threads pending TLB flush. [1] http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com Link: http://lkml.kernel.org/r/20170802000818.4760-8-namit@vmware.com Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Nadav Amit <namit@vmware.com> Reported-by: Nadav Amit <namit@vmware.com> Tested-by: Nadav Amit <namit@vmware.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Hugh Dickins <hughd@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10mm: fix MADV_[FREE|DONTNEED] TLB flush miss problemMinchan Kim
Nadav reported parallel MADV_DONTNEED on same range has a stale TLB problem and Mel fixed it[1] and found same problem on MADV_FREE[2]. Quote from Mel Gorman: "The race in question is CPU 0 running madv_free and updating some PTEs while CPU 1 is also running madv_free and looking at the same PTEs. CPU 1 may have writable TLB entries for a page but fail the pte_dirty check (because CPU 0 has updated it already) and potentially fail to flush. Hence, when madv_free on CPU 1 returns, there are still potentially writable TLB entries and the underlying PTE is still present so that a subsequent write does not necessarily propagate the dirty bit to the underlying PTE any more. Reclaim at some unknown time at the future may then see that the PTE is still clean and discard the page even though a write has happened in the meantime. I think this is possible but I could have missed some protection in madv_free that prevents it happening." This patch aims for solving both problems all at once and is ready for other problem with KSM, MADV_FREE and soft-dirty story[3]. TLB batch API(tlb_[gather|finish]_mmu] uses [inc|dec]_tlb_flush_pending and mmu_tlb_flush_pending so that when tlb_finish_mmu is called, we can catch there are parallel threads going on. In that case, forcefully, flush TLB to prevent for user to access memory via stale TLB entry although it fail to gather page table entry. I confirmed this patch works with [4] test program Nadav gave so this patch supersedes "mm: Always flush VMA ranges affected by zap_page_range v2" in current mmotm. NOTE: This patch modifies arch-specific TLB gathering interface(x86, ia64, s390, sh, um). It seems most of architecture are straightforward but s390 need to be careful because tlb_flush_mmu works only if mm->context.flush_mm is set to non-zero which happens only a pte entry really is cleared by ptep_get_and_clear and friends. However, this problem never changes the pte entries but need to flush to prevent memory access from stale tlb. [1] http://lkml.kernel.org/r/20170725101230.5v7gvnjmcnkzzql3@techsingularity.net [2] http://lkml.kernel.org/r/20170725100722.2dxnmgypmwnrfawp@suse.de [3] http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com [4] https://patchwork.kernel.org/patch/9861621/ [minchan@kernel.org: decrease tlb flush pending count in tlb_finish_mmu] Link: http://lkml.kernel.org/r/20170808080821.GA31730@bbox Link: http://lkml.kernel.org/r/20170802000818.4760-7-namit@vmware.com Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Nadav Amit <namit@vmware.com> Reported-by: Nadav Amit <namit@vmware.com> Reported-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Tony Luck <tony.luck@intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Jeff Dike <jdike@addtoit.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10mm: make tlb_flush_pending globalMinchan Kim
Currently, tlb_flush_pending is used only for CONFIG_[NUMA_BALANCING| COMPACTION] but upcoming patches to solve subtle TLB flush batching problem will use it regardless of compaction/NUMA so this patch doesn't remove the dependency. [akpm@linux-foundation.org: remove more ifdefs from world's ugliest printk statement] Link: http://lkml.kernel.org/r/20170802000818.4760-6-namit@vmware.com Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Nadav Amit <namit@vmware.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>