summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-10-10Merge tag 'gfs2-4.19.fixes2' of ↵Greg Kroah-Hartman
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Andreas writes: "gfs2 4.19 fix: This fixes a regression introduced in commit 64bc06bb32ee "gfs2: iomap buffered write support"" * tag 'gfs2-4.19.fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Fix iomap buffered write support for journaled files
2018-10-10Merge tag 's390-4.19-4' of ↵Greg Kroah-Hartman
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Martin writes: "s390 fixes for 4.19-rc8 Four more patches for 4.19: - Fix resume after suspend-to-disk if resume-CPU != suspend-CPU - Fix vfio-ccw check for pinned pages - Two patches to avoid a usercopy-whitelist warning in vfio-ccw" * tag 's390-4.19-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/cio: Fix how vfio-ccw checks pinned pages s390/cio: Refactor alloc of ccw_io_region s390/cio: Convert ccw_io_region to pointer s390/hibernate: fix error handling when suspend cpu != resume cpu
2018-10-10Merge tag 'mips_fixes_4.19_2' of ↵Greg Kroah-Hartman
git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux Paul writes: "A few MIPS fixes for 4.19: - Avoid suboptimal placement of our VDSO when using the legacy mmap layout, which can prevent statically linked programs that were able to allocate large amounts of memory using the brk syscall prior to the introduction of our VDSO from functioning correctly. - Fix up CONFIG_CMDLINE handling for platforms which ought to ignore DT arguments but have incorrectly used them & lost other arguments since v3.16. - Fix a path in MAINTAINERS to use valid wildcards. - Fixup a regression from v4.17 in memset() for systems using CPU_DADDI_WORKAROUNDS." * tag 'mips_fixes_4.19_2' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: MIPS: memset: Fix CPU_DADDI_WORKAROUNDS `small_fixup' regression MAINTAINERS: MIPS/LOONGSON2 ARCHITECTURE - Use the normal wildcard style MIPS: Fix CONFIG_CMDLINE handling MIPS: VDSO: Always map near top of user memory
2018-10-10x86/defconfig: Enable CONFIG_USB_XHCI_HCD=yAdam Borowski
A spanking new machine I just got has all but one USB ports wired as 3.0. Booting defconfig resulted in no keyboard or mouse, which was pretty uncool. Let's enable that -- USB3 is ubiquitous rather than an oddity. As 'y' not 'm' -- recovering from initrd problems needs a keyboard. Also add it to the 32-bit defconfig. Signed-off-by: Adam Borowski <kilobyte@angband.pl> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-usb@vger.kernel.org Link: http://lkml.kernel.org/r/20181009062803.4332-1-kilobyte@angband.pl Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-10MAINTAINERS: update the SELinux mailing list locationPaul Moore
Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-10-10s390/mem_detect: add missing includeHeiko Carstens
Fix this allnoconfig build breakage: arch/s390/boot/mem_detect.c: In function 'tprot': arch/s390/boot/mem_detect.c:122:12: error: 'EFAULT' undeclared (first use in this function) Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2018-10-10s390/dumpstack: print psw mask and address againHeiko Carstens
With pointer obfuscation the output of show_registers() became quite useless: Krnl PSW : (____ptrval____) (____ptrval____) (__list_add_valid+0x98/0xa8) In order to print the psw mask and address use %px instead of %p. And the output looks again like this: Krnl PSW : 0404d00180000000 00000000007c0dd0 (__list_add_valid+0x98/0xa8) Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2018-10-10s390/crypto: Enhance paes cipher to accept variable length key materialIngo Franzki
Enhance the paes_s390 kernel module to allow the paes cipher to accept variable length key material. The key material accepted by the paes cipher is a key blob of various types. As of today, two key blob types are supported: CCA secure key blobs and protected key blobs. Signed-off-by: Ingo Franzki <ifranzki@linux.ibm.com> Reviewed-by: Harald Freudenberger <freude@linux.ibm.com> Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2018-10-10s390/pkey: Introduce new API for transforming key blobsIngo Franzki
Introduce a new ioctl API and in-kernel API to transform a variable length key blob of any supported type into a protected key. Transforming a secure key blob uses the already existing function pkey_sec2protk(). Transforming a protected key blob also verifies if the protected key is still valid. If not, -ENODEV is returned. Both APIs are described in detail in the header files arch/s390/include/asm/pkey.h and arch/s390/include/uapi/asm/pkey.h. Signed-off-by: Ingo Franzki <ifranzki@linux.ibm.com> Reviewed-by: Harald Freudenberger <freude@linux.ibm.com> Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2018-10-10s390/pkey: Introduce new API for random protected key verificationIngo Franzki
Introduce a new ioctl API and in-kernel API to verify if a random protected key is still valid. A protected key is invalid when its wrapping key verification pattern does not match the verification pattern of the LPAR. Each time an LPAR is activated, a new LPAR wrapping key is generated and the wrapping key verification pattern is updated. Both APIs are described in detail in the header files arch/s390/include/asm/pkey.h and arch/s390/include/uapi/asm/pkey.h. Signed-off-by: Ingo Franzki <ifranzki@linux.ibm.com> Reviewed-by: Harald Freudenberger <freude@linux.ibm.com> Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2018-10-10s390/pkey: Add sysfs attributes to emit secure key blobsIngo Franzki
Add binary read-only sysfs attributes for the pkey module that can be used to read random ccadata secure keys from. Keys are read from these attributes using a cat-like interface. A typical use case for those keys is to encrypt a swap device using the paes cipher. During processing of /etc/crypttab, the random random ccadata secure key to encrypt the swap device is read from one of the attributes. The following attributes are added: ccadata/aes_128 ccadata/aes_192 ccadata/aes_256 ccadata/aes_128_xts ccadata/aes_256_xts Each attribute emits a secure key blob for the corresponding key size and cipher mode. Signed-off-by: Ingo Franzki <ifranzki@linux.ibm.com> Reviewed-by: Harald Freudenberger <freude@linux.ibm.com> Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2018-10-10s390/pkey: Add sysfs attributes to emit protected key blobsIngo Franzki
Add binary read-only sysfs attributes for the pkey module that can be used to read random protected keys from. Keys are read from these attributes using a cat-like interface. A typical use case for those keys is to encrypt a swap device using the paes cipher. During processing of /etc/crypttab, the random protected key to encrypt the swap device is read from one of the attributes. The following attributes are added: protkey/aes_128 protkey/aes_192 protkey/aes_256 protkey/aes_128_xts protkey/aes_256_xts Each attribute emits a protected key blob for the corresponding key size and cipher mode. Signed-off-by: Ingo Franzki <ifranzki@linux.ibm.com> Reviewed-by: Harald Freudenberger <freude@linux.ibm.com> Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2018-10-09mfd: cros-ec: copy the whole event in get_next_event_xferEmil Karlson
Commit 57e94c8b974db2d83c60e1139c89a70806abbea0 caused cros-ec keyboard events be truncated on many chromebooks so that Left and Right keys on Column 12 were always 0. Use ret as memcpy len to fix this. The old code was using ec_dev->event_size, which is the event payload/data size excluding event_type header, for the length of the memcpy operation. Use ret as memcpy length to avoid the off by one and copy the whole msg->data. Fixes: 57e94c8b974d ("mfd: cros-ec: Increase maximum mkbp event size") Acked-by: Enric Balletbo i Serra <enric.balletbo@collabora.com> Tested-by: Emil Renner Berthing <kernel@esmil.dk> Signed-off-by: Emil Karlson <jekarlson@gmail.com> Signed-off-by: Benson Leung <bleung@chromium.org>
2018-10-09sparc: Wire up io_pgetevents system call.David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-09mm: Preserve _PAGE_DEVMAP across mprotect() callsJan Kara
Currently _PAGE_DEVMAP bit is not preserved in mprotect(2) calls. As a result we will see warnings such as: BUG: Bad page map in process JobWrk0013 pte:800001803875ea25 pmd:7624381067 addr:00007f0930720000 vm_flags:280000f9 anon_vma: (null) mapping:ffff97f2384056f0 index:0 file:457-000000fe00000030-00000009-000000ca-00000001_2001.fileblock fault:xfs_filemap_fault [xfs] mmap:xfs_file_mmap [xfs] readpage: (null) CPU: 3 PID: 15848 Comm: JobWrk0013 Tainted: G W 4.12.14-2.g7573215-default #1 SLE12-SP4 (unreleased) Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5C620.86B.01.00.0833.051120182255 05/11/2018 Call Trace: dump_stack+0x5a/0x75 print_bad_pte+0x217/0x2c0 ? enqueue_task_fair+0x76/0x9f0 _vm_normal_page+0xe5/0x100 zap_pte_range+0x148/0x740 unmap_page_range+0x39a/0x4b0 unmap_vmas+0x42/0x90 unmap_region+0x99/0xf0 ? vma_gap_callbacks_rotate+0x1a/0x20 do_munmap+0x255/0x3a0 vm_munmap+0x54/0x80 SyS_munmap+0x1d/0x30 do_syscall_64+0x74/0x150 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 ... when mprotect(2) gets used on DAX mappings. Also there is a wide variety of other failures that can result from the missing _PAGE_DEVMAP flag when the area gets used by get_user_pages() later. Fix the problem by including _PAGE_DEVMAP in a set of flags that get preserved by mprotect(2). Fixes: 69660fd797c3 ("x86, mm: introduce _PAGE_DEVMAP") Fixes: ebd31197931d ("powerpc/mm: Add devmap support for ppc64") Cc: <stable@vger.kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-10-09dm: fix report zone remapping to account for partition offsetDamien Le Moal
If dm-linear or dm-flakey are layered on top of a partition of a zoned block device, remapping of the start sector and write pointer position of the zones reported by a report zones BIO must be modified to account for the target table entry mapping (start offset within the device and entry mapping with the dm device). If the target's backing device is a partition of a whole disk, the start sector on the physical device of the partition must also be accounted for when modifying the zone information. However, dm_remap_zone_report() was not considering this last case, resulting in incorrect zone information remapping with targets using disk partitions. Fix this by calculating the target backing device start sector using the position of the completed report zones BIO and the unchanged position and size of the original report zone BIO. With this value calculated, the start sector and write pointer position of the target zones can be correctly remapped. Fixes: 10999307c14e ("dm: introduce dm_remap_zone_report()") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-09dm cache: destroy migration_cache if cache target registration failedShenghui Wang
Commit 7e6358d244e47 ("dm: fix various targets to dm_register_target after module __init resources created") inadvertently introduced this bug when it moved dm_register_target() after the call to KMEM_CACHE(). Fixes: 7e6358d244e47 ("dm: fix various targets to dm_register_target after module __init resources created") Cc: stable@vger.kernel.org Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-10-09Merge branch 'ena-fixes'David S. Miller
Arthur Kiyanovski says: ==================== minor bug fixes for ENA Ethernet driver Arthur Kiyanovski (4): net: ena: fix warning in rmmod caused by double iounmap net: ena: fix rare bug when failed restart/resume is followed by driver removal net: ena: fix NULL dereference due to untimely napi initialization net: ena: fix auto casting to boolean ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-09net: ena: fix auto casting to booleanArthur Kiyanovski
Eliminate potential auto casting compilation error. Fixes: 1738cd3ed342 ("net: ena: Add a driver for Amazon Elastic Network Adapters (ENA)") Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-09net: ena: fix NULL dereference due to untimely napi initializationArthur Kiyanovski
napi poll functions should be initialized before running request_irq(), to handle a rare condition where there is a pending interrupt, causing the ISR to fire immediately while the poll function wasn't set yet, causing a NULL dereference. Fixes: 1738cd3ed342 ("net: ena: Add a driver for Amazon Elastic Network Adapters (ENA)") Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-09net: ena: fix rare bug when failed restart/resume is followed by driver removalArthur Kiyanovski
In a rare scenario when ena_device_restore() fails, followed by device remove, an FLR will not be issued. In this case, the device will keep sending asynchronous AENQ keep-alive events, even after driver removal, leading to memory corruption. Fixes: 8c5c7abdeb2d ("net: ena: add power management ops to the ENA driver") Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-09net: ena: fix warning in rmmod caused by double iounmapArthur Kiyanovski
Memory mapped with devm_ioremap is automatically freed when the driver is disconnected from the device. Therefore there is no need to explicitly call devm_iounmap. Fixes: 0857d92f71b6 ("net: ena: add missing unmap bars on device removal") Fixes: 411838e7b41c ("net: ena: fix rare kernel crash when bar memory remap fails") Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-09Merge tag 'kvmarm-fixes-for-4.19-2' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master KVM/arm fixes for 4.19, take #2 - Correctly order GICv3 SGI registers in the cp15 array
2018-10-09KVM: x86: support CONFIG_KVM_AMD=y with CONFIG_CRYPTO_DEV_CCP_DD=mPaolo Bonzini
SEV requires access to the AMD cryptographic device APIs, and this does not work when KVM is builtin and the crypto driver is a module. Actually the Kconfig conditions for CONFIG_KVM_AMD_SEV try to disable SEV in that case, but it does not work because the actual crypto calls are not culled, only sev_hardware_setup() is. This patch adds two CONFIG_KVM_AMD_SEV checks that gate all the remaining SEV code; it fixes this particular configuration, and drops 5 KiB of code when CONFIG_KVM_AMD_SEV=n. Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-09gfs2: Fix iomap buffered write support for journaled filesAndreas Gruenbacher
Commit 64bc06bb32ee broke buffered writes to journaled files (chattr +j): we'll try to journal the buffer heads of the page being written to in gfs2_iomap_journaled_page_done. However, the iomap code no longer creates buffer heads, so we'll BUG() in gfs2_page_add_databufs. Fix that by creating buffer heads ourself when needed. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-10-09arm64: mm: Drop the unused cpu parameterShaokun Zhang
Cpu parameter is never used in flush_context, remove it. Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2018-10-09resource: Clean it up a bitBorislav Petkov
- Drop BUG_ON()s and do normal error handling instead, in find_next_iomem_res(). - Align function arguments on opening braces. - Get rid of local var sibling_only in find_next_iomem_res(). - Shorten unnecessarily long first_level_children_only arg name. Signed-off-by: Borislav Petkov <bp@suse.de> CC: Andrew Morton <akpm@linux-foundation.org> CC: Bjorn Helgaas <bhelgaas@google.com> CC: Brijesh Singh <brijesh.singh@amd.com> CC: Dan Williams <dan.j.williams@intel.com> CC: H. Peter Anvin <hpa@zytor.com> CC: Lianbo Jiang <lijiang@redhat.com> CC: Takashi Iwai <tiwai@suse.de> CC: Thomas Gleixner <tglx@linutronix.de> CC: Tom Lendacky <thomas.lendacky@amd.com> CC: Vivek Goyal <vgoyal@redhat.com> CC: Yaowei Bai <baiyaowei@cmss.chinamobile.com> CC: bhe@redhat.com CC: dan.j.williams@intel.com CC: dyoung@redhat.com CC: kexec@lists.infradead.org CC: mingo@redhat.com Link: <new submission>
2018-10-09resource: Fix find_next_iomem_res() iteration issueBjorn Helgaas
Previously find_next_iomem_res() used "*res" as both an input parameter for the range to search and the type of resource to search for, and an output parameter for the resource we found, which makes the interface confusing. The current callers use find_next_iomem_res() incorrectly because they allocate a single struct resource and use it for repeated calls to find_next_iomem_res(). When find_next_iomem_res() returns a resource, it overwrites the start, end, flags, and desc members of the struct. If we call find_next_iomem_res() again, we must update or restore these fields. The previous code restored res.start and res.end, but not res.flags or res.desc. Since the callers did not restore res.flags, if they searched for flags IORESOURCE_MEM | IORESOURCE_BUSY and found a resource with flags IORESOURCE_MEM | IORESOURCE_BUSY | IORESOURCE_SYSRAM, the next search would incorrectly skip resources unless they were also marked as IORESOURCE_SYSRAM. Fix this by restructuring the interface so it takes explicit "start, end, flags" parameters and uses "*res" only as an output parameter. Based on a patch by Lianbo Jiang <lijiang@redhat.com>. [ bp: While at it: - make comments kernel-doc style. - Originally-by: http://lore.kernel.org/lkml/20180921073211.20097-2-lijiang@redhat.com Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Borislav Petkov <bp@suse.de> CC: Andrew Morton <akpm@linux-foundation.org> CC: Brijesh Singh <brijesh.singh@amd.com> CC: Dan Williams <dan.j.williams@intel.com> CC: H. Peter Anvin <hpa@zytor.com> CC: Lianbo Jiang <lijiang@redhat.com> CC: Takashi Iwai <tiwai@suse.de> CC: Thomas Gleixner <tglx@linutronix.de> CC: Tom Lendacky <thomas.lendacky@amd.com> CC: Vivek Goyal <vgoyal@redhat.com> CC: Yaowei Bai <baiyaowei@cmss.chinamobile.com> CC: bhe@redhat.com CC: dan.j.williams@intel.com CC: dyoung@redhat.com CC: kexec@lists.infradead.org CC: mingo@redhat.com CC: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/153805812916.1157.177580438135143788.stgit@bhelgaas-glaptop.roam.corp.google.com
2018-10-09resource: Include resource end in walk_*() interfacesBjorn Helgaas
find_next_iomem_res() finds an iomem resource that covers part of a range described by "start, end". All callers expect that range to be inclusive, i.e., both start and end are included, but find_next_iomem_res() doesn't handle the end address correctly. If it finds an iomem resource that contains exactly the end address, it skips it, e.g., if "start, end" is [0x0-0x10000] and there happens to be an iomem resource [mem 0x10000-0x10000] (the single byte at 0x10000), we skip it: find_next_iomem_res(...) { start = 0x0; end = 0x10000; for (p = next_resource(...)) { # p->start = 0x10000; # p->end = 0x10000; # we *should* return this resource, but this condition is false: if ((p->end >= start) && (p->start < end)) break; Adjust find_next_iomem_res() so it allows a resource that includes the single byte at the end of the range. This is a corner case that we probably don't see in practice. Fixes: 58c1b5b07907 ("[PATCH] memory hotadd fixes: find_next_system_ram catch range fix") Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Borislav Petkov <bp@suse.de> CC: Andrew Morton <akpm@linux-foundation.org> CC: Brijesh Singh <brijesh.singh@amd.com> CC: Dan Williams <dan.j.williams@intel.com> CC: H. Peter Anvin <hpa@zytor.com> CC: Lianbo Jiang <lijiang@redhat.com> CC: Takashi Iwai <tiwai@suse.de> CC: Thomas Gleixner <tglx@linutronix.de> CC: Tom Lendacky <thomas.lendacky@amd.com> CC: Vivek Goyal <vgoyal@redhat.com> CC: Yaowei Bai <baiyaowei@cmss.chinamobile.com> CC: bhe@redhat.com CC: dan.j.williams@intel.com CC: dyoung@redhat.com CC: kexec@lists.infradead.org CC: mingo@redhat.com CC: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/153805812254.1157.16736368485811773752.stgit@bhelgaas-glaptop.roam.corp.google.com
2018-10-09x86/kexec: Correct KEXEC_BACKUP_SRC_END off-by-one errorBjorn Helgaas
The only use of KEXEC_BACKUP_SRC_END is as an argument to walk_system_ram_res(): int crash_load_segments(struct kimage *image) { ... walk_system_ram_res(KEXEC_BACKUP_SRC_START, KEXEC_BACKUP_SRC_END, image, determine_backup_region); walk_system_ram_res() expects "start, end" arguments that are inclusive, i.e., the range to be walked includes both the start and end addresses. KEXEC_BACKUP_SRC_END was previously defined as (640 * 1024UL), which is the first address *past* the desired 0-640KB range. Define KEXEC_BACKUP_SRC_END as (640 * 1024UL - 1) so the KEXEC_BACKUP_SRC region is [0-0x9ffff], not [0-0xa0000]. Fixes: dd5f726076cc ("kexec: support for kexec on panic using new system call") Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Borislav Petkov <bp@suse.de> CC: "H. Peter Anvin" <hpa@zytor.com> CC: Andrew Morton <akpm@linux-foundation.org> CC: Brijesh Singh <brijesh.singh@amd.com> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> CC: Ingo Molnar <mingo@redhat.com> CC: Lianbo Jiang <lijiang@redhat.com> CC: Takashi Iwai <tiwai@suse.de> CC: Thomas Gleixner <tglx@linutronix.de> CC: Tom Lendacky <thomas.lendacky@amd.com> CC: Vivek Goyal <vgoyal@redhat.com> CC: baiyaowei@cmss.chinamobile.com CC: bhe@redhat.com CC: dan.j.williams@intel.com CC: dyoung@redhat.com CC: kexec@lists.infradead.org Link: http://lkml.kernel.org/r/153805811578.1157.6948388946904655969.stgit@bhelgaas-glaptop.roam.corp.google.com
2018-10-09x86/mm: Remove spurious fault pkey checkDave Hansen
Spurious faults only ever occur in the kernel's address space. They are also constrained specifically to faults with one of these error codes: X86_PF_WRITE | X86_PF_PROT X86_PF_INSTR | X86_PF_PROT So, it's never even possible to reach spurious_kernel_fault_check() with X86_PF_PK set. In addition, the kernel's address space never has pages with user-mode protections. Protection Keys are only enforced on pages with user-mode protection. This gives us lots of reasons to not check for protection keys in our sprurious kernel fault handling. But, let's also add some warnings to ensure that these assumptions about protection keys hold true. Cc: x86@kernel.org Cc: Jann Horn <jannh@google.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180928160231.243A0D6A@viggo.jf.intel.com
2018-10-09x86/mm/vsyscall: Consider vsyscall page part of user address spaceDave Hansen
The vsyscall page is weird. It is in what is traditionally part of the kernel address space. But, it has user permissions and we handle faults on it like we would on a user page: interrupts on. Right now, we handle vsyscall emulation in the "bad_area" code, which is used for both user-address-space and kernel-address-space faults. Move the handling to the user-address-space code *only* and ensure we get there by "excluding" the vsyscall page from the kernel address space via a check in fault_in_kernel_space(). Since the fault_in_kernel_space() check is used on 32-bit, also add a 64-bit check to make it clear we only use this path on 64-bit. Also move the unlikely() to be in is_vsyscall_vaddr() itself. This helps clean up the kernel fault handling path by removing a case that can happen in normal[1] operation. (Yeah, yeah, we can argue about the vsyscall page being "normal" or not.) This also makes sanity checks easier, like the "we never take pkey faults in the kernel address space" check in the next patch. Cc: x86@kernel.org Cc: Jann Horn <jannh@google.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180928160230.6E9336EE@viggo.jf.intel.com
2018-10-09x86/mm: Add vsyscall address helperDave Hansen
We will shortly be using this check in two locations. Put it in a helper before we do so. Let's also insert PAGE_MASK instead of the open-coded ~0xfff. It is easier to read and also more obviously correct considering the implicit type conversion that has to happen when it is not an implicit 'unsigned long'. Cc: x86@kernel.org Cc: Jann Horn <jannh@google.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180928160228.C593509B@viggo.jf.intel.com
2018-10-09x86/mm: Fix exception table commentsDave Hansen
The comments here are wrong. They are too absolute about where faults can occur when running in the kernel. The comments are also a bit hard to match up with the code. Trim down the comments, and make them more precise. Also add a comment explaining why we are doing the bad_area_nosemaphore() path here. Cc: x86@kernel.org Cc: Jann Horn <jannh@google.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180928160227.077DDD7A@viggo.jf.intel.com
2018-10-09x86/mm: Add clarifying comments for user addr spaceDave Hansen
The SMAP and Reserved checking do not have nice comments. Add some to clarify and make it match everything else. Cc: x86@kernel.org Cc: Jann Horn <jannh@google.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180928160225.FFD44B8D@viggo.jf.intel.com
2018-10-09x86/mm: Break out user address space handlingDave Hansen
The last patch broke out kernel address space handing into its own helper. Now, do the same for user address space handling. Cc: x86@kernel.org Cc: Jann Horn <jannh@google.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180928160223.9C4F6440@viggo.jf.intel.com
2018-10-09x86/mm: Break out kernel address space handlingDave Hansen
The page fault handler (__do_page_fault()) basically has two sections: one for handling faults in the kernel portion of the address space and another for faults in the user portion of the address space. But, these two parts don't stick out that well. Let's make that more clear from code separation and naming. Pull kernel fault handling into its own helper, and reflect that naming by renaming spurious_fault() -> spurious_kernel_fault(). Also, rewrite the vmalloc() handling comment a bit. It was a bit stale and also glossed over the reserved bit handling. Cc: x86@kernel.org Cc: Jann Horn <jannh@google.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180928160222.401F4E10@viggo.jf.intel.com
2018-10-09x86/mm: Clarify hardware vs. software "error_code"Dave Hansen
We pass around a variable called "error_code" all around the page fault code. Sounds simple enough, especially since "error_code" looks like it exactly matches the values that the hardware gives us on the stack to report the page fault error code (PFEC in SDM parlance). But, that's not how it works. For part of the page fault handler, "error_code" does exactly match PFEC. But, during later parts, it diverges and starts to mean something a bit different. Give it two names for its two jobs. The place it diverges is also really screwy. It's only in a spot where the hardware tells us we have kernel-mode access that occurred while we were in usermode accessing user-controlled address space. Add a warning in there. Cc: x86@kernel.org Cc: Jann Horn <jannh@google.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180928160220.4A2272C9@viggo.jf.intel.com
2018-10-09x86/mm/tlb: Make lazy TLB mode lazierRik van Riel
Lazy TLB mode can result in an idle CPU being woken up by a TLB flush, when all it really needs to do is reload %CR3 at the next context switch, assuming no page table pages got freed. Memory ordering is used to prevent race conditions between switch_mm_irqs_off, which checks whether .tlb_gen changed, and the TLB invalidation code, which increments .tlb_gen whenever page table entries get invalidated. The atomic increment in inc_mm_tlb_gen is its own barrier; the context switch code adds an explicit barrier between reading tlbstate.is_lazy and next->context.tlb_gen. CPUs in lazy TLB mode remain part of the mm_cpumask(mm), both because that allows TLB flush IPIs to be sent at page table freeing time, and because the cache line bouncing on the mm_cpumask(mm) was responsible for about half the CPU use in switch_mm_irqs_off(). We can change native_flush_tlb_others() without touching other (paravirt) implementations of flush_tlb_others() because we'll be flushing less. The existing implementations flush more and are therefore still correct. Cc: npiggin@gmail.com Cc: mingo@kernel.org Cc: will.deacon@arm.com Cc: kernel-team@fb.com Cc: luto@kernel.org Cc: hpa@zytor.com Tested-by: Song Liu <songliubraving@fb.com> Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180926035844.1420-8-riel@surriel.com
2018-10-09x86/mm/tlb: Add freed_tables element to flush_tlb_infoRik van Riel
Pass the information on to native_flush_tlb_others. No functional changes. Cc: npiggin@gmail.com Cc: mingo@kernel.org Cc: will.deacon@arm.com Cc: songliubraving@fb.com Cc: kernel-team@fb.com Cc: hpa@zytor.com Cc: luto@kernel.org Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180926035844.1420-7-riel@surriel.com
2018-10-09x86/mm/tlb: Add freed_tables argument to flush_tlb_mm_rangeRik van Riel
Add an argument to flush_tlb_mm_range to indicate whether page tables are about to be freed after this TLB flush. This allows for an optimization of flush_tlb_mm_range to skip CPUs in lazy TLB mode. No functional changes. Cc: npiggin@gmail.com Cc: mingo@kernel.org Cc: will.deacon@arm.com Cc: songliubraving@fb.com Cc: kernel-team@fb.com Cc: luto@kernel.org Cc: hpa@zytor.com Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180926035844.1420-6-riel@surriel.com
2018-10-09smp,cpumask: introduce on_each_cpu_cond_maskRik van Riel
Introduce a variant of on_each_cpu_cond that iterates only over the CPUs in a cpumask, in order to avoid making callbacks for every single CPU in the system when we only need to test a subset. Cc: npiggin@gmail.com Cc: mingo@kernel.org Cc: will.deacon@arm.com Cc: songliubraving@fb.com Cc: kernel-team@fb.com Cc: hpa@zytor.com Cc: luto@kernel.org Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180926035844.1420-5-riel@surriel.com
2018-10-09smp: use __cpumask_set_cpu in on_each_cpu_condRik van Riel
The code in on_each_cpu_cond sets CPUs in a locally allocated bitmask, which should never be used by other CPUs simultaneously. There is no need to use locked memory accesses to set the bits in this bitmap. Switch to __cpumask_set_cpu. Cc: npiggin@gmail.com Cc: mingo@kernel.org Cc: will.deacon@arm.com Cc: songliubraving@fb.com Cc: kernel-team@fb.com Cc: hpa@zytor.com Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Rik van Riel <riel@surriel.com> Reviewed-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180926035844.1420-4-riel@surriel.com
2018-10-09x86/mm/tlb: Restructure switch_mm_irqs_off()Rik van Riel
Move some code that will be needed for the lazy -> !lazy state transition when a lazy TLB CPU has gotten out of date. No functional changes, since the if (real_prev == next) branch always returns. (cherry picked from commit 61d0beb5796ab11f7f3bf38cb2eccc6579aaa70b) Cc: npiggin@gmail.com Cc: efault@gmx.de Cc: will.deacon@arm.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: songliubraving@fb.com Cc: kernel-team@fb.com Cc: hpa@zytor.com Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Rik van Riel <riel@surriel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180716190337.26133-4-riel@surriel.com
2018-10-09x86/mm/tlb: Always use lazy TLB modeRik van Riel
On most workloads, the number of context switches far exceeds the number of TLB flushes sent. Optimizing the context switches, by always using lazy TLB mode, speeds up those workloads. This patch results in about a 1% reduction in CPU use on a two socket Broadwell system running a memcache like workload. Cc: npiggin@gmail.com Cc: efault@gmx.de Cc: will.deacon@arm.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-team@fb.com Cc: hpa@zytor.com Cc: luto@kernel.org Tested-by: Song Liu <songliubraving@fb.com> Signed-off-by: Rik van Riel <riel@surriel.com> (cherry picked from commit 95b0e6357d3e4e05349668940d7ff8f3b7e7e11e) Acked-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180716190337.26133-7-riel@surriel.com
2018-10-09x86/mm: Page size aware flush_tlb_mm_range()Peter Zijlstra
Use the new tlb_get_unmap_shift() to determine the stride of the INVLPG loop. Cc: Nick Piggin <npiggin@gmail.com> Cc: Will Deacon <will.deacon@arm.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2018-10-09Merge branch 'tlb/asm-generic' of ↵Peter Zijlstra
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux into x86/mm Pull in the generic mmu_gather changes from the ARM64 tree such that we can put x86 specific things on top as well.
2018-10-09lightnvm: pblk: guarantee that backpointer is respected on writer stallJavier González
pblk's write buffer must guarantee that it respects the device's constrains for reads (i.e., mw_cunits). This is done by maintaining a backpointer that updates the L2P table as entries wrap up, making them point to the media instead of pointing to the write buffer. This mechanism can race in case that the write thread stalls, as the write pointer will protect the last written entry, thus disregarding the read constrains. This patch adds an extra check on wrap up, making sure that the threshold is respected at all times, preventing new entries to overwrite committed data, also in case of write thread stall. Reported-by: Heiner Litz <hlitz@ucsc.edu> Signed-off-by: Javier González <javier@cnexlabs.com> Reviewed-by: Heiner Litz <hlitz@ucsc.edu> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-09lightnvm: pblk: consider max hw sectors supported for max_write_pgsZhoujie Wu
When do GC, the number of read/write sectors are determined by max_write_pgs(see gc_rq preparation in pblk_gc_line_prepare_ws). Due to max_write_pgs doesn't consider max hw sectors supported by nvme controller(128K), which leads to GC tries to read 64 * 4K in one command, and see below error caused by pblk_bio_map_addr in function pblk_submit_read_gc. [ 2923.005376] pblk: could not add page to bio [ 2923.005377] pblk: could not allocate GC bio (18446744073709551604) Signed-off-by: Zhoujie Wu <zjwu@marvell.com> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-09lightnvm: pblk: fix error handling of pblk_lines_init()Wei Yongjun
In the too many bad blocks error handling case, we should release all the allocated resources, otherwise it will cause memory leak. Fixes: 2deeefc02dff ("lightnvm: pblk: fail gracefully on line alloc. failure") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>