linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2018-04-04	rxrpc: Fix undefined packet handling	David Howells
	By analogy with other Rx implementations, RxRPC packet types 9, 10 and 11 should just be discarded rather than being aborted like other undefined packet types. Reported-by: Jeffrey Altman <jaltman@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-04	MAINTAINERS: Add John Garry as maintainer for HiSilicon LPC driver	John Garry
	Add John Garry as maintainer for drivers/bus/hisi_lpc.c, the HiSilicon LPC driver. Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2018-04-04	HISI LPC: Add ACPI support	John Garry
	Based on the previous patches, this patch supports the LPC host on Hip06/Hip07 for ACPI FW. It is the responsibility of the LPC host driver to enumerate the child devices, as the ACPI scan code will not enumerate children of "indirect IO" hosts. The ACPI table for the LPC host controller and the child devices is in the following format: Device (LPC0) { Name (_HID, "HISI0191") // HiSi LPC Name (_CRS, ResourceTemplate () { Memory32Fixed (ReadWrite, 0xa01b0000, 0x1000) }) } Device (LPC0.IPMI) { Name (_HID, "IPI0001") Name (LORS, ResourceTemplate() { QWordIO ( ResourceConsumer, MinNotFixed, // _MIF MaxNotFixed, // _MAF PosDecode, EntireRange, 0x0, // _GRA 0xe4, // _MIN 0x3fff, // _MAX 0x0, // _TRA 0x04, // _LEN , , BTIO ) }) Since the IO resources of the child devices need to be translated from LPC bus addresses to logical PIO addresses, and we shouldn't modify the resources of the devices generated in the FW scan, a per-child MFD is created as a substitute. The MFD IO resources will be the translated bus addresses of the ACPI child. Tested-by: dann frazier <dann.frazier@canonical.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Zhichang Yuan <yuanzhichang@hisilicon.com> Signed-off-by: Gabriele Paoloni <gabriele.paoloni@huawei.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
2018-04-04	ACPI / scan: Do not enumerate Indirect IO host children	John Garry
	Through the logical PIO framework, systems which otherwise have no IO space access to legacy ISA/LPC devices may access these devices through so-called "indirect IO" method. In this, IO space accesses for non-PCI hosts are redirected to a host LLDD to manually generate the IO space (bus) accesses. Hosts are able to register a region in logical PIO space to map to its bus address range. Indirect IO child devices have an associated host-specific bus address. Special translation is required to map between a logical PIO address for a device and its host bus address. Since in the ACPI tables the child device IO resources would be the host-specific values, it is required the ACPI scan code should not enumerate these devices, and that this should be the responsibility of the host driver so that it can "fixup" the resources so that they map to the appropriate logical PIO addresses. To avoid enumerating these child devices, add a check from acpi_device_enumeration_by_parent() as to whether the parent for a device is a member of a known list of "indirect IO" hosts. For now, the HiSilicon LPC host controller ID is added. Tested-by: dann frazier <dann.frazier@canonical.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-04-04	ACPI / scan: Rename acpi_is_serial_bus_slave() for more general use	John Garry
	Currently the ACPI scan has special handling for serial bus slaves, in that it makes it the responsibility of the slave device's parent to enumerate the device. To support other types of slave devices which require the same special handling but where the bus is not strictly a serial bus, such as devices on the HiSilicon LPC controller bus, rename acpi_is_serial_bus_slave() to acpi_device_enumeration_by_parent(), so that the name can fit the wider purpose. Also rename the associated device flag acpi_device_flags.serial_bus_slave to .enumeration_by_parent. Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-04-04	HISI LPC: Support the LPC host on Hip06/Hip07 with DT bindings	Zhichang Yuan
	The low-pin-count (LPC) interface of Hip06/Hip07 accesses I/O port space of peripherals. Implement the LPC host controller driver which performs the I/O operations on the underlying hardware. We don't want to touch existing drivers such as ipmi-bt, so this driver applies the indirect-IO introduced in the previous patch after registering an indirect-IO node to the indirect-IO devices list which will be searched by the I/O accessors to retrieve the host-local I/O port. The driver config is set as a bool instead of a tristate. The reason here is that, by the very nature of the driver providing a logical PIO range, it does not make sense to have this driver as a loadable module. Another more specific reason is that the Huawei D03 board which includes Hip06 SoC requires the LPC bus for UART console, so should be built in. Tested-by: dann frazier <dann.frazier@canonical.com> Signed-off-by: Zou Rongrong <zourongrong@huawei.com> Signed-off-by: Zhichang Yuan <yuanzhichang@hisilicon.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Acked-by: Rob Herring <robh@kernel.org> # dts part
2018-04-04	of: Add missing I/O range exception for indirect-IO devices	Zhichang Yuan
	There are some special ISA/LPC devices that work on a specific I/O range where it is not correct to specify a 'ranges' property in the DTS parent node as CPU addresses translated from DTS node are only for memory space on some architectures, such as ARM64. Without the parent 'ranges' property, of_translate_address() returns an error. Here we add special handling for this case. During the OF address translation, some checking will be performed to identify whether the device node is registered as indirect-IO. If it is, the I/O translation will be done in a different way from that one of PCI MMIO. In this way, the I/O 'reg' property of the special ISA/LPC devices will be parsed correctly. Tested-by: dann frazier <dann.frazier@canonical.com> Signed-off-by: Zhichang Yuan <yuanzhichang@hisilicon.com> Signed-off-by: Gabriele Paoloni <gabriele.paoloni@huawei.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> # earlier draft Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Acked-by: Rob Herring <robh@kernel.org>
2018-04-04	PCI: Apply the new generic I/O management on PCI IO hosts	Zhichang Yuan
	After introducing the new generic I/O space management (Logical PIO), the original PCI MMIO relevant helpers need to be updated based on the new interfaces defined in logical PIO. Adapt the corresponding code to match the changes introduced by logical PIO. Tested-by: dann frazier <dann.frazier@canonical.com> Signed-off-by: Zhichang Yuan <yuanzhichang@hisilicon.com> Signed-off-by: Gabriele Paoloni <gabriele.paoloni@huawei.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> # earlier draft Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
2018-04-04	PCI: Add fwnode handler as input param of pci_register_io_range()	Gabriele Paoloni
	In preparation for having the PCI MMIO helpers use the new generic I/O space management (logical PIO) we need to add the fwnode handler as an extra input parameter. Changes the signature of pci_register_io_range() and its callers as needed. Tested-by: dann frazier <dann.frazier@canonical.com> Signed-off-by: Gabriele Paoloni <gabriele.paoloni@huawei.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Acked-by: Rob Herring <robh@kernel.org>
2018-04-04	PCI: Remove __weak tag from pci_register_io_range()	Gabriele Paoloni
	pci_register_io_range() has only one definition, so there is no need for the __weak attribute. Remove it. Tested-by: dann frazier <dann.frazier@canonical.com> Signed-off-by: Gabriele Paoloni <gabriele.paoloni@huawei.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
2018-04-04	fscache: Attach the index key and aux data to the cookie	David Howells
	Attach copies of the index key and auxiliary data to the fscache cookie so that: (1) The callbacks to the netfs for this stuff can be eliminated. This can simplify things in the cache as the information is still available, even after the cache has relinquished the cookie. (2) Simplifies the locking requirements of accessing the information as we don't have to worry about the netfs object going away on us. (3) The cache can do lazy updating of the coherency information on disk. As long as the cache is flushed before reboot/poweroff, there's no need to update the coherency info on disk every time it changes. (4) Cookies can be hashed or put in a tree as the index key is easily available. This allows: (a) Checks for duplicate cookies can be made at the top fscache layer rather than down in the bowels of the cache backend. (b) Caching can be added to a netfs object that has a cookie if the cache is brought online after the netfs object is allocated. A certain amount of space is made in the cookie for inline copies of the data, but if it won't fit there, extra memory will be allocated for it. The downside of this is that live cache operation requires more memory. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Anna Schumaker <anna.schumaker@netapp.com> Tested-by: Steve Dickson <steved@redhat.com>
2018-04-04	fscache: Add more tracepoints	David Howells
	Add more tracepoints to fscache, including: () fscache_page - Tracks netfs pages known to fscache. () fscache_check_page - Tracks the netfs querying whether a page is pending storage. () fscache_wake_cookie - Tracks cookies being woken up after a page completes/aborts storage in the cache. () fscache_op - Tracks operations being initialised. () fscache_wrote_page - Tracks return of the backend write_page op. () fscache_gang_lookup - Tracks lookup of pages to be stored in the write operation. Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-04	fscache: Add tracepoints	David Howells
	Add some tracepoints to fscache: () fscache_cookie - Tracks a cookie's usage count. () fscache_netfs - Logs registration of a network filesystem, including the pointer to the cookie allocated. () fscache_acquire - Logs cookie acquisition. () fscache_relinquish - Logs cookie relinquishment. () fscache_enable - Logs enablement of a cookie. () fscache_disable - Logs disablement of a cookie. () fscache_osm - Tracks execution of states in the object state machine. and cachefiles: () cachefiles_ref - Tracks a cachefiles object's usage count. () cachefiles_lookup - Logs result of lookup_one_len(). () cachefiles_mkdir - Logs result of vfs_mkdir(). () cachefiles_create - Logs result of vfs_create(). () cachefiles_unlink - Logs calls to vfs_unlink(). () cachefiles_rename - Logs calls to vfs_rename(). () cachefiles_mark_active - Logs an object becoming active. () cachefiles_wait_active - Logs a wait for an old object to be destroyed. () cachefiles_mark_inactive - Logs an object becoming inactive. (*) cachefiles_mark_buried - Logs the burial of an object. Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-04	fscache: Fix hanging wait on page discarded by writeback	David Howells
	If the fscache asynchronous write operation elects to discard a page that's pending storage to the cache because the page would be over the store limit then it needs to wake the page as someone may be waiting on completion of the write. The problem is that the store limit may be updated by a different asynchronous operation - and so may miss the write - and that the store limit may not even get updated until later by the netfs. Fix the kernel hang by making fscache_write_op() mark as written any pages that are over the limit. Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-04	fscache: Detect multiple relinquishment of a cookie	David Howells
	Report if an fscache cookie is relinquished multiple times by the netfs. Signed-off-by: David <dhowells@redhat.com>
2018-04-04	fscache: Pass the correct cancelled indications to fscache_op_complete()	David Howells
	The last parameter to fscache_op_complete() is a bool indicating whether or not the operation was cancelled. A lot of the time the inverse value is given or no differentiation is made. Fix this. Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-04	fscache, cachefiles: Fix checker warnings	David Howells
	Fix a couple of checker warnings in fscache and cachefiles: (1) fscache_n_op_requeue is never used, so get rid of it. (2) cachefiles_uncache_page() is passed in a lock that it releases, so this needs annotating. Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-04	afs: Be more aggressive in retiring cached vnodes	David Howells
	When relinquishing cookies, either due to iget failure or to inode eviction, retire a cookie if we think the corresponding vnode got deleted on the server rather than just letting it lie in the cache. Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-04	afs: Use the vnode ID uniquifier in the cache key not the aux data	David Howells
	AFS vnodes (files) are referenced by a triplet of { volume ID, vnode ID, uniquifier }. Currently, kafs is only using the vnode ID as the file key in the volume fscache index and checking the uniquifier on cookie acquisition against the contents of the auxiliary data stored in the cache. Unfortunately, this is subject to a race in which an FS.RemoveFile or FS.RemoveDir op is issued against the server but the local afs inode isn't torn down and disposed off before another thread issues something like FS.CreateFile. The latter then gets given the vnode ID that just got removed, but with a new uniquifier and a cookie collision occurs in the cache because the cookie is only keyed on the vnode ID whereas the inode is keyed on the vnode ID plus the uniquifier. Fix this by keying the cookie on the uniquifier in addition to the vnode ID and dropping the uniquifier from the auxiliary data supplied. Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-04	afs: Invalidate cache on server data change	David Howells
	Invalidate any data stored in fscache for a vnode that changes on the server so that we don't end up with the cache in a bad state locally. Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-04	cxl: Fix possible deadlock when processing page faults from cxllib	Frederic Barrat
	cxllib_handle_fault() is called by an external driver when it needs to have the host resolve page faults for a buffer. The buffer can cover several pages and VMAs. The function iterates over all the pages used by the buffer, based on the page size of the VMA. To ensure some stability while processing the faults, the thread T1 grabs the mm->mmap_sem semaphore with read access (R1). However, when processing a page fault for a single page, one of the underlying functions, copro_handle_mm_fault(), also grabs the same semaphore with read access (R2). So the thread T1 takes the semaphore twice. If another thread T2 tries to access the semaphore in write mode W1 (say, because it wants to allocate memory and calls 'brk'), then that thread T2 will have to wait because there's a reader (R1). If the thread T1 is processing a new page at that time, it won't get an automatic grant at R2, because there's now a writer thread waiting (T2). And we have a deadlock. The timeline is: 1. thread T1 owns the semaphore with read access R1 2. thread T2 requests write access W1 and waits 3. thread T1 requests read access R2 and waits The fix is for the thread T1 to release the semaphore R1 once it got the information it needs from the current VMA. The address space/VMAs could evolve while T1 iterates over the full buffer, but in the unlikely case where T1 misses a page, the external driver will raise a new page fault when retrying the memory access. Fixes: 3ced8d730063 ("cxl: Export library to support IBM XSL") Cc: stable@vger.kernel.org # 4.13+ Signed-off-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-04-04	powerpc/hw_breakpoint: Only disable hw breakpoint if cpu supports it	Naveen N. Rao
	We get the below warning if we try to use kexec on P9: kexec_core: Starting new kernel WARNING: CPU: 0 PID: 1223 at arch/powerpc/kernel/process.c:826 __set_breakpoint+0xb4/0x140 [snip] NIP __set_breakpoint+0xb4/0x140 LR kexec_prepare_cpus_wait+0x58/0x150 Call Trace: 0xc0000000ee70fb20 (unreliable) 0xc0000000ee70fb20 default_machine_kexec+0x234/0x2c0 machine_kexec+0x84/0x90 kernel_kexec+0xd8/0xe0 SyS_reboot+0x214/0x2c0 system_call+0x58/0x6c This happens since we are trying to clear hw breakpoint on POWER9, though we don't have CPU_FTR_DAWR enabled. Guard __set_breakpoint() within hw_breakpoint_disable() with ppc_breakpoint_available() to address this. Fixes: 9654153158d3 ("powerpc: Disable DAWR in the base POWER9 CPU features") Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-04-04	openrisc: Set CONFIG_MULTI_IRQ_HANDLER	Palmer Dabbelt
	arm has an optional MULTI_IRQ_HANDLER, which openrisc copied but didn't make optional. The multi irq handler infrastructure has been copied to generic code selectable with a new config symbol. That symbol can be selected by randconfig builds and can cause build breakage. Introduce CONFIG_MULTI_IRQ_HANDLER as an intermediate step which prevents the core config symbol from being selected. The openrisc local config symbol will be removed once openrisc gets converted to the generic code. Signed-off-by: Palmer Dabbelt <palmer@sifive.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Link: https://lkml.kernel.org/r/20180404043130.31277-3-palmer@sifive.com
2018-04-04	arm64: Set CONFIG_MULTI_IRQ_HANDLER	Palmer Dabbelt
	arm has an optional MULTI_IRQ_HANDLER, which arm64 copied but didn't make optional. The multi irq handler infrastructure has been copied to generic code selectable with a new config symbol. That symbol can be selected by randconfig builds and can cause build breakage. Introduce CONFIG_MULTI_IRQ_HANDLER as an intermediate step which prevents the core config symbol from being selected. The arm64 local config symbol will be removed once arm64 gets converted to the generic code. Signed-off-by: Palmer Dabbelt <palmer@sifive.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Link: https://lkml.kernel.org/r/20180404043130.31277-2-palmer@sifive.com
2018-04-04	genirq: Make GENERIC_IRQ_MULTI_HANDLER depend on !MULTI_IRQ_HANDLER	Palmer Dabbelt
	These config switches enable the same code in the core and the not yet converted architecture code. They can be selected both by randconfig builds and cause linker error because the same symbols are defined twice. Make the new GENERIC_IRQ_MULTI_HANDLER depend on !MULTI_IRQ_HANDLER to prevent that. The dependency will be removed once all architectures are converted over. Signed-off-by: Palmer Dabbelt <palmer@sifive.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Link: https://lkml.kernel.org/r/20180404043130.31277-4-palmer@sifive.com
2018-04-04	kernel/bpf/syscall: fix warning defined but not used	Anders Roxell
	There will be a build warning -Wunused-function if CONFIG_CGROUP_BPF isn't defined, since the only user is inside #ifdef CONFIG_CGROUP_BPF: kernel/bpf/syscall.c:1229:12: warning: ‘bpf_prog_attach_check_attach_type’ defined but not used [-Wunused-function] static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Current code moves function bpf_prog_attach_check_attach_type inside ifdef CONFIG_CGROUP_BPF. Fixes: 5e43f899b03a ("bpf: Check attach type at prog load time") Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-04	bpf: sockmap, duplicates release calls may NULL sk_prot	John Fastabend
	It is possible to have multiple ULP tcp_release call paths in flight if a sock is closed and simultaneously being removed from the sockmap control path. The result would be setting the sk_prot to the saved values on the first iteration and then on the second iteration setting the value to NULL. This patch resolves this by ensuring we only reset the sk_prot pointer if we have a valid saved state to set. Fixes: 4f738adba30a7 ("bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-04	bpf: sockmap, free memory on sock close with cork data	John Fastabend
	If a socket with pending cork data is closed we do not return the memory to the socket until the garbage collector free's the psock structure. The garbage collector though can run after the sock has completed its close operation. If this ordering happens the sock code will through a WARN_ON because there is still outstanding memory accounted to the sock. To resolve this ensure we return memory to the sock when a socket is closed. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Fixes: 91843d540a13 ("bpf: sockmap, add msg_cork_bytes() helper") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-04	powerpc/mm/radix: Update command line parsing for disable_radix	Aneesh Kumar K.V
	kernel parameter disable_radix takes different options disable_radix=yes\|no\|1\|0 or just disable_radix. prom_init parsing is not supporting these options. Fixes: 1fd6c0220710 ("powerpc/mm: Add a CONFIG option to choose if radix is used by default") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-04-04	powerpc/mm/radix: Parse disable_radix commandline correctly.	Aneesh Kumar K.V
	kernel parameter disable_radix takes different options disable_radix=yes\|no\|1\|0 or just disable_radix. When using the later format we get below error. `Malformed early option 'disable_radix'` Fixes: 1fd6c0220710 ("powerpc/mm: Add a CONFIG option to choose if radix is used by default") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-04-04	powerpc/mm/hugetlb: initialize the pagetable cache correctly for hugetlb	Aneesh Kumar K.V
	With 64k page size, we have hugetlb pte entries at the pmd and pud level for book3s64. We don't need to create a separate page table cache for that. With 4k we need to make sure hugepd page table cache for 16M is placed at PUD level and 16G at the PGD level. Simplify all these by not using HUGEPD_PD_SHIFT which is confusing for book3s64. Without this patch, with 64k page size we create pagetable caches with shift value 10 and 7 which are not used at all. Fixes: 419df06eea5b ("powerpc: Reduce the PTE_INDEX_SIZE") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-04-04	powerpc/mm/radix: Update pte fragment count from 16 to 256 on radix	Aneesh Kumar K.V
	With split PTL (page table lock) config, we allocate the level 4 (leaf) page table using pte fragment framework instead of slab cache like other levels. This was done to enable us to have split page table lock at the level 4 of the page table. We use page->plt backing the all the level 4 pte fragment for the lock. Currently with Radix, we use only 16 fragments out of the allocated page. In radix each fragment is 256 bytes which means we use only 4k out of the allocated 64K page wasting 60k of the allocated memory. This was done earlier to keep it closer to hash. This patch update the pte fragment count to 256, thereby using the full 64K page and reducing the memory usage. Performance tests shows really low impact even with THP disabled. With THP disabled we will be contenting further less on level 4 ptl and hence the impact should be further low. 256 threads: without patch (10 runs of ./ebizzy -m -n 1000 -s 131072 -S 100) median = 15678.5 stdev = 42.1209 with patch: median = 15354 stdev = 194.743 This is with THP disabled. With THP enabled the impact of the patch will be less. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-04-04	powerpc/mm/keys: Update documentation and remove unnecessary check	Aneesh Kumar K.V
	Adds more code comments. We also remove an unnecessary pkey check after we check for pkey error in this patch. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-04-04	Merge branch 'old.dcache' into work.dcache	Al Viro

2018-04-03	RDMA: Use ib_gid_attr during GID modification	Parav Pandit
	Now that ib_gid_attr contains device, port and index, simplify the provider APIs add_gid() and del_gid() to use device, port and index fields from the ib_gid_attr attributes structure. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-03	IB/providers: Avoid null netdev check for RoCE	Parav Pandit
	Now that IB core GID cache ensures that all RoCE entries have an associated netdev remove null checks from the provider drivers for clarity. Reviewed-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-03	IB/providers: Avoid zero GID check for RoCE	Parav Pandit
	Now that the IB core GID cache ensures that a zero GID doesn't exist in the GID table remove zero GID checks from the provider drivers for clarity. Reviewed-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-03	IB/core: Refactor GID modify code for RoCE	Parav Pandit
	Code is refactored to prepare separate functions for RoCE which can do more complex operations related to reference counting, while still maintainining code readability. This includes (a) Simplification to not perform netdevice checks and modifications for IB link layer. (b) Do not add RoCE GID entry which has NULL netdevice; instead return an error. (c) If GID addition fails at provider level add_gid(), do not add the entry in the cache and keep the entry marked as INVALID. (d) Simplify and reuse the ib_cache_gid_add()/del() routines so that they can be used even for modifying default GIDs. This avoid some code duplication in modifying default GIDs. (e) find_gid() routine refers to the data entry flags to qualify a GID as valid or invalid GID rather than depending on attributes and zeroness of the GID content. (f) gid_table_reserve_default() sets the GID default attribute at beginning while setting up the GID table. There is no need to use default_gid flag in low level functions such as write_gid(), add_gid(), del_gid(), as they never need to update the DEFAULT property of the GID entry while during GID table update. As as result of this refactor, reserved GID 0:0:0:0:0:0:0:0 is no longer searchable as described below. A unicast GID entry of 0:0:0:0:0:0:0:0 is Reserved GID as per the IB spec version 1.3 section 4.1.1, point (6) whose snippet is below. "The unicast GID address 0:0:0:0:0:0:0:0 is reserved - referred to as the Reserved GID. It shall never be assigned to any endport. It shall not be used as a destination address or in a global routing header (GRH)." GID table cache now only stores valid GID entries. Before this patch, Reserved GID 0:0:0:0:0:0:0:0 was searchable in the GID table using ib_find_cached_gid_by_port() and other similar find routines. Zero GID is no longer searchable as it shall not to be present in GRH or path recored entry as described in IB spec version 1.3 section 4.1.1, point (6), section 12.7.10 and section 12.7.20. ib_cache_update() is simplified to check link layer once, use unified locking scheme for all link layers, removed temporary gid table allocation/free logic. Additionally, (a) Expand ib_gid_attr to store port and index so that GID query routines can get port and index information from the attribute structure. (b) Expand ib_gid_attr to store device as well so that in future code when GID reference counting is done, device is used to reach back to the GID table entry. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-03	IB/core: Simplify ib_query_gid to always refer to cache	Parav Pandit
	Currently following inconsistencies exist. 1. ib_query_gid() returns GID from the software cache for a RoCE port and returns GID from the HCA for an IB port. This is incorrect because software GID cache is maintained regardless of HCA port type. 2. GID is queries from the HCA via ib_query_gid and updated in the software cache for IB link layer. Both of them might not be in sync. ULPs such as SRP initiator, SRP target, IPoIB driver have historically used ib_query_gid() API to query the GID. However CM used cached version during CM processing, When software cache was introduced, this inconsitency remained. In order to simplify, improve readability and avoid link layer specific above inconsistencies, this patch brings following changes. 1. ib_query_gid() always refers to the cache layer regardless of link layer. 2. cache module who reads the GID entry from HCA and builds the cache, directly invokes the HCA provider verb's query_gid() callback function. 3. ib_query_port() is being called in early stage where GID cache is not yet build while reading port immutable property. Therefore it needs to read the default GID from the HCA for IB link layer to publish the subnet prefix. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-03	RDMA/providers: Simplify query_gid callback of RoCE providers	Parav Pandit
	ib_query_gid() fetches the GID from the software cache maintained in ib_core for RoCE ports. Therefore, simplify the provider drivers for RoCE to treat query_gid() callback as never called for RoCE, and only require non-RoCE devices to implement it. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-03	RDMA/core: Update query_gid documentation for HCA drivers	Parav Pandit
	query_gid() should return right GID value for iWarp and IB link layers. It is a no-op for RoCE link layer. Update the documentation to reflect this. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-03	RDMA/ucma: Don't allow setting RDMA_OPTION_IB_PATH without an RDMA device	Roland Dreier
	Check to make sure that ctx->cm_id->device is set before we use it. Otherwise userspace can trigger a NULL dereference by doing RDMA_USER_CM_CMD_SET_OPTION on an ID that is not bound to a device. Cc: <stable@vger.kernel.org> Reported-by: <syzbot+a67bc93e14682d92fc2f@syzkaller.appspotmail.com> Signed-off-by: Roland Dreier <roland@purestorage.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-03	Merge branch 'userns-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace updates from Eric Biederman: "There was a lot of work this cycle fixing bugs that were discovered after the merge window and getting everything ready where we can reasonably support fully unprivileged fuse. The bug fixes you already have and much of the unprivileged fuse work is coming in via other trees. Still left for fully unprivileged fuse is figuring out how to cleanly handle .set_acl and .get_acl in the legacy case, and properly handling of evm xattrs on unprivileged mounts. Included in the tree is a cleanup from Alexely that replaced a linked list with a statically allocated fix sized array for the pid caches, which simplifies and speeds things up. Then there is are some cleanups and fixes for the ipc namespace. The motivation was that in reviewing other code it was discovered that access ipc objects from different pid namespaces recorded pids in such a way that when asked the wrong pids were returned. In the worst case there has been a measured 30% performance impact for sysvipc semaphores. Other test cases showed no measurable performance impact. Manfred Spraul and Davidlohr Bueso who tend to work on sysvipc performance both gave the nod that this is good enough. Casey Schaufler and James Morris have given their approval to the LSM side of the changes. I simplified the types and the code dealing with sysvipc to pass just kern_ipc_perm for all three types of ipc. Which reduced the header dependencies throughout the kernel and simplified the lsm code. Which let me work on the pid fixes without having to worry about trivial changes causing complete kernel recompiles" * 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ipc/shm: Fix pid freeing. ipc/shm: fix up for struct file no longer being available in shm.h ipc/smack: Tidy up from the change in type of the ipc security hooks ipc: Directly call the security hook in ipc_ops.associate ipc/sem: Fix semctl(..., GETPID, ...) between pid namespaces ipc/msg: Fix msgctl(..., IPC_STAT, ...) between pid namespaces ipc/shm: Fix shmctl(..., IPC_STAT, ...) between pid namespaces. ipc/util: Helpers for making the sysvipc operations pid namespace aware ipc: Move IPCMNI from include/ipc.h into ipc/util.h msg: Move struct msg_queue into ipc/msg.c shm: Move struct shmid_kernel into ipc/shm.c sem: Move struct sem and struct sem_array into ipc/sem.c msg/security: Pass kern_ipc_perm not msg_queue into the msg_queue security hooks shm/security: Pass kern_ipc_perm not shmid_kernel into the shm security hooks sem/security: Pass kern_ipc_perm not sem_array into the sem security hooks pidns: simpler allocation of pid_* caches
2018-04-03	orangefs: document package install and xfstests procedure	Martin Brandenburg
	Unless one is working on the userspace code, there's no need to compile OrangeFS. The package works just fine. (But leave the documentation for building from source since not everyone uses distributions which include the package.) Also document the process to run xfstests against OrangeFS. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2018-04-03	orangefs: remove unused code	Martin Brandenburg
	Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2018-04-03	orangefs: make several *_operations structs static	Martin Brandenburg
	Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2018-04-03	orangefs: implement vm_ops->fault	Martin Brandenburg
	Must retrieve size before running filemap_fault so the kernel has an up-to-date size. This should have been caught by xfstests generic/246, but it was masked by orangefs_new_inode, which set i_size to PAGE_SIZE. When nothing caused a getattr prior to a pagefault, i_size was still PAGE_SIZE. Since xfstests only read 10 bytes, it did not catch this bug. When orangefs_new_inode was modified to perform a getattr instead, i_size was set to zero, as it was a newly created file. Then orangefs_file_write_iter did NOT set i_size. Instead it invalidated the attribute cache, which should have caused the next caller to retrieve i_size. But the fault handler did not know it was supposed to retrieve i_size. So during xfstests, i_size was still zero, and filemap_fault returned VM_FAULT_SIGBUS. Fixes xfstests generic/452. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2018-04-03	f2fs: remain written times to update inode during fsync	Jaegeuk Kim
	This fixes xfstests/generic/392. The failure was caused by different times between 1) one marked in the last fsync(2) call and 2) the other given by roll-forward recovery after power-cut. The reason was that we skipped updating inode block at 1), since its i_size was recoverable along with 4KB-aligned data writes, which was fixed by: "f2fs: fix a wrong condition in f2fs_skip_inode_update" Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-04-04	powerpc/64s/idle: POWER9 ESL=0 stop avoid save/restore overhead	Nicholas Piggin
	When stop is executed with EC=ESL=0, it appears to execute like a normal instruction (resuming from NIP when woken by interrupt). So all the save/restore handling can be avoided completely. In particular NV GPRs do not have to be saved, and MSR does not have to be switched back to kernel MSR. So move the test for EC=ESL=0 sleep states out to power9_idle_stop, and return directly to the caller after stop in that case. This improves performance for ping-pong benchmark with the stop0_lite idle state by 2.54% for 2 threads in the same core, and 2.57% for different cores. Performance increase with HV_POSSIBLE defined will be improved further by avoiding the hwsync. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-04-03	Merge branch 'for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq	Linus Torvalds
	Pull workqueue updates from Tejun Heo: "rcu_work addition and a couple trivial changes" * 'for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: remove the comment about the old manager_arb mutex workqueue: fix the comments of nr_idle fs/aio: Use rcu_work instead of explicit rcu and work item cgroup: Use rcu_work instead of explicit rcu and work item RCU, workqueue: Implement rcu_work