summaryrefslogtreecommitdiff
path: root/drivers/nvdimm/bus.c
AgeCommit message (Collapse)Author
2019-12-01Merge tag 'libnvdimm-for-5.5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "The highlight this cycle is continuing integration fixes for PowerPC and some resulting optimizations. Summary: - Updates to better support vmalloc space restrictions on PowerPC platforms. - Cleanups to move common sysfs attributes to core 'struct device_type' objects. - Export the 'target_node' attribute (the effective numa node if pmem is marked online) for regions and namespaces. - Miscellaneous fixups and optimizations" * tag 'libnvdimm-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits) MAINTAINERS: Remove Keith from NVDIMM maintainers libnvdimm: Export the target_node attribute for regions and namespaces dax: Add numa_node to the default device-dax attributes libnvdimm: Simplify root read-only definition for the 'resource' attribute dax: Simplify root read-only definition for the 'resource' attribute dax: Create a dax device_type libnvdimm: Move nvdimm_bus_attribute_group to device_type libnvdimm: Move nvdimm_attribute_group to device_type libnvdimm: Move nd_mapping_attribute_group to device_type libnvdimm: Move nd_region_attribute_group to device_type libnvdimm: Move nd_numa_attribute_group to device_type libnvdimm: Move nd_device_attribute_group to device_type libnvdimm: Move region attribute group definition libnvdimm: Move attribute groups to device type libnvdimm: Remove prototypes for nonexistent functions libnvdimm/btt: fix variable 'rc' set but not used libnvdimm/pmem: Delete include of nd-core.h libnvdimm/namespace: Differentiate between probe mapping and runtime mapping libnvdimm/pfn_dev: Don't clear device memmap area during generic namespace probe libnvdimm: Trivial comment fix ...
2019-11-19libnvdimm: Export the target_node attribute for regions and namespacesDan Williams
Aneesh points out that some platforms may have "local" attached persistent memory and "remote" persistent memory that map to the same "online" node, or persistent memory devices with different performance properties. In this case 'numa_node' is identical for the two instances, but 'target_node' is differentiated so platform firmware can communicate distinct performance properties per range. Expose 'target_node' by default to allow for disambiguation of devices that share the same numa_map_to_online_node() result. Reported-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/157401274500.43284.2369509941678577768.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-11-19libnvdimm: Move nvdimm_bus_attribute_group to device_typeDan Williams
A 'struct device_type' instance can carry default attributes for the device. Use this facility to remove the export of nvdimm_bus_attribute_group and put the responsibility on the core rather than leaf implementations to define this attribute. Cc: Ira Weiny <ira.weiny@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/157309903815.1582359.6418211876315050283.stgit@dwillia2-desk3.amr.corp.intel.com
2019-11-19libnvdimm: Move nd_numa_attribute_group to device_typeDan Williams
A 'struct device_type' instance can carry default attributes for the device. Use this facility to remove the export of nd_numa_attribute_group and put the responsibility on the core rather than leaf implementations to define this attribute. Cc: Ira Weiny <ira.weiny@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/157401269537.43284.14411189404186877352.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-11-17libnvdimm: Move nd_device_attribute_group to device_typeDan Williams
A 'struct device_type' instance can carry default attributes for the device. Use this facility to remove the export of nd_device_attribute_group and put the responsibility on the core rather than leaf implementations to define this attribute. For regions this creates a new nd_region_attribute_groups[] added to the per-region device-type instances. Cc: Ira Weiny <ira.weiny@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/157309901138.1582359.12909354140826530394.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-10-23compat_ioctl: move more drivers to compat_ptr_ioctlArnd Bergmann
The .ioctl and .compat_ioctl file operations have the same prototype so they can both point to the same function, which works great almost all the time when all the commands are compatible. One exception is the s390 architecture, where a compat pointer is only 31 bit wide, and converting it into a 64-bit pointer requires calling compat_ptr(). Most drivers here will never run in s390, but since we now have a generic helper for it, it's easy enough to use it consistently. I double-checked all these drivers to ensure that all ioctl arguments are used as pointers or are ignored, but are not interpreted as integer values. Acked-by: Jason Gunthorpe <jgg@mellanox.com> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Acked-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: David Sterba <dsterba@suse.com> Acked-by: Darren Hart (VMware) <dvhart@infradead.org> Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org> Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-09-24libnvdimm/region: Initialize bad block for volatile namespacesAneesh Kumar K.V
We do check for a bad block during namespace init and that use region bad block list. We need to initialize the bad block for volatile regions for this to work. We also observe a lockdep warning as below because the lock is not initialized correctly since we skip bad block init for volatile regions. INFO: trying to register non-static key. the code is fine but needs lockdep annotation. turning off the locking correctness validator. CPU: 2 PID: 1 Comm: swapper/0 Not tainted 5.3.0-rc1-15699-g3dee241c937e #149 Call Trace: [c0000000f95cb250] [c00000000147dd84] dump_stack+0xe8/0x164 (unreliable) [c0000000f95cb2a0] [c00000000022ccd8] register_lock_class+0x308/0xa60 [c0000000f95cb3a0] [c000000000229cc0] __lock_acquire+0x170/0x1ff0 [c0000000f95cb4c0] [c00000000022c740] lock_acquire+0x220/0x270 [c0000000f95cb580] [c000000000a93230] badblocks_check+0xc0/0x290 [c0000000f95cb5f0] [c000000000d97540] nd_pfn_validate+0x5c0/0x7f0 [c0000000f95cb6d0] [c000000000d98300] nd_dax_probe+0xd0/0x1f0 [c0000000f95cb760] [c000000000d9b66c] nd_pmem_probe+0x10c/0x160 [c0000000f95cb790] [c000000000d7f5ec] nvdimm_bus_probe+0x10c/0x240 [c0000000f95cb820] [c000000000d0f844] really_probe+0x254/0x4e0 [c0000000f95cb8b0] [c000000000d0fdfc] driver_probe_device+0x16c/0x1e0 [c0000000f95cb930] [c000000000d10238] device_driver_attach+0x68/0xa0 [c0000000f95cb970] [c000000000d1040c] __driver_attach+0x19c/0x1c0 [c0000000f95cb9f0] [c000000000d0c4c4] bus_for_each_dev+0x94/0x130 [c0000000f95cba50] [c000000000d0f014] driver_attach+0x34/0x50 [c0000000f95cba70] [c000000000d0e208] bus_add_driver+0x178/0x2f0 [c0000000f95cbb00] [c000000000d117c8] driver_register+0x108/0x170 [c0000000f95cbb70] [c000000000d7edb0] __nd_driver_register+0xe0/0x100 [c0000000f95cbbd0] [c000000001a6baa4] nd_pmem_driver_init+0x34/0x48 [c0000000f95cbbf0] [c0000000000106f4] do_one_initcall+0x1d4/0x4b0 [c0000000f95cbcd0] [c0000000019f499c] kernel_init_freeable+0x544/0x65c [c0000000f95cbdb0] [c000000000010d6c] kernel_init+0x2c/0x180 [c0000000f95cbe20] [c00000000000b954] ret_from_kernel_thread+0x5c/0x68 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/20190919083355.26340-1-aneesh.kumar@linux.ibm.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-09-05libnvdimm/pmem: Advance namespace seed for specific probe errorsAneesh Kumar K.V
In order to support marking namespaces with unsupported feature/versions disabled, nvdimm core should advance the namespace seed on these probe failures. Otherwise, these failed namespaces will be considered a seed namespace and will be wrongly used while creating new namespaces. Add -EOPNOTSUPP as return from pmem probe callback to indicate a namespace initialization failures due to pfn superblock feature/version mismatch. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/20190905154603.10349-3-aneesh.kumar@linux.ibm.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-09-05libnvdimm/region: Rewrite _probe_success() to _advance_seeds()Dan Williams
The nd_region_probe_success() helper collides seed management with nvdimm->busy tracking. Given the 'busy' increment is handled internal to the nd_region driver 'probe' path move the decrement to the 'remove' path. With that cleanup the routine can be renamed to the more descriptive nd_region_advance_seeds(). The change is prompted by an incoming need to optionally advance the seeds on other events besides 'probe' success. Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/20190905154603.10349-2-aneesh.kumar@linux.ibm.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-08-29libnvdimm/security: Introduce a 'frozen' attributeDan Williams
In the process of debugging a system with an NVDIMM that was failing to unlock it was found that the kernel is reporting 'locked' while the DIMM security interface is 'frozen'. Unfortunately the security state is tracked internally as an enum which prevents it from communicating the difference between 'locked' and 'locked + frozen'. It follows that the enum also prevents the kernel from communicating 'unlocked + frozen' which would be useful for debugging why security operations like 'change passphrase' are disabled. Ditch the security state enum for a set of flags and introduce a new sysfs attribute explicitly for the 'frozen' state. The regression risk is low because the 'frozen' state was already blocked behind the 'locked' state, but will need to revisit if there were cases where applications need 'frozen' to show up in the primary 'security' attribute. The expectation is that communicating 'frozen' is mostly a helper for debug and status monitoring. Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reported-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Link: https://lore.kernel.org/r/156686729474.184120.5835135644278860826.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-07-18driver-core, libnvdimm: Let device subsystems add local lockdep coverageDan Williams
For good reason, the standard device_lock() is marked lockdep_set_novalidate_class() because there is simply no sane way to describe the myriad ways the device_lock() ordered with other locks. However, that leaves subsystems that know their own local device_lock() ordering rules to find lock ordering mistakes manually. Instead, introduce an optional / additional lockdep-enabled lock that a subsystem can acquire in all the same paths that the device_lock() is acquired. A conversion of the NFIT driver and NVDIMM subsystem to a lockdep-validate device_lock() scheme is included. The debug_nvdimm_lock() implementation implements the correct lock-class and stacking order for the libnvdimm device topology hierarchy. Yes, this is a hack, but hopefully it is a useful hack for other subsystems device_lock() debug sessions. Quoting Greg: "Yeah, it feels a bit hacky but it's really up to a subsystem to mess up using it as much as anything else, so user beware :) I don't object to it if it makes things easier for you to debug." Cc: Ingo Molnar <mingo@redhat.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/156341210661.292348.7014034644265455704.stgit@dwillia2-desk3.amr.corp.intel.com
2019-07-18libnvdimm/bus: Fix wait_nvdimm_bus_probe_idle() ABBA deadlockDan Williams
A multithreaded namespace creation/destruction stress test currently deadlocks with the following lockup signature: INFO: task ndctl:2924 blocked for more than 122 seconds. Tainted: G OE 5.2.0-rc4+ #3382 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ndctl D 0 2924 1176 0x00000000 Call Trace: ? __schedule+0x27e/0x780 schedule+0x30/0xb0 wait_nvdimm_bus_probe_idle+0x8a/0xd0 [libnvdimm] ? finish_wait+0x80/0x80 uuid_store+0xe6/0x2e0 [libnvdimm] kernfs_fop_write+0xf0/0x1a0 vfs_write+0xb7/0x1b0 ksys_write+0x5c/0xd0 do_syscall_64+0x60/0x240 INFO: task ndctl:2923 blocked for more than 122 seconds. Tainted: G OE 5.2.0-rc4+ #3382 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ndctl D 0 2923 1175 0x00000000 Call Trace: ? __schedule+0x27e/0x780 ? __mutex_lock+0x489/0x910 schedule+0x30/0xb0 schedule_preempt_disabled+0x11/0x20 __mutex_lock+0x48e/0x910 ? nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm] ? __lock_acquire+0x23f/0x1710 ? nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm] nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm] __dax_pmem_probe+0x5e/0x210 [dax_pmem_core] ? nvdimm_bus_probe+0x1d0/0x2c0 [libnvdimm] dax_pmem_probe+0xc/0x20 [dax_pmem] nvdimm_bus_probe+0x90/0x2c0 [libnvdimm] really_probe+0xef/0x390 driver_probe_device+0xb4/0x100 In this sequence an 'nd_dax' device is being probed and trying to take the lock on its backing namespace to validate that the 'nd_dax' device indeed has exclusive access to the backing namespace. Meanwhile, another thread is trying to update the uuid property of that same backing namespace. So one thread is in the probe path trying to acquire the lock, and the other thread has acquired the lock and tries to flush the probe path. Fix this deadlock by not holding the namespace device_lock over the wait_nvdimm_bus_probe_idle() synchronization step. In turn this requires the device_lock to be held on entry to wait_nvdimm_bus_probe_idle() and subsequently dropped internally to wait_nvdimm_bus_probe_idle(). Cc: <stable@vger.kernel.org> Fixes: bf9bccc14c05 ("libnvdimm: pmem label sets and namespace instantiation") Cc: Vishal Verma <vishal.l.verma@intel.com> Tested-by: Jane Chu <jane.chu@oracle.com> Link: https://lore.kernel.org/r/156341210094.292348.2384694131126767789.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-07-18libnvdimm/bus: Stop holding nvdimm_bus_list_mutex over __nd_ioctl()Dan Williams
In preparation for fixing a deadlock between wait_for_bus_probe_idle() and the nvdimm_bus_list_mutex arrange for __nd_ioctl() without nvdimm_bus_list_mutex held. This also unifies the 'dimm' and 'bus' level ioctls into a common nd_ioctl() preamble implementation. Marked for -stable as it is a pre-requisite for a follow-on fix. Cc: <stable@vger.kernel.org> Fixes: bf9bccc14c05 ("libnvdimm: pmem label sets and namespace instantiation") Cc: Vishal Verma <vishal.l.verma@intel.com> Tested-by: Jane Chu <jane.chu@oracle.com> Link: https://lore.kernel.org/r/156341209518.292348.7183897251740665198.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-07-18libnvdimm/bus: Prepare the nd_ioctl() path to be re-entrantDan Williams
In preparation for not holding a lock over the execution of nd_ioctl(), update the implementation to allow multiple threads to be attempting ioctls at the same time. The bus lock still prevents multiple in-flight ->ndctl() invocations from corrupting each other's state, but static global staging buffers are moved to the heap. Reported-by: Vishal Verma <vishal.l.verma@intel.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Tested-by: Vishal Verma <vishal.l.verma@intel.com> Link: https://lore.kernel.org/r/156341208947.292348.10560140326807607481.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-07-18libnvdimm/bus: Prevent duplicate device_unregister() callsDan Williams
A multithreaded namespace creation/destruction stress test currently fails with signatures like the following: sysfs group 'power' not found for kobject 'dax1.1' RIP: 0010:sysfs_remove_group+0x76/0x80 Call Trace: device_del+0x73/0x370 device_unregister+0x16/0x50 nd_async_device_unregister+0x1e/0x30 [libnvdimm] async_run_entry_fn+0x39/0x160 process_one_work+0x23c/0x5e0 worker_thread+0x3c/0x390 BUG: kernel NULL pointer dereference, address: 0000000000000020 RIP: 0010:klist_put+0x1b/0x6c Call Trace: klist_del+0xe/0x10 device_del+0x8a/0x2c9 ? __switch_to_asm+0x34/0x70 ? __switch_to_asm+0x40/0x70 device_unregister+0x44/0x4f nd_async_device_unregister+0x22/0x2d [libnvdimm] async_run_entry_fn+0x47/0x15a process_one_work+0x1a2/0x2eb worker_thread+0x1b8/0x26e Use the kill_device() helper to atomically resolve the race of multiple threads issuing kill, device_unregister(), requests. Reported-by: Jane Chu <jane.chu@oracle.com> Reported-by: Erwin Tsaur <erwin.tsaur@oracle.com> Fixes: 4d88a97aa9e8 ("libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver...") Cc: <stable@vger.kernel.org> Link: https://github.com/pmem/ndctl/issues/96 Tested-by: Tested-by: Jane Chu <jane.chu@oracle.com> Link: https://lore.kernel.org/r/156341207846.292348.10435719262819764054.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-06-05treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 295Thomas Gleixner
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of version 2 of the gnu general public license as published by the free software foundation this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 64 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190529141901.894819585@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-20libnvdimm: Fix compilation warnings with W=1Qian Cai
Several places (dimm_devs.c, core.c etc) include label.h but only label.c uses NSINDEX_SIGNATURE, so move its definition to label.c instead. In file included from drivers/nvdimm/dimm_devs.c:23: drivers/nvdimm/label.h:41:19: warning: 'NSINDEX_SIGNATURE' defined but not used [-Wunused-const-variable=] Also, some places abuse "/**" which is only reserved for the kernel-doc. drivers/nvdimm/bus.c:648: warning: cannot understand function prototype: 'struct attribute_group nd_device_attribute_group = ' drivers/nvdimm/bus.c:677: warning: cannot understand function prototype: 'struct attribute_group nd_numa_attribute_group = ' Those are just some member assignments for the "struct attribute_group" instances and it can't be expressed in the kernel-doc. Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-04-09treewide: Switch printk users from %pf and %pF to %ps and %pS, respectivelySakari Ailus
%pF and %pf are functionally equivalent to %pS and %ps conversion specifiers. The former are deprecated, therefore switch the current users to use the preferred variant. The changes have been produced by the following command: git grep -l '%p[fF]' | grep -v '^\(tools\|Documentation\)/' | \ while read i; do perl -i -pe 's/%pf/%ps/g; s/%pF/%pS/g;' $i; done And verifying the result. Link: http://lkml.kernel.org/r/20190325193229.23390-1-sakari.ailus@linux.intel.com Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: linux-arm-kernel@lists.infradead.org Cc: sparclinux@vger.kernel.org Cc: linux-um@lists.infradead.org Cc: xen-devel@lists.xenproject.org Cc: linux-acpi@vger.kernel.org Cc: linux-pm@vger.kernel.org Cc: drbd-dev@lists.linbit.com Cc: linux-block@vger.kernel.org Cc: linux-mmc@vger.kernel.org Cc: linux-nvdimm@lists.01.org Cc: linux-pci@vger.kernel.org Cc: linux-scsi@vger.kernel.org Cc: linux-btrfs@vger.kernel.org Cc: linux-f2fs-devel@lists.sourceforge.net Cc: linux-mm@kvack.org Cc: ceph-devel@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com> Acked-by: David Sterba <dsterba@suse.com> (for btrfs) Acked-by: Mike Rapoport <rppt@linux.ibm.com> (for mm/memblock.c) Acked-by: Bjorn Helgaas <bhelgaas@google.com> (for drivers/pci) Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-01-31libnvdimm: Schedule device registration on node local to the deviceAlexander Duyck
Force the device registration for nvdimm devices to be closer to the actual device. This is achieved by using either the NUMA node ID of the region, or of the parent. By doing this we can have everything above the region based on the region, and everything below the region based on the nvdimm bus. By guaranteeing NUMA locality I see an improvement of as high as 25% for per-node init of a system with 12TB of persistent memory. Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-27Merge miscellaneous libnvdimm updates for 4.21Dan Williams
* Use common helpers, bitmap_zalloc() and kstrndup(), to replace open coded versions. * Clarify the comments around hotplug vs initial init case for the nfit driver. * Cleanup the libnvdimm init path.
2018-12-21acpi/nfit, libnvdimm/security: Add security DSM overwrite supportDave Jiang
Add support for the NVDIMM_FAMILY_INTEL "ovewrite" capability as described by the Intel DSM spec v1.7. This will allow triggering of overwrite on Intel NVDIMMs. The overwrite operation can take tens of minutes. When the overwrite DSM is issued successfully, the NVDIMMs will be unaccessible. The kernel will do backoff polling to detect when the overwrite process is completed. According to the DSM spec v1.7, the 128G NVDIMMs can take up to 15mins to perform overwrite and larger DIMMs will take longer. Given that overwrite puts the DIMM in an indeterminate state until it completes introduce the NDD_SECURITY_OVERWRITE flag to prevent other operations from executing when overwrite is happening. The NDD_WORK_PENDING flag is added to denote that there is a device reference on the nvdimm device for an async workqueue thread context. Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-12-13acpi/nfit, libnvdimm: Introduce nvdimm_security_opsDave Jiang
Some NVDIMMs, like the ones defined by the NVDIMM_FAMILY_INTEL command set, expose a security capability to lock the DIMMs at poweroff and require a passphrase to unlock them. The security model is derived from ATA security. In anticipation of other DIMMs implementing a similar scheme, and to abstract the core security implementation away from the device-specific details, introduce nvdimm_security_ops. Initially only a status retrieval operation, ->state(), is defined, along with the base infrastructure and definitions for future operations. Signed-off-by: Dave Jiang <dave.jiang@intel.com> Co-developed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-12-10libnvdimm, bus: Check id immediately following ida_simple_getOcean He
The id check was not executed immediately following ida_simple_get. Just change the codes position, without function change. Signed-off-by: Ocean He <hehy1@lenovo.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-12-04acpi/nfit: Add support for Intel DSM 1.8 commandsDave Jiang
Add command definition for security commands defined in Intel DSM specification v1.8 [1]. This includes "get security state", "set passphrase", "unlock unit", "freeze lock", "secure erase", "overwrite", "overwrite query", "master passphrase enable/disable", and "master erase", . Since this adds several Intel definitions, move the relevant bits to their own header. These commands mutate physical data, but that manipulation is not cache coherent. The requirement to flush and invalidate caches makes these commands unsuitable to be called from userspace, so extra logic is added to detect and block these commands from being submitted via the ioctl command submission path. Lastly, the commands may contain sensitive key material that should not be dumped in a standard debug session. Update the nvdimm-command payload-dump facility to move security command payloads behind a default-off compile time switch. [1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdf Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-09-26libnvdimm: Set device node in nd_device_registerAlexander Duyck
This change makes it so that we don't repeatedly overwrite the device node for nvdimm regions. The earliest we can set the node is immediately after calling device init, so I have moved the code there so we can avoid rewriting the node with each uevent. Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-09-26libnvdimm: Hold reference on parent while scheduling async initAlexander Duyck
Unlike asynchronous initialization in the core we have not yet associated the device with the parent, and as such the device doesn't hold a reference to the parent. In order to resolve that we should be holding a reference on the parent until the asynchronous initialization has completed. Cc: <stable@vger.kernel.org> Fixes: 4d88a97aa9e8 ("libnvdimm: ...base ... infrastructure") Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-08-20libnvdimm: fix ars_status output length calculationVishal Verma
Commit efda1b5d87cb ("acpi, nfit, libnvdimm: fix / harden ars_status output length handling") Introduced additional hardening for ambiguity in the ACPI spec for ars_status output sizing. However, it had a couple of cases mixed up. Where it should have been checking for (and returning) "out_field[1] - 4" it was using "out_field[1] - 8" and vice versa. This caused a four byte discrepancy in the buffer size passed on to the command handler, and in some cases, this caused memory corruption like: ./daxdev-errors.sh: line 76: 24104 Aborted (core dumped) ./daxdev-errors $busdev $region malloc(): memory corruption Program received signal SIGABRT, Aborted. [...] #5 0x00007ffff7865a2e in calloc () from /lib64/libc.so.6 #6 0x00007ffff7bc2970 in ndctl_bus_cmd_new_ars_status (ars_cap=ars_cap@entry=0x6153b0) at ars.c:136 #7 0x0000000000401644 in check_ars_status (check=0x7fffffffdeb0, bus=0x604c20) at daxdev-errors.c:144 #8 test_daxdev_clear_error (region_name=<optimized out>, bus_name=<optimized out>) at daxdev-errors.c:332 Cc: <stable@vger.kernel.org> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Lukasz Dorau <lukasz.dorau@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Fixes: efda1b5d87cb ("acpi, nfit, libnvdimm: fix / harden ars_status output length handling") Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-of-by: Dave Jiang <dave.jiang@intel.com>
2018-06-02libnvdimm: Debug probe timesDan Williams
Instrument nvdimm_bus_probe() to emit timestamps for the start and end of libnvdimm device probing. This is useful for identifying sources of libnvdimm sub-system initialization latency. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-05-31linvdimm, pmem: Preserve read-only setting for pmem devicesRobert Elliott
The pmem driver does not honor a forced read-only setting for very long: $ blockdev --setro /dev/pmem0 $ blockdev --getro /dev/pmem0 1 followed by various commands like these: $ blockdev --rereadpt /dev/pmem0 or $ mkfs.ext4 /dev/pmem0 results in this in the kernel serial log: nd_pmem namespace0.0: region0 read-write, marking pmem0 read-write with the read-only setting lost: $ blockdev --getro /dev/pmem0 0 That's from bus.c nvdimm_revalidate_disk(), which always applies the setting from nd_region (which is initially based on the ACPI NFIT NVDIMM state flags not_armed bit). In contrast, commit 20bd1d026aac ("scsi: sd: Keep disk read-only when re-reading partition") fixed this issue for SCSI devices to preserve the previous setting if it was set to read-only. This patch modifies bus.c to preserve any previous read-only setting. It also eliminates the kernel serial log print except for cases where read-write is changed to read-only, so it doesn't print read-only to read-only non-changes. Cc: <stable@vger.kernel.org> Fixes: 581388209405 ("libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only") Signed-off-by: Robert Elliott <elliott@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-04-07libnvdimm: Add of_node to region and bus descriptorsOliver O'Halloran
We want to be able to cross reference the region and bus devices with the device tree node that they were spawned from. libNVDIMM handles creating the actual devices for these internally, so we need to pass in a pointer to the relevant node in the descriptor. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-03-06libnvdimm: remove redundant __func__ in dev_dbgDan Williams
Dynamic debug can be instructed to add the function name to the debug output using the +f switch, so there is no need for the libnvdimm modules to do it again. If a user decides to add the +f switch for libnvdimm's dynamic debug this results in double prints of the function name. Reported-by: Johannes Thumshirn <jthumshirn@suse.de> Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-12-04nfit, libnvdimm: deprecate the generic SMART ioctlDan Williams
The kernel's ND_IOCTL_SMART_THRESHOLD command is based on a payload definition that has become broken / out-of-sync with recent versions of the NVDIMM_FAMILY_INTEL definition. Deprecate the use of the ND_IOCTL_SMART_THRESHOLD command in favor of the ND_CMD_CALL approach taken by NVDIMM_FAMILY_{HPE,MSFT}, where we can manage the per-vendor variance in userspace. In a couple years, when the new scheme is widely deployed in userspace packages, the ND_IOCTL_SMART_THRESHOLD support can be removed. For now we prevent new binaries from compiling against the kernel header definitions, but kernel still compatible with old binaries. The libndctl.h [1] header is now the authoritative interface definition for NVDIMM SMART. [1]: https://github.com/pmem/ndctl Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-02libnvdimm: move poison list functions to a new 'badrange' fileDave Jiang
nfit_test needs to use the poison list manipulation code as well. Make it more generic and in the process rename poison to badrange, and move all the related helpers to a new file. Signed-off-by: Dave Jiang <dave.jiang@intel.com> [vishal: Add badrange.o to nfit_test's Kbuild] [vishal: add a missed include in bus.c for the new badrange functions] [vishal: rename all instances of 'be' to 'bre'] Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-09-04libnvdimm, nfit: move the check on nd_reserved2 to the endpointMeng Xu
Delay the check of nd_reserved2 to the actual endpoint (acpi_nfit_ctl) that uses it, as a prevention of a potential double-fetch bug. While examining the kernel source code, I found a dangerous operation that could turn into a double-fetch situation (a race condition bug) where the same userspace memory region are fetched twice into kernel with sanity checks after the first fetch while missing checks after the second fetch. In the case of _IOC_NR(ioctl_cmd) == ND_CMD_CALL: 1. The first fetch happens in line 935 copy_from_user(&pkg, p, sizeof(pkg) 2. subsequently `pkg.nd_reserved2` is asserted to be all zeroes (line 984 to 986). 3. The second fetch happens in line 1022 copy_from_user(buf, p, buf_len) 4. Given that `p` can be fully controlled in userspace, an attacker can race condition to override the header part of `p`, say, `((struct nd_cmd_pkg *)p)->nd_reserved2` to arbitrary value (say nine 0xFFFFFFFF for `nd_reserved2`) after the first fetch but before the second fetch. The changed value will be copied to `buf`. 5. There is no checks on the second fetches until the use of it in line 1034: nd_cmd_clear_to_send(nvdimm_bus, nvdimm, cmd, buf) and line 1038: nd_desc->ndctl(nd_desc, nvdimm, cmd, buf, buf_len, &cmd_rc) which means that the assumed relation, `p->nd_reserved2` are all zeroes might not hold after the second fetch. And once the control goes to these functions we lose the context to assert the assumed relation. 6. Based on my manual analysis, `p->nd_reserved2` is not used in function `nd_cmd_clear_to_send` and potential implementations of `nd_desc->ndctl` so there is no working exploit against it right now. However, this could easily turns to an exploitable one if careless developers start to use `p->nd_reserved2` later and assume that they are all zeroes. Move the validation of the nd_reserved2 field to the ->ndctl() implementation where it has a stable buffer to evaluate. Signed-off-by: Meng Xu <mengxu.gatech@gmail.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-08-31libnvdimm: fix integer overflow static analysis warningDan Williams
Dan reports: The patch 62232e45f4a2: "libnvdimm: control (ioctl) messages for nvdimm_bus and nvdimm devices" from Jun 8, 2015, leads to the following static checker warning: drivers/nvdimm/bus.c:1018 __nd_ioctl() warn: integer overflows 'buf_len' From a casual review, this seems like it might be a real bug. On the first iteration we load some data into in_env[]. On the second iteration we read a use controlled "in_size" from nd_cmd_in_size(). It can go up to UINT_MAX - 1. A high number means we will fill the whole in_env[] buffer. But we potentially keep looping and adding more to in_len so now it can be any value. It simple enough to change, but it feels weird that we keep looping even though in_env is totally full. Shouldn't we just return an error if we don't have space for desc->in_num. We keep looping because the size of the total input is allowed to be bigger than the 'envelope' which is a subset of the payload that tells us how much data to expect. For safety explicitly check that buf_len does not overflow which is what the checker flagged. Cc: <stable@vger.kernel.org> Fixes: 62232e45f4a2: "libnvdimm: control (ioctl) messages for nvdimm_bus..." Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-08-31libnvdimm: fix potential deadlock while clearing errorsVishal Verma
With the ACPI NFIT 'DSM' methods, acpi can be called from IO paths. Specifically, the DSM to clear media errors is called during writes, so that we can provide a writes-fix-errors model. However it is easy to imagine a scenario like: -> write through the nvdimm driver -> acpi allocation -> writeback, causes more IO through the nvdimm driver -> deadlock Fix this by using memalloc_noio_{save,restore}, which sets the GFP_NOIO flag for the current scope when issuing commands/IOs that are expected to clear errors. Cc: <linux-acpi@vger.kernel.org> Cc: <linux-nvdimm@lists.01.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Robert Moore <robert.moore@intel.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-07-03Merge branch 'for-4.13/dax' into libnvdimm-for-nextDan Williams
2017-07-01libnvdimm: passthru functions clear to sendJerry Hoemann
Have dsm functions called via the pass thru mechanism also be checked against clear to send. Signed-off-by: Jerry Hoemann <jerry.hoemann@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-27libnvdimm, nfit: enable support for volatile rangesDan Williams
Allow volatile nfit ranges to participate in all the same infrastructure provided for persistent memory regions. A resulting resulting namespace device will still be called "pmem", but the parent region type will be "nd_volatile". This is in preparation for disabling the dax ->flush() operation in the pmem driver when it is hosted on a volatile range. Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15libnvdimm, pmem: Add sysfs notifications to badblocksToshi Kani
Sysfs "badblocks" information may be updated during run-time that: - MCE, SCI, and sysfs "scrub" may add new bad blocks - Writes and ioctl() may clear bad blocks Add support to send sysfs notifications to sysfs "badblocks" file under region and pmem directories when their badblocks information is re-evaluated (but is not necessarily changed) during run-time. Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Linda Knippers <linda.knippers@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-29libnvdimm: rework region badblocks clearingDan Williams
Toshi noticed that the new support for a region-level badblocks missed the case where errors are cleared due to BTT I/O. An initial attempt to fix this ran into a "sleeping while atomic" warning due to taking the nvdimm_bus_lock() in the BTT I/O path to satisfy the locking requirements of __nvdimm_bus_badblocks_clear(). However, that lock is not needed since we are not acting on any data that is subject to change under that lock. The badblocks instance has its own internal lock to handle mutations of the error list. So, in order to make it clear that we are just acting on region devices, rename __nvdimm_bus_badblocks_clear() to nvdimm_clear_badblocks_regions(). Eliminate the lock and consolidate all support routines for the new nvdimm_account_cleared_poison() in drivers/nvdimm/bus.c. Finally, to the opportunity to cleanup to some unnecessary casts, make the calling convention of nvdimm_clear_badblocks_regions() clearer by replacing struct resource with the minimal struct clear_badblocks_context, and use the DEVICE_ATTR macro. Cc: Dave Jiang <dave.jiang@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Reported-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-28libnvdimm: fix clear length of nvdimm_forget_poison()Toshi Kani
ND_CMD_CLEAR_ERROR command returns 'clear_err.cleared', the length of error actually cleared, which may be smaller than its requested 'len'. Change nvdimm_clear_poison() to call nvdimm_forget_poison() with 'clear_err.cleared' when this value is valid. Cc: <stable@vger.kernel.org> Fixes: e046114af5fc ("libnvdimm: clear the internal poison_list when clearing badblocks") Cc: Dave Jiang <dave.jiang@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-13libnvdimm: fix clear poison locking with spinlock and GFP_NOWAIT allocationDave Jiang
The following warning results from holding a lane spinlock, preempt_disable(), or the btt map spinlock and then trying to take the reconfig_mutex to walk the poison list and potentially add new entries. BUG: sleeping function called from invalid context at kernel/locking/mutex. c:747 in_atomic(): 1, irqs_disabled(): 0, pid: 17159, name: dd [..] Call Trace: dump_stack+0x85/0xc8 ___might_sleep+0x184/0x250 __might_sleep+0x4a/0x90 __mutex_lock+0x58/0x9b0 ? nvdimm_bus_lock+0x21/0x30 [libnvdimm] ? __nvdimm_bus_badblocks_clear+0x2f/0x60 [libnvdimm] ? acpi_nfit_forget_poison+0x79/0x80 [nfit] ? _raw_spin_unlock+0x27/0x40 mutex_lock_nested+0x1b/0x20 nvdimm_bus_lock+0x21/0x30 [libnvdimm] nvdimm_forget_poison+0x25/0x50 [libnvdimm] nvdimm_clear_poison+0x106/0x140 [libnvdimm] nsio_rw_bytes+0x164/0x270 [libnvdimm] btt_write_pg+0x1de/0x3e0 [nd_btt] ? blk_queue_enter+0x30/0x290 btt_make_request+0x11a/0x310 [nd_btt] ? blk_queue_enter+0xb7/0x290 ? blk_queue_enter+0x30/0x290 generic_make_request+0x118/0x3b0 A spinlock is introduced to protect the poison list. This allows us to not having to acquire the reconfig_mutex for touching the poison list. The add_poison() function has been broken out into two helper functions. One to allocate the poison entry and the other to apppend the entry. This allows us to unlock the poison_lock in non-I/O path and continue to be able to allocate the poison entry with GFP_KERNEL. We will use GFP_NOWAIT in the I/O path in order to satisfy being in atomic context. Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-12libnvdimm: add support for clear poison list and badblocks for device daxDave Jiang
Providing mechanism to clear poison list via the ndctl ND_CMD_CLEAR_ERROR call. We will update the poison list and also the badblocks at region level if the region is in dax mode or in pmem mode and not active. In other words we force badblocks to be cleared through write requests if the address is currently accessed through a block device, otherwise it can only be done via the ioctl+dsm path. Signed-off-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-10libnvdimm: fix reconfig_mutex, mmap_sem, and jbd2_handle lockdep splatDan Williams
Holding the reconfig_mutex over a potential userspace fault sets up a lockdep dependency chain between filesystem-DAX and the libnvdimm ioctl path. Move the user access outside of the lock. [ INFO: possible circular locking dependency detected ] 4.11.0-rc3+ #13 Tainted: G W O ------------------------------------------------------- fallocate/16656 is trying to acquire lock: (&nvdimm_bus->reconfig_mutex){+.+.+.}, at: [<ffffffffa00080b1>] nvdimm_bus_lock+0x21/0x30 [libnvdimm] but task is already holding lock: (jbd2_handle){++++..}, at: [<ffffffff813b4944>] start_this_handle+0x104/0x460 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (jbd2_handle){++++..}: lock_acquire+0xbd/0x200 start_this_handle+0x16a/0x460 jbd2__journal_start+0xe9/0x2d0 __ext4_journal_start_sb+0x89/0x1c0 ext4_dirty_inode+0x32/0x70 __mark_inode_dirty+0x235/0x670 generic_update_time+0x87/0xd0 touch_atime+0xa9/0xd0 ext4_file_mmap+0x90/0xb0 mmap_region+0x370/0x5b0 do_mmap+0x415/0x4f0 vm_mmap_pgoff+0xd7/0x120 SyS_mmap_pgoff+0x1c5/0x290 SyS_mmap+0x22/0x30 entry_SYSCALL_64_fastpath+0x1f/0xc2 -> #1 (&mm->mmap_sem){++++++}: lock_acquire+0xbd/0x200 __might_fault+0x70/0xa0 __nd_ioctl+0x683/0x720 [libnvdimm] nvdimm_ioctl+0x8b/0xe0 [libnvdimm] do_vfs_ioctl+0xa8/0x740 SyS_ioctl+0x79/0x90 do_syscall_64+0x6c/0x200 return_from_SYSCALL_64+0x0/0x7a -> #0 (&nvdimm_bus->reconfig_mutex){+.+.+.}: __lock_acquire+0x16b6/0x1730 lock_acquire+0xbd/0x200 __mutex_lock+0x88/0x9b0 mutex_lock_nested+0x1b/0x20 nvdimm_bus_lock+0x21/0x30 [libnvdimm] nvdimm_forget_poison+0x25/0x50 [libnvdimm] nvdimm_clear_poison+0x106/0x140 [libnvdimm] pmem_do_bvec+0x1c2/0x2b0 [nd_pmem] pmem_make_request+0xf9/0x270 [nd_pmem] generic_make_request+0x118/0x3b0 submit_bio+0x75/0x150 Cc: <stable@vger.kernel.org> Fixes: 62232e45f4a2 ("libnvdimm: control (ioctl) messages for nvdimm_bus and nvdimm devices") Cc: Dave Jiang <dave.jiang@intel.com> Reported-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-12-06acpi, nfit, libnvdimm: fix / harden ars_status output length handlingDan Williams
Given ambiguities in the ACPI 6.1 definition of the "Output (Size)" field of the ARS (Address Range Scrub) Status command, a firmware implementation may in practice return 0, 4, or 8 to indicate that there is no output payload to process. The specification states "Size of Output Buffer in bytes, including this field.". However, 'Output Buffer' is also the name of the entire payload, and earlier in the specification it states "Max Query ARS Status Output Buffer Size: Maximum size of buffer (including the Status and Extended Status fields)". Without this fix if the BIOS happens to return 0 it causes memory corruption as evidenced by this result from the acpi_nfit_ctl() unit test. ars_status00000000: 00020000 00000000 ........ BUG: stack guard page was hit at ffffc90001750000 (stack is ffffc9000174c000..ffffc9000174ffff) kernel stack overflow (page fault): 0000 [#1] SMP DEBUG_PAGEALLOC task: ffff8803332d2ec0 task.stack: ffffc9000174c000 RIP: 0010:[<ffffffff814cfe72>] [<ffffffff814cfe72>] __memcpy+0x12/0x20 RSP: 0018:ffffc9000174f9a8 EFLAGS: 00010246 RAX: ffffc9000174fab8 RBX: 0000000000000000 RCX: 000000001fffff56 RDX: 0000000000000000 RSI: ffff8803231f5a08 RDI: ffffc90001750000 RBP: ffffc9000174fa88 R08: ffffc9000174fab0 R09: ffff8803231f54b8 R10: 0000000000000008 R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000003 R15: ffff8803231f54a0 FS: 00007f3a611af640(0000) GS:ffff88033ed00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffc90001750000 CR3: 0000000325b20000 CR4: 00000000000406e0 Stack: ffffffffa00bc60d 0000000000000008 ffffc90000000001 ffffc9000174faac 0000000000000292 ffffffffa00c24e4 ffffffffa00c2914 0000000000000000 0000000000000000 ffffffff00000003 ffff880331ae8ad0 0000000800000246 Call Trace: [<ffffffffa00bc60d>] ? acpi_nfit_ctl+0x49d/0x750 [nfit] [<ffffffffa01f4fe0>] nfit_test_probe+0x670/0xb1b [nfit_test] Cc: <stable@vger.kernel.org> Fixes: 747ffe11b440 ("libnvdimm, tools/testing/nvdimm: fix 'ars_status' output buffer sizing") Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-10-07Merge branch 'for-4.9/libnvdimm' into libnvdimm-for-nextDan Williams
2016-09-30libnvdimm: clear the internal poison_list when clearing badblocksVishal Verma
nvdimm_clear_poison cleared the user-visible badblocks, and sent commands to the NVDIMM to clear the areas marked as 'poison', but it neglected to clear the same areas from the internal poison_list which is used to marshal ARS results before sorting them by namespace. As a result, once on-demand ARS functionality was added: 37b137f nfit, libnvdimm: allow an ARS scrub to be triggered on demand A scrub triggered from either sysfs or an MCE was found to be adding stale entries that had been cleared from gendisk->badblocks, but were still present in nvdimm_bus->poison_list. Additionally, the stale entries could be triggered into producing stale disk->badblocks by simply disabling and re-enabling the namespace or region. This adds the missing step of clearing poison_list entries when clearing poison, so that it is always in sync with badblocks. Fixes: 37b137f ("nfit, libnvdimm: allow an ARS scrub to be triggered on demand") Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-09-09libnvdimm: allow legacy (e820) pmem region to clear bad blocksDave Jiang
Bad blocks can be injected via /sys/block/pmemN/badblocks. In a situation where legacy pmem is being used or a pmem region created by using memmap kernel parameter, the injected bad blocks are not cleared due to nvdimm_clear_poison() failing from lack of ndctl function pointer. In this case we need to just return as handled and allow the bad blocks to be cleared rather than fail. Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-28Merge tag 'libnvdimm-for-4.8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: - Replace pcommit with ADR / directed-flushing. The pcommit instruction, which has not shipped on any product, is deprecated. Instead, the requirement is that platforms implement either ADR, or provide one or more flush addresses per nvdimm. ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers to the memory controller on a power-fail event. Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware Interface Table (NFIT) sub-structure: "Flush Hint Address Structure". A flush hint is an mmio address that when written and fenced assures that all previous posted writes targeting a given dimm have been flushed to media. - On-demand ARS (address range scrub). Linux uses the results of the ACPI ARS commands to track bad blocks in pmem devices. When latent errors are detected we re-scrub the media to refresh the bad block list, userspace can also request a re-scrub at any time. - Support for the Microsoft DSM (device specific method) command format. - Support for EDK2/OVMF virtual disk device memory ranges. - Various fixes and cleanups across the subsystem. * tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits) libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register" nfit: do an ARS scrub on hitting a latent media error nfit: move to nfit/ sub-directory nfit, libnvdimm: allow an ARS scrub to be triggered on demand libnvdimm: register nvdimm_bus devices with an nd_bus driver pmem: clarify a debug print in pmem_clear_poison x86/insn: remove pcommit Revert "KVM: x86: add pcommit support" nfit, tools/testing/nvdimm/: unify shutdown paths libnvdimm: move ->module to struct nvdimm_bus_descriptor nfit: cleanup acpi_nfit_init calling convention nfit: fix _FIT evaluation memory leak + use after free tools/testing/nvdimm: add manufacturing_{date|location} dimm properties tools/testing/nvdimm: add virtual ramdisk range acpi, nfit: treat virtual ramdisk SPA as pmem region pmem: kill __pmem address space pmem: kill wmb_pmem() libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes fs/dax: remove wmb_pmem() libnvdimm, pmem: flush posted-write queues on shutdown ...