linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2020-07-29	revert: 1320a4052ea1 ("audit: trigger accompanying records when no rules ↵	Paul Moore
	present") Unfortunately the commit listed in the subject line above failed to ensure that the task's audit_context was properly initialized/set before enabling the "accompanying records". Depending on the situation, the resulting audit_context could have invalid values in some of it's fields which could cause a kernel panic/oops when the task/syscall exists and the audit records are generated. We will revisit the original patch, with the necessary fixes, in a future kernel but right now we just want to fix the kernel panic with the least amount of added risk. Cc: stable@vger.kernel.org Fixes: 1320a4052ea1 ("audit: trigger accompanying records when no rules present") Reported-by: j2468h@googlemail.com Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-07-29	Merge remote-tracking branch 'spi/for-5.9' into spi-next	Mark Brown

2020-07-29	Merge remote-tracking branch 'spi/for-5.8' into spi-linus	Mark Brown

2020-07-29	Merge series "Some bug fix for lpspi" from Clark Wang <xiaoning.wang@nxp.com>:	Mark Brown
	Hi, This patchset mainly fixes some recently discovered problems about CS for LPSPI module on i.MX8DXLEVK. Add the dt-bindings description for the new property. Clark Wang (4): spi: lpspi: Fix kernel warning dump when probe fail after calling spi_register spi: lpspi: remove unused fsl_lpspi->chipselect spi: lpspi: fix using CS discontinuously on i.MX8DXLEVK dt-bindings: lpspi: New property in document DT bindings for LPSPI .../bindings/spi/spi-fsl-lpspi.yaml \| 7 ++++++ drivers/spi/spi-fsl-lpspi.c \| 25 +++++++++++-------- 2 files changed, 21 insertions(+), 11 deletions(-) -- 2.17.1
2020-07-29	dt-bindings: lpspi: New property in document DT bindings for LPSPI	Clark Wang
	Add "fsl,spi-only-use-cs1-sel" to fit i.MX8DXL-EVK. Spi common code does not support use of CS signals discontinuously. It only uses CS1 without using CS0. So, add this property to re-config chipselect value. Signed-off-by: Clark Wang <xiaoning.wang@nxp.com> Link: https://lore.kernel.org/r/20200727031513.31774-1-xiaoning.wang@nxp.com Signed-off-by: Mark Brown <broonie@kernel.org>
2020-07-29	spi: lpspi: fix using CS discontinuously on i.MX8DXLEVK	Clark Wang
	SPI common code does not support using CS discontinuously for now. However, i.MX8DXL-EVK only uses CS1 without CS0. Therefore, add a flag is_only_cs1 to set the correct TCR[PCS]. Signed-off-by: Clark Wang <xiaoning.wang@nxp.com> Link: https://lore.kernel.org/r/20200727031448.31661-4-xiaoning.wang@nxp.com Signed-off-by: Mark Brown <broonie@kernel.org>
2020-07-29	spi: lpspi: remove unused fsl_lpspi->chipselect	Clark Wang
	The cs-gpio is initailized by spi_get_gpio_descs() now. Remove the chipselect. Signed-off-by: Clark Wang <xiaoning.wang@nxp.com> Link: https://lore.kernel.org/r/20200727031448.31661-3-xiaoning.wang@nxp.com Signed-off-by: Mark Brown <broonie@kernel.org>
2020-07-29	spi: lpspi: Fix kernel warning dump when probe fail after calling spi_register	Clark Wang
	Calling devm_spi_register_controller() too early will cause problem. When probe failed occurs after calling devm_spi_register_controller(), the call of spi_controller_put() will trigger the following warning dump. [ 2.092138] ------------[ cut here ]------------ [ 2.096876] kernfs: can not remove 'uevent', no directory [ 2.102440] WARNING: CPU: 0 PID: 181 at fs/kernfs/dir.c:1503 kernfs_remove_by_name_ns+0xa0/0xb0 [ 2.111142] Modules linked in: [ 2.114207] CPU: 0 PID: 181 Comm: kworker/0:7 Not tainted 5.4.24-05024-g775c6e8a738c-dirty #1314 [ 2.122991] Hardware name: Freescale i.MX8DXL EVK (DT) [ 2.128141] Workqueue: events deferred_probe_work_func [ 2.133281] pstate: 60000005 (nZCv daif -PAN -UAO) [ 2.138076] pc : kernfs_remove_by_name_ns+0xa0/0xb0 [ 2.142958] lr : kernfs_remove_by_name_ns+0xa0/0xb0 [ 2.147837] sp : ffff8000122bba70 [ 2.151145] x29: ffff8000122bba70 x28: ffff8000119d6000 [ 2.156462] x27: 0000000000000000 x26: ffff800011edbce8 [ 2.161779] x25: 0000000000000000 x24: ffff00003ae4f700 [ 2.167096] x23: ffff000010184c10 x22: ffff00003a3d6200 [ 2.172412] x21: ffff800011a464a8 x20: ffff000010126a68 [ 2.177729] x19: ffff00003ae5c800 x18: 000000000000000e [ 2.183046] x17: 0000000000000001 x16: 0000000000000019 [ 2.188362] x15: 0000000000000004 x14: 000000000000004c [ 2.193679] x13: 0000000000000000 x12: 0000000000000001 [ 2.198996] x11: 0000000000000000 x10: 00000000000009c0 [ 2.204313] x9 : ffff8000122bb7a0 x8 : ffff00003a3d6c20 [ 2.209630] x7 : ffff00003a3d6380 x6 : 0000000000000001 [ 2.214946] x5 : 0000000000000001 x4 : ffff00003a05eb18 [ 2.220263] x3 : 0000000000000005 x2 : ffff8000119f1c48 [ 2.225580] x1 : 2bcbda323bf5a800 x0 : 0000000000000000 [ 2.230898] Call trace: [ 2.233345] kernfs_remove_by_name_ns+0xa0/0xb0 [ 2.237879] sysfs_remove_file_ns+0x14/0x20 [ 2.242065] device_del+0x12c/0x348 [ 2.245555] device_unregister+0x14/0x30 [ 2.249492] spi_unregister_controller+0xac/0x120 [ 2.254201] devm_spi_unregister+0x10/0x18 [ 2.258304] release_nodes+0x1a8/0x220 [ 2.262055] devres_release_all+0x34/0x58 [ 2.266069] really_probe+0x1b8/0x318 [ 2.269733] driver_probe_device+0x54/0xe8 [ 2.273833] __device_attach_driver+0x80/0xb8 [ 2.278194] bus_for_each_drv+0x74/0xc0 [ 2.282034] __device_attach+0xdc/0x138 [ 2.285876] device_initial_probe+0x10/0x18 [ 2.290063] bus_probe_device+0x90/0x98 [ 2.293901] deferred_probe_work_func+0x64/0x98 [ 2.298442] process_one_work+0x198/0x320 [ 2.302451] worker_thread+0x1f0/0x420 [ 2.306208] kthread+0xf0/0x120 [ 2.309352] ret_from_fork+0x10/0x18 [ 2.312927] ---[ end trace 58abcdfae01bd3c7 ]--- So put this function at the end of the probe sequence. Signed-off-by: Clark Wang <xiaoning.wang@nxp.com> Link: https://lore.kernel.org/r/20200727031448.31661-2-xiaoning.wang@nxp.com Signed-off-by: Mark Brown <broonie@kernel.org>
2020-07-29	usb: typec: tcpm: Add WARN_ON ensure we are not trying to send 2 VDM packets ↵	Hans de Goede
	at the same time The tcpm.c code for sending VDMs assumes that there will only be one VDM in flight at the time. The "queue" used by tcpm_queue_vdm is only 1 entry deep. This assumes that the higher layers (tcpm state-machine and alt-mode drivers) ensure that queuing a new VDM before the old one has been completely send (or it timed out) add a WARN_ON to check for this. Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20200724174702.61754-6-hdegoede@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-29	usb: typec: tcpm: Fix AB BA lock inversion between tcpm code and the ↵	Hans de Goede
	alt-mode drivers When we receive a PD data packet which ends up being for the alt-mode driver we have the following lock order: 1. tcpm_pd_rx_handler take the tcpm-port lock 2. We call into the alt-mode driver which takes the alt-mode's lock And when the alt-mode driver initiates communication we have the following lock order: 3. alt-mode driver takes the alt-mode's lock 4. alt-mode driver calls tcpm_altmode_enter which takes the tcpm-port lock This is a classic AB BA lock inversion issue. With the refactoring of tcpm_handle_vdm_request() done before this patch, we don't rely on, or need to make changes to the tcpm-port data by the time we make call 2. from above. All data to be passed to the alt-mode driver sits on our stack at this point, and thus does not need locking. So after the refactoring we can simply fix this by releasing the tcpm-port lock before calling into the alt-mode driver. This fixes the following lockdep warning: [ 191.454238] ====================================================== [ 191.454240] WARNING: possible circular locking dependency detected [ 191.454244] 5.8.0-rc5+ #1 Not tainted [ 191.454246] ------------------------------------------------------ [ 191.454248] kworker/u8:5/794 is trying to acquire lock: [ 191.454251] ffff9bac8e30d4a8 (&dp->lock){+.+.}-{3:3}, at: dp_altmode_vdm+0x30/0xf0 [typec_displayport] [ 191.454263] but task is already holding lock: [ 191.454264] ffff9bac9dc240a0 (&port->lock#2){+.+.}-{3:3}, at: tcpm_pd_rx_handler+0x43/0x12c0 [tcpm] [ 191.454273] which lock already depends on the new lock. [ 191.454275] the existing dependency chain (in reverse order) is: [ 191.454277] -> #1 (&port->lock#2){+.+.}-{3:3}: [ 191.454286] __mutex_lock+0x7b/0x820 [ 191.454290] tcpm_altmode_enter+0x23/0x90 [tcpm] [ 191.454293] dp_altmode_work+0xca/0xe0 [typec_displayport] [ 191.454299] process_one_work+0x23f/0x570 [ 191.454302] worker_thread+0x55/0x3c0 [ 191.454305] kthread+0x138/0x160 [ 191.454309] ret_from_fork+0x22/0x30 [ 191.454311] -> #0 (&dp->lock){+.+.}-{3:3}: [ 191.454317] __lock_acquire+0x1241/0x2090 [ 191.454320] lock_acquire+0xa4/0x3d0 [ 191.454323] __mutex_lock+0x7b/0x820 [ 191.454326] dp_altmode_vdm+0x30/0xf0 [typec_displayport] [ 191.454330] tcpm_pd_rx_handler+0x11ae/0x12c0 [tcpm] [ 191.454333] process_one_work+0x23f/0x570 [ 191.454336] worker_thread+0x55/0x3c0 [ 191.454338] kthread+0x138/0x160 [ 191.454341] ret_from_fork+0x22/0x30 [ 191.454343] other info that might help us debug this: [ 191.454345] Possible unsafe locking scenario: [ 191.454347] CPU0 CPU1 [ 191.454348] ---- ---- [ 191.454350] lock(&port->lock#2); [ 191.454353] lock(&dp->lock); [ 191.454355] lock(&port->lock#2); [ 191.454357] lock(&dp->lock); [ 191.454360] * DEADLOCK * Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20200724174702.61754-5-hdegoede@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-29	usb: typec: tcpm: Refactor tcpm_handle_vdm_request	Hans de Goede
	Refactor tcpm_handle_vdm_request and its tcpm_pd_svdm helper function so that reporting the results of the vdm to the altmode-driver is separated out into a clear separate step inside tcpm_handle_vdm_request, instead of being scattered over various places inside the tcpm_pd_svdm helper. This is a preparation patch for fixing an AB BA lock inversion between the tcpm code and some altmode drivers. Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20200724174702.61754-4-hdegoede@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-29	usb: typec: tcpm: Refactor tcpm_handle_vdm_request payload handling	Hans de Goede
	Refactor the tcpm_handle_vdm_request payload handling by doing the endianness conversion only once directly inside tcpm_handle_vdm_request itself instead of doing it multiple times inside various helper functions called by tcpm_handle_vdm_request. This is a preparation patch for some further refactoring to fix an AB BA lock inversion between the tcpm code and some altmode drivers. Reviewed-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20200724174702.61754-3-hdegoede@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-29	usb: typec: tcpm: Add tcpm_queue_vdm_unlocked() helper	Hans de Goede
	Various callers (all the typec_altmode_ops) take the port-lock just for the tcpm_queue_vdm() call. Add a new tcpm_queue_vdm_unlocked() helper which takes the lock, so that its callers don't have to do this themselves. This is a preparation patch for fixing an AB BA lock inversion between the tcpm code and some altmode drivers. Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20200724174702.61754-2-hdegoede@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-29	usb: typec: tcpm: Move mod_delayed_work(&port->vdm_state_machine) call into ↵	Hans de Goede
	tcpm_queue_vdm() All callers of tcpm_queue_vdm() immediately follow the tcpm_queue_vdm() vdm call with a: mod_delayed_work(port->wq, &port->vdm_state_machine, 0); Call, fold this into tcpm_queue_vdm() itself. Reviewed-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20200724174702.61754-1-hdegoede@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-29	Merge tag 'usb-ci-v5.9-rc1' of ↵	Greg Kroah-Hartman
	git://git.kernel.org/pub/scm/linux/kernel/git/peter.chen/usb into usb-next Peter writes: ENDIAN issue fix and one query controller role API is introduced. * tag 'usb-ci-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/peter.chen/usb: usb: chipidea: imx: get available runtime dr mode for wakeup setting usb: chipidea: add query_available_role interface Documentation: ABI: usb: chipidea: Update Li Jun's e-mail usb: chipidea: udc: fix the ENDIAN issue
2020-07-29	Documentation/sysctl: Document uclamp sysctl knobs	Qais Yousef
	Uclamp exposes 3 sysctl knobs: * sched_util_clamp_min * sched_util_clamp_max * sched_util_clamp_min_rt_default Document them in sysctl/kernel.rst. Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200716110347.19553-3-qais.yousef@arm.com
2020-07-29	sched/uclamp: Add a new sysctl to control RT default boost value	Qais Yousef
	RT tasks by default run at the highest capacity/performance level. When uclamp is selected this default behavior is retained by enforcing the requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum value. This is also referred to as 'the default boost value of RT tasks'. See commit 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks"). On battery powered devices, it is desired to control this default (currently hardcoded) behavior at runtime to reduce energy consumed by RT tasks. For example, a mobile device manufacturer where big.LITTLE architecture is dominant, the performance of the little cores varies across SoCs, and on high end ones the big cores could be too power hungry. Given the diversity of SoCs, the new knob allows manufactures to tune the best performance/power for RT tasks for the particular hardware they run on. They could opt to further tune the value when the user selects a different power saving mode or when the device is actively charging. The runtime aspect of it further helps in creating a single kernel image that can be run on multiple devices that require different tuning. Keep in mind that a lot of RT tasks in the system are created by the kernel. On Android for instance I can see over 50 RT tasks, only a handful of which created by the Android framework. To control the default behavior globally by system admins and device integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default to change the default boost value of the RT tasks. I anticipate this to be mostly in the form of modifying the init script of a particular device. To avoid polluting the fast path with unnecessary code, the approach taken is to synchronously do the update by traversing all the existing tasks in the system. This could race with a concurrent fork(), which is dealt with by introducing sched_post_fork() function which will ensure the racy fork will get the right update applied. Tested on Juno-r2 in combination with the RT capacity awareness [1]. By default an RT task will go to the highest capacity CPU and run at the maximum frequency, which is particularly energy inefficient on high end mobile devices because the biggest core[s] are 'huge' and power hungry. With this patch the RT task can be controlled to run anywhere by default, and doesn't cause the frequency to be maximum all the time. Yet any task that really needs to be boosted can easily escape this default behavior by modifying its requested uclamp.min value (p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall. [1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200716110347.19553-2-qais.yousef@arm.com
2020-07-29	sched/uclamp: Fix a deadlock when enabling uclamp static key	Qais Yousef
	The following splat was caught when setting uclamp value of a task: BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:49 cpus_read_lock+0x68/0x130 static_key_enable+0x1c/0x38 __sched_setscheduler+0x900/0xad8 Fix by ensuring we enable the key outside of the critical section in __sched_setscheduler() Fixes: 46609ce22703 ("sched/uclamp: Protect uclamp fast path code with static key") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200716110347.19553-4-qais.yousef@arm.com
2020-07-29	ALSA: hda: fix NULL pointer dereference during suspend	Ranjani Sridharan
	When the ASoC card registration fails and the codec component driver never probes, the codec device is not initialized and therefore memory for codec->wcaps is not allocated. This results in a NULL pointer dereference when the codec driver suspend callback is invoked during system suspend. Fix this by returning without performing any actions during codec suspend/resume if the card was not registered successfully. Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Signed-off-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Link: https://lore.kernel.org/r/20200728231011.1454066-1-ranjani.sridharan@linux.intel.com Signed-off-by: Takashi Iwai <tiwai@suse.de>
2020-07-29	habanalabs: goya_ctx_init() can be static	kernel test robot
	Signed-off-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/r/20200729000313.GA14680@e442e3f624c4 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-29	habanalabs: fix up absolute include instructions	Greg Kroah-Hartman
	There's no need to try to be cute with the include file locations in the Makefile, so just specify exactly where the files are. Bonus is this fixes the problem of building with O= as well as trying to just build the subdirectory alone. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Oded Gabbay <oded.gabbay@gmail.com> Cc: Omer Shpigelman <oshpigelman@habana.ai> Cc: Tomer Tayar <ttayar@habana.ai> Cc: Moti Haimovski <mhaimovski@habana.ai> Cc: Ofir Bitton <obitton@habana.ai> Cc: Ben Segal <bpsegal20@gmail.com> Cc: Christine Gharzuzi <cgharzuzi@habana.ai> Cc: Pawel Piskorski <ppiskorski@habana.ai> Link: https://lore.kernel.org/r/20200728171851.55842-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-29	nvme: add a Identify Namespace Identification Descriptor list quirk	Christoph Hellwig
	Add a quirk for a device that does not support the Identify Namespace Identification Descriptor list despite claiming 1.3 compliance. Fixes: ea43d9709f72 ("nvme: fix identify error status silent ignore") Reported-by: Ingo Brunberg <ingo_brunberg@web.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Ingo Brunberg <ingo_brunberg@web.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2020-07-29	nvme-loop: remove extra variable in create ctrl	Chaitanya Kulkarni
	We can call the nvme_change_ctrl_state() directly and have WARN_ON_ONCE(1) call instead of having to use an extra variable which matches the name of the function. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29	nvme-loop: set ctrl state connecting after init	Chaitanya Kulkarni
	When creating a loop controller (ctrl) in nvme_loop_create_ctrl() -> nvme_init_ctrl() we set the ctrl state to NVME_CTRL_NEW. Prior to [1] NVME_CTRL_NEW state was allowed in nvmf_check_ready() for fabrics command type connect. Now, this fails in the following code path for fabrics connect command when creating admin queue :- nvme_loop_create_ctrl() nvme_loo_configure_admin_queue() nvmf_connect_admin_queue() __nvme_submit_sync_cmd() blk_execute_rq() nvme_loop_queue_rq() nvmf_check_ready() # echo "transport=loop,nqn=fs" > /dev/nvme-fabrics [ 6047.741327] nvmet: adding nsid 1 to subsystem fs [ 6048.756430] nvme nvme1: Connect command failed, error wo/DNR bit: 880 We need to set the ctrl state to NVME_CTRL_CONNECTING after :- nvme_loop_create_ctrl() nvme_init_ctrl() so that the above mentioned check for nvmf_check_ready() will return true. This patch sets the ctrl state to connecting after we init the ctrl in nvme_loop_create_ctrl() nvme_init_ctrl() . [1] commit aa63fa6776a7 ("nvme-fabrics: allow to queue requests for live queues") Fixes: aa63fa6776a7 ("nvme-fabrics: allow to queue requests for live queues") Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29	nvme-multipath: do not fall back to __nvme_find_path() for non-optimized paths	Hannes Reinecke
	When nvme_round_robin_path() finds a valid namespace we should be using it; falling back to __nvme_find_path() for non-optimized paths will cause the result from nvme_round_robin_path() to be ignored for non-optimized paths. Fixes: 75c10e732724 ("nvme-multipath: round-robin I/O policy") Signed-off-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29	nvme-multipath: fix logic for non-optimized paths	Martin Wilck
	Handle the special case where we have exactly one optimized path, which we should keep using in this case. Fixes: 75c10e732724 ("nvme-multipath: round-robin I/O policy") Signed off-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29	nvme-rdma: fix controller reset hang during traffic	Sagi Grimberg
	commit fe35ec58f0d3 ("block: update hctx map when use multiple maps") exposed an issue where we may hang trying to wait for queue freeze during I/O. We call blk_mq_update_nr_hw_queues which in case of multiple queue maps (which we have now for default/read/poll) is attempting to freeze the queue. However we never started queue freeze when starting the reset, which means that we have inflight pending requests that entered the queue that we will not complete once the queue is quiesced. So start a freeze before we quiesce the queue, and unfreeze the queue after we successfully connected the I/O queues (and make sure to call blk_mq_update_nr_hw_queues only after we are sure that the queue was already frozen). This follows to how the pci driver handles resets. Fixes: fe35ec58f0d3 ("block: update hctx map when use multiple maps") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29	nvme-tcp: fix controller reset hang during traffic	Sagi Grimberg
	commit fe35ec58f0d3 ("block: update hctx map when use multiple maps") exposed an issue where we may hang trying to wait for queue freeze during I/O. We call blk_mq_update_nr_hw_queues which in case of multiple queue maps (which we have now for default/read/poll) is attempting to freeze the queue. However we never started queue freeze when starting the reset, which means that we have inflight pending requests that entered the queue that we will not complete once the queue is quiesced. So start a freeze before we quiesce the queue, and unfreeze the queue after we successfully connected the I/O queues (and make sure to call blk_mq_update_nr_hw_queues only after we are sure that the queue was already frozen). This follows to how the pci driver handles resets. Fixes: fe35ec58f0d3 ("block: update hctx map when use multiple maps") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29	nvmet: introduce the passthru Kconfig option	Chaitanya Kulkarni
	This patch updates KConfig file for the NVMeOF target where we add new option so that user can selectively enable/disable passthru code. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> [logang@deltatee.com: fixed some of the wording in the help message] Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29	nvmet: introduce the passthru configfs interface	Logan Gunthorpe
	When CONFIG_NVME_TARGET_PASSTHRU as 'passthru' directory will be added to each subsystem. The directory is similar to a namespace and has two attributes: device_path and enable. The user must set the path to the nvme controller's char device and write '1' to enable the subsystem to use passthru. Any given subsystem is prevented from enabling both a regular namespace and the passthru device. If one is enabled, enabling the other will produce an error. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29	nvmet: Add passthru enable/disable helpers	Logan Gunthorpe
	This patch adds helper functions which are used in the NVMeOF configfs when the user is configuring the passthru subsystem. Here we ensure that only one subsys is assigned to each nvme_ctrl by using an xarray on the cntlid. The subsystem's version number is overridden by the passed through controller's version. However, if that version is less than 1.2.1, then we bump the advertised version to that and print a warning in dmesg. Based-on-a-patch-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29	nvmet: add passthru code to process commands	Logan Gunthorpe
	Add passthru command handling capability for the NVMeOF target and export passthru APIs which are used to integrate passthru code with nvmet-core. The new file passthru.c handles passthru cmd parsing and execution. In the passthru mode, we create a block layer request from the nvmet request and map the data on to the block layer request. Admin commands and features are on an allow list as there are a number of each that don't make too much sense with passthrough. We use an allow list such that new commands can be considered before being blindly passed through. In both cases, vendor specific commands are always allowed. We also reject reservation IO commands as the underlying device cannot differentiate between multiple hosts behind a fabric. Based-on-a-patch-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29	nvme: export nvme_find_get_ns() and nvme_put_ns()	Logan Gunthorpe
	nvme_find_get_ns() and nvme_put_ns() are required by the target passthru code and are exported under the NVME_TARGET_PASSTHRU namespace. Based-on-a-patch-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29	nvme: introduce nvme_ctrl_get_by_path()	Logan Gunthorpe
	nvme_ctrl_get_by_path() is analogous to blkdev_get_by_path() except it gets a struct nvme_ctrl from the path to its char dev (/dev/nvme0). It makes use of filp_open() to open the file and uses the private data to obtain a pointer to the struct nvme_ctrl. If the fops of the file do not match, -EINVAL is returned. The purpose of this function is to support NVMe-OF target passthru and is exported under the NVME_TARGET_PASSTHRU namespace. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29	nvme: introduce nvme_execute_passthru_rq to call nvme_passthru_[start\|end]()	Logan Gunthorpe
	Introduce a new nvme_execute_passthru_rq() helper which calls nvme_passthru_[start\|end]() around blk_execute_rq(). This ensures all passthru calls (including nvme_submit_io()) will be wrapped appropriately. nvme_execute_passthru_rq() will also be useful for the nvmet passthru code and is exported in the NVME_TARGET_PASSTHRU namespace. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29	nvme: create helper function to obtain command effects	Logan Gunthorpe
	Separate the code to obtain command effects from the code to start a passthru request and move the nvme_passthru_start() and nvme_passthru_end() functions up above nvme_submit_user_cmd() in order that they may be used in a new helper a subsequent patch. The new helper function will be necessary for nvmet passthru code to determine if we need to change out of interrupt context to handle the effects. It is exported in the NVME_TARGET_PASSTHRU namespace. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29	nvme: clear any SGL flags in passthru commands	Logan Gunthorpe
	The host driver should decide whether to use SGLs or PRPs and they currently assume the flags are cleared after the call to nvme_setup_cmd(). However, passed-through commands may erroneously set these bits; so clear them for all cases. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29	nvmet-fc: remove redundant del_work_active flag	James Smart
	The transport has a del_work_active flag to avoid duplicate scheduling of the del_work item. This is redundant with the checks that schedule_work() makes. Remove the del_work_active flag. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29	nvmet-fc: check successful reference in nvmet_fc_find_target_assoc	James Smart
	When searching for an association based on an association id, when there is a match, the code takes a reference. However, it is not validating that the reference taking was successful. Check the status of the reference. If unsuccessful, the device is being deleted and should be ignored. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29	nvme-fc: set max_segments to lldd max value	James Smart
	Currently the FC transport is set max_hw_sectors based on the lldds max sgl segment count. However, the block queue max segments is set based on the controller's max_segments count, which the transport does not set. As such, the lldd is receiving sgl lists that are exceeding its max segment count. Set the controller max segment count and derive max_hw_sectors from the max segment count. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29	nvme-fc: drop a duplicated word in a comment	Randy Dunlap
	Drop the repeated word "a" in a comment. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29	nvme-hwmon: log the controller device name	Sagi Grimberg
	Stay consistent with the rest of the driver Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29	nvme: fix deadlock in disconnect during scan_work and/or ana_work	Sagi Grimberg
	A deadlock happens in the following scenario with multipath: 1) scan_work(nvme0) detects a new nsid while nvme0 is an optimized path to it, path nvme1 happens to be inaccessible. 2) Before scan_work is complete nvme0 disconnect is initiated nvme_delete_ctrl_sync() sets nvme0 state to NVME_CTRL_DELETING 3) scan_work(1) attempts to submit IO, but nvme_path_is_optimized() observes nvme0 is not LIVE. Since nvme1 is a possible path IO is requeued and scan_work hangs. -- Workqueue: nvme-wq nvme_scan_work [nvme_core] kernel: Call Trace: kernel: __schedule+0x2b9/0x6c0 kernel: schedule+0x42/0xb0 kernel: io_schedule+0x16/0x40 kernel: do_read_cache_page+0x438/0x830 kernel: read_cache_page+0x12/0x20 kernel: read_dev_sector+0x27/0xc0 kernel: read_lba+0xc1/0x220 kernel: efi_partition+0x1e6/0x708 kernel: check_partition+0x154/0x244 kernel: rescan_partitions+0xae/0x280 kernel: __blkdev_get+0x40f/0x560 kernel: blkdev_get+0x3d/0x140 kernel: __device_add_disk+0x388/0x480 kernel: device_add_disk+0x13/0x20 kernel: nvme_mpath_set_live+0x119/0x140 [nvme_core] kernel: nvme_update_ns_ana_state+0x5c/0x60 [nvme_core] kernel: nvme_set_ns_ana_state+0x1e/0x30 [nvme_core] kernel: nvme_parse_ana_log+0xa1/0x180 [nvme_core] kernel: nvme_mpath_add_disk+0x47/0x90 [nvme_core] kernel: nvme_validate_ns+0x396/0x940 [nvme_core] kernel: nvme_scan_work+0x24f/0x380 [nvme_core] kernel: process_one_work+0x1db/0x380 kernel: worker_thread+0x249/0x400 kernel: kthread+0x104/0x140 -- 4) Delete also hangs in flush_work(ctrl->scan_work) from nvme_remove_namespaces(). Similiarly a deadlock with ana_work may happen: if ana_work has started and calls nvme_mpath_set_live and device_add_disk, it will trigger I/O. When we trigger disconnect I/O will block because our accessible (optimized) path is disconnecting, but the alternate path is inaccessible, so I/O blocks. Then disconnect tries to flush the ana_work and hangs. [ 605.550896] Workqueue: nvme-wq nvme_ana_work [nvme_core] [ 605.552087] Call Trace: [ 605.552683] __schedule+0x2b9/0x6c0 [ 605.553507] schedule+0x42/0xb0 [ 605.554201] io_schedule+0x16/0x40 [ 605.555012] do_read_cache_page+0x438/0x830 [ 605.556925] read_cache_page+0x12/0x20 [ 605.557757] read_dev_sector+0x27/0xc0 [ 605.558587] amiga_partition+0x4d/0x4c5 [ 605.561278] check_partition+0x154/0x244 [ 605.562138] rescan_partitions+0xae/0x280 [ 605.563076] __blkdev_get+0x40f/0x560 [ 605.563830] blkdev_get+0x3d/0x140 [ 605.564500] __device_add_disk+0x388/0x480 [ 605.565316] device_add_disk+0x13/0x20 [ 605.566070] nvme_mpath_set_live+0x5e/0x130 [nvme_core] [ 605.567114] nvme_update_ns_ana_state+0x2c/0x30 [nvme_core] [ 605.568197] nvme_update_ana_state+0xca/0xe0 [nvme_core] [ 605.569360] nvme_parse_ana_log+0xa1/0x180 [nvme_core] [ 605.571385] nvme_read_ana_log+0x76/0x100 [nvme_core] [ 605.572376] nvme_ana_work+0x15/0x20 [nvme_core] [ 605.573330] process_one_work+0x1db/0x380 [ 605.574144] worker_thread+0x4d/0x400 [ 605.574896] kthread+0x104/0x140 [ 605.577205] ret_from_fork+0x35/0x40 [ 605.577955] INFO: task nvme:14044 blocked for more than 120 seconds. [ 605.579239] Tainted: G OE 5.3.5-050305-generic #201910071830 [ 605.580712] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 605.582320] nvme D 0 14044 14043 0x00000000 [ 605.583424] Call Trace: [ 605.583935] __schedule+0x2b9/0x6c0 [ 605.584625] schedule+0x42/0xb0 [ 605.585290] schedule_timeout+0x203/0x2f0 [ 605.588493] wait_for_completion+0xb1/0x120 [ 605.590066] __flush_work+0x123/0x1d0 [ 605.591758] __cancel_work_timer+0x10e/0x190 [ 605.593542] cancel_work_sync+0x10/0x20 [ 605.594347] nvme_mpath_stop+0x2f/0x40 [nvme_core] [ 605.595328] nvme_stop_ctrl+0x12/0x50 [nvme_core] [ 605.596262] nvme_do_delete_ctrl+0x3f/0x90 [nvme_core] [ 605.597333] nvme_sysfs_delete+0x5c/0x70 [nvme_core] [ 605.598320] dev_attr_store+0x17/0x30 Fix this by introducing a new state: NVME_CTRL_DELETE_NOIO, which will indicate the phase of controller deletion where I/O cannot be allowed to access the namespace. NVME_CTRL_DELETING still allows mpath I/O to be issued to the bottom device, and only after we flush the ana_work and scan_work (after nvme_stop_ctrl and nvme_prep_remove_namespaces) we change the state to NVME_CTRL_DELETING_NOIO. Also we prevent ana_work from re-firing by aborting early if we are not LIVE, so we should be safe here. In addition, change the transport drivers to follow the updated state machine. Fixes: 0d0b660f214d ("nvme: add ANA support") Reported-by: Anton Eidelman <anton@lightbitslabs.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29	nvme: document nvme controller states	Sagi Grimberg
	We are starting to see some non-trivial states so lets start documenting them. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29	nvmet: use xarray for ctrl ns storing	Chaitanya Kulkarni
	This patch replaces the ctrl->namespaces tracking from linked list to xarray and improves the performance when accessing one namespce :- XArray vs Default:- IOPS and BW (more the better) increase BW (~1.8%):- --------------------------------------------------- XArray :- read: IOPS=160k, BW=626MiB/s (656MB/s)(18.3GiB/30001msec) read: IOPS=160k, BW=626MiB/s (656MB/s)(18.3GiB/30001msec) read: IOPS=162k, BW=631MiB/s (662MB/s)(18.5GiB/30001msec) Default:- read: IOPS=156k, BW=609MiB/s (639MB/s)(17.8GiB/30001msec) read: IOPS=157k, BW=613MiB/s (643MB/s)(17.0GiB/30001msec) read: IOPS=160k, BW=626MiB/s (656MB/s)(18.3GiB/30001msec) Submission latency (less the better) decrease (~8.3%):- ------------------------------------------------------- XArray:- slat (usec): min=7, max=8386, avg=11.19, stdev=5.96 slat (usec): min=7, max=441, avg=11.09, stdev=4.48 slat (usec): min=7, max=1088, avg=11.21, stdev=4.54 Default :- slat (usec): min=8, max=2826.5k, avg=23.96, stdev=3911.50 slat (usec): min=8, max=503, avg=12.52, stdev=5.07 slat (usec): min=8, max=2384, avg=12.50, stdev=5.28 CPU Usage (less the better) decrease (~5.2%):- ---------------------------------------------- XArray:- cpu : usr=1.84%, sys=18.61%, ctx=949471, majf=0, minf=250 cpu : usr=1.83%, sys=18.41%, ctx=950262, majf=0, minf=237 cpu : usr=1.82%, sys=18.82%, ctx=957224, majf=0, minf=234 Default:- cpu : usr=1.70%, sys=19.21%, ctx=858196, majf=0, minf=251 cpu : usr=1.82%, sys=19.98%, ctx=929720, majf=0, minf=227 cpu : usr=1.83%, sys=20.33%, ctx=947208, majf=0, minf=235. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29	nvmet-rdma: use new shared CQ mechanism	Yamin Friedman
	Has the driver use shared CQs providing ~10%-20% improvement when multiple disks are used. Instead of opening a CQ for each QP per controller, a CQ for each core will be provided by the RDMA core driver that will be shared between the QPs on that core reducing interrupt overhead. Signed-off-by: Yamin Friedman <yaminf@mellanox.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29	nvme-rdma: use new shared CQ mechanism	Yamin Friedman
	Has the driver use shared CQs providing ~10%-20% improvement as seen in the patch introducing shared CQs. Instead of opening a CQ for each QP per controller connected, a CQ for each QP will be provided by the RDMA core driver that will be shared between the QPs on that core reducing interrupt overhead. Signed-off-by: Yamin Friedman <yaminf@mellanox.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29	nvme-pci: add support for ACPI StorageD3Enable property	David E. Box
	This patch implements a solution for a BIOS hack used on some currently shipping Intel systems to change driver power management policy for PCIe NVMe drives. Some newer Intel platforms, like some Comet Lake systems, require that PCIe devices use D3 when doing suspend-to-idle in order to allow the platform to realize maximum power savings. This is particularly needed to support ATX power supply shutdown on desktop systems. In order to ensure this happens for root ports with storage devices, Microsoft apparently created this ACPI _DSD property as a way to influence their driver policy. To my knowledge this property has not been discussed with the NVME specification body. Though the solution is not ideal, it addresses a problem that also affects Linux since the NVMe driver's default policy of using NVMe APST during suspend-to-idle prevents the PCI root port from going to D3 and leads to higher power consumption for these platforms. The power consumption difference may be negligible on laptop systems, but many watts on desktop systems when the ATX power supply is blocked from powering down. The patch creates a new nvme_acpi_storage_d3 function to check for the StorageD3Enable property during probe and enables D3 as a quirk if set. It also provides a 'noacpi' module parameter to allow skipping the quirk if needed. Tested with: - PM961 NVMe SED Samsung 512GB - INTEL SSDPEKKF512G8 Link: https://docs.microsoft.com/en-us/windows-hardware/design/component-guidelines/power-management-for-storage-hardware-devices-intro Signed-off-by: David E. Box <david.e.box@linux.intel.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29	nvme-pci: use max of PRP or SGL for iod size	Chaitanya Kulkarni
	>From the initial implementation of NVMe SGL kernel support commit a7a7cbe353a5 ("nvme-pci: add SGL support") with addition of the commit 943e942e6266 ("nvme-pci: limit max IO size and segments to avoid high order allocations") now there is only caller left for nvme_pci_iod_alloc_size() which statically passes true for last parameter that calculates allocation size based on SGL since we need size of biggest command supported for mempool allocation. This patch modifies the helper functions nvme_pci_iod_alloc_size() such that it is now uses maximum of PRP and SGL size for iod allocation size calculation. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29	nvme-core: replace ctrl page size with a macro	Chaitanya Kulkarni
	Saving the nvme controller's page size was from a time when the driver tried to use different sized pages, but this value is always set to a constant, and has been this way for some time. Remove the 'page_size' field and replace its usage with the constant value. This also lets the compiler make some micro-optimizations in the io path, and that's always a good thing. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>