Age | Commit message (Collapse) | Author |
|
When running iperf in a two host configuration the following trace can
occur:
[ 319.728730] NETDEV WATCHDOG: ib0 (hfi1): transmit queue 0 timed out
The issue happens because the current implementation relies on the netif
txq being stopped to control the flushing of the tx list.
There are two resources that the transmit logic can wait on and stop the
txq:
- SDMA descriptors
- Ring space to hold completions
The ring space is tested on the sending side and relieved when the ring is
consumed in the napi tx reaping.
Unfortunately, that reaping can run conncurrently with the workqueue
flushing of the txlist. If the txq is started just before the workitem
executes, the txlist will never be flushed, leading to the txq being
stuck.
Fix by:
- Adding sleep/wakeup wrappers
* Use an atomic to control the call to the netif routines inside the
wrappers
- Use another atomic to record ring space exhaustion
* Only wakeup when the a ring space exhaustion has happened and it
relieved
Add additional wrappers to clarify the ring space resource handling.
Fixes: d99dc602e2a5 ("IB/hfi1: Add functions to transmit datagram ipoib packets")
Link: https://lore.kernel.org/r/20200623204327.108092.4024.stgit@awfm-01.aw.intel.com
Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
The current code mishandles -EBUSY in two ways:
- The flow change doesn't test the return from the flush and runs on to
process the current packet racing with the wakeup processing
- The -EBUSY handling for a single packet inserts the tx into the txlist
after the submit call, racing with the same wakeup processing
Fix the first by dropping the skb and returning NETDEV_TX_OK.
Fix the second by insuring the the list entry within the txreq is inited
when allocated. This enables the sleep routine to detect that the txreq
has used the non-list api and queue the packet to the txlist.
Both flaws can lead to having the flushing thread executing in causing two
threads to manipulate the txlist.
Fixes: d99dc602e2a5 ("IB/hfi1: Add functions to transmit datagram ipoib packets")
Link: https://lore.kernel.org/r/20200623204321.108092.83898.stgit@awfm-01.aw.intel.com
Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
When the try_module_get calls were removed from opening and closing of the
i2c debugfs file, the corresponding module_put calls were missed. This
results in an inaccurate module use count that requires a power cycle to
fix.
Fixes: 09fbca8e6240 ("IB/hfi1: No need to use try_module_get for debugfs")
Link: https://lore.kernel.org/r/20200623203230.106975.76240.stgit@awfm-01.aw.intel.com
Cc: <stable@vger.kernel.org>
Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
We need to do some rework on the dummy netdev. Calling the free_netdev()
would normally make sense, and that will be addressed in an upcoming
patch. For now just revert the behavior to what it was before keeping the
unused variable removal part of the patch.
The dd->dumm_netdev is mainly used for packet receiving through
alloc_netdev_mqs() for typical net devices. A a result, it should be freed
with kfree instead of free_netdev() that leads to a crash when unloading
the hfi1 module:
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 8000000855b54067 P4D 8000000855b54067 PUD 84a4f5067 PMD 0
Oops: 0000 [#1] SMP PTI
CPU: 73 PID: 10299 Comm: modprobe Not tainted 5.6.0-rc5+ #1
Hardware name: Intel Corporation S2600WT2R/S2600WT2R, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
RIP: 0010:__hw_addr_flush+0x12/0x80
Code: 40 00 48 83 c4 08 4c 89 e7 5b 5d 41 5c e9 76 77 18 00 66 0f 1f 44 00 00 0f 1f 44 00 00 41 54 49 89 fc 55 53 48 8b 1f 48 39 df <48> 8b 2b 75 08 eb 4a 48 89 eb 48 89 c5 48 89 df e8 99 bf d0 ff 84
RSP: 0018:ffffb40e08783db8 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002
RDX: ffffb40e00000000 RSI: 0000000000000246 RDI: ffff88ab13662298
RBP: ffff88ab13662000 R08: 0000000000001549 R09: 0000000000001549
R10: 0000000000000001 R11: 0000000000aaaaaa R12: ffff88ab13662298
R13: ffff88ab1b259e20 R14: ffff88ab1b259e42 R15: 0000000000000000
FS: 00007fb39b534740(0000) GS:ffff88b31f940000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000084d3ea004 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
dev_addr_flush+0x15/0x30
free_netdev+0x7e/0x130
hfi1_netdev_free+0x59/0x70 [hfi1]
remove_one+0x65/0x110 [hfi1]
pci_device_remove+0x3b/0xc0
device_release_driver_internal+0xec/0x1b0
driver_detach+0x46/0x90
bus_remove_driver+0x58/0xd0
pci_unregister_driver+0x26/0xa0
hfi1_mod_cleanup+0xc/0xd54 [hfi1]
__x64_sys_delete_module+0x16c/0x260
? exit_to_usermode_loop+0xa4/0xc0
do_syscall_64+0x5b/0x200
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: 193ba03141bb ("IB/hfi1: Use free_netdev() in hfi1_netdev_free()")
Link: https://lore.kernel.org/r/20200623203224.106975.16926.stgit@awfm-01.aw.intel.com
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Per the datasheet for max6697, OVERT mask and ALERT mask are different.
For example, the 7th bit of OVERT is the local channel but for alert
mask, the 6th bit is the local channel. Therefore, we can't apply the
same mask for both registers. In addition to that, the max6697 driver
is supposed to be compatibale with different models. I manually went over
all the listed chips and made sure all chip types have the same layout.
Testing;
mask value of 0x9 should map to 0x44 for ALERT and 0x84 for OVERT.
I used iotool to read the reg value back to verify. I only tested this
change on max6581.
Reference:
https://datasheets.maximintegrated.com/en/ds/MAX6581.pdf
https://datasheets.maximintegrated.com/en/ds/MAX6697.pdf
https://datasheets.maximintegrated.com/en/ds/MAX6699.pdf
Signed-off-by: Chu Lin <linchuyuan@google.com>
Fixes: 5372d2d71c46e ("hwmon: Driver for Maxim MAX6697 and compatibles")
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
|
|
The mpath disk node takes a reference on the request mpath
request queue when adding live path to the mpath gendisk.
However if we connected to an inaccessible path device_add_disk
is not called, so if we disconnect and remove the mpath gendisk
we endup putting an reference on the request queue that was
never taken [1].
Fix that to check if we ever added a live path (using
NVME_NS_HEAD_HAS_DISK flag) and if not, clear the disk->queue
reference.
[1]:
------------[ cut here ]------------
refcount_t: underflow; use-after-free.
WARNING: CPU: 1 PID: 1372 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
CPU: 1 PID: 1372 Comm: nvme Tainted: G O 5.7.0-rc2+ #3
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014
RIP: 0010:refcount_warn_saturate+0xa6/0xf0
RSP: 0018:ffffb29e8053bdc0 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff8b7a2f4fc060 RCX: 0000000000000007
RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8b7a3ec99980
RBP: ffff8b7a2f4fc000 R08: 00000000000002e1 R09: 0000000000000004
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
R13: fffffffffffffff2 R14: ffffb29e8053bf08 R15: ffff8b7a320e2da0
FS: 00007f135d4ca800(0000) GS:ffff8b7a3ec80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005651178c0c30 CR3: 000000003b650005 CR4: 0000000000360ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
disk_release+0xa2/0xc0
device_release+0x28/0x80
kobject_put+0xa5/0x1b0
nvme_put_ns_head+0x26/0x70 [nvme_core]
nvme_put_ns+0x30/0x60 [nvme_core]
nvme_remove_namespaces+0x9b/0xe0 [nvme_core]
nvme_do_delete_ctrl+0x43/0x5c [nvme_core]
nvme_sysfs_delete.cold+0x8/0xd [nvme_core]
kernfs_fop_write+0xc1/0x1a0
vfs_write+0xb6/0x1a0
ksys_write+0x5f/0xe0
do_syscall_64+0x52/0x1a0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Reported-by: Anton Eidelman <anton@lightbitslabs.com>
Tested-by: Anton Eidelman <anton@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
In the following scenario scan_work and ana_work will deadlock:
When scan_work calls nvme_mpath_add_disk() this holds ana_lock
and invokes nvme_parse_ana_log(), which may issue IO
in device_add_disk() and hang waiting for an accessible path.
While nvme_mpath_set_live() only called when nvme_state_is_live(),
a transition may cause NVME_SC_ANA_TRANSITION and requeue the IO.
Since nvme_mpath_set_live() holds ns->head->lock, an ana_work on
ANY ctrl will not be able to complete nvme_mpath_set_live()
on the same ns->head, which is required in order to update
the new accessible path and remove NVME_NS_ANA_PENDING..
Therefore IO never completes: deadlock [1].
Fix:
Move device_add_disk out of the head->lock and protect it with an
atomic test_and_set for a new NVME_NS_HEAD_HAS_DISK bit.
[1]:
kernel: INFO: task kworker/u8:2:160 blocked for more than 120 seconds.
kernel: Tainted: G OE 5.3.5-050305-generic #201910071830
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: kworker/u8:2 D 0 160 2 0x80004000
kernel: Workqueue: nvme-wq nvme_ana_work [nvme_core]
kernel: Call Trace:
kernel: __schedule+0x2b9/0x6c0
kernel: schedule+0x42/0xb0
kernel: schedule_preempt_disabled+0xe/0x10
kernel: __mutex_lock.isra.0+0x182/0x4f0
kernel: __mutex_lock_slowpath+0x13/0x20
kernel: mutex_lock+0x2e/0x40
kernel: nvme_update_ns_ana_state+0x22/0x60 [nvme_core]
kernel: nvme_update_ana_state+0xca/0xe0 [nvme_core]
kernel: nvme_parse_ana_log+0xa1/0x180 [nvme_core]
kernel: nvme_read_ana_log+0x76/0x100 [nvme_core]
kernel: nvme_ana_work+0x15/0x20 [nvme_core]
kernel: process_one_work+0x1db/0x380
kernel: worker_thread+0x4d/0x400
kernel: kthread+0x104/0x140
kernel: ret_from_fork+0x35/0x40
kernel: INFO: task kworker/u8:4:439 blocked for more than 120 seconds.
kernel: Tainted: G OE 5.3.5-050305-generic #201910071830
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: kworker/u8:4 D 0 439 2 0x80004000
kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
kernel: Call Trace:
kernel: __schedule+0x2b9/0x6c0
kernel: schedule+0x42/0xb0
kernel: io_schedule+0x16/0x40
kernel: do_read_cache_page+0x438/0x830
kernel: read_cache_page+0x12/0x20
kernel: read_dev_sector+0x27/0xc0
kernel: read_lba+0xc1/0x220
kernel: efi_partition+0x1e6/0x708
kernel: check_partition+0x154/0x244
kernel: rescan_partitions+0xae/0x280
kernel: __blkdev_get+0x40f/0x560
kernel: blkdev_get+0x3d/0x140
kernel: __device_add_disk+0x388/0x480
kernel: device_add_disk+0x13/0x20
kernel: nvme_mpath_set_live+0x119/0x140 [nvme_core]
kernel: nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
kernel: nvme_mpath_add_disk+0xbe/0x100 [nvme_core]
kernel: nvme_validate_ns+0x396/0x940 [nvme_core]
kernel: nvme_scan_work+0x256/0x390 [nvme_core]
kernel: process_one_work+0x1db/0x380
kernel: worker_thread+0x4d/0x400
kernel: kthread+0x104/0x140
kernel: ret_from_fork+0x35/0x40
Fixes: 0d0b660f214d ("nvme: add ANA support")
Signed-off-by: Anton Eidelman <anton@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Right now ns->head->lock is protecting namespace mutation
which is wrong and unneeded. Move it to only protect
against head mutations. While we're at it, remove unnecessary
ns->head reference as we already have head pointer.
The problem with this is that the head->lock spans
mpath disk node I/O that may block under some conditions (if
for example the controller is disconnecting or the path
became inaccessible), The locking scheme does not allow any
other path to enable itself, preventing blocked I/O to complete
and forward-progress from there.
This is a preparation patch for the fix in a subsequent patch
where the disk I/O will also be done outside the head->lock.
Fixes: 0d0b660f214d ("nvme: add ANA support")
Signed-off-by: Anton Eidelman <anton@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
When scan_work calls nvme_mpath_add_disk() this holds ana_lock
and invokes nvme_parse_ana_log(), which may issue IO
in device_add_disk() and hang waiting for an accessible path.
While nvme_mpath_set_live() only called when nvme_state_is_live(),
a transition may cause NVME_SC_ANA_TRANSITION and requeue the IO.
In order to recover and complete the IO ana_work on the same ctrl
should be able to update the path state and remove NVME_NS_ANA_PENDING.
The deadlock occurs because scan_work keeps holding ana_lock,
so ana_work hangs [1].
Fix:
Now nvme_mpath_add_disk() uses nvme_parse_ana_log() to obtain a copy
of the ANA group desc, and then calls nvme_update_ns_ana_state() without
holding ana_lock.
[1]:
kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
kernel: Call Trace:
kernel: __schedule+0x2b9/0x6c0
kernel: schedule+0x42/0xb0
kernel: io_schedule+0x16/0x40
kernel: do_read_cache_page+0x438/0x830
kernel: read_cache_page+0x12/0x20
kernel: read_dev_sector+0x27/0xc0
kernel: read_lba+0xc1/0x220
kernel: efi_partition+0x1e6/0x708
kernel: check_partition+0x154/0x244
kernel: rescan_partitions+0xae/0x280
kernel: __blkdev_get+0x40f/0x560
kernel: blkdev_get+0x3d/0x140
kernel: __device_add_disk+0x388/0x480
kernel: device_add_disk+0x13/0x20
kernel: nvme_mpath_set_live+0x119/0x140 [nvme_core]
kernel: nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
kernel: nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
kernel: nvme_parse_ana_log+0xa1/0x180 [nvme_core]
kernel: nvme_mpath_add_disk+0x47/0x90 [nvme_core]
kernel: nvme_validate_ns+0x396/0x940 [nvme_core]
kernel: nvme_scan_work+0x24f/0x380 [nvme_core]
kernel: process_one_work+0x1db/0x380
kernel: worker_thread+0x249/0x400
kernel: kthread+0x104/0x140
kernel: Workqueue: nvme-wq nvme_ana_work [nvme_core]
kernel: Call Trace:
kernel: __schedule+0x2b9/0x6c0
kernel: schedule+0x42/0xb0
kernel: schedule_preempt_disabled+0xe/0x10
kernel: __mutex_lock.isra.0+0x182/0x4f0
kernel: ? __switch_to_asm+0x34/0x70
kernel: ? select_task_rq_fair+0x1aa/0x5c0
kernel: ? kvm_sched_clock_read+0x11/0x20
kernel: ? sched_clock+0x9/0x10
kernel: __mutex_lock_slowpath+0x13/0x20
kernel: mutex_lock+0x2e/0x40
kernel: nvme_read_ana_log+0x3a/0x100 [nvme_core]
kernel: nvme_ana_work+0x15/0x20 [nvme_core]
kernel: process_one_work+0x1db/0x380
kernel: worker_thread+0x4d/0x400
kernel: kthread+0x104/0x140
kernel: ? process_one_work+0x380/0x380
kernel: ? kthread_park+0x80/0x80
kernel: ret_from_fork+0x35/0x40
Fixes: 0d0b660f214d ("nvme: add ANA support")
Signed-off-by: Anton Eidelman <anton@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Revert fab7772bfbcf ("nvme-multipath: revalidate nvme_ns_head gendisk
in nvme_validate_ns")
When adding a new namespace to the head disk (via nvme_mpath_set_live)
we will see partition scan which triggers I/O on the mpath device node.
This process will usually be triggered from the scan_work which holds
the scan_lock. If I/O blocks (if we got ana change currently have only
available paths but none are accessible) this can deadlock on the head
disk bd_mutex as both partition scan I/O takes it, and head disk revalidation
takes it to check for resize (also triggered from scan_work on a different
path). See trace [1].
The mpath disk revalidation was originally added to detect online disk
size change, but this is no longer needed since commit cb224c3af4df
("nvme: Convert to use set_capacity_revalidate_and_notify") which already
updates resize info without unnecessarily revalidating the disk (the
mpath disk doesn't even implement .revalidate_disk fop).
[1]:
--
kernel: INFO: task kworker/u65:9:494 blocked for more than 241 seconds.
kernel: Tainted: G OE 5.3.5-050305-generic #201910071830
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: kworker/u65:9 D 0 494 2 0x80004000
kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
kernel: Call Trace:
kernel: __schedule+0x2b9/0x6c0
kernel: schedule+0x42/0xb0
kernel: schedule_preempt_disabled+0xe/0x10
kernel: __mutex_lock.isra.0+0x182/0x4f0
kernel: __mutex_lock_slowpath+0x13/0x20
kernel: mutex_lock+0x2e/0x40
kernel: revalidate_disk+0x63/0xa0
kernel: __nvme_revalidate_disk+0xfe/0x110 [nvme_core]
kernel: nvme_revalidate_disk+0xa4/0x160 [nvme_core]
kernel: ? evict+0x14c/0x1b0
kernel: revalidate_disk+0x2b/0xa0
kernel: nvme_validate_ns+0x49/0x940 [nvme_core]
kernel: ? blk_mq_free_request+0xd2/0x100
kernel: ? __nvme_submit_sync_cmd+0xbe/0x1e0 [nvme_core]
kernel: nvme_scan_work+0x24f/0x380 [nvme_core]
kernel: process_one_work+0x1db/0x380
kernel: worker_thread+0x249/0x400
kernel: kthread+0x104/0x140
kernel: ? process_one_work+0x380/0x380
kernel: ? kthread_park+0x80/0x80
kernel: ret_from_fork+0x1f/0x40
...
kernel: INFO: task kworker/u65:1:2630 blocked for more than 241 seconds.
kernel: Tainted: G OE 5.3.5-050305-generic #201910071830
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: kworker/u65:1 D 0 2630 2 0x80004000
kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
kernel: Call Trace:
kernel: __schedule+0x2b9/0x6c0
kernel: schedule+0x42/0xb0
kernel: io_schedule+0x16/0x40
kernel: do_read_cache_page+0x438/0x830
kernel: ? __switch_to_asm+0x34/0x70
kernel: ? file_fdatawait_range+0x30/0x30
kernel: read_cache_page+0x12/0x20
kernel: read_dev_sector+0x27/0xc0
kernel: read_lba+0xc1/0x220
kernel: ? kmem_cache_alloc_trace+0x19c/0x230
kernel: efi_partition+0x1e6/0x708
kernel: ? vsnprintf+0x39e/0x4e0
kernel: ? snprintf+0x49/0x60
kernel: check_partition+0x154/0x244
kernel: rescan_partitions+0xae/0x280
kernel: __blkdev_get+0x40f/0x560
kernel: blkdev_get+0x3d/0x140
kernel: __device_add_disk+0x388/0x480
kernel: device_add_disk+0x13/0x20
kernel: nvme_mpath_set_live+0x119/0x140 [nvme_core]
kernel: nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
kernel: nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
kernel: nvme_parse_ana_log+0xa1/0x180 [nvme_core]
kernel: ? nvme_update_ns_ana_state+0x60/0x60 [nvme_core]
kernel: nvme_mpath_add_disk+0x47/0x90 [nvme_core]
kernel: nvme_validate_ns+0x396/0x940 [nvme_core]
kernel: ? blk_mq_free_request+0xd2/0x100
kernel: nvme_scan_work+0x24f/0x380 [nvme_core]
kernel: process_one_work+0x1db/0x380
kernel: worker_thread+0x249/0x400
kernel: kthread+0x104/0x140
kernel: ? process_one_work+0x380/0x380
kernel: ? kthread_park+0x80/0x80
kernel: ret_from_fork+0x1f/0x40
--
Fixes: fab7772bfbcf ("nvme-multipath: revalidate nvme_ns_head gendisk
in nvme_validate_ns")
Signed-off-by: Anton Eidelman <anton@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The completion vector index that is given during CQ creation can't
exceed the number of support vectors by the underlying RDMA device. This
violation currently can accure, for example, in case one will try to
connect with N regular read/write queues and M poll queues and the sum
of N + M > num_supported_vectors. This will lead to failure in establish
a connection to remote target. Instead, in that case, share a completion
vector between queues.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Both admin's and drive's tagsets should be set according the numa
node of the controller.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Both admin's and drive's tagsets should be set according the numa
node of the controller.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Both admin's and drive's tagsets should be set according the numa node
of the controller.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Set the node value according to the PCI device numa node.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Initialize the node to NUMA_NO_NODE value. Transports that are aware of
numa node affinity can override it (e.g. RDMA transport set the affinity
according to the RDMA HCA).
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Move the struct block_device definition together with most of the
block layer definitions, as it has nothing to do with the rest of fs.h.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Large part of bio.h, blkdev.h and genhd.h are under ifdef CONFIG_BLOCK
for no good reason. Only stub out function that are called from
code that is not dependent on CONFIG_BLOCK and leave the harmless
other declarations around.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Move the !CONFIG_BLOCK stub to the same place as the non-stub
declaration.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Move most of the block related definition out of fs.h into more suitable
headers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Just use IS_ENABLED instead of providing a stub for !CONFIG_BLOCK.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
No one calls these functions without CONFIG_BLOCK, so don't bother
stubbing them out.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
These are not defined anywhere, and contrary to the comments we really
do not care about out of tree code at all.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We can also thaw non-block file systems. Remove the CONFIG_BLOCK in
sysrq.c after making the prototype available unconditionally.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Revert and incorret transformation that caused requests using remote
invalidation to never complete.
Fixes: 421147be863b ("nvme-rdma: factor out a nvme_rdma_end_request helper")
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes.
This code was detected with the help of Coccinelle and, audited and
fixed manually.
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Addresses-KSPP-ID: https://github.com/KSPP/linux/issues/83
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes.
This code was detected with the help of Coccinelle and, audited and
fixed manually.
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Addresses-KSPP-ID: https://github.com/KSPP/linux/issues/83
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We were only creating the request_queue debugfs_dir only
for make_request block drivers (multiqueue), but never for
request-based block drivers. We did this as we were only
creating non-blktrace additional debugfs files on that directory
for make_request drivers. However, since blktrace *always* creates
that directory anyway, we special-case the use of that directory
on blktrace. Other than this being an eye-sore, this exposes
request-based block drivers to the same debugfs fragile
race that used to exist with make_request block drivers
where if we start adding files onto that directory we can later
run a race with a double removal of dentries on the directory
if we don't deal with this carefully on blktrace.
Instead, just simplify things by always creating the request_queue
debugfs_dir on request_queue registration. Rename the mutex also to
reflect the fact that this is used outside of the blktrace context.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We make an assumption that a debugfs directory exists, but since
this can fail ensure it exists before allowing blktrace setup to
complete. Otherwise we end up stuffing blktrace files on the debugfs
root directory. In the worst case scenario this *in theory* can create
an eventual panic *iff* in the future a similarly named file is created
prior on the debugfs root directory. This theoretical crash can happen
due to a recursive removal followed by a specific dentry removal.
This doesn't fix any known crash, however I have seen the files
go into the main debugfs root directory in cases where the debugfs
directory was not created due to other internal bugs with blktrace
now fixed.
blktrace is also completely useless without this directory, so
this ensures to userspace we only setup blktrace if the kernel
can stuff files where they are supposed to go into.
debugfs directory creations typically aren't checked for, and we have
maintainers doing sweep removals of these checks, but since we need this
check to ensure proper userspace blktrace functionality we make sure
to annotate the justification for the check.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
On commit 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
merged on v4.12 Omar fixed the original blktrace code for request-based
drivers (multiqueue). This however left in place a possible crash, if you
happen to abuse blktrace while racing to remove / add a device.
We used to use asynchronous removal of the request_queue, and with that
the issue was easier to reproduce. Now that we have reverted to
synchronous removal of the request_queue, the issue is still possible to
reproduce, its however just a bit more difficult.
We essentially run two instances of break-blktrace which add/remove
a loop device, and setup a blktrace and just never tear the blktrace
down. We do this twice in parallel. This is easily reproduced with the
script run_0004.sh from break-blktrace [0].
We can end up with two types of panics each reflecting where we
race, one a failed blktrace setup:
[ 252.426751] debugfs: Directory 'loop0' with parent 'block' already present!
[ 252.432265] BUG: kernel NULL pointer dereference, address: 00000000000000a0
[ 252.436592] #PF: supervisor write access in kernel mode
[ 252.439822] #PF: error_code(0x0002) - not-present page
[ 252.442967] PGD 0 P4D 0
[ 252.444656] Oops: 0002 [#1] SMP NOPTI
[ 252.446972] CPU: 10 PID: 1153 Comm: break-blktrace Tainted: G E 5.7.0-rc2-next-20200420+ #164
[ 252.452673] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
[ 252.456343] RIP: 0010:down_write+0x15/0x40
[ 252.458146] Code: eb ca e8 ae 22 8d ff cc cc cc cc cc cc cc cc cc cc cc cc
cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00
00 00 <f0> 48 0f b1 55 00 75 0f 48 8b 04 25 c0 8b 01 00 48 89
45 08 5d
[ 252.463638] RSP: 0018:ffffa626415abcc8 EFLAGS: 00010246
[ 252.464950] RAX: 0000000000000000 RBX: ffff958c25f0f5c0 RCX: ffffff8100000000
[ 252.466727] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
[ 252.468482] RBP: 00000000000000a0 R08: 0000000000000000 R09: 0000000000000001
[ 252.470014] R10: 0000000000000000 R11: ffff958d1f9227ff R12: 0000000000000000
[ 252.471473] R13: ffff958c25ea5380 R14: ffffffff8cce15f1 R15: 00000000000000a0
[ 252.473346] FS: 00007f2e69dee540(0000) GS:ffff958c2fc80000(0000) knlGS:0000000000000000
[ 252.475225] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 252.476267] CR2: 00000000000000a0 CR3: 0000000427d10004 CR4: 0000000000360ee0
[ 252.477526] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 252.478776] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 252.479866] Call Trace:
[ 252.480322] simple_recursive_removal+0x4e/0x2e0
[ 252.481078] ? debugfs_remove+0x60/0x60
[ 252.481725] ? relay_destroy_buf+0x77/0xb0
[ 252.482662] debugfs_remove+0x40/0x60
[ 252.483518] blk_remove_buf_file_callback+0x5/0x10
[ 252.484328] relay_close_buf+0x2e/0x60
[ 252.484930] relay_open+0x1ce/0x2c0
[ 252.485520] do_blk_trace_setup+0x14f/0x2b0
[ 252.486187] __blk_trace_setup+0x54/0xb0
[ 252.486803] blk_trace_ioctl+0x90/0x140
[ 252.487423] ? do_sys_openat2+0x1ab/0x2d0
[ 252.488053] blkdev_ioctl+0x4d/0x260
[ 252.488636] block_ioctl+0x39/0x40
[ 252.489139] ksys_ioctl+0x87/0xc0
[ 252.489675] __x64_sys_ioctl+0x16/0x20
[ 252.490380] do_syscall_64+0x52/0x180
[ 252.491032] entry_SYSCALL_64_after_hwframe+0x44/0xa9
And the other on the device removal:
[ 128.528940] debugfs: Directory 'loop0' with parent 'block' already present!
[ 128.615325] BUG: kernel NULL pointer dereference, address: 00000000000000a0
[ 128.619537] #PF: supervisor write access in kernel mode
[ 128.622700] #PF: error_code(0x0002) - not-present page
[ 128.625842] PGD 0 P4D 0
[ 128.627585] Oops: 0002 [#1] SMP NOPTI
[ 128.629871] CPU: 12 PID: 544 Comm: break-blktrace Tainted: G E 5.7.0-rc2-next-20200420+ #164
[ 128.635595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
[ 128.640471] RIP: 0010:down_write+0x15/0x40
[ 128.643041] Code: eb ca e8 ae 22 8d ff cc cc cc cc cc cc cc cc cc cc cc cc
cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00
00 00 <f0> 48 0f b1 55 00 75 0f 65 48 8b 04 25 c0 8b 01 00 48 89
45 08 5d
[ 128.650180] RSP: 0018:ffffa9c3c05ebd78 EFLAGS: 00010246
[ 128.651820] RAX: 0000000000000000 RBX: ffff8ae9a6370240 RCX: ffffff8100000000
[ 128.653942] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
[ 128.655720] RBP: 00000000000000a0 R08: 0000000000000002 R09: ffff8ae9afd2d3d0
[ 128.657400] R10: 0000000000000056 R11: 0000000000000000 R12: 0000000000000000
[ 128.659099] R13: 0000000000000000 R14: 0000000000000003 R15: 00000000000000a0
[ 128.660500] FS: 00007febfd995540(0000) GS:ffff8ae9afd00000(0000) knlGS:0000000000000000
[ 128.662204] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 128.663426] CR2: 00000000000000a0 CR3: 0000000420042003 CR4: 0000000000360ee0
[ 128.664776] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 128.666022] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 128.667282] Call Trace:
[ 128.667801] simple_recursive_removal+0x4e/0x2e0
[ 128.668663] ? debugfs_remove+0x60/0x60
[ 128.669368] debugfs_remove+0x40/0x60
[ 128.669985] blk_trace_free+0xd/0x50
[ 128.670593] __blk_trace_remove+0x27/0x40
[ 128.671274] blk_trace_shutdown+0x30/0x40
[ 128.671935] blk_release_queue+0x95/0xf0
[ 128.672589] kobject_put+0xa5/0x1b0
[ 128.673188] disk_release+0xa2/0xc0
[ 128.673786] device_release+0x28/0x80
[ 128.674376] kobject_put+0xa5/0x1b0
[ 128.674915] loop_remove+0x39/0x50 [loop]
[ 128.675511] loop_control_ioctl+0x113/0x130 [loop]
[ 128.676199] ksys_ioctl+0x87/0xc0
[ 128.676708] __x64_sys_ioctl+0x16/0x20
[ 128.677274] do_syscall_64+0x52/0x180
[ 128.677823] entry_SYSCALL_64_after_hwframe+0x44/0xa9
The common theme here is:
debugfs: Directory 'loop0' with parent 'block' already present
This crash happens because of how blktrace uses the debugfs directory
where it places its files. Upon init we always create the same directory
which would be needed by blktrace but we only do this for make_request
drivers (multiqueue) block drivers. When you race a removal of these
devices with a blktrace setup you end up in a situation where the
make_request recursive debugfs removal will sweep away the blktrace
files and then later blktrace will also try to remove individual
dentries which are already NULL. The inverse is also possible and hence
the two types of use after frees.
We don't create the block debugfs directory on init for these types of
block devices:
* request-based block driver block devices
* every possible partition
* scsi-generic
And so, this race should in theory only be possible with make_request
drivers.
We can fix the UAF by simply re-using the debugfs directory for
make_request drivers (multiqueue) and only creating the ephemeral
directory for the other type of block devices. The new clarifications
on relying on the q->blk_trace_mutex *and* also checking for q->blk_trace
*prior* to processing a blktrace ensures the debugfs directories are
only created if no possible directory name clashes are possible.
This goes tested with:
o nvme partitions
o ISCSI with tgt, and blktracing against scsi-generic with:
o block
o tape
o cdrom
o media changer
o blktests
This patch is part of the work which disputes the severity of
CVE-2019-19770 which shows this issue is not a core debugfs issue, but
a misuse of debugfs within blktace.
Fixes: 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
Reported-by: syzbot+603294af2d01acfdd6da@syzkaller.appspotmail.com
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: "James E.J. Bottomley" <jejb@linux.ibm.com>
Cc: yu kuai <yukuai3@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Be pedantic on removal as well and hold the mutex.
This should prevent uses of addition while we exit.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Ensure it is clear which lock is required on do_blk_trace_setup().
Suggested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit dc9edc44de6c ("block: Fix a blk_exit_rl() regression") merged on
v4.12 moved the work behind blk_release_queue() into a workqueue after a
splat floated around which indicated some work on blk_release_queue()
could sleep in blk_exit_rl(). This splat would be possible when a driver
called blk_put_queue() or blk_cleanup_queue() (which calls blk_put_queue()
as its final call) from an atomic context.
blk_put_queue() decrements the refcount for the request_queue kobject, and
upon reaching 0 blk_release_queue() is called. Although blk_exit_rl() is
now removed through commit db6d99523560 ("block: remove request_list code")
on v5.0, we reserve the right to be able to sleep within
blk_release_queue() context.
The last reference for the request_queue must not be called from atomic
context. *When* the last reference to the request_queue reaches 0 varies,
and so let's take the opportunity to document when that is expected to
happen and also document the context of the related calls as best as
possible so we can avoid future issues, and with the hopes that the
synchronous request_queue removal sticks.
We revert back to synchronous request_queue removal because asynchronous
removal creates a regression with expected userspace interaction with
several drivers. An example is when removing the loopback driver, one
uses ioctls from userspace to do so, but upon return and if successful,
one expects the device to be removed. Likewise if one races to add another
device the new one may not be added as it is still being removed. This was
expected behavior before and it now fails as the device is still present
and busy still. Moving to asynchronous request_queue removal could have
broken many scripts which relied on the removal to have been completed if
there was no error. Document this expectation as well so that this
doesn't regress userspace again.
Using asynchronous request_queue removal however has helped us find
other bugs. In the future we can test what could break with this
arrangement by enabling CONFIG_DEBUG_KOBJECT_RELEASE.
While at it, update the docs with the context expectations for the
request_queue / gendisk refcount decrement, and make these
expectations explicit by using might_sleep().
Fixes: dc9edc44de6c ("block: Fix a blk_exit_rl() regression")
Suggested-by: Nicolai Stange <nstange@suse.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: yu kuai <yukuai3@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Let us clarify the context under which the helpers to increment the
refcount for the gendisk and request_queue can be called under. We
make this explicit on the places where we may sleep with might_sleep().
We don't address the decrement context yet, as that needs some extra
work and fixes, but will be addressed in the next patch.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This adds documentation for the gendisk / request_queue refcount
helpers.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Use the new blk_mq_complete_request_remote helper to avoid an indirect
function call in the completion fast path.
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Factor a small sniplet of duplicated code into a new helper in
preparation for making this sniplet a little bit less trivial.
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This is a variant of blk_mq_complete_request_remote that only completes
the request if it needs to be bounced to another CPU or a softirq. If
the request can be completed locally the function returns false and lets
the driver complete it without requring and indirect function call.
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add a helper to decide if we can complete locally or need an IPI.
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We don't really care if we get migrated during the I/O completion.
In the worth case we either perform an IPI that wasn't required, or
complete the request on a CPU which we just migrated off.
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Move the call to blk_should_fake_timeout out of blk_mq_complete_request
and into the drivers, skipping call sites that are obvious error
handlers, and remove the now superflous blk_mq_force_complete_rq helper.
This ensures we don't keep injecting errors into completions that just
terminate the Linux request after the hardware has been reset or the
command has been aborted.
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Both the softirq path for single queue devices and the multi-queue
completion handler share the same logic to figure out if we need an
IPI for the completion and eventually issue it. Merge the two
versions into a single unified code path.
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Let the compile optimize out the entire IPI path, given that we are
obviously not going to use it.
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Even for single queue devices there is no point in offloading a polled
completion to the softirq, given that blk_mq_force_complete_rq is called
from the polling thread in that case and thus there are no starvation
issues.
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
By open coding raise_blk_irq in the only caller, and replacing the
ifdef CONFIG_SMP with an IS_ENABLED check the flow in the caller
can be significantly simplified.
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add a helper to deduplicate the logic that raises the block softirq.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
__blk_complete_request is only called from the blk-mq code, and
duplicates a lot of code from blk-mq.c. Move it there to prepare
for better code sharing and simplifications.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This driver assumed that dmaengine_tx_status() could return
the residue even if the transfer was completed. However,
this was not correct usage [1] and this caused to break getting
the residue after the commit 24461d9792c2 ("dmaengine:
virt-dma: Fix access after free in vchan_complete()") actually.
So, this is possible to get wrong received size if the usb
controller gets a short packet. For example, g_zero driver
causes "bad OUT byte" errors.
The usb-dmac driver will support the callback_result, so this
driver can use it to get residue correctly. Note that even if
the usb-dmac driver has not supported the callback_result yet,
this patch doesn't cause any side-effects.
[1]
https://lore.kernel.org/dmaengine/20200616165550.GP2324254@vkoul-mobl/
Reported-by: Hien Dang <hien.dang.eb@renesas.com>
Fixes: 24461d9792c2 ("dmaengine: virt-dma: Fix access after free in vchan_complete()")
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Link: https://lore.kernel.org/r/1592482277-19563-1-git-send-email-yoshihiro.shimoda.uh@renesas.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|