summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-03-15scsi: qla2xxx: Reduce false trigger to loginQuinn Tran
While a session is in the middle of a relogin, a late RSCN can be delivered from switch. RSCN trigger fabric scan where the scan logic can trigger another session login while a login is in progress. Reduce the extra trigger to prevent multiple logins to the same session. Link: https://lore.kernel.org/r/20220310092604.22950-10-njavali@marvell.com Fixes: bee8b84686c4 ("scsi: qla2xxx: Reduce redundant ADISC command for RSCNs") Cc: stable@vger.kernel.org Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Quinn Tran <qutran@marvell.com> Signed-off-by: Nilesh Javali <njavali@marvell.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-03-15scsi: qla2xxx: Fix laggy FC remote port session recoveryQuinn Tran
For session recovery, driver relies on the dpc thread to initiate certain operations. The dpc thread runs exclusively without the Mailbox interface being occupied. A recent code change for heartbeat check via mailbox cmd 0 is preventing the dpc thread from carrying out its operation. This patch allows the higher priority error recovery to run first before running the lower priority heartbeat check. Link: https://lore.kernel.org/r/20220310092604.22950-9-njavali@marvell.com Fixes: d94d8158e184 ("scsi: qla2xxx: Add heartbeat check") Cc: stable@vger.kernel.org Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Quinn Tran <qutran@marvell.com> Signed-off-by: Nilesh Javali <njavali@marvell.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-03-15scsi: qla2xxx: Fix hang due to session stuckQuinn Tran
User experienced device lost. The log shows Get port data base command was queued up, failed, and requeued again. Every time it is requeued, it set the FCF_ASYNC_ACTIVE. This prevents any recovery code from occurring because driver thinks a recovery is in progress for this session. In essence, this session is hung. The reason it gets into this place is the session deletion got in front of this call due to link perturbation. Break the requeue cycle and exit. The session deletion code will trigger a session relogin. Link: https://lore.kernel.org/r/20220310092604.22950-8-njavali@marvell.com Fixes: 726b85487067 ("qla2xxx: Add framework for async fabric discovery") Cc: stable@vger.kernel.org Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Quinn Tran <qutran@marvell.com> Signed-off-by: Nilesh Javali <njavali@marvell.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-03-15scsi: qla2xxx: Fix N2N inconsistent PLOGIQuinn Tran
For N2N topology, ELS Passthrough is used to send PLOGI. On failure of ELS pass through PLOGI, driver flipped over to using LLIOCB PLOGI for N2N. This is not consistent. Delete the session to restart the connection where ELS pass through PLOGI would be used consistently. Link: https://lore.kernel.org/r/20220310092604.22950-7-njavali@marvell.com Fixes: c76ae845ea83 ("scsi: qla2xxx: Add error handling for PLOGI ELS passthrough") Cc: stable@vger.kernel.org Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Quinn Tran <qutran@marvell.com> Signed-off-by: Nilesh Javali <njavali@marvell.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-03-15scsi: qla2xxx: Fix crash during module load unload testArun Easi
During purex packet handling the driver was incorrectly freeing a pre-allocated structure. Fix this by skipping that entry. System crashed with the following stack during a module unload test. Call Trace: sbitmap_init_node+0x7f/0x1e0 sbitmap_queue_init_node+0x24/0x150 blk_mq_init_bitmaps+0x3d/0xa0 blk_mq_init_tags+0x68/0x90 blk_mq_alloc_map_and_rqs+0x44/0x120 blk_mq_alloc_set_map_and_rqs+0x63/0x150 blk_mq_alloc_tag_set+0x11b/0x230 scsi_add_host_with_dma.cold+0x3f/0x245 qla2x00_probe_one+0xd5a/0x1b80 [qla2xxx] Call Trace with slub_debug and debug kernel: kasan_report_invalid_free+0x50/0x80 __kasan_slab_free+0x137/0x150 slab_free_freelist_hook+0xc6/0x190 kfree+0xe8/0x2e0 qla2x00_free_device+0x3bb/0x5d0 [qla2xxx] qla2x00_remove_one+0x668/0xcf0 [qla2xxx] Link: https://lore.kernel.org/r/20220310092604.22950-6-njavali@marvell.com Fixes: 62e9dd177732 ("scsi: qla2xxx: Change in PUREX to handle FPIN ELS requests") Cc: stable@vger.kernel.org Reported-by: Marco Patalano <mpatalan@redhat.com> Tested-by: Marco Patalano <mpatalan@redhat.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Arun Easi <aeasi@marvell.com> Signed-off-by: Nilesh Javali <njavali@marvell.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-03-15scsi: qla2xxx: Fix missed DMA unmap for NVMe ls requestsArun Easi
At NVMe ELS request time, request structure is DMA mapped and never unmapped. Fix this by calling the unmap on ELS completion. Link: https://lore.kernel.org/r/20220310092604.22950-5-njavali@marvell.com Fixes: e84067d74301 ("scsi: qla2xxx: Add FC-NVMe F/W initialization and transport registration") Cc: stable@vger.kernel.org Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Arun Easi <aeasi@marvell.com> Signed-off-by: Nilesh Javali <njavali@marvell.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-03-15scsi: qla2xxx: Fix loss of NVMe namespaces after driver reload testArun Easi
Driver registration of localport can race when it happens at the remote port discovery time. Fix this by calling the registration under a mutex. Link: https://lore.kernel.org/r/20220310092604.22950-4-njavali@marvell.com Fixes: e84067d74301 ("scsi: qla2xxx: Add FC-NVMe F/W initialization and transport registration") Cc: stable@vger.kernel.org Reported-by: Marco Patalano <mpatalan@redhat.com> Tested-by: Marco Patalano <mpatalan@redhat.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Arun Easi <aeasi@marvell.com> Signed-off-by: Nilesh Javali <njavali@marvell.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-03-15scsi: qla2xxx: Fix disk failure to rediscoverQuinn Tran
User experienced some of the LUN failed to get rediscovered after long cable pull test. The issue is triggered by a race condition between driver setting session online state vs starting the LUN scan process at the same time. Current code set the online state after notifying the session is available. In this case, trigger to start the LUN scan process happened before driver could set the session in online state. LUN scan ends up with failure due to the session online check was failing. Set the online state before reporting of the availability of the session. Link: https://lore.kernel.org/r/20220310092604.22950-3-njavali@marvell.com Fixes: aecf043443d3 ("scsi: qla2xxx: Fix Remote port registration") Cc: stable@vger.kernel.org Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Quinn Tran <qutran@marvell.com> Signed-off-by: Nilesh Javali <njavali@marvell.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-03-15scsi: qla2xxx: Fix incorrect reporting of task management failureQuinn Tran
User experienced no task management error while target device is responding with error. The RSP_CODE field in the status IOCB is in little endian. Driver assumes it's big endian and it picked up erroneous data. Convert the data back to big endian as is on the wire. Link: https://lore.kernel.org/r/20220310092604.22950-2-njavali@marvell.com Fixes: faef62d13463 ("[SCSI] qla2xxx: Fix Task Management command asynchronous handling") Cc: stable@vger.kernel.org Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Quinn Tran <qutran@marvell.com> Signed-off-by: Nilesh Javali <njavali@marvell.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-03-15scsi: libiscsi: Teardown iscsi_cls_conn gracefullyWenchao Hao
Commit 1b8d0300a3e9 ("scsi: libiscsi: Fix UAF in iscsi_conn_get_param()/iscsi_conn_teardown()") fixed an UAF in iscsi_conn_get_param() and introduced 2 tmp_xxx varibles. We can gracefully fix this UAF with the help of device_del(). Calling iscsi_remove_conn() at the beginning of iscsi_conn_teardown would make userspace unable to see iscsi_cls_conn. This way we we can free memory safely. Remove iscsi_destroy_conn() since it is no longer used. Link: https://lore.kernel.org/r/20220310015759.3296841-4-haowenchao@huawei.com Reviewed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Wenchao Hao <haowenchao@huawei.com> Signed-off-by: Wu Bo <wubo40@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-03-15scsi: libiscsi: Add iscsi_cls_conn to sysfs after initializationWenchao Hao
iscsi_create_conn() exposed iscsi_cls_conn to sysfs prior to initialization of iscsi_conn's dd_data. When userspace tried to access an attribute such as the connect address, a NULL pointer dereference was observed. Do not add iscsi_cls_conn to sysfs until it has been initialized. Remove iscsi_create_conn() since it is no longer used. Link: https://lore.kernel.org/r/20220310015759.3296841-3-haowenchao@huawei.com Reviewed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Wenchao Hao <haowenchao@huawei.com> Signed-off-by: Wu Bo <wubo40@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-03-15scsi: iscsi: Add helper functions to manage iscsi_cls_connWenchao Hao
- iscsi_alloc_conn(): Allocate and initialize iscsi_cls_conn - iscsi_add_conn(): Expose iscsi_cls_conn to userspace via sysfs - iscsi_remove_conn(): Remove iscsi_cls_conn from sysfs Link: https://lore.kernel.org/r/20220310015759.3296841-2-haowenchao@huawei.com Reviewed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Wenchao Hao <haowenchao@huawei.com> Signed-off-by: Wu Bo <wubo40@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-03-15scsi: core: Remove unreachable code warningYang Li
The smatch tool reported the following warning: drivers/scsi/scsi_error.c:1988 scsi_decide_disposition() warn: ignoring unreachable code. Remove the "default:return FAILED;" instead of "return FAILED;" reported by smatch, because compilers can provide more useful diagnostics about switch/case statements that do not have a default statement, especially if the "switch" applies to a value with enumeration type. Link: https://lore.kernel.org/r/20220301080448.112813-1-yang.lee@linux.alibaba.com Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-03-15scsi: megasas: Clean up some inconsistent indentingYang Li
Eliminate the following smatch warning: drivers/scsi/megaraid/megaraid_sas_fusion.c:5104 megasas_reset_fusion() warn: inconsistent indenting Link: https://lore.kernel.org/r/20220225011605.130927-1-yang.lee@linux.alibaba.com Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-03-14scsi: aacraid: Clean up some inconsistent indentingJiapeng Chong
Eliminate the following smatch warning: drivers/scsi/aacraid/linit.c:867 aac_eh_tmf_hard_reset_fib() warn: inconsistent indenting. Link: https://lore.kernel.org/r/20220309005031.126504-1-jiapeng.chong@linux.alibaba.com Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-03-14scsi: mpt3sas: Page fault in reply q processingMatt Lupfer
A page fault was encountered in mpt3sas on a LUN reset error path: [ 145.763216] mpt3sas_cm1: Task abort tm failed: handle(0x0002),timeout(30) tr_method(0x0) smid(3) msix_index(0) [ 145.778932] scsi 1:0:0:0: task abort: FAILED scmd(0x0000000024ba29a2) [ 145.817307] scsi 1:0:0:0: attempting device reset! scmd(0x0000000024ba29a2) [ 145.827253] scsi 1:0:0:0: [sg1] tag#2 CDB: Receive Diagnostic 1c 01 01 ff fc 00 [ 145.837617] scsi target1:0:0: handle(0x0002), sas_address(0x500605b0000272b9), phy(0) [ 145.848598] scsi target1:0:0: enclosure logical id(0x500605b0000272b8), slot(0) [ 149.858378] mpt3sas_cm1: Poll ReplyDescriptor queues for completion of smid(0), task_type(0x05), handle(0x0002) [ 149.875202] BUG: unable to handle page fault for address: 00000007fffc445d [ 149.885617] #PF: supervisor read access in kernel mode [ 149.894346] #PF: error_code(0x0000) - not-present page [ 149.903123] PGD 0 P4D 0 [ 149.909387] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 149.917417] CPU: 24 PID: 3512 Comm: scsi_eh_1 Kdump: loaded Tainted: G S O 5.10.89-altav-1 #1 [ 149.934327] Hardware name: DDN 200NVX2 /200NVX2-MB , BIOS ATHG2.2.02.01 09/10/2021 [ 149.951871] RIP: 0010:_base_process_reply_queue+0x4b/0x900 [mpt3sas] [ 149.961889] Code: 0f 84 22 02 00 00 8d 48 01 49 89 fd 48 8d 57 38 f0 0f b1 4f 38 0f 85 d8 01 00 00 49 8b 45 10 45 31 e4 41 8b 55 0c 48 8d 1c d0 <0f> b6 03 83 e0 0f 3c 0f 0f 85 a2 00 00 00 e9 e6 01 00 00 0f b7 ee [ 149.991952] RSP: 0018:ffffc9000f1ebcb8 EFLAGS: 00010246 [ 150.000937] RAX: 0000000000000055 RBX: 00000007fffc445d RCX: 000000002548f071 [ 150.011841] RDX: 00000000ffff8881 RSI: 0000000000000001 RDI: ffff888125ed50d8 [ 150.022670] RBP: 0000000000000000 R08: 0000000000000000 R09: c0000000ffff7fff [ 150.033445] R10: ffffc9000f1ebb68 R11: ffffc9000f1ebb60 R12: 0000000000000000 [ 150.044204] R13: ffff888125ed50d8 R14: 0000000000000080 R15: 34cdc00034cdea80 [ 150.054963] FS: 0000000000000000(0000) GS:ffff88dfaf200000(0000) knlGS:0000000000000000 [ 150.066715] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 150.076078] CR2: 00000007fffc445d CR3: 000000012448a006 CR4: 0000000000770ee0 [ 150.086887] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 150.097670] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 150.108323] PKRU: 55555554 [ 150.114690] Call Trace: [ 150.120497] ? printk+0x48/0x4a [ 150.127049] mpt3sas_scsih_issue_tm.cold.114+0x2e/0x2b3 [mpt3sas] [ 150.136453] mpt3sas_scsih_issue_locked_tm+0x86/0xb0 [mpt3sas] [ 150.145759] scsih_dev_reset+0xea/0x300 [mpt3sas] [ 150.153891] scsi_eh_ready_devs+0x541/0x9e0 [scsi_mod] [ 150.162206] ? __scsi_host_match+0x20/0x20 [scsi_mod] [ 150.170406] ? scsi_try_target_reset+0x90/0x90 [scsi_mod] [ 150.178925] ? blk_mq_tagset_busy_iter+0x45/0x60 [ 150.186638] ? scsi_try_target_reset+0x90/0x90 [scsi_mod] [ 150.195087] scsi_error_handler+0x3a5/0x4a0 [scsi_mod] [ 150.203206] ? __schedule+0x1e9/0x610 [ 150.209783] ? scsi_eh_get_sense+0x210/0x210 [scsi_mod] [ 150.217924] kthread+0x12e/0x150 [ 150.224041] ? kthread_worker_fn+0x130/0x130 [ 150.231206] ret_from_fork+0x1f/0x30 This is caused by mpt3sas_base_sync_reply_irqs() using an invalid reply_q pointer outside of the list_for_each_entry() loop. At the end of the full list traversal the pointer is invalid. Move the _base_process_reply_queue() call inside of the loop. Link: https://lore.kernel.org/r/d625deae-a958-0ace-2ba3-0888dd0a415b@ddn.com Fixes: 711a923c14d9 ("scsi: mpt3sas: Postprocessing of target and LUN reset") Cc: stable@vger.kernel.org Acked-by: Sreekanth Reddy <sreekanth.reddy@broadcom.com> Signed-off-by: Matt Lupfer <mlupfer@ddn.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-03-14scsi: target: Add iscsi/cpus_allowed_list in configfsMingzhe Zou
The RX/TX threads for iSCSI connection can be scheduled to any online CPUs, and will not be rescheduled. When binding other heavy load threads along with iSCSI connection RX/TX thread to the same CPU, the iSCSI performance will be worse. Add iscsi/cpus_allowed_list in configfs. The available CPU set of iSCSI connection RX/TX threads is allowed_cpus & online_cpus. If it is modified, all RX/TX threads will be rescheduled. Link: https://lore.kernel.org/r/20220301075500.14266-1-mingzhe.zou@easystack.cn Reviewed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-03-14scsi: hisi_sas: Use libsas internal abort supportJohn Garry
Use the common libsas internal abort functionality. In addition, this driver has special handling for internal abort timeouts - specifically whether to reset the controller in that instance, so extend the API for that. Timeout is now increased to 20 * Hz from 6 * Hz. We also retry for failure now, but this should not make a difference. Link: https://lore.kernel.org/r/1647001432-239276-5-git-send-email-john.garry@huawei.com Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-03-14scsi: pm8001: Use libsas internal abort supportJohn Garry
New special handling is added for SAS_PROTOCOL_INTERNAL_ABORT proto so that we may use the common queue command API. Link: https://lore.kernel.org/r/1647001432-239276-4-git-send-email-john.garry@huawei.com Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-03-14scsi: libsas: Add sas_execute_internal_abort_dev()John Garry
Add support for a "device" variant of internal abort, which will abort all pending I/Os for a specific device. Link: https://lore.kernel.org/r/1647001432-239276-3-git-send-email-john.garry@huawei.com Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-03-14scsi: libsas: Add sas_execute_internal_abort_single()John Garry
The internal abort feature is common to hisi_sas and pm8001 HBAs, and the driver support is similar also, so add a common handler. Two modes of operation will be supported: - single: Abort a single tagged command - device: Abort all commands associated with a specific domain device A new protocol is added, SAS_PROTOCOL_INTERNAL_ABORT, so the common queue command API may be re-used. Only add "single" support as a first step. Link: https://lore.kernel.org/r/1647001432-239276-2-git-send-email-john.garry@huawei.com Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-03-14scsi: lpfc: Remove failing soft_wwn supportJames Smart
The soft_wwpn/soft_wwn functionality, which allows the driver to modify service parameters in an attempt to override the adapter-assigned WWN, was originally attempted to be removed roughly 6 yrs ago as new fabric features were being introduced that clashed with the implementation. In the end, the feature was left in with the user being responsible if things went south. We've reached a point where soft_wwn is no longer functional and is failing in almost all production use cases. Use of Fabric features such as Fabric Assigned WWPN and Automatic DPORT is now prevalent and the features require coordination between the adapter and driver that can't be solved by the simplistic update of the service parameters. As it is no longer functional, the feature is to be removed. There are still ways to override the adapter-assigned WWN but they require the admin to invoke bios/efi level menus. Link: https://lore.kernel.org/r/20220310154845.11125-1-jsmart2021@gmail.com Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-03-14Input: aiptek - properly check endpoint typePavel Skripkin
Syzbot reported warning in usb_submit_urb() which is caused by wrong endpoint type. There was a check for the number of endpoints, but not for the type of endpoint. Fix it by replacing old desc.bNumEndpoints check with usb_find_common_endpoints() helper for finding endpoints Fail log: usb 5-1: BOGUS urb xfer, pipe 1 != type 3 WARNING: CPU: 2 PID: 48 at drivers/usb/core/urb.c:502 usb_submit_urb+0xed2/0x18a0 drivers/usb/core/urb.c:502 Modules linked in: CPU: 2 PID: 48 Comm: kworker/2:2 Not tainted 5.17.0-rc6-syzkaller-00226-g07ebd38a0da2 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 Workqueue: usb_hub_wq hub_event ... Call Trace: <TASK> aiptek_open+0xd5/0x130 drivers/input/tablet/aiptek.c:830 input_open_device+0x1bb/0x320 drivers/input/input.c:629 kbd_connect+0xfe/0x160 drivers/tty/vt/keyboard.c:1593 Fixes: 8e20cf2bce12 ("Input: aiptek - fix crash on detecting device without endpoints") Reported-and-tested-by: syzbot+75cccf2b7da87fb6f84b@syzkaller.appspotmail.com Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Link: https://lore.kernel.org/r/20220308194328.26220-1-paskripkin@gmail.com Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2022-03-14block: don't merge across cgroup boundaries if blkcg is enabledTejun Heo
blk-iocost and iolatency are cgroup aware rq-qos policies but they didn't disable merges across different cgroups. This obviously can lead to accounting and control errors but more importantly to priority inversions - e.g. an IO which belongs to a higher priority cgroup or IO class may end up getting throttled incorrectly because it gets merged to an IO issued from a low priority cgroup. Fix it by adding blk_cgroup_mergeable() which is called from merge paths and rejects cross-cgroup and cross-issue_as_root merges. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: d70675121546 ("block: introduce blk-iolatency io controller") Cc: stable@vger.kernel.org # v4.19+ Cc: Josef Bacik <jbacik@fb.com> Link: https://lore.kernel.org/r/Yi/eE/6zFNyWJ+qd@slm.duckdns.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-14ice: use ice_is_vf_trusted helper functionJacob Keller
The ice_vc_cfg_promiscuous_mode_msg function directly checks ICE_VIRTCHNL_VF_CAP_PRIVILEGE, instead of using the existing helper function ice_is_vf_trusted. Switch this to use the helper function so that all trusted checks are consistent. This aids in any potential future refactor by ensuring consistent code. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-03-14ice: log an error message when eswitch fails to configureJacob Keller
When ice_eswitch_configure fails, print an error message to make it more obvious why VF initialization did not succeed. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-03-14ice: cleanup error logging for ice_ena_vfsJacob Keller
The ice_ena_vfs function and some of its sub-functions like ice_set_per_vf_res use a "if (<function>) { <print error> ; <exit> }" flow. This flow discards specialized errors reported by the called function. This style is generally not preferred as it makes tracing error sources more difficult. It also means we cannot log the actual error received properly. Refactor several calls in the ice_ena_vfs function that do this to catch the error in the 'ret' variable. Report this in the messages, and then return the more precise error value. Doing this reveals that ice_set_per_vf_res returns -EINVAL or -EIO in places where -ENOSPC makes more sense. Fix these calls up to return the more appropriate value. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-03-14ice: move ice_set_vf_port_vlan near other .ndo opsJacob Keller
The ice_set_vf_port_vlan function is located in ice_sriov.c very far away from the other .ndo operations that it is similar to. Move this so that its located near the other .ndo operation definitions. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-03-14ice: refactor spoofchk control code in ice_sriov.cJacob Keller
The API to control the VSI spoof checking for a VF VSI has three functions: enable, disable, and set. The set function takes the VSI and the VF and decides whether to call enable or disable based on the vf->spoofchk field. In some flows, vf->spoofchk is not yet set, such as the function used to control the setting for a VF. (vf->spoofchk is only updated after a success). Simplify this API by refactoring ice_vf_set_spoofchk_cfg to be "ice_vsi_apply_spoofchk" which takes the boolean and allows all callers to avoid having to determine whether to call enable or disable themselves. This matches the expected callers better, and will prevent the need to export more than one function when this code must be called from another file. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-03-14ice: rename ICE_MAX_VF_COUNT to avoid confusionJacob Keller
The ICE_MAX_VF_COUNT field is defined in ice_sriov.h. This count is true for SR-IOV but will not be true for all VF implementations, such as when the ice driver supports Scalable IOV. Rename this definition to clearly indicate ICE_MAX_SRIOV_VFS. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-03-14ice: remove unused definitions from ice_sriov.hJacob Keller
A few more macros exist in ice_sriov.h which are not used anywhere. These can be safely removed. Note that ICE_VIRTCHNL_VF_CAP_L2 capability is set but never checked anywhere in the driver. Thus it is also safe to remove. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-03-14ice: convert vf->vc_ops to a const pointerJacob Keller
The vc_ops structure is used to allow different handlers for virtchnl commands when the driver is in representor mode. The current implementation uses a copy of the ops table in each VF, and modifies this copy dynamically. The usual practice in kernel code is to store the ops table in a constant structure and point to different versions. This has a number of advantages: 1. Reduced memory usage. Each VF merely points to the correct table, so they're able to re-use the same constant lookup table in memory. 2. Consistency. It becomes more difficult to accidentally update or edit only one op call. Instead, the code switches to the correct able by a single pointer write. In general this is atomic, either the pointer is updated or its not. 3. Code Layout. The VF structure can store a pointer to the table without needing to have the full structure definition defined prior to the VF structure definition. This will aid in future refactoring of code by allowing the VF pointer to be kept in ice_vf_lib.h while the virtchnl ops table can be maintained in ice_virtchnl.h There is one major downside in the case of the vc_ops structure. Most of the operations in the table are the same between the two current implementations. This can appear to lead to duplication since each implementation must now fill in the complete table. It could make spotting the differences in the representor mode more challenging. Unfortunately, methods to make this less error prone either add complexity overhead (macros using CPP token concatenation) or don't work on all compilers we support (constant initializer from another constant structure). The cost of maintaining two structures does not out weigh the benefits of the constant table model. While we're making these changes, go ahead and rename the structure and implementations with "virtchnl" instead of "vc_vf_". This will more closely align with the planned file renaming, and avoid similar names when we later introduce a "vf ops" table for separating Scalable IOV and Single Root IOV implementations. Leave the accessor/assignment functions in order to avoid issues with compiling with options disabled. The interface makes it easier to handle when CONFIG_PCI_IOV is disabled in the kernel. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-03-14ice: remove circular header dependencies on ice.hJacob Keller
Several headers in the ice driver include ice.h even though they are themselves included by that header. The most notable of these is ice_common.h, but several other headers also do this. Such a recursive inclusion is problematic as it forces headers to be included in a strict order, otherwise compilation errors can result. The circular inclusions do not trigger an endless loop due to standard header inclusion guards, however other errors can occur. For example, ice_flow.h defines ice_rss_hash_cfg, which is used by ice_sriov.h as part of the definition of ice_vf_hash_ip_ctx. ice_flow.h includes ice_acl.h, which includes ice_common.h, and which finally includes ice.h. Since ice.h itself includes ice_sriov.h, this creates a circular dependency. The definition in ice_sriov.h requires things from ice_flow.h, but ice_flow.h itself will lead to trying to load ice_sriov.h as part of its process for expanding ice.h. The current code avoids this issue by having an implicit dependency without the include of ice_flow.h. If we were to fix that so that ice_sriov.h explicitly depends on ice_flow.h the following pattern would occur: ice_flow.h -> ice_acl.h -> ice_common.h -> ice.h -> ice_sriov.h At this point, during the expansion of, the header guard for ice_flow.h is already set, so when ice_sriov.h attempts to load the ice_flow.h header it is skipped. Then, we go on to begin including the rest of ice_sriov.h, including structure definitions which depend on ice_rss_hash_cfg. This produces a compiler warning because ice_rss_hash_cfg hasn't yet been included. Remember, we're just at the start of ice_flow.h! If the order of headers is incorrect (ice_flow.h is not implicitly loaded first in all files which include ice_sriov.h) then we get the same failure. Removing this recursive inclusion requires fixing a few cases where some headers depended on the header inclusions from ice.h. In addition, a few other changes are also required. Most notably, ice_hw_to_dev is implemented as a macro in ice_osdep.h, which is the likely reason that ice_common.h includes ice.h at all. This macro implementation requires the full definition of ice_pf in order to properly compile. Fix this by moving it to a function declared in ice_main.c, so that we do not require all files to depend on the layout of the ice_pf structure. Note that this change only fixes circular dependencies, but it does not fully resolve all implicit dependencies where one header may depend on the inclusion of another. I tried to fix as many of the implicit dependencies as I noticed, but fixing them all requires a somewhat tedious analysis of each header and attempting to compile it separately. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-03-14ice: rename ice_virtchnl_pf.c to ice_sriov.cJacob Keller
The ice_virtchnl_pf.c and ice_virtchnl_pf.h files are where most of the code for implementing Single Root IOV virtualization resides. This code includes support for bringing up and tearing down VFs, hooks into the kernel SR-IOV netdev operations, and for handling virtchnl messages from VFs. In the future, we plan to support Scalable IOV in addition to Single Root IOV as an alternative virtualization scheme. This implementation will re-use some but not all of the code in ice_virtchnl_pf.c To prepare for this future, we want to refactor and split up the code in ice_virtchnl_pf.c into the following scheme: * ice_vf_lib.[ch] Basic VF structures and accessors. This is where scheme-independent code will reside. * ice_virtchnl.[ch] Virtchnl message handling. This is where the bulk of the logic for processing messages from VFs using the virtchnl messaging scheme will reside. This is separated from ice_vf_lib.c because it is distinct and has a bulk of the processing code. * ice_sriov.[ch] Single Root IOV implementation, including initialization and the routines for interacting with SR-IOV based netdev operations. * (future) ice_siov.[ch] Scalable IOV implementation. As a first step, lets assume that all of the code in ice_virtchnl_pf.[ch] is for Single Root IOV. Rename this file to ice_sriov.c and its header to ice_sriov.h Future changes will further split out the code in these files following the plan outlined here. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-03-14ice: rename ice_sriov.c to ice_vf_mbx.cJacob Keller
The ice_sriov.c file primarily contains code which handles the logic for mailbox overflow detection and some other utility functions related to the virtualization mailbox. The bulk of the SR-IOV implementation is actually found in ice_virtchnl_pf.c, and this file isn't strictly SR-IOV specific. In the future, the ice driver will support an additional virtualization scheme known as Scalable IOV, and the code in this file will be used for this alternative implementation. Rename this file (and its associated header) to ice_vf_mbx.c, so that we can later re-use the ice_sriov.c file as the SR-IOV specific file. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-03-14RDMA/qib: Fix typos in commentsJulia Lawall
Various spelling mistakes in comments. Detected with the help of Coccinelle. Link: https://lore.kernel.org/r/20220314115354.144023-23-Julia.Lawall@inria.fr Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-03-14RDMA/mlx5: Fix memory leak in error flow for subscribe event routineYongzhi Liu
In case the second xa_insert() fails, the obj_event is not released. Fix the error unwind flow to free that memory to avoid a memory leak. Fixes: 759738537142 ("IB/mlx5: Enable subscription for device events over DEVX") Link: https://lore.kernel.org/r/1647018361-18266-1-git-send-email-lyz_cs@pku.edu.cn Signed-off-by: Yongzhi Liu <lyz_cs@pku.edu.cn> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-03-14Revert "RDMA/core: Fix ib_qp_usecnt_dec() called when error"Leon Romanovsky
This reverts commit 7c4a539ec38f4ce400a0f3fcb5ff6c940fcd67bb. which causes to the following error in mlx4. Destroy of kernel CQ shouldn't fail WARNING: CPU: 4 PID: 18064 at include/rdma/ib_verbs.h:3936 mlx4_ib_dealloc_xrcd+0x12e/0x1b0 [mlx4_ib] Modules linked in: bonding ib_ipoib ip_gre ipip tunnel4 geneve rdma_ucm nf_tables ib_umad mlx4_en mlx4_ib ib_uverbs mlx4_core ip6_gre gre ip6_tunnel tunnel6 iptable_raw openvswitch nsh rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm ib_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter overlay fuse [last unloaded: mlx4_core] CPU: 4 PID: 18064 Comm: ibv_xsrq_pingpo Not tainted 5.17.0-rc7_master_62c6ecb #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:mlx4_ib_dealloc_xrcd+0x12e/0x1b0 [mlx4_ib] Code: 1e 93 08 00 40 80 fd 01 0f 87 fa f1 04 00 83 e5 01 0f 85 2b ff ff ff 48 c7 c7 20 4f b6 a0 c6 05 fd 92 08 00 01 e8 47 c9 82 e2 <0f> 0b e9 11 ff ff ff 0f b6 2d eb 92 08 00 40 80 fd 01 0f 87 b1 f1 RSP: 0018:ffff8881a4957750 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff8881ac4b6800 RCX: 0000000000000000 RDX: 0000000000000027 RSI: 0000000000000004 RDI: ffffed103492aedc RBP: 0000000000000000 R08: 0000000000000001 R09: ffff8884d2e378eb R10: ffffed109a5c6f1d R11: 0000000000000001 R12: ffff888132620000 R13: ffff8881a4957a90 R14: ffff8881aa2d4000 R15: ffff8881a4957ad0 FS: 00007f0401747740(0000) GS:ffff8884d2e00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055f8ae036118 CR3: 000000012fe94005 CR4: 0000000000370ea0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ib_dealloc_xrcd_user+0xce/0x120 [ib_core] ib_uverbs_dealloc_xrcd+0xad/0x210 [ib_uverbs] uverbs_free_xrcd+0xe8/0x190 [ib_uverbs] destroy_hw_idr_uobject+0x7a/0x130 [ib_uverbs] uverbs_destroy_uobject+0x164/0x730 [ib_uverbs] uobj_destroy+0x72/0xf0 [ib_uverbs] ib_uverbs_cmd_verbs+0x19fb/0x3110 [ib_uverbs] ib_uverbs_ioctl+0x169/0x260 [ib_uverbs] __x64_sys_ioctl+0x856/0x1550 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes: 7c4a539ec38f ("RDMA/core: Fix ib_qp_usecnt_dec() called when error") Link: https://lore.kernel.org/r/74c11029adaf449b3b9228a77cc82f39e9e892c8.1646851220.git.leonro@nvidia.com Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-03-14RDMA/rxe: Remove useless argument for update_state()Chengguang Xu
The argument 'payload' is not used in update_state(), so just remove it. Link: https://lore.kernel.org/r/20220307145047.3235675-2-cgxu519@mykernel.net Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Zhu Yanjun <zyjzyj2000@gmail.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-03-14RDMA/rxe: Change variable and function argument to proper typeChengguang Xu
The type of wqe length is u32 so in order to avoid overflow and shadow casting change variable and relevant function argument to proper type. Link: https://lore.kernel.org/r/20220307145047.3235675-1-cgxu519@mykernel.net Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-03-14RDMA/irdma: Prevent some integer underflowsDan Carpenter
My static checker complains that: drivers/infiniband/hw/irdma/ctrl.c:3605 irdma_sc_ceq_init() warn: can subtract underflow 'info->dev->hmc_fpm_misc.max_ceqs'? It appears that "info->dev->hmc_fpm_misc.max_ceqs" comes from the firmware in irdma_sc_parse_fpm_query_buf() so, yes, there is a chance that it could be zero. Even if we trust the firmware, it's easy enough to change the condition just as a hardenning measure. Fixes: 3f49d6842569 ("RDMA/irdma: Implement HW Admin Queue OPs") Link: https://lore.kernel.org/r/20220307125928.GE16710@kili Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-03-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nfJakub Kicinski
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net coming late in the 5.17-rc process: 1) Revert port remap to mitigate shadowing service ports, this is causing problems in existing setups and this mitigation can be achieved with explicit ruleset, eg. ... tcp sport < 16386 tcp dport >= 32768 masquerade random This patches provided a built-in policy similar to the one described above. 2) Disable register tracking infrastructure in nf_tables. Florian reported two issues: - Existing expressions with no implemented .reduce interface that causes data-store on register should cancel the tracking. - Register clobbering might be possible storing data on registers that are larger than 32-bits. This might lead to generating incorrect ruleset bytecode. These two issues are scheduled to be addressed in the next release cycle. * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: disable register tracking Revert "netfilter: conntrack: tag conntracks picked up in local out hook" Revert "netfilter: nat: force port remap to prevent shadowing well-known ports" ==================== Link: https://lore.kernel.org/r/20220312220315.64531-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-14nfp: flower: avoid newline at the end of message in NL_SET_ERR_MSG_MODNiklas Söderlund
Fix the following coccicheck warning: drivers/net/ethernet/netronome/nfp/flower/action.c:959:7-69: WARNING avoid newline at end of message in NL_SET_ERR_MSG_MOD Signed-off-by: Niklas Söderlund <niklas.soderlund@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20220312095823.2425775-1-niklas.soderlund@corigine.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-14net: phy: marvell: Fix invalid comparison in the resume and suspend functionsKurt Cancemi
This bug resulted in only the current mode being resumed and suspended when the PHY supported both fiber and copper modes and when the PHY only supported copper mode the fiber mode would incorrectly be attempted to be resumed and suspended. Fixes: 3758be3dc162 ("Marvell phy: add functions to suspend and resume both interfaces: fiber and copper links.") Signed-off-by: Kurt Cancemi <kurt@x64architecture.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20220312201512.326047-1-kurt@x64architecture.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-14net/mlx5e: Fix use-after-free in mlx5e_stats_grp_sw_update_statsSaeed Mahameed
We need to sync page pool stats only for active channels. Reading ethtool stats on a down netdev or a netdev with modified number of channels will result in a user-after-free, trying to access page pools that are freed already. BUG: KASAN: use-after-free in mlx5e_stats_grp_sw_update_stats+0x465/0xf80 Read of size 8 at addr ffff888004835e40 by task ethtool/720 Fixes: cc10e84b2ec3 ("mlx5: add support for page_pool_get_stats") Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reported-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Joe Damato <jdamato@fastly.com> Link: https://lore.kernel.org/r/20220312005353.786255-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-14block: fix rq-qos breakage from skipping rq_qos_done_bio()Tejun Heo
a647a524a467 ("block: don't call rq_qos_ops->done_bio if the bio isn't tracked") made bio_endio() skip rq_qos_done_bio() if BIO_TRACKED is not set. While this fixed a potential oops, it also broke blk-iocost by skipping the done_bio callback for merged bios. Before, whether a bio goes through rq_qos_throttle() or rq_qos_merge(), rq_qos_done_bio() would be called on the bio on completion with BIO_TRACKED distinguishing the former from the latter. rq_qos_done_bio() is not called for bios which wenth through rq_qos_merge(). This royally confuses blk-iocost as the merged bios never finish and are considered perpetually in-flight. One reliably reproducible failure mode is an intermediate cgroup geting stuck active preventing its children from being activated due to the leaf-only rule, leading to loss of control. The following is from resctl-bench protection scenario which emulates isolating a web server like workload from a memory bomb run on an iocost configuration which should yield a reasonable level of protection. # cat /sys/block/nvme2n1/device/model Samsung SSD 970 PRO 512GB # cat /sys/fs/cgroup/io.cost.model 259:0 ctrl=user model=linear rbps=834913556 rseqiops=93622 rrandiops=102913 wbps=618985353 wseqiops=72325 wrandiops=71025 # cat /sys/fs/cgroup/io.cost.qos 259:0 enable=1 ctrl=user rpct=95.00 rlat=18776 wpct=95.00 wlat=8897 min=60.00 max=100.00 # resctl-bench -m 29.6G -r out.json run protection::scenario=mem-hog,loops=1 ... Memory Hog Summary ================== IO Latency: R p50=242u:336u/2.5m p90=794u:1.4m/7.5m p99=2.7m:8.0m/62.5m max=8.0m:36.4m/350m W p50=221u:323u/1.5m p90=709u:1.2m/5.5m p99=1.5m:2.5m/9.5m max=6.9m:35.9m/350m Isolation and Request Latency Impact Distributions: min p01 p05 p10 p25 p50 p75 p90 p95 p99 max mean stdev isol% 15.90 15.90 15.90 40.05 57.24 59.07 60.01 74.63 74.63 90.35 90.35 58.12 15.82 lat-imp% 0 0 0 0 0 4.55 14.68 15.54 233.5 548.1 548.1 53.88 143.6 Result: isol=58.12:15.82% lat_imp=53.88%:143.6 work_csv=100.0% missing=3.96% The isolation result of 58.12% is close to what this device would show without any IO control. Fix it by introducing a new flag BIO_QOS_MERGED to mark merged bios and calling rq_qos_done_bio() on them too. For consistency and clarity, rename BIO_TRACKED to BIO_QOS_THROTTLED. The flag checks are moved into rq_qos_done_bio() so that it's next to the code paths that set the flags. With the patch applied, the above same benchmark shows: # resctl-bench -m 29.6G -r out.json run protection::scenario=mem-hog,loops=1 ... Memory Hog Summary ================== IO Latency: R p50=123u:84.4u/985u p90=322u:256u/2.5m p99=1.6m:1.4m/9.5m max=11.1m:36.0m/350m W p50=429u:274u/995u p90=1.7m:1.3m/4.5m p99=3.4m:2.7m/11.5m max=7.9m:5.9m/26.5m Isolation and Request Latency Impact Distributions: min p01 p05 p10 p25 p50 p75 p90 p95 p99 max mean stdev isol% 84.91 84.91 89.51 90.73 92.31 94.49 96.36 98.04 98.71 100.0 100.0 94.42 2.81 lat-imp% 0 0 0 0 0 2.81 5.73 11.11 13.92 17.53 22.61 4.10 4.68 Result: isol=94.42:2.81% lat_imp=4.10%:4.68 work_csv=58.34% missing=0% Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: a647a524a467 ("block: don't call rq_qos_ops->done_bio if the bio isn't tracked") Cc: stable@vger.kernel.org # v5.15+ Cc: Ming Lei <ming.lei@redhat.com> Cc: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/Yi7rdrzQEHjJLGKB@slm.duckdns.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-14net/mlx4_en: use kzallocJulia Lawall
Use kzalloc instead of kmalloc + memset. The semantic patch that makes this change is: (https://coccinelle.gitlabpages.inria.fr/website/) //<smpl> @@ expression res, size, flag; @@ - res = kmalloc(size, flag); + res = kzalloc(size, flag); ... - memset(res, 0, size); //</smpl> Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20220312102705.71413-3-Julia.Lawall@inria.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-14block: release rq qos structures for queue without diskMing Lei
blkcg_init_queue() may add rq qos structures to request queue, previously blk_cleanup_queue() calls rq_qos_exit() to release them, but commit 8e141f9eb803 ("block: drain file system I/O on del_gendisk") moves rq_qos_exit() into del_gendisk(), so memory leak is caused because queues may not have disk, such as un-present scsi luns, nvme admin queue, ... Fixes the issue by adding rq_qos_exit() to blk_cleanup_queue() back. BTW, v5.18 won't need this patch any more since we move blkcg_init_queue()/blkcg_exit_queue() into disk allocation/release handler, and patches have been in for-5.18/block. Cc: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org Fixes: 8e141f9eb803 ("block: drain file system I/O on del_gendisk") Reported-by: syzbot+b42749a851a47a0f581b@syzkaller.appspotmail.com Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220314043018.177141-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-14fs: Convert is_partially_uptodate to foliosMatthew Wilcox (Oracle)
Since the uptodate property is maintained on a per-folio basis, the is_partially_uptodate method should also take a folio. Fix the types at the same time so it's clear that it returns true/false and takes the count in bytes, not blocks. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-14buffer: Add folio_buffers()Matthew Wilcox (Oracle)
While there is no intent to use large folios in filesystems using buffer heads, converting the filesystems to use single-page folios is still worth doing to remove legacy infrastructure and hidden calls to compound_head(). These helper functions are needed for that conversion to take place. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs