diff options
| author | Martin K. Petersen <martin.petersen@oracle.com> | 2023-05-22 17:09:51 -0400 |
|---|---|---|
| committer | Martin K. Petersen <martin.petersen@oracle.com> | 2023-05-22 17:09:51 -0400 |
| commit | 8b60e2189fcd8b10b592608256eb97aebfcff147 (patch) | |
| tree | 664f034a0f2579ccbfd59da2ae80728897b4952f /include | |
| parent | 7907ad748bdba8ac9ca47f0a650cc2e5d2ad6e24 (diff) | |
| parent | 18bd7718b5c489b3161b6c2ab4685d57c1e2da3b (diff) | |
Merge patch series "Add Command Duration Limits support"
Niklas Cassel <nks@flawful.org> says:
This series adds support for Command Duration Limits.
The series is based on linux tag: v6.4-rc1
The series can also be found in git: https://github.com/floatious/linux/commits/cdl-v7
=================
CDL in ATA / SCSI
=================
Command Duration Limits is defined in:
T13 ATA Command Set - 5 (ACS-5) and
T10 SCSI Primary Commands - 6 (SPC-6) respectively
(a simpler version of CDL is defined in T10 SPC-5).
CDL defines Duration Limits Descriptors (DLD).
7 DLDs for read commands and 7 DLDs for write commands.
Simply put, a DLD contains a limit and a policy.
A command can specify that a certain limit should be applied by setting
the DLD index field (3 bits, so 0-7) in the command itself.
The DLD index points to one of the 7 DLDs.
DLD index 0 means no descriptor, so no limit.
DLD index 1-7 means DLD 1-7.
A DLD can have a few different policies, but the two major ones are:
-Policy 0xF (abort), command will be completed with command aborted error
(ATA) or status CHECK CONDITION (SCSI), with sense data indicating that
the command timed out.
-Policy 0xD (complete-unavailable), command will be completed without
error (ATA) or status GOOD (SCSI), with sense data indicating that the
command timed out. Note that the command will not have transferred any
data to/from the device when the command timed out, even though the
command returned success.
Regardless of the CDL policy, in case of a CDL timeout, the I/O will
result in a -ETIME error to user-space.
The DLDs are defined in the CDL log page(s) and are readable and writable.
Reading and writing the CDL DLDs are outside the scope of the kernel.
If a user wants to read or write the descriptors, they can do so using a
user-space application that sends passthrough commands, such as cdl-tools:
https://github.com/westerndigitalcorporation/cdl-tools
================================
The introduction of ioprio hints
================================
What the kernel does provide, is a method to let I/O use one of the CDL DLDs
defined in the device. Note that the kernel will simply forward the DLD index
to the device, so the kernel currently does not know, nor does it need to know,
how the DLDs are defined inside the device.
The way that the CDL DLD index is supplied to the kernel is by introducing a
new 10 bit "ioprio hint" field within the existing 16 bit ioprio definition.
Currently, only 6 out of the 16 ioprio bits are in use, the remaining 10 bits
are unused, and are currently explicitly disallowed to be set by the kernel.
For now, we only add ioprio hints representing CDL DLD index 1-7. Additional
ioprio hints for other QoS features could be defined in the future.
A theoretical future work could be to make an I/O scheduler aware of these
hints. E.g. for CDL, an I/O scheduler could make use of the duration limit
in each descriptor, and take that information into account while scheduling
commands. Right now, the ioprio hints will be ignored by the I/O schedulers.
==============================
How to use CDL from user-space
==============================
Since CDL is mutually exclusive with NCQ priority
(see ncq_prio_enable and sas_ncq_prio_enable in
Documentation/ABI/testing/sysfs-block-device),
CDL has to be explicitly enabled using:
echo 1 > /sys/block/$bdev/device/cdl_enable
Since the ioprio hints are supplied through the existing I/O priority API,
it should be simple for an application to make use of the ioprio hints.
It simply has to reuse one of the new macros defined in
include/uapi/linux/ioprio.h: IOPRIO_PRIO_HINT() or IOPRIO_PRIO_VALUE_HINT(),
and supply one of the new hints defined in include/uapi/linux/ioprio.h:
IOPRIO_HINT_DEV_DURATION_LIMIT_[1-7], which indicates that the I/O should
use the corresponding CDL DLD index 1-7.
By reusing the I/O priority API, the user can both define a DLD to use per
AIO (io_uring sqe->ioprio or libaio iocb->aio_reqprio) or per-thread
(ioprio_set()).
=======
Testing
=======
With the following fio patches:
https://github.com/floatious/fio/commits/cdl
fio adds support for ioprio hints, such that CDL can be tested using e.g.:
fio --ioengine=io_uring --cmdprio_percentage=10 --cmdprio_hint=DLD_index
A simple way to test is to use a DLD with a very short duration limit,
and send large reads. Regardless of the CDL policy, in case of a CDL
timeout, the I/O will result in a -ETIME error to user-space.
We also provide a CDL test suite located in the cdl-tools repo, see:
https://github.com/westerndigitalcorporation/cdl-tools#testing-a-system-command-duration-limits-support
We have tested this patch series using:
-real hardware
-the following QEMU implementation:
https://github.com/floatious/qemu/tree/cdl
(NOTE: the QEMU implementation requires you to define the CDL policy at compile
time, so you currently need to recompile QEMU when switching between policies.)
===================
Further information
===================
For further information about CDL, see Damien's slides:
Presented at SDC 2021:
https://www.snia.org/sites/default/files/SDC/2021/pdfs/SNIA-SDC21-LeMoal-Be-On-Time-command-duration-limits-Feature-Support-in%20Linux.pdf
Presented at Lund Linux Con 2022:
https://drive.google.com/file/d/1I6ChFc0h4JY9qZdO1bY5oCAdYCSZVqWw/view?usp=sharing
================
Changes since V6
================
-Rebased series on v6.4-rc1.
-Picked up Reviewed-by tags from Hannes (Thank you Hannes!)
-Picked up Reviewed-by tag from Christoph (Thank you Christoph!)
-Changed KernelVersion from 6.4 to 6.5 for new sysfs attributes.
For older change logs, see previous patch series versions:
https://lore.kernel.org/linux-scsi/20230406113252.41211-1-nks@flawful.org/
https://lore.kernel.org/linux-scsi/20230404182428.715140-1-nks@flawful.org/
https://lore.kernel.org/linux-scsi/20230309215516.3800571-1-niklas.cassel@wdc.com/
https://lore.kernel.org/linux-scsi/20230124190308.127318-1-niklas.cassel@wdc.com/
https://lore.kernel.org/linux-scsi/20230112140412.667308-1-niklas.cassel@wdc.com/
https://lore.kernel.org/linux-scsi/20221208105947.2399894-1-niklas.cassel@wdc.com/
Link: https://lore.kernel.org/r/20230511011356.227789-1-nks@flawful.org
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Diffstat (limited to 'include')
| -rw-r--r-- | include/linux/ata.h | 11 | ||||
| -rw-r--r-- | include/linux/blk_types.h | 6 | ||||
| -rw-r--r-- | include/linux/libata.h | 42 | ||||
| -rw-r--r-- | include/scsi/scsi_cmnd.h | 5 | ||||
| -rw-r--r-- | include/scsi/scsi_device.h | 18 | ||||
| -rw-r--r-- | include/uapi/linux/ioprio.h | 68 |
6 files changed, 125 insertions, 25 deletions
diff --git a/include/linux/ata.h b/include/linux/ata.h index c224dbddb9b2..792e10a09787 100644 --- a/include/linux/ata.h +++ b/include/linux/ata.h @@ -322,15 +322,21 @@ enum { ATA_LOG_SATA_NCQ = 0x10, ATA_LOG_NCQ_NON_DATA = 0x12, ATA_LOG_NCQ_SEND_RECV = 0x13, + ATA_LOG_CDL = 0x18, + ATA_LOG_CDL_SIZE = ATA_SECT_SIZE, ATA_LOG_IDENTIFY_DEVICE = 0x30, + ATA_LOG_SENSE_NCQ = 0x0F, + ATA_LOG_SENSE_NCQ_SIZE = ATA_SECT_SIZE * 2, ATA_LOG_CONCURRENT_POSITIONING_RANGES = 0x47, /* Identify device log pages: */ + ATA_LOG_SUPPORTED_CAPABILITIES = 0x03, + ATA_LOG_CURRENT_SETTINGS = 0x04, ATA_LOG_SECURITY = 0x06, ATA_LOG_SATA_SETTINGS = 0x08, ATA_LOG_ZONED_INFORMATION = 0x09, - /* Identify device SATA settings log:*/ + /* Identify device SATA settings log: */ ATA_LOG_DEVSLP_OFFSET = 0x30, ATA_LOG_DEVSLP_SIZE = 0x08, ATA_LOG_DEVSLP_MDAT = 0x00, @@ -415,6 +421,8 @@ enum { SETFEATURES_SATA_ENABLE = 0x10, /* Enable use of SATA feature */ SETFEATURES_SATA_DISABLE = 0x90, /* Disable use of SATA feature */ + SETFEATURES_CDL = 0x0d, /* Enable/disable cmd duration limits */ + /* SETFEATURE Sector counts for SATA features */ SATA_FPDMA_OFFSET = 0x01, /* FPDMA non-zero buffer offsets */ SATA_FPDMA_AA = 0x02, /* FPDMA Setup FIS Auto-Activate */ @@ -425,6 +433,7 @@ enum { SATA_DEVSLP = 0x09, /* Device Sleep */ SETFEATURE_SENSE_DATA = 0xC3, /* Sense Data Reporting feature */ + SETFEATURE_SENSE_DATA_SUCC_NCQ = 0xC4, /* Sense Data for successful NCQ commands */ /* feature values for SET_MAX */ ATA_SET_MAX_ADDR = 0x00, diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 936e898016f8..bb573f87c8d1 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -171,6 +171,12 @@ typedef u16 blk_short_t; */ #define BLK_STS_OFFLINE ((__force blk_status_t)17) +/* + * BLK_STS_DURATION_LIMIT is returned from the driver when the target device + * aborted the command because it exceeded one of its Command Duration Limits. + */ +#define BLK_STS_DURATION_LIMIT ((__force blk_status_t)18) + /** * blk_path_error - returns true if error may be path related * @error: status the request was completed with diff --git a/include/linux/libata.h b/include/linux/libata.h index 311cd93377c7..5c8ef33b0af2 100644 --- a/include/linux/libata.h +++ b/include/linux/libata.h @@ -94,17 +94,19 @@ enum { ATA_DFLAG_DMADIR = (1 << 10), /* device requires DMADIR */ ATA_DFLAG_NCQ_SEND_RECV = (1 << 11), /* device supports NCQ SEND and RECV */ ATA_DFLAG_NCQ_PRIO = (1 << 12), /* device supports NCQ priority */ - ATA_DFLAG_CFG_MASK = (1 << 13) - 1, - - ATA_DFLAG_PIO = (1 << 13), /* device limited to PIO mode */ - ATA_DFLAG_NCQ_OFF = (1 << 14), /* device limited to non-NCQ mode */ - ATA_DFLAG_SLEEPING = (1 << 15), /* device is sleeping */ - ATA_DFLAG_DUBIOUS_XFER = (1 << 16), /* data transfer not verified */ - ATA_DFLAG_NO_UNLOAD = (1 << 17), /* device doesn't support unload */ - ATA_DFLAG_UNLOCK_HPA = (1 << 18), /* unlock HPA */ - ATA_DFLAG_INIT_MASK = (1 << 19) - 1, - - ATA_DFLAG_NCQ_PRIO_ENABLED = (1 << 19), /* Priority cmds sent to dev */ + ATA_DFLAG_CDL = (1 << 13), /* supports cmd duration limits */ + ATA_DFLAG_CFG_MASK = (1 << 14) - 1, + + ATA_DFLAG_PIO = (1 << 14), /* device limited to PIO mode */ + ATA_DFLAG_NCQ_OFF = (1 << 15), /* device limited to non-NCQ mode */ + ATA_DFLAG_SLEEPING = (1 << 16), /* device is sleeping */ + ATA_DFLAG_DUBIOUS_XFER = (1 << 17), /* data transfer not verified */ + ATA_DFLAG_NO_UNLOAD = (1 << 18), /* device doesn't support unload */ + ATA_DFLAG_UNLOCK_HPA = (1 << 19), /* unlock HPA */ + ATA_DFLAG_INIT_MASK = (1 << 20) - 1, + + ATA_DFLAG_NCQ_PRIO_ENABLED = (1 << 20), /* Priority cmds sent to dev */ + ATA_DFLAG_CDL_ENABLED = (1 << 21), /* cmd duration limits is enabled */ ATA_DFLAG_DETACH = (1 << 24), ATA_DFLAG_DETACHED = (1 << 25), ATA_DFLAG_DA = (1 << 26), /* device supports Device Attention */ @@ -115,7 +117,8 @@ enum { ATA_DFLAG_FEATURES_MASK = (ATA_DFLAG_TRUSTED | ATA_DFLAG_DA | \ ATA_DFLAG_DEVSLP | ATA_DFLAG_NCQ_SEND_RECV | \ - ATA_DFLAG_NCQ_PRIO | ATA_DFLAG_FUA), + ATA_DFLAG_NCQ_PRIO | ATA_DFLAG_FUA | \ + ATA_DFLAG_CDL), ATA_DEV_UNKNOWN = 0, /* unknown device */ ATA_DEV_ATA = 1, /* ATA device */ @@ -206,10 +209,12 @@ enum { ATA_QCFLAG_CLEAR_EXCL = (1 << 5), /* clear excl_link on completion */ ATA_QCFLAG_QUIET = (1 << 6), /* don't report device error */ ATA_QCFLAG_RETRY = (1 << 7), /* retry after failure */ + ATA_QCFLAG_HAS_CDL = (1 << 8), /* qc has CDL a descriptor set */ ATA_QCFLAG_EH = (1 << 16), /* cmd aborted and owned by EH */ ATA_QCFLAG_SENSE_VALID = (1 << 17), /* sense data valid */ ATA_QCFLAG_EH_SCHEDULED = (1 << 18), /* EH scheduled (obsolete) */ + ATA_QCFLAG_EH_SUCCESS_CMD = (1 << 19), /* EH should fetch sense for this successful cmd */ /* host set flags */ ATA_HOST_SIMPLEX = (1 << 0), /* Host is simplex, one DMA channel per host only */ @@ -308,8 +313,10 @@ enum { ATA_EH_RESET = ATA_EH_SOFTRESET | ATA_EH_HARDRESET, ATA_EH_ENABLE_LINK = (1 << 3), ATA_EH_PARK = (1 << 5), /* unload heads and stop I/O */ + ATA_EH_GET_SUCCESS_SENSE = (1 << 6), /* Get sense data for successful cmd */ - ATA_EH_PERDEV_MASK = ATA_EH_REVALIDATE | ATA_EH_PARK, + ATA_EH_PERDEV_MASK = ATA_EH_REVALIDATE | ATA_EH_PARK | + ATA_EH_GET_SUCCESS_SENSE, ATA_EH_ALL_ACTIONS = ATA_EH_REVALIDATE | ATA_EH_RESET | ATA_EH_ENABLE_LINK, @@ -709,6 +716,9 @@ struct ata_device { /* Concurrent positioning ranges */ struct ata_cpr_log *cpr_log; + /* Command Duration Limits log support */ + u8 cdl[ATA_LOG_CDL_SIZE]; + /* error history */ int spdn_cnt; /* ering is CLEAR_END, read comment above CLEAR_END */ @@ -860,6 +870,7 @@ struct ata_port { struct ata_acpi_gtm __acpi_init_gtm; /* use ata_acpi_init_gtm() */ #endif /* owned by EH */ + u8 *ncq_sense_buf; u8 sector_buf[ATA_SECT_SIZE] ____cacheline_aligned; }; @@ -1178,6 +1189,7 @@ extern int sata_link_hardreset(struct ata_link *link, bool *online, int (*check_ready)(struct ata_link *)); extern int sata_link_resume(struct ata_link *link, const unsigned long *params, unsigned long deadline); +extern int ata_eh_read_sense_success_ncq_log(struct ata_link *link); extern void ata_eh_analyze_ncq_error(struct ata_link *link); #else static inline const unsigned long * @@ -1215,6 +1227,10 @@ static inline int sata_link_resume(struct ata_link *link, { return -EOPNOTSUPP; } +static inline int ata_eh_read_sense_success_ncq_log(struct ata_link *link) +{ + return -EOPNOTSUPP; +} static inline void ata_eh_analyze_ncq_error(struct ata_link *link) { } #endif extern int sata_link_debounce(struct ata_link *link, diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h index c2cb5f69635c..526def14e7fb 100644 --- a/include/scsi/scsi_cmnd.h +++ b/include/scsi/scsi_cmnd.h @@ -52,6 +52,11 @@ struct scsi_pointer { #define SCMD_TAGGED (1 << 0) #define SCMD_INITIALIZED (1 << 1) #define SCMD_LAST (1 << 2) +/* + * libata uses SCSI EH to fetch sense data for successful commands. + * SCSI EH should not overwrite scmd->result when SCMD_FORCE_EH_SUCCESS is set. + */ +#define SCMD_FORCE_EH_SUCCESS (1 << 3) #define SCMD_FAIL_IF_RECOVERING (1 << 4) /* flags preserved across unprep / reprep */ #define SCMD_PRESERVED_FLAGS (SCMD_INITIALIZED | SCMD_FAIL_IF_RECOVERING) diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h index f10a008e5bfa..b2cdb078b7bd 100644 --- a/include/scsi/scsi_device.h +++ b/include/scsi/scsi_device.h @@ -218,6 +218,9 @@ struct scsi_device { unsigned silence_suspend:1; /* Do not print runtime PM related messages */ unsigned no_vpd_size:1; /* No VPD size reported in header */ + unsigned cdl_supported:1; /* Command duration limits supported */ + unsigned cdl_enable:1; /* Enable/disable Command duration limits */ + unsigned int queue_stopped; /* request queue is quiesced */ bool offline_already; /* Device offline message logged */ @@ -364,6 +367,8 @@ extern int scsi_register_device_handler(struct scsi_device_handler *scsi_dh); extern void scsi_remove_device(struct scsi_device *); extern int scsi_unregister_device_handler(struct scsi_device_handler *scsi_dh); void scsi_attach_vpd(struct scsi_device *sdev); +void scsi_cdl_check(struct scsi_device *sdev); +int scsi_cdl_enable(struct scsi_device *sdev, bool enable); extern struct scsi_device *scsi_device_from_queue(struct request_queue *q); extern int __must_check scsi_device_get(struct scsi_device *); @@ -421,10 +426,10 @@ extern int scsi_track_queue_full(struct scsi_device *, int); extern int scsi_set_medium_removal(struct scsi_device *, char); -extern int scsi_mode_sense(struct scsi_device *sdev, int dbd, int modepage, - unsigned char *buffer, int len, int timeout, - int retries, struct scsi_mode_data *data, - struct scsi_sense_hdr *); +int scsi_mode_sense(struct scsi_device *sdev, int dbd, int modepage, + int subpage, unsigned char *buffer, int len, int timeout, + int retries, struct scsi_mode_data *data, + struct scsi_sense_hdr *); extern int scsi_mode_select(struct scsi_device *sdev, int pf, int sp, unsigned char *buffer, int len, int timeout, int retries, struct scsi_mode_data *data, @@ -433,8 +438,9 @@ extern int scsi_test_unit_ready(struct scsi_device *sdev, int timeout, int retries, struct scsi_sense_hdr *sshdr); extern int scsi_get_vpd_page(struct scsi_device *, u8 page, unsigned char *buf, int buf_len); -extern int scsi_report_opcode(struct scsi_device *sdev, unsigned char *buffer, - unsigned int len, unsigned char opcode); +int scsi_report_opcode(struct scsi_device *sdev, unsigned char *buffer, + unsigned int len, unsigned char opcode, + unsigned short sa); extern int scsi_device_set_state(struct scsi_device *sdev, enum scsi_device_state state); extern struct scsi_event *sdev_evt_alloc(enum scsi_device_event evt_type, diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h index f70f2596a6bf..4c4806e8230b 100644 --- a/include/uapi/linux/ioprio.h +++ b/include/uapi/linux/ioprio.h @@ -17,7 +17,7 @@ ((data) & IOPRIO_PRIO_MASK)) /* - * These are the io priority groups as implemented by the BFQ and mq-deadline + * These are the io priority classes as implemented by the BFQ and mq-deadline * schedulers. RT is the realtime class, it always gets premium service. For * ATA disks supporting NCQ IO priority, RT class IOs will be processed using * high priority NCQ commands. BE is the best-effort scheduling class, the @@ -32,11 +32,20 @@ enum { }; /* - * The RT and BE priority classes both support up to 8 priority levels. + * The RT and BE priority classes both support up to 8 priority levels that + * can be specified using the lower 3-bits of the priority data. */ -#define IOPRIO_NR_LEVELS 8 -#define IOPRIO_BE_NR IOPRIO_NR_LEVELS +#define IOPRIO_LEVEL_NR_BITS 3 +#define IOPRIO_NR_LEVELS (1 << IOPRIO_LEVEL_NR_BITS) +#define IOPRIO_LEVEL_MASK (IOPRIO_NR_LEVELS - 1) +#define IOPRIO_PRIO_LEVEL(ioprio) ((ioprio) & IOPRIO_LEVEL_MASK) +#define IOPRIO_BE_NR IOPRIO_NR_LEVELS + +/* + * Possible values for the "which" argument of the ioprio_get() and + * ioprio_set() system calls (see "man ioprio_set"). + */ enum { IOPRIO_WHO_PROCESS = 1, IOPRIO_WHO_PGRP, @@ -44,9 +53,58 @@ enum { }; /* - * Fallback BE priority level. + * Fallback BE class priority level. */ #define IOPRIO_NORM 4 #define IOPRIO_BE_NORM IOPRIO_NORM +/* + * The 10 bits between the priority class and the priority level are used to + * optionally define I/O hints for any combination of I/O priority class and + * level. Depending on the kernel configuration, I/O scheduler being used and + * the target I/O device being used, hints can influence how I/Os are processed + * without affecting the I/O scheduling ordering defined by the I/O priority + * class and level. + */ +#define IOPRIO_HINT_SHIFT IOPRIO_LEVEL_NR_BITS +#define IOPRIO_HINT_NR_BITS 10 +#define IOPRIO_NR_HINTS (1 << IOPRIO_HINT_NR_BITS) +#define IOPRIO_HINT_MASK (IOPRIO_NR_HINTS - 1) +#define IOPRIO_PRIO_HINT(ioprio) \ + (((ioprio) >> IOPRIO_HINT_SHIFT) & IOPRIO_HINT_MASK) + +/* + * Alternate macro for IOPRIO_PRIO_VALUE() to define an I/O priority with + * a class, level and hint. + */ +#define IOPRIO_PRIO_VALUE_HINT(class, level, hint) \ + ((((class) & IOPRIO_CLASS_MASK) << IOPRIO_CLASS_SHIFT) | \ + (((hint) & IOPRIO_HINT_MASK) << IOPRIO_HINT_SHIFT) | \ + ((level) & IOPRIO_LEVEL_MASK)) + +/* + * I/O hints. + */ +enum { + /* No hint */ + IOPRIO_HINT_NONE = 0, + + /* + * Device command duration limits: indicate to the device a desired + * duration limit for the commands that will be used to process an I/O. + * These will currently only be effective for SCSI and ATA devices that + * support the command duration limits feature. If this feature is + * enabled, then the commands issued to the device to process an I/O with + * one of these hints set will have the duration limit index (dld field) + * set to the value of the hint. + */ + IOPRIO_HINT_DEV_DURATION_LIMIT_1 = 1, + IOPRIO_HINT_DEV_DURATION_LIMIT_2 = 2, + IOPRIO_HINT_DEV_DURATION_LIMIT_3 = 3, + IOPRIO_HINT_DEV_DURATION_LIMIT_4 = 4, + IOPRIO_HINT_DEV_DURATION_LIMIT_5 = 5, + IOPRIO_HINT_DEV_DURATION_LIMIT_6 = 6, + IOPRIO_HINT_DEV_DURATION_LIMIT_7 = 7, +}; + #endif /* _UAPI_LINUX_IOPRIO_H */ |
