From 4419470191386456e0b8ed4eb06a70b0021798a6 Mon Sep 17 00:00:00 2001 From: Pawan Gupta Date: Thu, 19 May 2022 20:26:07 -0700 Subject: Documentation: Add documentation for Processor MMIO Stale Data Add the admin guide for Processor MMIO stale data vulnerabilities. Signed-off-by: Pawan Gupta Signed-off-by: Borislav Petkov --- Documentation/admin-guide/hw-vuln/index.rst | 1 + .../hw-vuln/processor_mmio_stale_data.rst | 246 +++++++++++++++++++++ 2 files changed, 247 insertions(+) create mode 100644 Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst index 8cbc711cda93..4df436e7c417 100644 --- a/Documentation/admin-guide/hw-vuln/index.rst +++ b/Documentation/admin-guide/hw-vuln/index.rst @@ -17,3 +17,4 @@ are configurable at compile, boot or run time. special-register-buffer-data-sampling.rst core-scheduling.rst l1d_flush.rst + processor_mmio_stale_data.rst diff --git a/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst b/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst new file mode 100644 index 000000000000..9393c50b5afc --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst @@ -0,0 +1,246 @@ +========================================= +Processor MMIO Stale Data Vulnerabilities +========================================= + +Processor MMIO Stale Data Vulnerabilities are a class of memory-mapped I/O +(MMIO) vulnerabilities that can expose data. The sequences of operations for +exposing data range from simple to very complex. Because most of the +vulnerabilities require the attacker to have access to MMIO, many environments +are not affected. System environments using virtualization where MMIO access is +provided to untrusted guests may need mitigation. These vulnerabilities are +not transient execution attacks. However, these vulnerabilities may propagate +stale data into core fill buffers where the data can subsequently be inferred +by an unmitigated transient execution attack. Mitigation for these +vulnerabilities includes a combination of microcode update and software +changes, depending on the platform and usage model. Some of these mitigations +are similar to those used to mitigate Microarchitectural Data Sampling (MDS) or +those used to mitigate Special Register Buffer Data Sampling (SRBDS). + +Data Propagators +================ +Propagators are operations that result in stale data being copied or moved from +one microarchitectural buffer or register to another. Processor MMIO Stale Data +Vulnerabilities are operations that may result in stale data being directly +read into an architectural, software-visible state or sampled from a buffer or +register. + +Fill Buffer Stale Data Propagator (FBSDP) +----------------------------------------- +Stale data may propagate from fill buffers (FB) into the non-coherent portion +of the uncore on some non-coherent writes. Fill buffer propagation by itself +does not make stale data architecturally visible. Stale data must be propagated +to a location where it is subject to reading or sampling. + +Sideband Stale Data Propagator (SSDP) +------------------------------------- +The sideband stale data propagator (SSDP) is limited to the client (including +Intel Xeon server E3) uncore implementation. The sideband response buffer is +shared by all client cores. For non-coherent reads that go to sideband +destinations, the uncore logic returns 64 bytes of data to the core, including +both requested data and unrequested stale data, from a transaction buffer and +the sideband response buffer. As a result, stale data from the sideband +response and transaction buffers may now reside in a core fill buffer. + +Primary Stale Data Propagator (PSDP) +------------------------------------ +The primary stale data propagator (PSDP) is limited to the client (including +Intel Xeon server E3) uncore implementation. Similar to the sideband response +buffer, the primary response buffer is shared by all client cores. For some +processors, MMIO primary reads will return 64 bytes of data to the core fill +buffer including both requested data and unrequested stale data. This is +similar to the sideband stale data propagator. + +Vulnerabilities +=============== +Device Register Partial Write (DRPW) (CVE-2022-21166) +----------------------------------------------------- +Some endpoint MMIO registers incorrectly handle writes that are smaller than +the register size. Instead of aborting the write or only copying the correct +subset of bytes (for example, 2 bytes for a 2-byte write), more bytes than +specified by the write transaction may be written to the register. On +processors affected by FBSDP, this may expose stale data from the fill buffers +of the core that created the write transaction. + +Shared Buffers Data Sampling (SBDS) (CVE-2022-21125) +---------------------------------------------------- +After propagators may have moved data around the uncore and copied stale data +into client core fill buffers, processors affected by MFBDS can leak data from +the fill buffer. It is limited to the client (including Intel Xeon server E3) +uncore implementation. + +Shared Buffers Data Read (SBDR) (CVE-2022-21123) +------------------------------------------------ +It is similar to Shared Buffer Data Sampling (SBDS) except that the data is +directly read into the architectural software-visible state. It is limited to +the client (including Intel Xeon server E3) uncore implementation. + +Affected Processors +=================== +Not all the CPUs are affected by all the variants. For instance, most +processors for the server market (excluding Intel Xeon E3 processors) are +impacted by only Device Register Partial Write (DRPW). + +Below is the list of affected Intel processors [#f1]_: + + =================== ============ ========= + Common name Family_Model Steppings + =================== ============ ========= + HASWELL_X 06_3FH 2,4 + SKYLAKE_L 06_4EH 3 + BROADWELL_X 06_4FH All + SKYLAKE_X 06_55H 3,4,6,7,11 + BROADWELL_D 06_56H 3,4,5 + SKYLAKE 06_5EH 3 + ICELAKE_X 06_6AH 4,5,6 + ICELAKE_D 06_6CH 1 + ICELAKE_L 06_7EH 5 + ATOM_TREMONT_D 06_86H All + LAKEFIELD 06_8AH 1 + KABYLAKE_L 06_8EH 9 to 12 + ATOM_TREMONT 06_96H 1 + ATOM_TREMONT_L 06_9CH 0 + KABYLAKE 06_9EH 9 to 13 + COMETLAKE 06_A5H 2,3,5 + COMETLAKE_L 06_A6H 0,1 + ROCKETLAKE 06_A7H 1 + =================== ============ ========= + +If a CPU is in the affected processor list, but not affected by a variant, it +is indicated by new bits in MSR IA32_ARCH_CAPABILITIES. As described in a later +section, mitigation largely remains the same for all the variants, i.e. to +clear the CPU fill buffers via VERW instruction. + +New bits in MSRs +================ +Newer processors and microcode update on existing affected processors added new +bits to IA32_ARCH_CAPABILITIES MSR. These bits can be used to enumerate +specific variants of Processor MMIO Stale Data vulnerabilities and mitigation +capability. + +MSR IA32_ARCH_CAPABILITIES +-------------------------- +Bit 13 - SBDR_SSDP_NO - When set, processor is not affected by either the + Shared Buffers Data Read (SBDR) vulnerability or the sideband stale + data propagator (SSDP). +Bit 14 - FBSDP_NO - When set, processor is not affected by the Fill Buffer + Stale Data Propagator (FBSDP). +Bit 15 - PSDP_NO - When set, processor is not affected by Primary Stale Data + Propagator (PSDP). +Bit 17 - FB_CLEAR - When set, VERW instruction will overwrite CPU fill buffer + values as part of MD_CLEAR operations. Processors that do not + enumerate MDS_NO (meaning they are affected by MDS) but that do + enumerate support for both L1D_FLUSH and MD_CLEAR implicitly enumerate + FB_CLEAR as part of their MD_CLEAR support. +Bit 18 - FB_CLEAR_CTRL - Processor supports read and write to MSR + IA32_MCU_OPT_CTRL[FB_CLEAR_DIS]. On such processors, the FB_CLEAR_DIS + bit can be set to cause the VERW instruction to not perform the + FB_CLEAR action. Not all processors that support FB_CLEAR will support + FB_CLEAR_CTRL. + +MSR IA32_MCU_OPT_CTRL +--------------------- +Bit 3 - FB_CLEAR_DIS - When set, VERW instruction does not perform the FB_CLEAR +action. This may be useful to reduce the performance impact of FB_CLEAR in +cases where system software deems it warranted (for example, when performance +is more critical, or the untrusted software has no MMIO access). Note that +FB_CLEAR_DIS has no impact on enumeration (for example, it does not change +FB_CLEAR or MD_CLEAR enumeration) and it may not be supported on all processors +that enumerate FB_CLEAR. + +Mitigation +========== +Like MDS, all variants of Processor MMIO Stale Data vulnerabilities have the +same mitigation strategy to force the CPU to clear the affected buffers before +an attacker can extract the secrets. + +This is achieved by using the otherwise unused and obsolete VERW instruction in +combination with a microcode update. The microcode clears the affected CPU +buffers when the VERW instruction is executed. + +Kernel reuses the MDS function to invoke the buffer clearing: + + mds_clear_cpu_buffers() + +On MDS affected CPUs, the kernel already invokes CPU buffer clear on +kernel/userspace, hypervisor/guest and C-state (idle) transitions. No +additional mitigation is needed on such CPUs. + +For CPUs not affected by MDS or TAA, mitigation is needed only for the attacker +with MMIO capability. Therefore, VERW is not required for kernel/userspace. For +virtualization case, VERW is only needed at VMENTER for a guest with MMIO +capability. + +Mitigation points +----------------- +Return to user space +^^^^^^^^^^^^^^^^^^^^ +Same mitigation as MDS when affected by MDS/TAA, otherwise no mitigation +needed. + +C-State transition +^^^^^^^^^^^^^^^^^^ +Control register writes by CPU during C-state transition can propagate data +from fill buffer to uncore buffers. Execute VERW before C-state transition to +clear CPU fill buffers. + +Guest entry point +^^^^^^^^^^^^^^^^^ +Same mitigation as MDS when processor is also affected by MDS/TAA, otherwise +execute VERW at VMENTER only for MMIO capable guests. On CPUs not affected by +MDS/TAA, guest without MMIO access cannot extract secrets using Processor MMIO +Stale Data vulnerabilities, so there is no need to execute VERW for such guests. + +Mitigation control on the kernel command line +--------------------------------------------- +The kernel command line allows to control the Processor MMIO Stale Data +mitigations at boot time with the option "mmio_stale_data=". The valid +arguments for this option are: + + ========== ================================================================= + full If the CPU is vulnerable, enable mitigation; CPU buffer clearing + on exit to userspace and when entering a VM. Idle transitions are + protected as well. It does not automatically disable SMT. + full,nosmt Same as full, with SMT disabled on vulnerable CPUs. This is the + complete mitigation. + off Disables mitigation completely. + ========== ================================================================= + +If the CPU is affected and mmio_stale_data=off is not supplied on the kernel +command line, then the kernel selects the appropriate mitigation. + +Mitigation status information +----------------------------- +The Linux kernel provides a sysfs interface to enumerate the current +vulnerability status of the system: whether the system is vulnerable, and +which mitigations are active. The relevant sysfs file is: + + /sys/devices/system/cpu/vulnerabilities/mmio_stale_data + +The possible values in this file are: + + .. list-table:: + + * - 'Not affected' + - The processor is not vulnerable + * - 'Vulnerable' + - The processor is vulnerable, but no mitigation enabled + * - 'Vulnerable: Clear CPU buffers attempted, no microcode' + - The processor is vulnerable, but microcode is not updated. The + mitigation is enabled on a best effort basis. + * - 'Mitigation: Clear CPU buffers' + - The processor is vulnerable and the CPU buffer clearing mitigation is + enabled. + +If the processor is vulnerable then the following information is appended to +the above information: + + ======================== =========================================== + 'SMT vulnerable' SMT is enabled + 'SMT disabled' SMT is disabled + 'SMT Host state unknown' Kernel runs in a VM, Host SMT state unknown + ======================== =========================================== + +References +---------- +.. [#f1] Affected Processors + https://www.intel.com/content/www/us/en/developer/topic-technology/software-security-guidance/processors-affected-consolidated-product-cpu-model.html -- cgit From 51802186158c74a0304f51ab963e7c2b3a2b046f Mon Sep 17 00:00:00 2001 From: Pawan Gupta Date: Thu, 19 May 2022 20:27:08 -0700 Subject: x86/speculation/mmio: Enumerate Processor MMIO Stale Data bug Processor MMIO Stale Data is a class of vulnerabilities that may expose data after an MMIO operation. For more details please refer to Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst Add the Processor MMIO Stale Data bug enumeration. A microcode update adds new bits to the MSR IA32_ARCH_CAPABILITIES, define them. Signed-off-by: Pawan Gupta Signed-off-by: Borislav Petkov --- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/msr-index.h | 19 ++++++++++++++ arch/x86/kernel/cpu/common.c | 43 ++++++++++++++++++++++++++++++-- tools/arch/x86/include/asm/cpufeatures.h | 1 + tools/arch/x86/include/asm/msr-index.h | 19 ++++++++++++++ 5 files changed, 81 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 73e643ae94b6..e17de69faa54 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -443,5 +443,6 @@ #define X86_BUG_TAA X86_BUG(22) /* CPU is affected by TSX Async Abort(TAA) */ #define X86_BUG_ITLB_MULTIHIT X86_BUG(23) /* CPU may incur MCE during certain page attribute changes */ #define X86_BUG_SRBDS X86_BUG(24) /* CPU may leak RNG bits if not mitigated */ +#define X86_BUG_MMIO_STALE_DATA X86_BUG(25) /* CPU is affected by Processor MMIO Stale Data vulnerabilities */ #endif /* _ASM_X86_CPUFEATURES_H */ diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index ee15311b6be1..12976405441b 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -114,6 +114,25 @@ * Not susceptible to * TSX Async Abort (TAA) vulnerabilities. */ +#define ARCH_CAP_SBDR_SSDP_NO BIT(13) /* + * Not susceptible to SBDR and SSDP + * variants of Processor MMIO stale data + * vulnerabilities. + */ +#define ARCH_CAP_FBSDP_NO BIT(14) /* + * Not susceptible to FBSDP variant of + * Processor MMIO stale data + * vulnerabilities. + */ +#define ARCH_CAP_PSDP_NO BIT(15) /* + * Not susceptible to PSDP variant of + * Processor MMIO stale data + * vulnerabilities. + */ +#define ARCH_CAP_FB_CLEAR BIT(17) /* + * VERW clears CPU fill buffer + * even on MDS_NO CPUs. + */ #define MSR_IA32_FLUSH_CMD 0x0000010b #define L1D_FLUSH BIT(0) /* diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index e342ae4db3c4..f7757409e133 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1237,18 +1237,39 @@ static const __initconst struct x86_cpu_id cpu_vuln_whitelist[] = { X86_FEATURE_ANY, issues) #define SRBDS BIT(0) +/* CPU is affected by X86_BUG_MMIO_STALE_DATA */ +#define MMIO BIT(1) static const struct x86_cpu_id cpu_vuln_blacklist[] __initconst = { VULNBL_INTEL_STEPPINGS(IVYBRIDGE, X86_STEPPING_ANY, SRBDS), VULNBL_INTEL_STEPPINGS(HASWELL, X86_STEPPING_ANY, SRBDS), VULNBL_INTEL_STEPPINGS(HASWELL_L, X86_STEPPING_ANY, SRBDS), VULNBL_INTEL_STEPPINGS(HASWELL_G, X86_STEPPING_ANY, SRBDS), + VULNBL_INTEL_STEPPINGS(HASWELL_X, BIT(2) | BIT(4), MMIO), + VULNBL_INTEL_STEPPINGS(BROADWELL_D, X86_STEPPINGS(0x3, 0x5), MMIO), VULNBL_INTEL_STEPPINGS(BROADWELL_G, X86_STEPPING_ANY, SRBDS), + VULNBL_INTEL_STEPPINGS(BROADWELL_X, X86_STEPPING_ANY, MMIO), VULNBL_INTEL_STEPPINGS(BROADWELL, X86_STEPPING_ANY, SRBDS), + VULNBL_INTEL_STEPPINGS(SKYLAKE_L, X86_STEPPINGS(0x3, 0x3), SRBDS | MMIO), VULNBL_INTEL_STEPPINGS(SKYLAKE_L, X86_STEPPING_ANY, SRBDS), + VULNBL_INTEL_STEPPINGS(SKYLAKE_X, BIT(3) | BIT(4) | BIT(6) | + BIT(7) | BIT(0xB), MMIO), + VULNBL_INTEL_STEPPINGS(SKYLAKE, X86_STEPPINGS(0x3, 0x3), SRBDS | MMIO), VULNBL_INTEL_STEPPINGS(SKYLAKE, X86_STEPPING_ANY, SRBDS), - VULNBL_INTEL_STEPPINGS(KABYLAKE_L, X86_STEPPINGS(0x0, 0xC), SRBDS), - VULNBL_INTEL_STEPPINGS(KABYLAKE, X86_STEPPINGS(0x0, 0xD), SRBDS), + VULNBL_INTEL_STEPPINGS(KABYLAKE_L, X86_STEPPINGS(0x9, 0xC), SRBDS | MMIO), + VULNBL_INTEL_STEPPINGS(KABYLAKE_L, X86_STEPPINGS(0x0, 0x8), SRBDS), + VULNBL_INTEL_STEPPINGS(KABYLAKE, X86_STEPPINGS(0x9, 0xD), SRBDS | MMIO), + VULNBL_INTEL_STEPPINGS(KABYLAKE, X86_STEPPINGS(0x0, 0x8), SRBDS), + VULNBL_INTEL_STEPPINGS(ICELAKE_L, X86_STEPPINGS(0x5, 0x5), MMIO), + VULNBL_INTEL_STEPPINGS(ICELAKE_D, X86_STEPPINGS(0x1, 0x1), MMIO), + VULNBL_INTEL_STEPPINGS(ICELAKE_X, X86_STEPPINGS(0x4, 0x6), MMIO), + VULNBL_INTEL_STEPPINGS(COMETLAKE, BIT(2) | BIT(3) | BIT(5), MMIO), + VULNBL_INTEL_STEPPINGS(COMETLAKE_L, X86_STEPPINGS(0x0, 0x1), MMIO), + VULNBL_INTEL_STEPPINGS(LAKEFIELD, X86_STEPPINGS(0x1, 0x1), MMIO), + VULNBL_INTEL_STEPPINGS(ROCKETLAKE, X86_STEPPINGS(0x1, 0x1), MMIO), + VULNBL_INTEL_STEPPINGS(ATOM_TREMONT, X86_STEPPINGS(0x1, 0x1), MMIO), + VULNBL_INTEL_STEPPINGS(ATOM_TREMONT_D, X86_STEPPING_ANY, MMIO), + VULNBL_INTEL_STEPPINGS(ATOM_TREMONT_L, X86_STEPPINGS(0x0, 0x0), MMIO), {} }; @@ -1269,6 +1290,13 @@ u64 x86_read_arch_cap_msr(void) return ia32_cap; } +static bool arch_cap_mmio_immune(u64 ia32_cap) +{ + return (ia32_cap & ARCH_CAP_FBSDP_NO && + ia32_cap & ARCH_CAP_PSDP_NO && + ia32_cap & ARCH_CAP_SBDR_SSDP_NO); +} + static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c) { u64 ia32_cap = x86_read_arch_cap_msr(); @@ -1328,6 +1356,17 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c) cpu_matches(cpu_vuln_blacklist, SRBDS)) setup_force_cpu_bug(X86_BUG_SRBDS); + /* + * Processor MMIO Stale Data bug enumeration + * + * Affected CPU list is generally enough to enumerate the vulnerability, + * but for virtualization case check for ARCH_CAP MSR bits also, VMM may + * not want the guest to enumerate the bug. + */ + if (cpu_matches(cpu_vuln_blacklist, MMIO) && + !arch_cap_mmio_immune(ia32_cap)) + setup_force_cpu_bug(X86_BUG_MMIO_STALE_DATA); + if (cpu_matches(cpu_vuln_whitelist, NO_MELTDOWN)) return; diff --git a/tools/arch/x86/include/asm/cpufeatures.h b/tools/arch/x86/include/asm/cpufeatures.h index 73e643ae94b6..e17de69faa54 100644 --- a/tools/arch/x86/include/asm/cpufeatures.h +++ b/tools/arch/x86/include/asm/cpufeatures.h @@ -443,5 +443,6 @@ #define X86_BUG_TAA X86_BUG(22) /* CPU is affected by TSX Async Abort(TAA) */ #define X86_BUG_ITLB_MULTIHIT X86_BUG(23) /* CPU may incur MCE during certain page attribute changes */ #define X86_BUG_SRBDS X86_BUG(24) /* CPU may leak RNG bits if not mitigated */ +#define X86_BUG_MMIO_STALE_DATA X86_BUG(25) /* CPU is affected by Processor MMIO Stale Data vulnerabilities */ #endif /* _ASM_X86_CPUFEATURES_H */ diff --git a/tools/arch/x86/include/asm/msr-index.h b/tools/arch/x86/include/asm/msr-index.h index ee15311b6be1..12976405441b 100644 --- a/tools/arch/x86/include/asm/msr-index.h +++ b/tools/arch/x86/include/asm/msr-index.h @@ -114,6 +114,25 @@ * Not susceptible to * TSX Async Abort (TAA) vulnerabilities. */ +#define ARCH_CAP_SBDR_SSDP_NO BIT(13) /* + * Not susceptible to SBDR and SSDP + * variants of Processor MMIO stale data + * vulnerabilities. + */ +#define ARCH_CAP_FBSDP_NO BIT(14) /* + * Not susceptible to FBSDP variant of + * Processor MMIO stale data + * vulnerabilities. + */ +#define ARCH_CAP_PSDP_NO BIT(15) /* + * Not susceptible to PSDP variant of + * Processor MMIO stale data + * vulnerabilities. + */ +#define ARCH_CAP_FB_CLEAR BIT(17) /* + * VERW clears CPU fill buffer + * even on MDS_NO CPUs. + */ #define MSR_IA32_FLUSH_CMD 0x0000010b #define L1D_FLUSH BIT(0) /* -- cgit From f52ea6c26953fed339aa4eae717ee5c2133c7ff2 Mon Sep 17 00:00:00 2001 From: Pawan Gupta Date: Thu, 19 May 2022 20:28:10 -0700 Subject: x86/speculation: Add a common function for MD_CLEAR mitigation update Processor MMIO Stale Data mitigation uses similar mitigation as MDS and TAA. In preparation for adding its mitigation, add a common function to update all mitigations that depend on MD_CLEAR. [ bp: Add a newline in md_clear_update_mitigation() to separate statements better. ] Signed-off-by: Pawan Gupta Signed-off-by: Borislav Petkov --- arch/x86/kernel/cpu/bugs.c | 59 ++++++++++++++++++++++++++-------------------- 1 file changed, 33 insertions(+), 26 deletions(-) diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 6296e1ebed1d..e05d207e7ec9 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -41,7 +41,7 @@ static void __init spectre_v2_select_mitigation(void); static void __init ssb_select_mitigation(void); static void __init l1tf_select_mitigation(void); static void __init mds_select_mitigation(void); -static void __init mds_print_mitigation(void); +static void __init md_clear_update_mitigation(void); static void __init taa_select_mitigation(void); static void __init srbds_select_mitigation(void); static void __init l1d_flush_select_mitigation(void); @@ -123,10 +123,10 @@ void __init check_bugs(void) l1d_flush_select_mitigation(); /* - * As MDS and TAA mitigations are inter-related, print MDS - * mitigation until after TAA mitigation selection is done. + * As MDS and TAA mitigations are inter-related, update and print their + * mitigation after TAA mitigation selection is done. */ - mds_print_mitigation(); + md_clear_update_mitigation(); arch_smt_update(); @@ -267,14 +267,6 @@ static void __init mds_select_mitigation(void) } } -static void __init mds_print_mitigation(void) -{ - if (!boot_cpu_has_bug(X86_BUG_MDS) || cpu_mitigations_off()) - return; - - pr_info("%s\n", mds_strings[mds_mitigation]); -} - static int __init mds_cmdline(char *str) { if (!boot_cpu_has_bug(X86_BUG_MDS)) @@ -329,7 +321,7 @@ static void __init taa_select_mitigation(void) /* TSX previously disabled by tsx=off */ if (!boot_cpu_has(X86_FEATURE_RTM)) { taa_mitigation = TAA_MITIGATION_TSX_DISABLED; - goto out; + return; } if (cpu_mitigations_off()) { @@ -343,7 +335,7 @@ static void __init taa_select_mitigation(void) */ if (taa_mitigation == TAA_MITIGATION_OFF && mds_mitigation == MDS_MITIGATION_OFF) - goto out; + return; if (boot_cpu_has(X86_FEATURE_MD_CLEAR)) taa_mitigation = TAA_MITIGATION_VERW; @@ -375,18 +367,6 @@ static void __init taa_select_mitigation(void) if (taa_nosmt || cpu_mitigations_auto_nosmt()) cpu_smt_disable(false); - - /* - * Update MDS mitigation, if necessary, as the mds_user_clear is - * now enabled for TAA mitigation. - */ - if (mds_mitigation == MDS_MITIGATION_OFF && - boot_cpu_has_bug(X86_BUG_MDS)) { - mds_mitigation = MDS_MITIGATION_FULL; - mds_select_mitigation(); - } -out: - pr_info("%s\n", taa_strings[taa_mitigation]); } static int __init tsx_async_abort_parse_cmdline(char *str) @@ -410,6 +390,33 @@ static int __init tsx_async_abort_parse_cmdline(char *str) } early_param("tsx_async_abort", tsx_async_abort_parse_cmdline); +#undef pr_fmt +#define pr_fmt(fmt) "" fmt + +static void __init md_clear_update_mitigation(void) +{ + if (cpu_mitigations_off()) + return; + + if (!static_key_enabled(&mds_user_clear)) + goto out; + + /* + * mds_user_clear is now enabled. Update MDS mitigation, if + * necessary. + */ + if (mds_mitigation == MDS_MITIGATION_OFF && + boot_cpu_has_bug(X86_BUG_MDS)) { + mds_mitigation = MDS_MITIGATION_FULL; + mds_select_mitigation(); + } +out: + if (boot_cpu_has_bug(X86_BUG_MDS)) + pr_info("MDS: %s\n", mds_strings[mds_mitigation]); + if (boot_cpu_has_bug(X86_BUG_TAA)) + pr_info("TAA: %s\n", taa_strings[taa_mitigation]); +} + #undef pr_fmt #define pr_fmt(fmt) "SRBDS: " fmt -- cgit From 8cb861e9e3c9a55099ad3d08e1a3b653d29c33ca Mon Sep 17 00:00:00 2001 From: Pawan Gupta Date: Thu, 19 May 2022 20:29:11 -0700 Subject: x86/speculation/mmio: Add mitigation for Processor MMIO Stale Data Processor MMIO Stale Data is a class of vulnerabilities that may expose data after an MMIO operation. For details please refer to Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst. These vulnerabilities are broadly categorized as: Device Register Partial Write (DRPW): Some endpoint MMIO registers incorrectly handle writes that are smaller than the register size. Instead of aborting the write or only copying the correct subset of bytes (for example, 2 bytes for a 2-byte write), more bytes than specified by the write transaction may be written to the register. On some processors, this may expose stale data from the fill buffers of the core that created the write transaction. Shared Buffers Data Sampling (SBDS): After propagators may have moved data around the uncore and copied stale data into client core fill buffers, processors affected by MFBDS can leak data from the fill buffer. Shared Buffers Data Read (SBDR): It is similar to Shared Buffer Data Sampling (SBDS) except that the data is directly read into the architectural software-visible state. An attacker can use these vulnerabilities to extract data from CPU fill buffers using MDS and TAA methods. Mitigate it by clearing the CPU fill buffers using the VERW instruction before returning to a user or a guest. On CPUs not affected by MDS and TAA, user application cannot sample data from CPU fill buffers using MDS or TAA. A guest with MMIO access can still use DRPW or SBDR to extract data architecturally. Mitigate it with VERW instruction to clear fill buffers before VMENTER for MMIO capable guests. Add a kernel parameter mmio_stale_data={off|full|full,nosmt} to control the mitigation. Signed-off-by: Pawan Gupta Signed-off-by: Borislav Petkov --- Documentation/admin-guide/kernel-parameters.txt | 36 ++++++++ arch/x86/include/asm/nospec-branch.h | 2 + arch/x86/kernel/cpu/bugs.c | 111 +++++++++++++++++++++++- arch/x86/kvm/vmx/vmx.c | 3 + 4 files changed, 148 insertions(+), 4 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 3f1cc5e317ed..c4893782055b 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3105,6 +3105,7 @@ kvm.nx_huge_pages=off [X86] no_entry_flush [PPC] no_uaccess_flush [PPC] + mmio_stale_data=off [X86] Exceptions: This does not have any effect on @@ -3126,6 +3127,7 @@ Equivalent to: l1tf=flush,nosmt [X86] mds=full,nosmt [X86] tsx_async_abort=full,nosmt [X86] + mmio_stale_data=full,nosmt [X86] mminit_loglevel= [KNL] When CONFIG_DEBUG_MEMORY_INIT is set, this @@ -3135,6 +3137,40 @@ log everything. Information is printed at KERN_DEBUG so loglevel=8 may also need to be specified. + mmio_stale_data= + [X86,INTEL] Control mitigation for the Processor + MMIO Stale Data vulnerabilities. + + Processor MMIO Stale Data is a class of + vulnerabilities that may expose data after an MMIO + operation. Exposed data could originate or end in + the same CPU buffers as affected by MDS and TAA. + Therefore, similar to MDS and TAA, the mitigation + is to clear the affected CPU buffers. + + This parameter controls the mitigation. The + options are: + + full - Enable mitigation on vulnerable CPUs + + full,nosmt - Enable mitigation and disable SMT on + vulnerable CPUs. + + off - Unconditionally disable mitigation + + On MDS or TAA affected machines, + mmio_stale_data=off can be prevented by an active + MDS or TAA mitigation as these vulnerabilities are + mitigated with the same mechanism so in order to + disable this mitigation, you need to specify + mds=off and tsx_async_abort=off too. + + Not specifying this option is equivalent to + mmio_stale_data=full. + + For details see: + Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst + module.sig_enforce [KNL] When CONFIG_MODULE_SIG is set, this means that modules without (valid) signatures will fail to load. diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h index acbaeaf83b61..da251a5645b0 100644 --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -269,6 +269,8 @@ DECLARE_STATIC_KEY_FALSE(mds_idle_clear); DECLARE_STATIC_KEY_FALSE(switch_mm_cond_l1d_flush); +DECLARE_STATIC_KEY_FALSE(mmio_stale_data_clear); + #include /** diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index e05d207e7ec9..7b01ba9bc701 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -43,6 +43,7 @@ static void __init l1tf_select_mitigation(void); static void __init mds_select_mitigation(void); static void __init md_clear_update_mitigation(void); static void __init taa_select_mitigation(void); +static void __init mmio_select_mitigation(void); static void __init srbds_select_mitigation(void); static void __init l1d_flush_select_mitigation(void); @@ -85,6 +86,10 @@ EXPORT_SYMBOL_GPL(mds_idle_clear); */ DEFINE_STATIC_KEY_FALSE(switch_mm_cond_l1d_flush); +/* Controls CPU Fill buffer clear before KVM guest MMIO accesses */ +DEFINE_STATIC_KEY_FALSE(mmio_stale_data_clear); +EXPORT_SYMBOL_GPL(mmio_stale_data_clear); + void __init check_bugs(void) { identify_boot_cpu(); @@ -119,12 +124,14 @@ void __init check_bugs(void) l1tf_select_mitigation(); mds_select_mitigation(); taa_select_mitigation(); + mmio_select_mitigation(); srbds_select_mitigation(); l1d_flush_select_mitigation(); /* - * As MDS and TAA mitigations are inter-related, update and print their - * mitigation after TAA mitigation selection is done. + * As MDS, TAA and MMIO Stale Data mitigations are inter-related, update + * and print their mitigation after MDS, TAA and MMIO Stale Data + * mitigation selection is done. */ md_clear_update_mitigation(); @@ -390,6 +397,90 @@ static int __init tsx_async_abort_parse_cmdline(char *str) } early_param("tsx_async_abort", tsx_async_abort_parse_cmdline); +#undef pr_fmt +#define pr_fmt(fmt) "MMIO Stale Data: " fmt + +enum mmio_mitigations { + MMIO_MITIGATION_OFF, + MMIO_MITIGATION_UCODE_NEEDED, + MMIO_MITIGATION_VERW, +}; + +/* Default mitigation for Processor MMIO Stale Data vulnerabilities */ +static enum mmio_mitigations mmio_mitigation __ro_after_init = MMIO_MITIGATION_VERW; +static bool mmio_nosmt __ro_after_init = false; + +static const char * const mmio_strings[] = { + [MMIO_MITIGATION_OFF] = "Vulnerable", + [MMIO_MITIGATION_UCODE_NEEDED] = "Vulnerable: Clear CPU buffers attempted, no microcode", + [MMIO_MITIGATION_VERW] = "Mitigation: Clear CPU buffers", +}; + +static void __init mmio_select_mitigation(void) +{ + u64 ia32_cap; + + if (!boot_cpu_has_bug(X86_BUG_MMIO_STALE_DATA) || + cpu_mitigations_off()) { + mmio_mitigation = MMIO_MITIGATION_OFF; + return; + } + + if (mmio_mitigation == MMIO_MITIGATION_OFF) + return; + + ia32_cap = x86_read_arch_cap_msr(); + + /* + * Enable CPU buffer clear mitigation for host and VMM, if also affected + * by MDS or TAA. Otherwise, enable mitigation for VMM only. + */ + if (boot_cpu_has_bug(X86_BUG_MDS) || (boot_cpu_has_bug(X86_BUG_TAA) && + boot_cpu_has(X86_FEATURE_RTM))) + static_branch_enable(&mds_user_clear); + else + static_branch_enable(&mmio_stale_data_clear); + + /* + * Check if the system has the right microcode. + * + * CPU Fill buffer clear mitigation is enumerated by either an explicit + * FB_CLEAR or by the presence of both MD_CLEAR and L1D_FLUSH on MDS + * affected systems. + */ + if ((ia32_cap & ARCH_CAP_FB_CLEAR) || + (boot_cpu_has(X86_FEATURE_MD_CLEAR) && + boot_cpu_has(X86_FEATURE_FLUSH_L1D) && + !(ia32_cap & ARCH_CAP_MDS_NO))) + mmio_mitigation = MMIO_MITIGATION_VERW; + else + mmio_mitigation = MMIO_MITIGATION_UCODE_NEEDED; + + if (mmio_nosmt || cpu_mitigations_auto_nosmt()) + cpu_smt_disable(false); +} + +static int __init mmio_stale_data_parse_cmdline(char *str) +{ + if (!boot_cpu_has_bug(X86_BUG_MMIO_STALE_DATA)) + return 0; + + if (!str) + return -EINVAL; + + if (!strcmp(str, "off")) { + mmio_mitigation = MMIO_MITIGATION_OFF; + } else if (!strcmp(str, "full")) { + mmio_mitigation = MMIO_MITIGATION_VERW; + } else if (!strcmp(str, "full,nosmt")) { + mmio_mitigation = MMIO_MITIGATION_VERW; + mmio_nosmt = true; + } + + return 0; +} +early_param("mmio_stale_data", mmio_stale_data_parse_cmdline); + #undef pr_fmt #define pr_fmt(fmt) "" fmt @@ -402,19 +493,31 @@ static void __init md_clear_update_mitigation(void) goto out; /* - * mds_user_clear is now enabled. Update MDS mitigation, if - * necessary. + * mds_user_clear is now enabled. Update MDS, TAA and MMIO Stale Data + * mitigation, if necessary. */ if (mds_mitigation == MDS_MITIGATION_OFF && boot_cpu_has_bug(X86_BUG_MDS)) { mds_mitigation = MDS_MITIGATION_FULL; mds_select_mitigation(); } + if (taa_mitigation == TAA_MITIGATION_OFF && + boot_cpu_has_bug(X86_BUG_TAA)) { + taa_mitigation = TAA_MITIGATION_VERW; + taa_select_mitigation(); + } + if (mmio_mitigation == MMIO_MITIGATION_OFF && + boot_cpu_has_bug(X86_BUG_MMIO_STALE_DATA)) { + mmio_mitigation = MMIO_MITIGATION_VERW; + mmio_select_mitigation(); + } out: if (boot_cpu_has_bug(X86_BUG_MDS)) pr_info("MDS: %s\n", mds_strings[mds_mitigation]); if (boot_cpu_has_bug(X86_BUG_TAA)) pr_info("TAA: %s\n", taa_strings[taa_mitigation]); + if (boot_cpu_has_bug(X86_BUG_MMIO_STALE_DATA)) + pr_info("MMIO Stale Data: %s\n", mmio_strings[mmio_mitigation]); } #undef pr_fmt diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 610355b9ccce..4fa216acadce 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6773,6 +6773,9 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu, vmx_l1d_flush(vcpu); else if (static_branch_unlikely(&mds_user_clear)) mds_clear_cpu_buffers(); + else if (static_branch_unlikely(&mmio_stale_data_clear) && + kvm_arch_has_assigned_device(vcpu->kvm)) + mds_clear_cpu_buffers(); if (vcpu->arch.cr2 != native_read_cr2()) native_write_cr2(vcpu->arch.cr2); -- cgit From e5925fb867290ee924fcf2fe3ca887b792714366 Mon Sep 17 00:00:00 2001 From: Pawan Gupta Date: Thu, 19 May 2022 20:30:12 -0700 Subject: x86/bugs: Group MDS, TAA & Processor MMIO Stale Data mitigations MDS, TAA and Processor MMIO Stale Data mitigations rely on clearing CPU buffers. Moreover, status of these mitigations affects each other. During boot, it is important to maintain the order in which these mitigations are selected. This is especially true for md_clear_update_mitigation() that needs to be called after MDS, TAA and Processor MMIO Stale Data mitigation selection is done. Introduce md_clear_select_mitigation(), and select all these mitigations from there. This reflects relationships between these mitigations and ensures proper ordering. Signed-off-by: Pawan Gupta Signed-off-by: Borislav Petkov --- arch/x86/kernel/cpu/bugs.c | 26 ++++++++++++++++---------- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 7b01ba9bc701..d2cc7dbba5e2 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -42,6 +42,7 @@ static void __init ssb_select_mitigation(void); static void __init l1tf_select_mitigation(void); static void __init mds_select_mitigation(void); static void __init md_clear_update_mitigation(void); +static void __init md_clear_select_mitigation(void); static void __init taa_select_mitigation(void); static void __init mmio_select_mitigation(void); static void __init srbds_select_mitigation(void); @@ -122,19 +123,10 @@ void __init check_bugs(void) spectre_v2_select_mitigation(); ssb_select_mitigation(); l1tf_select_mitigation(); - mds_select_mitigation(); - taa_select_mitigation(); - mmio_select_mitigation(); + md_clear_select_mitigation(); srbds_select_mitigation(); l1d_flush_select_mitigation(); - /* - * As MDS, TAA and MMIO Stale Data mitigations are inter-related, update - * and print their mitigation after MDS, TAA and MMIO Stale Data - * mitigation selection is done. - */ - md_clear_update_mitigation(); - arch_smt_update(); #ifdef CONFIG_X86_32 @@ -520,6 +512,20 @@ out: pr_info("MMIO Stale Data: %s\n", mmio_strings[mmio_mitigation]); } +static void __init md_clear_select_mitigation(void) +{ + mds_select_mitigation(); + taa_select_mitigation(); + mmio_select_mitigation(); + + /* + * As MDS, TAA and MMIO Stale Data mitigations are inter-related, update + * and print their mitigation after MDS, TAA and MMIO Stale Data + * mitigation selection is done. + */ + md_clear_update_mitigation(); +} + #undef pr_fmt #define pr_fmt(fmt) "SRBDS: " fmt -- cgit From 99a83db5a605137424e1efe29dc0573d6a5b6316 Mon Sep 17 00:00:00 2001 From: Pawan Gupta Date: Thu, 19 May 2022 20:31:12 -0700 Subject: x86/speculation/mmio: Enable CPU Fill buffer clearing on idle When the CPU is affected by Processor MMIO Stale Data vulnerabilities, Fill Buffer Stale Data Propagator (FBSDP) can propagate stale data out of Fill buffer to uncore buffer when CPU goes idle. Stale data can then be exploited with other variants using MMIO operations. Mitigate it by clearing the Fill buffer before entering idle state. Signed-off-by: Pawan Gupta Co-developed-by: Josh Poimboeuf Signed-off-by: Josh Poimboeuf Signed-off-by: Borislav Petkov --- arch/x86/kernel/cpu/bugs.c | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index d2cc7dbba5e2..56d5dea5e128 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -433,6 +433,14 @@ static void __init mmio_select_mitigation(void) else static_branch_enable(&mmio_stale_data_clear); + /* + * If Processor-MMIO-Stale-Data bug is present and Fill Buffer data can + * be propagated to uncore buffers, clearing the Fill buffers on idle + * is required irrespective of SMT state. + */ + if (!(ia32_cap & ARCH_CAP_FBSDP_NO)) + static_branch_enable(&mds_idle_clear); + /* * Check if the system has the right microcode. * @@ -1225,6 +1233,8 @@ static void update_indir_branch_cond(void) /* Update the static key controlling the MDS CPU buffer clear in idle */ static void update_mds_branch_idle(void) { + u64 ia32_cap = x86_read_arch_cap_msr(); + /* * Enable the idle clearing if SMT is active on CPUs which are * affected only by MSBDS and not any other MDS variant. @@ -1236,10 +1246,12 @@ static void update_mds_branch_idle(void) if (!boot_cpu_has_bug(X86_BUG_MSBDS_ONLY)) return; - if (sched_smt_active()) + if (sched_smt_active()) { static_branch_enable(&mds_idle_clear); - else + } else if (mmio_mitigation == MMIO_MITIGATION_OFF || + (ia32_cap & ARCH_CAP_FBSDP_NO)) { static_branch_disable(&mds_idle_clear); + } } #define MDS_MSG_SMT "MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.\n" -- cgit From 8d50cdf8b8341770bc6367bce40c0c1bb0e1d5b3 Mon Sep 17 00:00:00 2001 From: Pawan Gupta Date: Thu, 19 May 2022 20:32:13 -0700 Subject: x86/speculation/mmio: Add sysfs reporting for Processor MMIO Stale Data Add the sysfs reporting file for Processor MMIO Stale Data vulnerability. It exposes the vulnerability and mitigation state similar to the existing files for the other hardware vulnerabilities. Signed-off-by: Pawan Gupta Signed-off-by: Borislav Petkov --- Documentation/ABI/testing/sysfs-devices-system-cpu | 1 + arch/x86/kernel/cpu/bugs.c | 22 ++++++++++++++++++++++ drivers/base/cpu.c | 8 ++++++++ include/linux/cpu.h | 3 +++ 4 files changed, 34 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu index 2ad01cad7f1c..bcc974d276dc 100644 --- a/Documentation/ABI/testing/sysfs-devices-system-cpu +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -526,6 +526,7 @@ What: /sys/devices/system/cpu/vulnerabilities /sys/devices/system/cpu/vulnerabilities/srbds /sys/devices/system/cpu/vulnerabilities/tsx_async_abort /sys/devices/system/cpu/vulnerabilities/itlb_multihit + /sys/devices/system/cpu/vulnerabilities/mmio_stale_data Date: January 2018 Contact: Linux kernel mailing list Description: Information about CPU vulnerabilities diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 56d5dea5e128..38853077ca58 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -1902,6 +1902,20 @@ static ssize_t tsx_async_abort_show_state(char *buf) sched_smt_active() ? "vulnerable" : "disabled"); } +static ssize_t mmio_stale_data_show_state(char *buf) +{ + if (mmio_mitigation == MMIO_MITIGATION_OFF) + return sysfs_emit(buf, "%s\n", mmio_strings[mmio_mitigation]); + + if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) { + return sysfs_emit(buf, "%s; SMT Host state unknown\n", + mmio_strings[mmio_mitigation]); + } + + return sysfs_emit(buf, "%s; SMT %s\n", mmio_strings[mmio_mitigation], + sched_smt_active() ? "vulnerable" : "disabled"); +} + static char *stibp_state(void) { if (spectre_v2_in_eibrs_mode(spectre_v2_enabled)) @@ -2002,6 +2016,9 @@ static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr case X86_BUG_SRBDS: return srbds_show_state(buf); + case X86_BUG_MMIO_STALE_DATA: + return mmio_stale_data_show_state(buf); + default: break; } @@ -2053,4 +2070,9 @@ ssize_t cpu_show_srbds(struct device *dev, struct device_attribute *attr, char * { return cpu_show_common(dev, attr, buf, X86_BUG_SRBDS); } + +ssize_t cpu_show_mmio_stale_data(struct device *dev, struct device_attribute *attr, char *buf) +{ + return cpu_show_common(dev, attr, buf, X86_BUG_MMIO_STALE_DATA); +} #endif diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index 2ef23fce0860..a97776ea9d99 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -564,6 +564,12 @@ ssize_t __weak cpu_show_srbds(struct device *dev, return sysfs_emit(buf, "Not affected\n"); } +ssize_t __weak cpu_show_mmio_stale_data(struct device *dev, + struct device_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "Not affected\n"); +} + static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL); static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL); static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL); @@ -573,6 +579,7 @@ static DEVICE_ATTR(mds, 0444, cpu_show_mds, NULL); static DEVICE_ATTR(tsx_async_abort, 0444, cpu_show_tsx_async_abort, NULL); static DEVICE_ATTR(itlb_multihit, 0444, cpu_show_itlb_multihit, NULL); static DEVICE_ATTR(srbds, 0444, cpu_show_srbds, NULL); +static DEVICE_ATTR(mmio_stale_data, 0444, cpu_show_mmio_stale_data, NULL); static struct attribute *cpu_root_vulnerabilities_attrs[] = { &dev_attr_meltdown.attr, @@ -584,6 +591,7 @@ static struct attribute *cpu_root_vulnerabilities_attrs[] = { &dev_attr_tsx_async_abort.attr, &dev_attr_itlb_multihit.attr, &dev_attr_srbds.attr, + &dev_attr_mmio_stale_data.attr, NULL }; diff --git a/include/linux/cpu.h b/include/linux/cpu.h index 54dc2f9a2d56..2c7477354744 100644 --- a/include/linux/cpu.h +++ b/include/linux/cpu.h @@ -65,6 +65,9 @@ extern ssize_t cpu_show_tsx_async_abort(struct device *dev, extern ssize_t cpu_show_itlb_multihit(struct device *dev, struct device_attribute *attr, char *buf); extern ssize_t cpu_show_srbds(struct device *dev, struct device_attribute *attr, char *buf); +extern ssize_t cpu_show_mmio_stale_data(struct device *dev, + struct device_attribute *attr, + char *buf); extern __printf(4, 5) struct device *cpu_device_create(struct device *parent, void *drvdata, -- cgit From 22cac9c677c95f3ac5c9244f8ca0afdc7c8afb19 Mon Sep 17 00:00:00 2001 From: Pawan Gupta Date: Thu, 19 May 2022 20:33:13 -0700 Subject: x86/speculation/srbds: Update SRBDS mitigation selection Currently, Linux disables SRBDS mitigation on CPUs not affected by MDS and have the TSX feature disabled. On such CPUs, secrets cannot be extracted from CPU fill buffers using MDS or TAA. Without SRBDS mitigation, Processor MMIO Stale Data vulnerabilities can be used to extract RDRAND, RDSEED, and EGETKEY data. Do not disable SRBDS mitigation by default when CPU is also affected by Processor MMIO Stale Data vulnerabilities. Signed-off-by: Pawan Gupta Signed-off-by: Borislav Petkov --- arch/x86/kernel/cpu/bugs.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 38853077ca58..ef4749097f42 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -595,11 +595,13 @@ static void __init srbds_select_mitigation(void) return; /* - * Check to see if this is one of the MDS_NO systems supporting - * TSX that are only exposed to SRBDS when TSX is enabled. + * Check to see if this is one of the MDS_NO systems supporting TSX that + * are only exposed to SRBDS when TSX is enabled or when CPU is affected + * by Processor MMIO Stale Data vulnerability. */ ia32_cap = x86_read_arch_cap_msr(); - if ((ia32_cap & ARCH_CAP_MDS_NO) && !boot_cpu_has(X86_FEATURE_RTM)) + if ((ia32_cap & ARCH_CAP_MDS_NO) && !boot_cpu_has(X86_FEATURE_RTM) && + !boot_cpu_has_bug(X86_BUG_MMIO_STALE_DATA)) srbds_mitigation = SRBDS_MITIGATION_TSX_OFF; else if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) srbds_mitigation = SRBDS_MITIGATION_HYPERVISOR; -- cgit From a992b8a4682f119ae035a01b40d4d0665c4a2875 Mon Sep 17 00:00:00 2001 From: Pawan Gupta Date: Thu, 19 May 2022 20:34:14 -0700 Subject: x86/speculation/mmio: Reuse SRBDS mitigation for SBDS The Shared Buffers Data Sampling (SBDS) variant of Processor MMIO Stale Data vulnerabilities may expose RDRAND, RDSEED and SGX EGETKEY data. Mitigation for this is added by a microcode update. As some of the implications of SBDS are similar to SRBDS, SRBDS mitigation infrastructure can be leveraged by SBDS. Set X86_BUG_SRBDS and use SRBDS mitigation. Mitigation is enabled by default; use srbds=off to opt-out. Mitigation status can be checked from below file: /sys/devices/system/cpu/vulnerabilities/srbds Signed-off-by: Pawan Gupta Signed-off-by: Borislav Petkov --- arch/x86/kernel/cpu/common.c | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index f7757409e133..af5d0c188f7b 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1239,6 +1239,8 @@ static const __initconst struct x86_cpu_id cpu_vuln_whitelist[] = { #define SRBDS BIT(0) /* CPU is affected by X86_BUG_MMIO_STALE_DATA */ #define MMIO BIT(1) +/* CPU is affected by Shared Buffers Data Sampling (SBDS), a variant of X86_BUG_MMIO_STALE_DATA */ +#define MMIO_SBDS BIT(2) static const struct x86_cpu_id cpu_vuln_blacklist[] __initconst = { VULNBL_INTEL_STEPPINGS(IVYBRIDGE, X86_STEPPING_ANY, SRBDS), @@ -1260,16 +1262,17 @@ static const struct x86_cpu_id cpu_vuln_blacklist[] __initconst = { VULNBL_INTEL_STEPPINGS(KABYLAKE_L, X86_STEPPINGS(0x0, 0x8), SRBDS), VULNBL_INTEL_STEPPINGS(KABYLAKE, X86_STEPPINGS(0x9, 0xD), SRBDS | MMIO), VULNBL_INTEL_STEPPINGS(KABYLAKE, X86_STEPPINGS(0x0, 0x8), SRBDS), - VULNBL_INTEL_STEPPINGS(ICELAKE_L, X86_STEPPINGS(0x5, 0x5), MMIO), + VULNBL_INTEL_STEPPINGS(ICELAKE_L, X86_STEPPINGS(0x5, 0x5), MMIO | MMIO_SBDS), VULNBL_INTEL_STEPPINGS(ICELAKE_D, X86_STEPPINGS(0x1, 0x1), MMIO), VULNBL_INTEL_STEPPINGS(ICELAKE_X, X86_STEPPINGS(0x4, 0x6), MMIO), - VULNBL_INTEL_STEPPINGS(COMETLAKE, BIT(2) | BIT(3) | BIT(5), MMIO), - VULNBL_INTEL_STEPPINGS(COMETLAKE_L, X86_STEPPINGS(0x0, 0x1), MMIO), - VULNBL_INTEL_STEPPINGS(LAKEFIELD, X86_STEPPINGS(0x1, 0x1), MMIO), + VULNBL_INTEL_STEPPINGS(COMETLAKE, BIT(2) | BIT(3) | BIT(5), MMIO | MMIO_SBDS), + VULNBL_INTEL_STEPPINGS(COMETLAKE_L, X86_STEPPINGS(0x1, 0x1), MMIO | MMIO_SBDS), + VULNBL_INTEL_STEPPINGS(COMETLAKE_L, X86_STEPPINGS(0x0, 0x0), MMIO), + VULNBL_INTEL_STEPPINGS(LAKEFIELD, X86_STEPPINGS(0x1, 0x1), MMIO | MMIO_SBDS), VULNBL_INTEL_STEPPINGS(ROCKETLAKE, X86_STEPPINGS(0x1, 0x1), MMIO), - VULNBL_INTEL_STEPPINGS(ATOM_TREMONT, X86_STEPPINGS(0x1, 0x1), MMIO), + VULNBL_INTEL_STEPPINGS(ATOM_TREMONT, X86_STEPPINGS(0x1, 0x1), MMIO | MMIO_SBDS), VULNBL_INTEL_STEPPINGS(ATOM_TREMONT_D, X86_STEPPING_ANY, MMIO), - VULNBL_INTEL_STEPPINGS(ATOM_TREMONT_L, X86_STEPPINGS(0x0, 0x0), MMIO), + VULNBL_INTEL_STEPPINGS(ATOM_TREMONT_L, X86_STEPPINGS(0x0, 0x0), MMIO | MMIO_SBDS), {} }; @@ -1350,10 +1353,14 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c) /* * SRBDS affects CPUs which support RDRAND or RDSEED and are listed * in the vulnerability blacklist. + * + * Some of the implications and mitigation of Shared Buffers Data + * Sampling (SBDS) are similar to SRBDS. Give SBDS same treatment as + * SRBDS. */ if ((cpu_has(c, X86_FEATURE_RDRAND) || cpu_has(c, X86_FEATURE_RDSEED)) && - cpu_matches(cpu_vuln_blacklist, SRBDS)) + cpu_matches(cpu_vuln_blacklist, SRBDS | MMIO_SBDS)) setup_force_cpu_bug(X86_BUG_SRBDS); /* -- cgit From 027bbb884be006b05d9c577d6401686053aa789e Mon Sep 17 00:00:00 2001 From: Pawan Gupta Date: Thu, 19 May 2022 20:35:15 -0700 Subject: KVM: x86/speculation: Disable Fill buffer clear within guests The enumeration of MD_CLEAR in CPUID(EAX=7,ECX=0).EDX{bit 10} is not an accurate indicator on all CPUs of whether the VERW instruction will overwrite fill buffers. FB_CLEAR enumeration in IA32_ARCH_CAPABILITIES{bit 17} covers the case of CPUs that are not vulnerable to MDS/TAA, indicating that microcode does overwrite fill buffers. Guests running in VMM environments may not be aware of all the capabilities/vulnerabilities of the host CPU. Specifically, a guest may apply MDS/TAA mitigations when a virtual CPU is enumerated as vulnerable to MDS/TAA even when the physical CPU is not. On CPUs that enumerate FB_CLEAR_CTRL the VMM may set FB_CLEAR_DIS to skip overwriting of fill buffers by the VERW instruction. This is done by setting FB_CLEAR_DIS during VMENTER and resetting on VMEXIT. For guests that enumerate FB_CLEAR (explicitly asking for fill buffer clear capability) the VMM will not use FB_CLEAR_DIS. Irrespective of guest state, host overwrites CPU buffers before VMENTER to protect itself from an MMIO capable guest, as part of mitigation for MMIO Stale Data vulnerabilities. Signed-off-by: Pawan Gupta Signed-off-by: Borislav Petkov --- arch/x86/include/asm/msr-index.h | 6 +++ arch/x86/kvm/vmx/vmx.c | 69 ++++++++++++++++++++++++++++++++++ arch/x86/kvm/vmx/vmx.h | 2 + arch/x86/kvm/x86.c | 3 ++ tools/arch/x86/include/asm/msr-index.h | 6 +++ 5 files changed, 86 insertions(+) diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 12976405441b..4425d6773183 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -133,6 +133,11 @@ * VERW clears CPU fill buffer * even on MDS_NO CPUs. */ +#define ARCH_CAP_FB_CLEAR_CTRL BIT(18) /* + * MSR_IA32_MCU_OPT_CTRL[FB_CLEAR_DIS] + * bit available to control VERW + * behavior. + */ #define MSR_IA32_FLUSH_CMD 0x0000010b #define L1D_FLUSH BIT(0) /* @@ -150,6 +155,7 @@ #define MSR_IA32_MCU_OPT_CTRL 0x00000123 #define RNGDS_MITG_DIS BIT(0) /* SRBDS support */ #define RTM_ALLOW BIT(1) /* TSX development mode */ +#define FB_CLEAR_DIS BIT(3) /* CPU Fill buffer clear disable */ #define MSR_IA32_SYSENTER_CS 0x00000174 #define MSR_IA32_SYSENTER_ESP 0x00000175 diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 4fa216acadce..6e8fb36bc49a 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -229,6 +229,9 @@ static const struct { #define L1D_CACHE_ORDER 4 static void *vmx_l1d_flush_pages; +/* Control for disabling CPU Fill buffer clear */ +static bool __read_mostly vmx_fb_clear_ctrl_available; + static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf) { struct page *page; @@ -360,6 +363,60 @@ static int vmentry_l1d_flush_get(char *s, const struct kernel_param *kp) return sprintf(s, "%s\n", vmentry_l1d_param[l1tf_vmx_mitigation].option); } +static void vmx_setup_fb_clear_ctrl(void) +{ + u64 msr; + + if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES) && + !boot_cpu_has_bug(X86_BUG_MDS) && + !boot_cpu_has_bug(X86_BUG_TAA)) { + rdmsrl(MSR_IA32_ARCH_CAPABILITIES, msr); + if (msr & ARCH_CAP_FB_CLEAR_CTRL) + vmx_fb_clear_ctrl_available = true; + } +} + +static __always_inline void vmx_disable_fb_clear(struct vcpu_vmx *vmx) +{ + u64 msr; + + if (!vmx->disable_fb_clear) + return; + + rdmsrl(MSR_IA32_MCU_OPT_CTRL, msr); + msr |= FB_CLEAR_DIS; + wrmsrl(MSR_IA32_MCU_OPT_CTRL, msr); + /* Cache the MSR value to avoid reading it later */ + vmx->msr_ia32_mcu_opt_ctrl = msr; +} + +static __always_inline void vmx_enable_fb_clear(struct vcpu_vmx *vmx) +{ + if (!vmx->disable_fb_clear) + return; + + vmx->msr_ia32_mcu_opt_ctrl &= ~FB_CLEAR_DIS; + wrmsrl(MSR_IA32_MCU_OPT_CTRL, vmx->msr_ia32_mcu_opt_ctrl); +} + +static void vmx_update_fb_clear_dis(struct kvm_vcpu *vcpu, struct vcpu_vmx *vmx) +{ + vmx->disable_fb_clear = vmx_fb_clear_ctrl_available; + + /* + * If guest will not execute VERW, there is no need to set FB_CLEAR_DIS + * at VMEntry. Skip the MSR read/write when a guest has no use case to + * execute VERW. + */ + if ((vcpu->arch.arch_capabilities & ARCH_CAP_FB_CLEAR) || + ((vcpu->arch.arch_capabilities & ARCH_CAP_MDS_NO) && + (vcpu->arch.arch_capabilities & ARCH_CAP_TAA_NO) && + (vcpu->arch.arch_capabilities & ARCH_CAP_PSDP_NO) && + (vcpu->arch.arch_capabilities & ARCH_CAP_FBSDP_NO) && + (vcpu->arch.arch_capabilities & ARCH_CAP_SBDR_SSDP_NO))) + vmx->disable_fb_clear = false; +} + static const struct kernel_param_ops vmentry_l1d_flush_ops = { .set = vmentry_l1d_flush_set, .get = vmentry_l1d_flush_get, @@ -2252,6 +2309,10 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) ret = kvm_set_msr_common(vcpu, msr_info); } + /* FB_CLEAR may have changed, also update the FB_CLEAR_DIS behavior */ + if (msr_index == MSR_IA32_ARCH_CAPABILITIES) + vmx_update_fb_clear_dis(vcpu, vmx); + return ret; } @@ -4553,6 +4614,8 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu); vpid_sync_context(vmx->vpid); + + vmx_update_fb_clear_dis(vcpu, vmx); } static void vmx_enable_irq_window(struct kvm_vcpu *vcpu) @@ -6777,6 +6840,8 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu, kvm_arch_has_assigned_device(vcpu->kvm)) mds_clear_cpu_buffers(); + vmx_disable_fb_clear(vmx); + if (vcpu->arch.cr2 != native_read_cr2()) native_write_cr2(vcpu->arch.cr2); @@ -6785,6 +6850,8 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu, vcpu->arch.cr2 = native_read_cr2(); + vmx_enable_fb_clear(vmx); + guest_state_exit_irqoff(); } @@ -8185,6 +8252,8 @@ static int __init vmx_init(void) return r; } + vmx_setup_fb_clear_ctrl(); + for_each_possible_cpu(cpu) { INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu)); diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index b98c7e96697a..8d2342ede0c5 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -348,6 +348,8 @@ struct vcpu_vmx { u64 msr_ia32_feature_control_valid_bits; /* SGX Launch Control public key hash */ u64 msr_ia32_sgxlepubkeyhash[4]; + u64 msr_ia32_mcu_opt_ctrl; + bool disable_fb_clear; struct pt_desc pt_desc; struct lbr_desc lbr_desc; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 4790f0d7d40b..44b72caf2e0b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1587,6 +1587,9 @@ static u64 kvm_get_arch_capabilities(void) */ } + /* Guests don't need to know "Fill buffer clear control" exists */ + data &= ~ARCH_CAP_FB_CLEAR_CTRL; + return data; } diff --git a/tools/arch/x86/include/asm/msr-index.h b/tools/arch/x86/include/asm/msr-index.h index 12976405441b..4425d6773183 100644 --- a/tools/arch/x86/include/asm/msr-index.h +++ b/tools/arch/x86/include/asm/msr-index.h @@ -133,6 +133,11 @@ * VERW clears CPU fill buffer * even on MDS_NO CPUs. */ +#define ARCH_CAP_FB_CLEAR_CTRL BIT(18) /* + * MSR_IA32_MCU_OPT_CTRL[FB_CLEAR_DIS] + * bit available to control VERW + * behavior. + */ #define MSR_IA32_FLUSH_CMD 0x0000010b #define L1D_FLUSH BIT(0) /* @@ -150,6 +155,7 @@ #define MSR_IA32_MCU_OPT_CTRL 0x00000123 #define RNGDS_MITG_DIS BIT(0) /* SRBDS support */ #define RTM_ALLOW BIT(1) /* TSX development mode */ +#define FB_CLEAR_DIS BIT(3) /* CPU Fill buffer clear disable */ #define MSR_IA32_SYSENTER_CS 0x00000174 #define MSR_IA32_SYSENTER_ESP 0x00000175 -- cgit From 9ff9f77f34e44a0054eadb9041e459548c955ccb Mon Sep 17 00:00:00 2001 From: Jeff Layton Date: Tue, 24 May 2022 06:31:54 -0400 Subject: MAINTAINERS: reciprocal co-maintainership for file locking and nfsd Chuck has agreed to backstop me as maintainer of the file locking code, and I'll do the same for him on knfsd. Signed-off-by: Jeff Layton Signed-off-by: Chuck Lever --- MAINTAINERS | 2 ++ 1 file changed, 2 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index d6d879cb0afd..82f89b035cce 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -7572,6 +7572,7 @@ F: include/uapi/scsi/fc/ FILE LOCKING (flock() and fcntl()/lockf()) M: Jeff Layton +M: Chuck Lever L: linux-fsdevel@vger.kernel.org S: Maintained F: fs/fcntl.c @@ -10646,6 +10647,7 @@ W: http://kernelnewbies.org/KernelJanitors KERNEL NFSD, SUNRPC, AND LOCKD SERVERS M: Chuck Lever +M: Jeff Layton L: linux-nfs@vger.kernel.org S: Supported W: http://nfs.sourceforge.net/ -- cgit From 1dc6ff02c8bf77d71b9b5d11cbc9df77cfb28626 Mon Sep 17 00:00:00 2001 From: Josh Poimboeuf Date: Mon, 23 May 2022 09:11:49 -0700 Subject: x86/speculation/mmio: Print SMT warning Similar to MDS and TAA, print a warning if SMT is enabled for the MMIO Stale Data vulnerability. Signed-off-by: Josh Poimboeuf Signed-off-by: Thomas Gleixner --- arch/x86/kernel/cpu/bugs.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index ef4749097f42..a8a9f6406331 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -1258,6 +1258,7 @@ static void update_mds_branch_idle(void) #define MDS_MSG_SMT "MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.\n" #define TAA_MSG_SMT "TAA CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/tsx_async_abort.html for more details.\n" +#define MMIO_MSG_SMT "MMIO Stale Data CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/processor_mmio_stale_data.html for more details.\n" void cpu_bugs_smt_update(void) { @@ -1302,6 +1303,16 @@ void cpu_bugs_smt_update(void) break; } + switch (mmio_mitigation) { + case MMIO_MITIGATION_VERW: + case MMIO_MITIGATION_UCODE_NEEDED: + if (sched_smt_active()) + pr_warn_once(MMIO_MSG_SMT); + break; + case MMIO_MITIGATION_OFF: + break; + } + mutex_unlock(&spec_ctrl_mutex); } -- cgit From b6c71c66b0ad8f2b59d9bc08c7a5079b110bec01 Mon Sep 17 00:00:00 2001 From: Chuck Lever Date: Tue, 31 May 2022 19:49:01 -0400 Subject: NFSD: Fix potential use-after-free in nfsd_file_put() nfsd_file_put_noref() can free @nf, so don't dereference @nf immediately upon return from nfsd_file_put_noref(). Suggested-by: Trond Myklebust Fixes: 999397926ab3 ("nfsd: Clean up nfsd_file_put()") Signed-off-by: Chuck Lever --- fs/nfsd/filecache.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c index d32fcd8ad457..148b25a43caf 100644 --- a/fs/nfsd/filecache.c +++ b/fs/nfsd/filecache.c @@ -308,11 +308,12 @@ nfsd_file_put(struct nfsd_file *nf) if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags) == 0) { nfsd_file_flush(nf); nfsd_file_put_noref(nf); - } else { + } else if (nf->nf_file) { nfsd_file_put_noref(nf); - if (nf->nf_file) - nfsd_file_schedule_laundrette(); - } + nfsd_file_schedule_laundrette(); + } else + nfsd_file_put_noref(nf); + if (atomic_long_read(&nfsd_filecache_count) >= NFSD_FILE_LRU_LIMIT) nfsd_file_gc(); } -- cgit From f012e95b377c73c0283f009823c633104dedb337 Mon Sep 17 00:00:00 2001 From: Chuck Lever Date: Wed, 1 Jun 2022 12:46:52 -0400 Subject: SUNRPC: Trap RDMA segment overflows Prevent svc_rdma_build_writes() from walking off the end of a Write chunk's segment array. Caught with KASAN. The test that this fix replaces is invalid, and might have been left over from an earlier prototype of the PCL work. Fixes: 7a1cbfa18059 ("svcrdma: Use parsed chunk lists to construct RDMA Writes") Signed-off-by: Chuck Lever --- net/sunrpc/xprtrdma/svc_rdma_rw.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c index 5f0155fdefc7..11cf7c646644 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_rw.c +++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c @@ -478,10 +478,10 @@ svc_rdma_build_writes(struct svc_rdma_write_info *info, unsigned int write_len; u64 offset; - seg = &info->wi_chunk->ch_segments[info->wi_seg_no]; - if (!seg) + if (info->wi_seg_no >= info->wi_chunk->ch_segcount) goto out_overflow; + seg = &info->wi_chunk->ch_segments[info->wi_seg_no]; write_len = min(remaining, seg->rs_length - info->wi_seg_off); if (!write_len) goto out_overflow; -- cgit From c36ee7dab7749f7be21f7a72392744490b2a9a2b Mon Sep 17 00:00:00 2001 From: Paulo Alcantara Date: Sun, 5 Jun 2022 19:54:26 -0300 Subject: cifs: fix reconnect on smb3 mount types cifs.ko defines two file system types: cifs & smb3, and __cifs_get_super() was not including smb3 file system type when looking up superblocks, therefore failing to reconnect tcons in cifs_tree_connect(). Fix this by calling iterate_supers_type() on both file system types. Link: https://lore.kernel.org/r/CAFrh3J9soC36+BVuwHB=g9z_KB5Og2+p2_W+BBoBOZveErz14w@mail.gmail.com Cc: stable@vger.kernel.org Tested-by: Satadru Pramanik Reported-by: Satadru Pramanik Signed-off-by: Paulo Alcantara (SUSE) Signed-off-by: Steve French --- fs/cifs/cifsfs.c | 2 +- fs/cifs/cifsfs.h | 2 +- fs/cifs/misc.c | 27 ++++++++++++++++----------- 3 files changed, 18 insertions(+), 13 deletions(-) diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c index 12c872800326..325423180fd2 100644 --- a/fs/cifs/cifsfs.c +++ b/fs/cifs/cifsfs.c @@ -1086,7 +1086,7 @@ struct file_system_type cifs_fs_type = { }; MODULE_ALIAS_FS("cifs"); -static struct file_system_type smb3_fs_type = { +struct file_system_type smb3_fs_type = { .owner = THIS_MODULE, .name = "smb3", .init_fs_context = smb3_init_fs_context, diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h index dd7e070ca243..b17be47a8e59 100644 --- a/fs/cifs/cifsfs.h +++ b/fs/cifs/cifsfs.h @@ -38,7 +38,7 @@ static inline unsigned long cifs_get_time(struct dentry *dentry) return (unsigned long) dentry->d_fsdata; } -extern struct file_system_type cifs_fs_type; +extern struct file_system_type cifs_fs_type, smb3_fs_type; extern const struct address_space_operations cifs_addr_ops; extern const struct address_space_operations cifs_addr_ops_smallbuf; diff --git a/fs/cifs/misc.c b/fs/cifs/misc.c index 35962a1a23b9..8e67a2d406ab 100644 --- a/fs/cifs/misc.c +++ b/fs/cifs/misc.c @@ -1211,18 +1211,23 @@ static struct super_block *__cifs_get_super(void (*f)(struct super_block *, void .data = data, .sb = NULL, }; + struct file_system_type **fs_type = (struct file_system_type *[]) { + &cifs_fs_type, &smb3_fs_type, NULL, + }; - iterate_supers_type(&cifs_fs_type, f, &sd); - - if (!sd.sb) - return ERR_PTR(-EINVAL); - /* - * Grab an active reference in order to prevent automounts (DFS links) - * of expiring and then freeing up our cifs superblock pointer while - * we're doing failover. - */ - cifs_sb_active(sd.sb); - return sd.sb; + for (; *fs_type; fs_type++) { + iterate_supers_type(*fs_type, f, &sd); + if (sd.sb) { + /* + * Grab an active reference in order to prevent automounts (DFS links) + * of expiring and then freeing up our cifs superblock pointer while + * we're doing failover. + */ + cifs_sb_active(sd.sb); + return sd.sb; + } + } + return ERR_PTR(-EINVAL); } static void __cifs_put_super(struct super_block *sb) -- cgit From 386cbe7f1b152c8476a7d322d39512b1b4259ed5 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Thu, 12 May 2022 20:39:21 +0300 Subject: gpio: crystalcove: make irq_chip immutable Since recently, the kernel is nagging about mutable irq_chips: "not an immutable chip, please consider fixing it!" Drop the unneeded copy, flag it as IRQCHIP_IMMUTABLE, add the new helper functions and call the appropriate gpiolib functions. Signed-off-by: Andy Shevchenko --- drivers/gpio/gpio-crystalcove.c | 40 +++++++++++++++++++++++++--------------- 1 file changed, 25 insertions(+), 15 deletions(-) diff --git a/drivers/gpio/gpio-crystalcove.c b/drivers/gpio/gpio-crystalcove.c index b55c74a5e064..cf33041533aa 100644 --- a/drivers/gpio/gpio-crystalcove.c +++ b/drivers/gpio/gpio-crystalcove.c @@ -15,6 +15,7 @@ #include #include #include +#include #define CRYSTALCOVE_GPIO_NUM 16 #define CRYSTALCOVE_VGPIO_NUM 95 @@ -238,34 +239,43 @@ static void crystalcove_bus_sync_unlock(struct irq_data *data) static void crystalcove_irq_unmask(struct irq_data *data) { - struct crystalcove_gpio *cg = - gpiochip_get_data(irq_data_get_irq_chip_data(data)); + struct gpio_chip *gc = irq_data_get_irq_chip_data(data); + struct crystalcove_gpio *cg = gpiochip_get_data(gc); + irq_hw_number_t hwirq = irqd_to_hwirq(data); - if (data->hwirq < CRYSTALCOVE_GPIO_NUM) { - cg->set_irq_mask = false; - cg->update |= UPDATE_IRQ_MASK; - } + if (hwirq >= CRYSTALCOVE_GPIO_NUM) + return; + + gpiochip_enable_irq(gc, hwirq); + + cg->set_irq_mask = false; + cg->update |= UPDATE_IRQ_MASK; } static void crystalcove_irq_mask(struct irq_data *data) { - struct crystalcove_gpio *cg = - gpiochip_get_data(irq_data_get_irq_chip_data(data)); + struct gpio_chip *gc = irq_data_get_irq_chip_data(data); + struct crystalcove_gpio *cg = gpiochip_get_data(gc); + irq_hw_number_t hwirq = irqd_to_hwirq(data); - if (data->hwirq < CRYSTALCOVE_GPIO_NUM) { - cg->set_irq_mask = true; - cg->update |= UPDATE_IRQ_MASK; - } + if (hwirq >= CRYSTALCOVE_GPIO_NUM) + return; + + cg->set_irq_mask = true; + cg->update |= UPDATE_IRQ_MASK; + + gpiochip_disable_irq(gc, hwirq); } -static struct irq_chip crystalcove_irqchip = { +static const struct irq_chip crystalcove_irqchip = { .name = "Crystal Cove", .irq_mask = crystalcove_irq_mask, .irq_unmask = crystalcove_irq_unmask, .irq_set_type = crystalcove_irq_type, .irq_bus_lock = crystalcove_bus_lock, .irq_bus_sync_unlock = crystalcove_bus_sync_unlock, - .flags = IRQCHIP_SKIP_SET_WAKE, + .flags = IRQCHIP_SKIP_SET_WAKE | IRQCHIP_IMMUTABLE, + GPIOCHIP_IRQ_RESOURCE_HELPERS, }; static irqreturn_t crystalcove_gpio_irq_handler(int irq, void *data) @@ -353,7 +363,7 @@ static int crystalcove_gpio_probe(struct platform_device *pdev) cg->regmap = pmic->regmap; girq = &cg->chip.irq; - girq->chip = &crystalcove_irqchip; + gpio_irq_chip_set_chip(girq, &crystalcove_irqchip); /* This will let us handle the parent IRQ in the driver */ girq->parent_handler = NULL; girq->num_parents = 0; -- cgit From b34d2ad73af3c58dbaf8aa71b7308f17d9863780 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Wed, 1 Jun 2022 17:18:02 +0300 Subject: gpio: crystalcove: Use specific type and API for IRQ number Use specific type and API for IRQ number in the callbacks. Signed-off-by: Andy Shevchenko --- drivers/gpio/gpio-crystalcove.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/gpio/gpio-crystalcove.c b/drivers/gpio/gpio-crystalcove.c index cf33041533aa..36870d14323f 100644 --- a/drivers/gpio/gpio-crystalcove.c +++ b/drivers/gpio/gpio-crystalcove.c @@ -188,8 +188,9 @@ static int crystalcove_irq_type(struct irq_data *data, unsigned int type) { struct crystalcove_gpio *cg = gpiochip_get_data(irq_data_get_irq_chip_data(data)); + irq_hw_number_t hwirq = irqd_to_hwirq(data); - if (data->hwirq >= CRYSTALCOVE_GPIO_NUM) + if (hwirq >= CRYSTALCOVE_GPIO_NUM) return 0; switch (type) { @@ -226,12 +227,12 @@ static void crystalcove_bus_sync_unlock(struct irq_data *data) { struct crystalcove_gpio *cg = gpiochip_get_data(irq_data_get_irq_chip_data(data)); - int gpio = data->hwirq; + irq_hw_number_t hwirq = irqd_to_hwirq(data); if (cg->update & UPDATE_IRQ_TYPE) - crystalcove_update_irq_ctrl(cg, gpio); + crystalcove_update_irq_ctrl(cg, hwirq); if (cg->update & UPDATE_IRQ_MASK) - crystalcove_update_irq_mask(cg, gpio); + crystalcove_update_irq_mask(cg, hwirq); cg->update = 0; mutex_unlock(&cg->buslock); -- cgit From 68a12c19e1cb0f3332d3f59e1d5447f2aff97cd7 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Wed, 1 Jun 2022 17:22:04 +0300 Subject: gpio: crystalcove: Join function declarations and long lines There is no more hard limit of 80 characters for long lines, so join a few of them for better readability. Signed-off-by: Andy Shevchenko --- drivers/gpio/gpio-crystalcove.c | 21 +++++++-------------- 1 file changed, 7 insertions(+), 14 deletions(-) diff --git a/drivers/gpio/gpio-crystalcove.c b/drivers/gpio/gpio-crystalcove.c index 36870d14323f..1ee62cd58582 100644 --- a/drivers/gpio/gpio-crystalcove.c +++ b/drivers/gpio/gpio-crystalcove.c @@ -111,8 +111,7 @@ static inline int to_reg(int gpio, enum ctrl_register reg_type) return reg + gpio % 8; } -static void crystalcove_update_irq_mask(struct crystalcove_gpio *cg, - int gpio) +static void crystalcove_update_irq_mask(struct crystalcove_gpio *cg, int gpio) { u8 mirqs0 = gpio < 8 ? MGPIO0IRQS0 : MGPIO1IRQS0; int mask = BIT(gpio % 8); @@ -141,8 +140,7 @@ static int crystalcove_gpio_dir_in(struct gpio_chip *chip, unsigned int gpio) return regmap_write(cg->regmap, reg, CTLO_INPUT_SET); } -static int crystalcove_gpio_dir_out(struct gpio_chip *chip, unsigned int gpio, - int value) +static int crystalcove_gpio_dir_out(struct gpio_chip *chip, unsigned int gpio, int value) { struct crystalcove_gpio *cg = gpiochip_get_data(chip); int reg = to_reg(gpio, CTRL_OUT); @@ -169,8 +167,7 @@ static int crystalcove_gpio_get(struct gpio_chip *chip, unsigned int gpio) return val & 0x1; } -static void crystalcove_gpio_set(struct gpio_chip *chip, - unsigned int gpio, int value) +static void crystalcove_gpio_set(struct gpio_chip *chip, unsigned int gpio, int value) { struct crystalcove_gpio *cg = gpiochip_get_data(chip); int reg = to_reg(gpio, CTRL_OUT); @@ -186,8 +183,7 @@ static void crystalcove_gpio_set(struct gpio_chip *chip, static int crystalcove_irq_type(struct irq_data *data, unsigned int type) { - struct crystalcove_gpio *cg = - gpiochip_get_data(irq_data_get_irq_chip_data(data)); + struct crystalcove_gpio *cg = gpiochip_get_data(irq_data_get_irq_chip_data(data)); irq_hw_number_t hwirq = irqd_to_hwirq(data); if (hwirq >= CRYSTALCOVE_GPIO_NUM) @@ -217,16 +213,14 @@ static int crystalcove_irq_type(struct irq_data *data, unsigned int type) static void crystalcove_bus_lock(struct irq_data *data) { - struct crystalcove_gpio *cg = - gpiochip_get_data(irq_data_get_irq_chip_data(data)); + struct crystalcove_gpio *cg = gpiochip_get_data(irq_data_get_irq_chip_data(data)); mutex_lock(&cg->buslock); } static void crystalcove_bus_sync_unlock(struct irq_data *data) { - struct crystalcove_gpio *cg = - gpiochip_get_data(irq_data_get_irq_chip_data(data)); + struct crystalcove_gpio *cg = gpiochip_get_data(irq_data_get_irq_chip_data(data)); irq_hw_number_t hwirq = irqd_to_hwirq(data); if (cg->update & UPDATE_IRQ_TYPE) @@ -304,8 +298,7 @@ static irqreturn_t crystalcove_gpio_irq_handler(int irq, void *data) return IRQ_HANDLED; } -static void crystalcove_gpio_dbg_show(struct seq_file *s, - struct gpio_chip *chip) +static void crystalcove_gpio_dbg_show(struct seq_file *s, struct gpio_chip *chip) { struct crystalcove_gpio *cg = gpiochip_get_data(chip); int gpio, offset; -- cgit From 41a18c4918dcd57a49b0d046d9f2d587878de739 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Thu, 12 May 2022 20:39:21 +0300 Subject: gpio: wcove: make irq_chip immutable Since recently, the kernel is nagging about mutable irq_chips: "not an immutable chip, please consider fixing it!" Drop the unneeded copy, flag it as IRQCHIP_IMMUTABLE, add the new helper functions and call the appropriate gpiolib functions. Signed-off-by: Andy Shevchenko Reviewed-by: Kuppuswamy Sathyanarayanan --- drivers/gpio/gpio-wcove.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/gpio/gpio-wcove.c b/drivers/gpio/gpio-wcove.c index 16a0fae1e32e..c18b6b47384f 100644 --- a/drivers/gpio/gpio-wcove.c +++ b/drivers/gpio/gpio-wcove.c @@ -299,6 +299,8 @@ static void wcove_irq_unmask(struct irq_data *data) if (gpio >= WCOVE_GPIO_NUM) return; + gpiochip_enable_irq(chip, gpio); + wg->set_irq_mask = false; wg->update |= UPDATE_IRQ_MASK; } @@ -314,15 +316,19 @@ static void wcove_irq_mask(struct irq_data *data) wg->set_irq_mask = true; wg->update |= UPDATE_IRQ_MASK; + + gpiochip_disable_irq(chip, gpio); } -static struct irq_chip wcove_irqchip = { +static const struct irq_chip wcove_irqchip = { .name = "Whiskey Cove", .irq_mask = wcove_irq_mask, .irq_unmask = wcove_irq_unmask, .irq_set_type = wcove_irq_type, .irq_bus_lock = wcove_bus_lock, .irq_bus_sync_unlock = wcove_bus_sync_unlock, + .flags = IRQCHIP_IMMUTABLE, + GPIOCHIP_IRQ_RESOURCE_HELPERS, }; static irqreturn_t wcove_gpio_irq_handler(int irq, void *data) @@ -452,7 +458,7 @@ static int wcove_gpio_probe(struct platform_device *pdev) } girq = &wg->chip.irq; - girq->chip = &wcove_irqchip; + gpio_irq_chip_set_chip(girq, &wcove_irqchip); /* This will let us handle the parent IRQ in the driver */ girq->parent_handler = NULL; girq->num_parents = 0; -- cgit From a80fed9fb643175832e2fb8481d38f5d92cbcd34 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Thu, 12 May 2022 20:39:21 +0300 Subject: gpio: merrifield: make irq_chip immutable Since recently, the kernel is nagging about mutable irq_chips: "not an immutable chip, please consider fixing it!" Drop the unneeded copy, flag it as IRQCHIP_IMMUTABLE, add the new helper functions and call the appropriate gpiolib functions. Signed-off-by: Andy Shevchenko --- drivers/gpio/gpio-merrifield.c | 22 +++++++++++++++------- 1 file changed, 15 insertions(+), 7 deletions(-) diff --git a/drivers/gpio/gpio-merrifield.c b/drivers/gpio/gpio-merrifield.c index f3d1baeacbe9..72ac09a59702 100644 --- a/drivers/gpio/gpio-merrifield.c +++ b/drivers/gpio/gpio-merrifield.c @@ -220,10 +220,8 @@ static void mrfld_irq_ack(struct irq_data *d) raw_spin_unlock_irqrestore(&priv->lock, flags); } -static void mrfld_irq_unmask_mask(struct irq_data *d, bool unmask) +static void mrfld_irq_unmask_mask(struct mrfld_gpio *priv, u32 gpio, bool unmask) { - struct mrfld_gpio *priv = irq_data_get_irq_chip_data(d); - u32 gpio = irqd_to_hwirq(d); void __iomem *gimr = gpio_reg(&priv->chip, gpio, GIMR); unsigned long flags; u32 value; @@ -241,12 +239,20 @@ static void mrfld_irq_unmask_mask(struct irq_data *d, bool unmask) static void mrfld_irq_mask(struct irq_data *d) { - mrfld_irq_unmask_mask(d, false); + struct mrfld_gpio *priv = irq_data_get_irq_chip_data(d); + u32 gpio = irqd_to_hwirq(d); + + mrfld_irq_unmask_mask(priv, gpio, false); + gpiochip_disable_irq(&priv->chip, gpio); } static void mrfld_irq_unmask(struct irq_data *d) { - mrfld_irq_unmask_mask(d, true); + struct mrfld_gpio *priv = irq_data_get_irq_chip_data(d); + u32 gpio = irqd_to_hwirq(d); + + gpiochip_enable_irq(&priv->chip, gpio); + mrfld_irq_unmask_mask(priv, gpio, true); } static int mrfld_irq_set_type(struct irq_data *d, unsigned int type) @@ -329,13 +335,15 @@ static int mrfld_irq_set_wake(struct irq_data *d, unsigned int on) return 0; } -static struct irq_chip mrfld_irqchip = { +static const struct irq_chip mrfld_irqchip = { .name = "gpio-merrifield", .irq_ack = mrfld_irq_ack, .irq_mask = mrfld_irq_mask, .irq_unmask = mrfld_irq_unmask, .irq_set_type = mrfld_irq_set_type, .irq_set_wake = mrfld_irq_set_wake, + .flags = IRQCHIP_IMMUTABLE, + GPIOCHIP_IRQ_RESOURCE_HELPERS, }; static void mrfld_irq_handler(struct irq_desc *desc) @@ -482,7 +490,7 @@ static int mrfld_gpio_probe(struct pci_dev *pdev, const struct pci_device_id *id return retval; girq = &priv->chip.irq; - girq->chip = &mrfld_irqchip; + gpio_irq_chip_set_chip(girq, &mrfld_irqchip); girq->init_hw = mrfld_irq_init_hw; girq->parent_handler = mrfld_irq_handler; girq->num_parents = 1; -- cgit From f1138dacb7ff5221c4a37b823e42fc0a34df8731 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Wed, 1 Jun 2022 18:36:56 +0300 Subject: gpio: sch: make irq_chip immutable Since recently, the kernel is nagging about mutable irq_chips: "not an immutable chip, please consider fixing it!" Drop the unneeded copy, flag it as IRQCHIP_IMMUTABLE, add the new helper functions and call the appropriate gpiolib functions. Signed-off-by: Andy Shevchenko Reviewed-by: Bartosz Golaszewski --- drivers/gpio/gpio-sch.c | 35 ++++++++++++++++++++++------------- 1 file changed, 22 insertions(+), 13 deletions(-) diff --git a/drivers/gpio/gpio-sch.c b/drivers/gpio/gpio-sch.c index acda4c5052d3..8a83f7bf4382 100644 --- a/drivers/gpio/gpio-sch.c +++ b/drivers/gpio/gpio-sch.c @@ -38,7 +38,6 @@ struct sch_gpio { struct gpio_chip chip; - struct irq_chip irqchip; spinlock_t lock; unsigned short iobase; unsigned short resume_base; @@ -218,11 +217,9 @@ static void sch_irq_ack(struct irq_data *d) spin_unlock_irqrestore(&sch->lock, flags); } -static void sch_irq_mask_unmask(struct irq_data *d, int val) +static void sch_irq_mask_unmask(struct gpio_chip *gc, irq_hw_number_t gpio_num, int val) { - struct gpio_chip *gc = irq_data_get_irq_chip_data(d); struct sch_gpio *sch = gpiochip_get_data(gc); - irq_hw_number_t gpio_num = irqd_to_hwirq(d); unsigned long flags; spin_lock_irqsave(&sch->lock, flags); @@ -232,14 +229,32 @@ static void sch_irq_mask_unmask(struct irq_data *d, int val) static void sch_irq_mask(struct irq_data *d) { - sch_irq_mask_unmask(d, 0); + struct gpio_chip *gc = irq_data_get_irq_chip_data(d); + irq_hw_number_t gpio_num = irqd_to_hwirq(d); + + sch_irq_mask_unmask(gc, gpio_num, 0); + gpiochip_disable_irq(gc, gpio_num); } static void sch_irq_unmask(struct irq_data *d) { - sch_irq_mask_unmask(d, 1); + struct gpio_chip *gc = irq_data_get_irq_chip_data(d); + irq_hw_number_t gpio_num = irqd_to_hwirq(d); + + gpiochip_enable_irq(gc, gpio_num); + sch_irq_mask_unmask(gc, gpio_num, 1); } +static const struct irq_chip sch_irqchip = { + .name = "sch_gpio", + .irq_ack = sch_irq_ack, + .irq_mask = sch_irq_mask, + .irq_unmask = sch_irq_unmask, + .irq_set_type = sch_irq_type, + .flags = IRQCHIP_IMMUTABLE, + GPIOCHIP_IRQ_RESOURCE_HELPERS, +}; + static u32 sch_gpio_gpe_handler(acpi_handle gpe_device, u32 gpe, void *context) { struct sch_gpio *sch = context; @@ -367,14 +382,8 @@ static int sch_gpio_probe(struct platform_device *pdev) platform_set_drvdata(pdev, sch); - sch->irqchip.name = "sch_gpio"; - sch->irqchip.irq_ack = sch_irq_ack; - sch->irqchip.irq_mask = sch_irq_mask; - sch->irqchip.irq_unmask = sch_irq_unmask; - sch->irqchip.irq_set_type = sch_irq_type; - girq = &sch->chip.irq; - girq->chip = &sch->irqchip; + gpio_irq_chip_set_chip(girq, &sch_irqchip); girq->num_parents = 0; girq->parents = NULL; girq->parent_handler = NULL; -- cgit From b93a8b2c5161696e732185311d309e0aaf0575be Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Thu, 12 May 2022 20:39:21 +0300 Subject: gpio: dln2: make irq_chip immutable Since recently, the kernel is nagging about mutable irq_chips: "not an immutable chip, please consider fixing it!" Drop the unneeded copy, flag it as IRQCHIP_IMMUTABLE, add the new helper functions and call the appropriate gpiolib functions. Signed-off-by: Andy Shevchenko --- drivers/gpio/gpio-dln2.c | 23 ++++++++++++++--------- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/drivers/gpio/gpio-dln2.c b/drivers/gpio/gpio-dln2.c index 08b9e2cf4f2d..71fa437b491f 100644 --- a/drivers/gpio/gpio-dln2.c +++ b/drivers/gpio/gpio-dln2.c @@ -46,7 +46,6 @@ struct dln2_gpio { struct platform_device *pdev; struct gpio_chip gpio; - struct irq_chip irqchip; /* * Cache pin direction to save us one transfer, since the hardware has @@ -306,6 +305,7 @@ static void dln2_irq_unmask(struct irq_data *irqd) struct dln2_gpio *dln2 = gpiochip_get_data(gc); int pin = irqd_to_hwirq(irqd); + gpiochip_enable_irq(gc, pin); set_bit(pin, dln2->unmasked_irqs); } @@ -316,6 +316,7 @@ static void dln2_irq_mask(struct irq_data *irqd) int pin = irqd_to_hwirq(irqd); clear_bit(pin, dln2->unmasked_irqs); + gpiochip_disable_irq(gc, pin); } static int dln2_irq_set_type(struct irq_data *irqd, unsigned type) @@ -384,6 +385,17 @@ static void dln2_irq_bus_unlock(struct irq_data *irqd) mutex_unlock(&dln2->irq_lock); } +static const struct irq_chip dln2_irqchip = { + .name = "dln2-irq", + .irq_mask = dln2_irq_mask, + .irq_unmask = dln2_irq_unmask, + .irq_set_type = dln2_irq_set_type, + .irq_bus_lock = dln2_irq_bus_lock, + .irq_bus_sync_unlock = dln2_irq_bus_unlock, + .flags = IRQCHIP_IMMUTABLE, + GPIOCHIP_IRQ_RESOURCE_HELPERS, +}; + static void dln2_gpio_event(struct platform_device *pdev, u16 echo, const void *data, int len) { @@ -465,15 +477,8 @@ static int dln2_gpio_probe(struct platform_device *pdev) dln2->gpio.direction_output = dln2_gpio_direction_output; dln2->gpio.set_config = dln2_gpio_set_config; - dln2->irqchip.name = "dln2-irq", - dln2->irqchip.irq_mask = dln2_irq_mask, - dln2->irqchip.irq_unmask = dln2_irq_unmask, - dln2->irqchip.irq_set_type = dln2_irq_set_type, - dln2->irqchip.irq_bus_lock = dln2_irq_bus_lock, - dln2->irqchip.irq_bus_sync_unlock = dln2_irq_bus_unlock, - girq = &dln2->gpio.irq; - girq->chip = &dln2->irqchip; + gpio_irq_chip_set_chip(girq, &dln2_irqchip); /* The event comes from the outside so no parent handler */ girq->parent_handler = NULL; girq->num_parents = 0; -- cgit From 8ea21823aa584b55ba4b861307093b78054b0c1b Mon Sep 17 00:00:00 2001 From: Shyam Prasad N Date: Tue, 31 May 2022 12:31:05 +0000 Subject: cifs: return errors during session setup during reconnects During reconnects, we check the return value from cifs_negotiate_protocol, and have handlers for both success and failures. But if that passes, and cifs_setup_session returns any errors other than -EACCES, we do not handle that. This fix adds a handler for that, so that we don't go ahead and try a tree_connect on a failed session. Signed-off-by: Shyam Prasad N Reviewed-by: Enzo Matsumiya Cc: stable@vger.kernel.org Signed-off-by: Steve French --- fs/cifs/smb2pdu.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c index 0e8c85249579..eaf975f1ad89 100644 --- a/fs/cifs/smb2pdu.c +++ b/fs/cifs/smb2pdu.c @@ -288,6 +288,9 @@ smb2_reconnect(__le16 smb2_command, struct cifs_tcon *tcon, mutex_unlock(&ses->session_mutex); rc = -EHOSTDOWN; goto failed; + } else if (rc) { + mutex_unlock(&ses->session_mutex); + goto out; } } else { mutex_unlock(&ses->session_mutex); -- cgit From 77991645952c21962a095910c51fe0f73d35bf91 Mon Sep 17 00:00:00 2001 From: Roger Knecht Date: Sat, 21 May 2022 14:47:45 +0200 Subject: crc-itu-t: fix typo in CRC ITU-T polynomial comment The code comment says that the polynomial is x^16 + x^12 + x^15 + 1, but the correct polynomial is x^16 + x^12 + x^5 + 1. Quoting from page 2 in the ITU-T V.41 specification [1]: 2 Encoding and checking process The service bits and information bits, taken in conjunction, correspond to the coefficients of a message polynomial having terms from x^(n-1) (n = total number of bits in a block or sequence) down to x^16. This polynomial is divided, modulo 2, by the generating polynomial x^16 + x^12 + x^5 + 1. The hex (truncated) polynomial 0x1021 and CRC code implementation are correct, however. [1] https://www.itu.int/rec/T-REC-V.41-198811-I/en Signed-off-by: Roger Knecht Acked-by: Randy Dunlap Signed-off-by: Jason A. Donenfeld --- include/linux/crc-itu-t.h | 2 +- lib/crc-itu-t.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/crc-itu-t.h b/include/linux/crc-itu-t.h index a4367051e192..2f991a427ade 100644 --- a/include/linux/crc-itu-t.h +++ b/include/linux/crc-itu-t.h @@ -4,7 +4,7 @@ * * Implements the standard CRC ITU-T V.41: * Width 16 - * Poly 0x1021 (x^16 + x^12 + x^15 + 1) + * Poly 0x1021 (x^16 + x^12 + x^5 + 1) * Init 0 */ diff --git a/lib/crc-itu-t.c b/lib/crc-itu-t.c index 1974b355c148..1d26a1647da5 100644 --- a/lib/crc-itu-t.c +++ b/lib/crc-itu-t.c @@ -7,7 +7,7 @@ #include #include -/** CRC table for the CRC ITU-T V.41 0x1021 (x^16 + x^12 + x^15 + 1) */ +/* CRC table for the CRC ITU-T V.41 0x1021 (x^16 + x^12 + x^5 + 1) */ const u16 crc_itu_t_table[256] = { 0x0000, 0x1021, 0x2042, 0x3063, 0x4084, 0x50a5, 0x60c6, 0x70e7, 0x8108, 0x9129, 0xa14a, 0xb16b, 0xc18c, 0xd1ad, 0xe1ce, 0xf1ef, -- cgit From d52d165d67c5aa26c8c89909003c94a66492d23d Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Sat, 28 May 2022 12:38:11 +0100 Subject: KVM: arm64: Always start with clearing SVE flag on load On each vcpu load, we set the KVM_ARM64_HOST_SVE_ENABLED flag if SVE is enabled for EL0 on the host. This is used to restore the correct state on vpcu put. However, it appears that nothing ever clears this flag. Once set, it will stick until the vcpu is destroyed, which has the potential to spuriously enable SVE for userspace. We probably never saw the issue because no VMM uses SVE, but that's still pretty bad. Unconditionally clearing the flag on vcpu load addresses the issue. Fixes: 8383741ab2e7 ("KVM: arm64: Get rid of host SVE tracking/saving") Signed-off-by: Marc Zyngier Cc: stable@vger.kernel.org Reviewed-by: Mark Brown Link: https://lore.kernel.org/r/20220528113829.1043361-2-maz@kernel.org --- arch/arm64/kvm/fpsimd.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm64/kvm/fpsimd.c b/arch/arm64/kvm/fpsimd.c index 3d251a4d2cf7..8267ff4642d3 100644 --- a/arch/arm64/kvm/fpsimd.c +++ b/arch/arm64/kvm/fpsimd.c @@ -80,6 +80,7 @@ void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu) vcpu->arch.flags &= ~KVM_ARM64_FP_ENABLED; vcpu->arch.flags |= KVM_ARM64_FP_HOST; + vcpu->arch.flags &= ~KVM_ARM64_HOST_SVE_ENABLED; if (read_sysreg(cpacr_el1) & CPACR_EL1_ZEN_EL0EN) vcpu->arch.flags |= KVM_ARM64_HOST_SVE_ENABLED; -- cgit From 039f49c4cafb785504c678f28664d088e0108d35 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Sat, 28 May 2022 12:38:12 +0100 Subject: KVM: arm64: Always start with clearing SME flag on load On each vcpu load, we set the KVM_ARM64_HOST_SME_ENABLED flag if SME is enabled for EL0 on the host. This is used to restore the correct state on vpcu put. However, it appears that nothing ever clears this flag. Once set, it will stick until the vcpu is destroyed, which has the potential to spuriously enable SME for userspace. As it turns out, this is due to the SME code being more or less copied from SVE, and inheriting the same shortcomings. We never saw the issue because nothing uses SME, and the amount of testing is probably still pretty low. Fixes: 861262ab8627 ("KVM: arm64: Handle SME host state when running guests") Signed-off-by: Marc Zyngier Reviwed-by: Mark Brown Link: https://lore.kernel.org/r/20220528113829.1043361-3-maz@kernel.org --- arch/arm64/kvm/fpsimd.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm64/kvm/fpsimd.c b/arch/arm64/kvm/fpsimd.c index 8267ff4642d3..6012b08ecb14 100644 --- a/arch/arm64/kvm/fpsimd.c +++ b/arch/arm64/kvm/fpsimd.c @@ -94,6 +94,7 @@ void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu) * operations. Do this for ZA as well for now for simplicity. */ if (system_supports_sme()) { + vcpu->arch.flags &= ~KVM_ARM64_HOST_SME_ENABLED; if (read_sysreg(cpacr_el1) & CPACR_EL1_SMEN_EL0EN) vcpu->arch.flags |= KVM_ARM64_HOST_SME_ENABLED; -- cgit From e3fe65e0d3671ee5ae8a2723e429ee4830a7c89c Mon Sep 17 00:00:00 2001 From: sunliming Date: Thu, 2 Jun 2022 10:48:05 +0800 Subject: KVM: arm64: Fix inconsistent indenting Fix the following smatch warnings: arch/arm64/kvm/vmid.c:62 flush_context() warn: inconsistent indenting Reported-by: kernel test robot Signed-off-by: sunliming Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220602024805.511457-1-sunliming@kylinos.cn --- arch/arm64/kvm/vmid.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm64/kvm/vmid.c b/arch/arm64/kvm/vmid.c index 8d5f0506fd87..d78ae63d7c15 100644 --- a/arch/arm64/kvm/vmid.c +++ b/arch/arm64/kvm/vmid.c @@ -66,7 +66,7 @@ static void flush_context(void) * the next context-switch, we broadcast TLB flush + I-cache * invalidation over the inner shareable domain on rollover. */ - kvm_call_hyp(__kvm_flush_vm_context); + kvm_call_hyp(__kvm_flush_vm_context); } static bool check_update_reserved_vmid(u64 vmid, u64 newvmid) -- cgit From 2cdea19a34c2340b3aa69508804efe4e3750fcec Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Tue, 7 Jun 2022 14:14:25 +0100 Subject: KVM: arm64: Don't read a HW interrupt pending state in user context Since 5bfa685e62e9 ("KVM: arm64: vgic: Read HW interrupt pending state from the HW"), we're able to source the pending bit for an interrupt that is stored either on the physical distributor or on a device. However, this state is only available when the vcpu is loaded, and is not intended to be accessed from userspace. Unfortunately, the GICv2 emulation doesn't provide specific userspace accessors, and we fallback with the ones that are intended for the guest, with fatal consequences. Add a new vgic_uaccess_read_pending() accessor for userspace to use, build on top of the existing vgic_mmio_read_pending(). Reported-by: Eric Auger Reviewed-by: Eric Auger Tested-by: Eric Auger Signed-off-by: Marc Zyngier Fixes: 5bfa685e62e9 ("KVM: arm64: vgic: Read HW interrupt pending state from the HW") Link: https://lore.kernel.org/r/20220607131427.1164881-2-maz@kernel.org Cc: stable@vger.kernel.org --- arch/arm64/kvm/vgic/vgic-mmio-v2.c | 4 ++-- arch/arm64/kvm/vgic/vgic-mmio.c | 19 ++++++++++++++++--- arch/arm64/kvm/vgic/vgic-mmio.h | 3 +++ 3 files changed, 21 insertions(+), 5 deletions(-) diff --git a/arch/arm64/kvm/vgic/vgic-mmio-v2.c b/arch/arm64/kvm/vgic/vgic-mmio-v2.c index 77a67e9d3d14..e070cda86e12 100644 --- a/arch/arm64/kvm/vgic/vgic-mmio-v2.c +++ b/arch/arm64/kvm/vgic/vgic-mmio-v2.c @@ -429,11 +429,11 @@ static const struct vgic_register_region vgic_v2_dist_registers[] = { VGIC_ACCESS_32bit), REGISTER_DESC_WITH_BITS_PER_IRQ(GIC_DIST_PENDING_SET, vgic_mmio_read_pending, vgic_mmio_write_spending, - NULL, vgic_uaccess_write_spending, 1, + vgic_uaccess_read_pending, vgic_uaccess_write_spending, 1, VGIC_ACCESS_32bit), REGISTER_DESC_WITH_BITS_PER_IRQ(GIC_DIST_PENDING_CLEAR, vgic_mmio_read_pending, vgic_mmio_write_cpending, - NULL, vgic_uaccess_write_cpending, 1, + vgic_uaccess_read_pending, vgic_uaccess_write_cpending, 1, VGIC_ACCESS_32bit), REGISTER_DESC_WITH_BITS_PER_IRQ(GIC_DIST_ACTIVE_SET, vgic_mmio_read_active, vgic_mmio_write_sactive, diff --git a/arch/arm64/kvm/vgic/vgic-mmio.c b/arch/arm64/kvm/vgic/vgic-mmio.c index 49837d3a3ef5..dc8c52487e47 100644 --- a/arch/arm64/kvm/vgic/vgic-mmio.c +++ b/arch/arm64/kvm/vgic/vgic-mmio.c @@ -226,8 +226,9 @@ int vgic_uaccess_write_cenable(struct kvm_vcpu *vcpu, return 0; } -unsigned long vgic_mmio_read_pending(struct kvm_vcpu *vcpu, - gpa_t addr, unsigned int len) +static unsigned long __read_pending(struct kvm_vcpu *vcpu, + gpa_t addr, unsigned int len, + bool is_user) { u32 intid = VGIC_ADDR_TO_INTID(addr, 1); u32 value = 0; @@ -248,7 +249,7 @@ unsigned long vgic_mmio_read_pending(struct kvm_vcpu *vcpu, IRQCHIP_STATE_PENDING, &val); WARN_RATELIMIT(err, "IRQ %d", irq->host_irq); - } else if (vgic_irq_is_mapped_level(irq)) { + } else if (!is_user && vgic_irq_is_mapped_level(irq)) { val = vgic_get_phys_line_level(irq); } else { val = irq_is_pending(irq); @@ -263,6 +264,18 @@ unsigned long vgic_mmio_read_pending(struct kvm_vcpu *vcpu, return value; } +unsigned long vgic_mmio_read_pending(struct kvm_vcpu *vcpu, + gpa_t addr, unsigned int len) +{ + return __read_pending(vcpu, addr, len, false); +} + +unsigned long vgic_uaccess_read_pending(struct kvm_vcpu *vcpu, + gpa_t addr, unsigned int len) +{ + return __read_pending(vcpu, addr, len, true); +} + static bool is_vgic_v2_sgi(struct kvm_vcpu *vcpu, struct vgic_irq *irq) { return (vgic_irq_is_sgi(irq->intid) && diff --git a/arch/arm64/kvm/vgic/vgic-mmio.h b/arch/arm64/kvm/vgic/vgic-mmio.h index 3fa696f198a3..6082d4b66d39 100644 --- a/arch/arm64/kvm/vgic/vgic-mmio.h +++ b/arch/arm64/kvm/vgic/vgic-mmio.h @@ -149,6 +149,9 @@ int vgic_uaccess_write_cenable(struct kvm_vcpu *vcpu, unsigned long vgic_mmio_read_pending(struct kvm_vcpu *vcpu, gpa_t addr, unsigned int len); +unsigned long vgic_uaccess_read_pending(struct kvm_vcpu *vcpu, + gpa_t addr, unsigned int len); + void vgic_mmio_write_spending(struct kvm_vcpu *vcpu, gpa_t addr, unsigned int len, unsigned long val); -- cgit From 7bf179de5b2dfae54a6839eaf7caba44a888ee2e Mon Sep 17 00:00:00 2001 From: Kevin Locke Date: Mon, 6 Jun 2022 20:42:54 -0600 Subject: kbuild: avoid regex RS for POSIX awk In 22f26f21774f8 awk was added to deduplicate *.mod files. The awk invocation passes -v RS='( |\n)' to match a space or newline character as the record separator. Unfortunately, POSIX states[1] > If RS contains more than one character, the results are unspecified. Some implementations (such as the One True Awk[2] used by the BSDs) do not treat RS as a regular expression. When awk does not support regex RS, build failures such as the following are produced (first error using allmodconfig): CC [M] arch/x86/events/intel/uncore.o CC [M] arch/x86/events/intel/uncore_nhmex.o CC [M] arch/x86/events/intel/uncore_snb.o CC [M] arch/x86/events/intel/uncore_snbep.o CC [M] arch/x86/events/intel/uncore_discovery.o LD [M] arch/x86/events/intel/intel-uncore.o ld: cannot find uncore_nhmex.o: No such file or directory ld: cannot find uncore_snb.o: No such file or directory ld: cannot find uncore_snbep.o: No such file or directory ld: cannot find uncore_discovery.o: No such file or directory make[3]: *** [scripts/Makefile.build:422: arch/x86/events/intel/intel-uncore.o] Error 1 make[2]: *** [scripts/Makefile.build:487: arch/x86/events/intel] Error 2 make[1]: *** [scripts/Makefile.build:487: arch/x86/events] Error 2 make: *** [Makefile:1839: arch/x86] Error 2 To avoid this, use printf(1) to produce a newline between each object path, instead of the space produced by echo(1), so that the default RS can be used by awk. [1]: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html [2]: https://github.com/onetrueawk/awk Fixes: 22f26f21774f ("kbuild: get rid of duplication in *.mod files") Signed-off-by: Kevin Locke Signed-off-by: Masahiro Yamada --- scripts/Makefile.build | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/scripts/Makefile.build b/scripts/Makefile.build index 1f01ac65c0cd..cac070aee791 100644 --- a/scripts/Makefile.build +++ b/scripts/Makefile.build @@ -251,8 +251,8 @@ $(obj)/%.o: $(src)/%.c $(recordmcount_source) FORCE # To make this rule robust against "Argument list too long" error, # ensure to add $(obj)/ prefix by a shell command. -cmd_mod = echo $(call real-search, $*.o, .o, -objs -y -m) | \ - $(AWK) -v RS='( |\n)' '!x[$$0]++ { print("$(obj)/"$$0) }' > $@ +cmd_mod = printf '%s\n' $(call real-search, $*.o, .o, -objs -y -m) | \ + $(AWK) '!x[$$0]++ { print("$(obj)/"$$0) }' > $@ $(obj)/%.mod: FORCE $(call if_changed,mod) -- cgit From c4f135d643823a869becfa87539f7820ef9d5bfa Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Wed, 1 Jun 2022 16:32:47 +0900 Subject: workqueue: Wrap flush_workqueue() using a macro Since flush operation synchronously waits for completion, flushing system-wide WQs (e.g. system_wq) might introduce possibility of deadlock due to unexpected locking dependency. Tejun Heo commented at [1] that it makes no sense at all to call flush_workqueue() on the shared WQs as the caller has no idea what it's gonna end up waiting for. Although there is flush_scheduled_work() which flushes system_wq WQ with "Think twice before calling this function! It's very easy to get into trouble if you don't take great care." warning message, syzbot found a circular locking dependency caused by flushing system_wq WQ [2]. Therefore, let's change the direction to that developers had better use their local WQs if flush_scheduled_work()/flush_workqueue(system_*_wq) is inevitable. Steps for converting system-wide WQs into local WQs are explained at [3], and a conversion to stop flushing system-wide WQs is in progress. Now we want some mechanism for preventing developers who are not aware of this conversion from again start flushing system-wide WQs. Since I found that WARN_ON() is complete but awkward approach for teaching developers about this problem, let's use __compiletime_warning() for incomplete but handy approach. For completeness, we will also insert WARN_ON() into __flush_workqueue() after all in-tree users stopped calling flush_scheduled_work(). Link: https://lore.kernel.org/all/YgnQGZWT%2Fn3VAITX@slm.duckdns.org/ [1] Link: https://syzkaller.appspot.com/bug?extid=bde0f89deacca7c765b8 [2] Link: https://lkml.kernel.org/r/49925af7-78a8-a3dd-bce6-cfc02e1a9236@I-love.SAKURA.ne.jp [3] Signed-off-by: Tetsuo Handa Signed-off-by: Tejun Heo --- include/linux/workqueue.h | 64 +++++++++++++++++++++++++++++++++++++++++------ kernel/workqueue.c | 16 +++++++++--- 2 files changed, 68 insertions(+), 12 deletions(-) diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index 7fee9b6cfede..e1f1c8b1121b 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -445,7 +445,7 @@ extern bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq, struct delayed_work *dwork, unsigned long delay); extern bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork); -extern void flush_workqueue(struct workqueue_struct *wq); +extern void __flush_workqueue(struct workqueue_struct *wq); extern void drain_workqueue(struct workqueue_struct *wq); extern int schedule_on_each_cpu(work_func_t func); @@ -563,15 +563,23 @@ static inline bool schedule_work(struct work_struct *work) return queue_work(system_wq, work); } +/* + * Detect attempt to flush system-wide workqueues at compile time when possible. + * + * See https://lkml.kernel.org/r/49925af7-78a8-a3dd-bce6-cfc02e1a9236@I-love.SAKURA.ne.jp + * for reasons and steps for converting system-wide workqueues into local workqueues. + */ +extern void __warn_flushing_systemwide_wq(void) + __compiletime_warning("Please avoid flushing system-wide workqueues."); + /** * flush_scheduled_work - ensure that any scheduled work has run to completion. * * Forces execution of the kernel-global workqueue and blocks until its * completion. * - * Think twice before calling this function! It's very easy to get into - * trouble if you don't take great care. Either of the following situations - * will lead to deadlock: + * It's very easy to get into trouble if you don't take great care. + * Either of the following situations will lead to deadlock: * * One of the work items currently on the workqueue needs to acquire * a lock held by your code or its caller. @@ -586,11 +594,51 @@ static inline bool schedule_work(struct work_struct *work) * need to know that a particular work item isn't queued and isn't running. * In such cases you should use cancel_delayed_work_sync() or * cancel_work_sync() instead. + * + * Please stop calling this function! A conversion to stop flushing system-wide + * workqueues is in progress. This function will be removed after all in-tree + * users stopped calling this function. */ -static inline void flush_scheduled_work(void) -{ - flush_workqueue(system_wq); -} +/* + * The background of commit 771c035372a036f8 ("deprecate the + * '__deprecated' attribute warnings entirely and for good") is that, + * since Linus builds all modules between every single pull he does, + * the standard kernel build needs to be _clean_ in order to be able to + * notice when new problems happen. Therefore, don't emit warning while + * there are in-tree users. + */ +#define flush_scheduled_work() \ +({ \ + if (0) \ + __warn_flushing_systemwide_wq(); \ + __flush_workqueue(system_wq); \ +}) + +/* + * Although there is no longer in-tree caller, for now just emit warning + * in order to give out-of-tree callers time to update. + */ +#define flush_workqueue(wq) \ +({ \ + struct workqueue_struct *_wq = (wq); \ + \ + if ((__builtin_constant_p(_wq == system_wq) && \ + _wq == system_wq) || \ + (__builtin_constant_p(_wq == system_highpri_wq) && \ + _wq == system_highpri_wq) || \ + (__builtin_constant_p(_wq == system_long_wq) && \ + _wq == system_long_wq) || \ + (__builtin_constant_p(_wq == system_unbound_wq) && \ + _wq == system_unbound_wq) || \ + (__builtin_constant_p(_wq == system_freezable_wq) && \ + _wq == system_freezable_wq) || \ + (__builtin_constant_p(_wq == system_power_efficient_wq) && \ + _wq == system_power_efficient_wq) || \ + (__builtin_constant_p(_wq == system_freezable_power_efficient_wq) && \ + _wq == system_freezable_power_efficient_wq)) \ + __warn_flushing_systemwide_wq(); \ + __flush_workqueue(_wq); \ +}) /** * schedule_delayed_work_on - queue work in global workqueue on CPU after delay diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 4056f2a3f9d5..1ea50f6be843 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -2788,13 +2788,13 @@ static bool flush_workqueue_prep_pwqs(struct workqueue_struct *wq, } /** - * flush_workqueue - ensure that any scheduled work has run to completion. + * __flush_workqueue - ensure that any scheduled work has run to completion. * @wq: workqueue to flush * * This function sleeps until all work items which were queued on entry * have finished execution, but it is not livelocked by new incoming ones. */ -void flush_workqueue(struct workqueue_struct *wq) +void __flush_workqueue(struct workqueue_struct *wq) { struct wq_flusher this_flusher = { .list = LIST_HEAD_INIT(this_flusher.list), @@ -2943,7 +2943,7 @@ void flush_workqueue(struct workqueue_struct *wq) out_unlock: mutex_unlock(&wq->mutex); } -EXPORT_SYMBOL(flush_workqueue); +EXPORT_SYMBOL(__flush_workqueue); /** * drain_workqueue - drain a workqueue @@ -2971,7 +2971,7 @@ void drain_workqueue(struct workqueue_struct *wq) wq->flags |= __WQ_DRAINING; mutex_unlock(&wq->mutex); reflush: - flush_workqueue(wq); + __flush_workqueue(wq); mutex_lock(&wq->mutex); @@ -6111,3 +6111,11 @@ void __init workqueue_init(void) wq_online = true; wq_watchdog_init(); } + +/* + * Despite the naming, this is a no-op function which is here only for avoiding + * link error. Since compile-time warning may fail to catch, we will need to + * emit run-time warning from __flush_workqueue(). + */ +void __warn_flushing_systemwide_wq(void) { } +EXPORT_SYMBOL(__warn_flushing_systemwide_wq); -- cgit From 873a400938b31a1e443c4d94b560b78300787540 Mon Sep 17 00:00:00 2001 From: Wonhyuk Yang Date: Wed, 4 May 2022 11:32:03 +0900 Subject: workqueue: Fix type of cpu in trace event The trace event "workqueue_queue_work" use unsigned int type for req_cpu, cpu. This casue confusing cpu number like below log. $ cat /sys/kernel/debug/tracing/trace cat-317 [001] ...: workqueue_queue_work: ... req_cpu=8192 cpu=4294967295 So, change unsigned type to signed type in the trace event. After applying this patch, cpu number will be printed as -1 instead of 4294967295 as folllows. $ cat /sys/kernel/debug/tracing/trace cat-1338 [002] ...: workqueue_queue_work: ... req_cpu=8192 cpu=-1 Cc: Baik Song An Cc: Hong Yeon Kim Cc: Taeung Song Cc: linuxgeek@linuxgeek.io Signed-off-by: Wonhyuk Yang Signed-off-by: Tejun Heo --- include/trace/events/workqueue.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/include/trace/events/workqueue.h b/include/trace/events/workqueue.h index 6154a2e72bce..262d52021c23 100644 --- a/include/trace/events/workqueue.h +++ b/include/trace/events/workqueue.h @@ -22,7 +22,7 @@ struct pool_workqueue; */ TRACE_EVENT(workqueue_queue_work, - TP_PROTO(unsigned int req_cpu, struct pool_workqueue *pwq, + TP_PROTO(int req_cpu, struct pool_workqueue *pwq, struct work_struct *work), TP_ARGS(req_cpu, pwq, work), @@ -31,8 +31,8 @@ TRACE_EVENT(workqueue_queue_work, __field( void *, work ) __field( void *, function) __string( workqueue, pwq->wq->name) - __field( unsigned int, req_cpu ) - __field( unsigned int, cpu ) + __field( int, req_cpu ) + __field( int, cpu ) ), TP_fast_assign( @@ -43,7 +43,7 @@ TRACE_EVENT(workqueue_queue_work, __entry->cpu = pwq->pool->cpu; ), - TP_printk("work struct=%p function=%ps workqueue=%s req_cpu=%u cpu=%u", + TP_printk("work struct=%p function=%ps workqueue=%s req_cpu=%d cpu=%d", __entry->work, __entry->function, __get_str(workqueue), __entry->req_cpu, __entry->cpu) ); -- cgit From f92de9d110429e39929a49240d823251c2fe903e Mon Sep 17 00:00:00 2001 From: Tyler Erickson Date: Thu, 2 Jun 2022 16:51:13 -0600 Subject: scsi: sd: Fix interpretation of VPD B9h length Fixing the interpretation of the length of the B9h VPD page (Concurrent Positioning Ranges). Adding 4 is necessary as the first 4 bytes of the page is the header with page number and length information. Adding 3 was likely a misinterpretation of the SBC-5 specification which sets all offsets starting at zero. This fixes the error in dmesg: [ 9.014456] sd 1:0:0:0: [sda] Invalid Concurrent Positioning Ranges VPD page Link: https://lore.kernel.org/r/20220602225113.10218-4-tyler.erickson@seagate.com Fixes: e815d36548f0 ("scsi: sd: add concurrent positioning ranges support") Cc: stable@vger.kernel.org Tested-by: Michael English Reviewed-by: Muhammad Ahmad Reviewed-by: Damien Le Moal Reviewed-by: Hannes Reinecke Signed-off-by: Tyler Erickson Signed-off-by: Christoph Hellwig Signed-off-by: Martin K. Petersen --- drivers/scsi/sd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 895b56c8f25e..a1a2ac09066f 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -3072,7 +3072,7 @@ static void sd_read_cpr(struct scsi_disk *sdkp) goto out; /* We must have at least a 64B header and one 32B range descriptor */ - vpd_len = get_unaligned_be16(&buffer[2]) + 3; + vpd_len = get_unaligned_be16(&buffer[2]) + 4; if (vpd_len > buf_len || vpd_len < 64 + 32 || (vpd_len & 31)) { sd_printk(KERN_ERR, sdkp, "Invalid Concurrent Positioning Ranges VPD page\n"); -- cgit From cf71d59c2eceadfcde0fb52e237990a0909880d7 Mon Sep 17 00:00:00 2001 From: Wentao Wang Date: Thu, 2 Jun 2022 08:57:00 +0000 Subject: scsi: vmw_pvscsi: Expand vcpuHint to 16 bits vcpuHint has been expanded to 16 bit on host to enable routing to more CPUs. Guest side should align with the change. This change has been tested with hosts with 8-bit and 16-bit vcpuHint, on both platforms host side can get correct value. Link: https://lore.kernel.org/r/EF35F4D5-5DCC-42C5-BCC4-29DF1729B24C@vmware.com Signed-off-by: Wentao Wang Signed-off-by: Martin K. Petersen --- drivers/scsi/vmw_pvscsi.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/scsi/vmw_pvscsi.h b/drivers/scsi/vmw_pvscsi.h index 51a82f7803d3..9d16cf925483 100644 --- a/drivers/scsi/vmw_pvscsi.h +++ b/drivers/scsi/vmw_pvscsi.h @@ -331,8 +331,8 @@ struct PVSCSIRingReqDesc { u8 tag; u8 bus; u8 target; - u8 vcpuHint; - u8 unused[59]; + u16 vcpuHint; + u8 unused[58]; } __packed; /* -- cgit From 44ba9786b67345dc4e5eabe537c9ef2bfd889888 Mon Sep 17 00:00:00 2001 From: James Smart Date: Fri, 3 Jun 2022 10:43:21 -0700 Subject: scsi: lpfc: Correct BDE type for XMIT_SEQ64_WQE in lpfc_ct_reject_event() A previous commit assumed all XMIT_SEQ64_WQEs are prepped with the correct BDE type in word 0-2. However, lpfc_ct_reject_event() routine was missed and is still filling out the incorrect BDE type. Fix lpfc_ct_reject_event() routine so that type BUFF_TYPE_BDE_64 is set instead of BUFF_TYPE_BLP_64. Link: https://lore.kernel.org/r/20220603174329.63777-2-jsmart2021@gmail.com Fixes: 596fc8adb171 ("scsi: lpfc: Fix dmabuf ptr assignment in lpfc_ct_reject_event()") Co-developed-by: Justin Tee Signed-off-by: Justin Tee Signed-off-by: James Smart Signed-off-by: Martin K. Petersen --- drivers/scsi/lpfc/lpfc_ct.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/scsi/lpfc/lpfc_ct.c b/drivers/scsi/lpfc/lpfc_ct.c index 9d36b20fb878..13dfe285493d 100644 --- a/drivers/scsi/lpfc/lpfc_ct.c +++ b/drivers/scsi/lpfc/lpfc_ct.c @@ -197,7 +197,7 @@ lpfc_ct_reject_event(struct lpfc_nodelist *ndlp, memset(bpl, 0, sizeof(struct ulp_bde64)); bpl->addrHigh = le32_to_cpu(putPaddrHigh(mp->phys)); bpl->addrLow = le32_to_cpu(putPaddrLow(mp->phys)); - bpl->tus.f.bdeFlags = BUFF_TYPE_BLP_64; + bpl->tus.f.bdeFlags = BUFF_TYPE_BDE_64; bpl->tus.f.bdeSize = (LPFC_CT_PREAMBLE - 4); bpl->tus.w = le32_to_cpu(bpl->tus.w); -- cgit From 24e1f056677eefe834d5dcf61905cce857ca4b19 Mon Sep 17 00:00:00 2001 From: James Smart Date: Fri, 3 Jun 2022 10:43:22 -0700 Subject: scsi: lpfc: Resolve some cleanup issues following abort path refactoring Refactoring and consolidation of abort paths: - lpfc_sli4_abort_fcp_cmpl() and lpfc_sli_abort_fcp_cmpl() are combined into a single generic lpfc_sli_abort_fcp_cmpl() routine. Thus, remove extraneous lpfc_sli4_abort_fcp_cmpl() prototype declaration. - lpfc_nvme_abort_fcreq_cmpl() abort completion routine is called with a mismatched argument type. This may result in misleading log message content. Update to the correct argument type of lpfc_iocbq instead of lpfc_wcqe_complete. The lpfc_wcqe_complete should be derived from the lpfc_iocbq structure. Link: https://lore.kernel.org/r/20220603174329.63777-3-jsmart2021@gmail.com Fixes: 31a59f75702f ("scsi: lpfc: SLI path split: Refactor Abort paths") Cc: # v5.18 Co-developed-by: Justin Tee Signed-off-by: Justin Tee Signed-off-by: James Smart Signed-off-by: Martin K. Petersen --- drivers/scsi/lpfc/lpfc_crtn.h | 4 +--- drivers/scsi/lpfc/lpfc_nvme.c | 6 ++++-- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/scsi/lpfc/lpfc_crtn.h b/drivers/scsi/lpfc/lpfc_crtn.h index b1be0dd0337a..f5d74958b664 100644 --- a/drivers/scsi/lpfc/lpfc_crtn.h +++ b/drivers/scsi/lpfc/lpfc_crtn.h @@ -420,8 +420,6 @@ int lpfc_sli_issue_iocb_wait(struct lpfc_hba *, uint32_t, uint32_t); void lpfc_sli_abort_fcp_cmpl(struct lpfc_hba *, struct lpfc_iocbq *, struct lpfc_iocbq *); -void lpfc_sli4_abort_fcp_cmpl(struct lpfc_hba *h, struct lpfc_iocbq *i, - struct lpfc_wcqe_complete *w); void lpfc_sli_free_hbq(struct lpfc_hba *, struct hbq_dmabuf *); @@ -630,7 +628,7 @@ void lpfc_nvmet_invalidate_host(struct lpfc_hba *phba, struct lpfc_nodelist *ndlp); void lpfc_nvme_abort_fcreq_cmpl(struct lpfc_hba *phba, struct lpfc_iocbq *cmdiocb, - struct lpfc_wcqe_complete *abts_cmpl); + struct lpfc_iocbq *rspiocb); void lpfc_create_multixri_pools(struct lpfc_hba *phba); void lpfc_create_destroy_pools(struct lpfc_hba *phba); void lpfc_move_xri_pvt_to_pbl(struct lpfc_hba *phba, u32 hwqid); diff --git a/drivers/scsi/lpfc/lpfc_nvme.c b/drivers/scsi/lpfc/lpfc_nvme.c index 335e90633933..88fa630ab93a 100644 --- a/drivers/scsi/lpfc/lpfc_nvme.c +++ b/drivers/scsi/lpfc/lpfc_nvme.c @@ -1787,7 +1787,7 @@ lpfc_nvme_fcp_io_submit(struct nvme_fc_local_port *pnvme_lport, * lpfc_nvme_abort_fcreq_cmpl - Complete an NVME FCP abort request. * @phba: Pointer to HBA context object * @cmdiocb: Pointer to command iocb object. - * @abts_cmpl: Pointer to wcqe complete object. + * @rspiocb: Pointer to response iocb object. * * This is the callback function for any NVME FCP IO that was aborted. * @@ -1796,8 +1796,10 @@ lpfc_nvme_fcp_io_submit(struct nvme_fc_local_port *pnvme_lport, **/ void lpfc_nvme_abort_fcreq_cmpl(struct lpfc_hba *phba, struct lpfc_iocbq *cmdiocb, - struct lpfc_wcqe_complete *abts_cmpl) + struct lpfc_iocbq *rspiocb) { + struct lpfc_wcqe_complete *abts_cmpl = &rspiocb->wcqe_cmpl; + lpfc_printf_log(phba, KERN_INFO, LOG_NVME, "6145 ABORT_XRI_CN completing on rpi x%x " "original iotag x%x, abort cmd iotag x%x " -- cgit From e27f05147bff21408c1b8410ad8e90cd286e7952 Mon Sep 17 00:00:00 2001 From: James Smart Date: Fri, 3 Jun 2022 10:43:23 -0700 Subject: scsi: lpfc: Resolve some cleanup issues following SLI path refactoring Following refactoring and consolidation in SLI processing, fix up some minor issues related to SLI path: - Correct the setting of LPFC_EXCHANGE_BUSY flag in response IOCB. - Fix some typographical errors. - Fix duplicate log messages. Link: https://lore.kernel.org/r/20220603174329.63777-4-jsmart2021@gmail.com Fixes: 1b64aa9eae28 ("scsi: lpfc: SLI path split: Refactor fast and slow paths to native SLI4") Cc: # v5.18 Co-developed-by: Justin Tee Signed-off-by: Justin Tee Signed-off-by: James Smart Signed-off-by: Martin K. Petersen --- drivers/scsi/lpfc/lpfc_init.c | 2 +- drivers/scsi/lpfc/lpfc_sli.c | 25 ++++++++++++------------- 2 files changed, 13 insertions(+), 14 deletions(-) diff --git a/drivers/scsi/lpfc/lpfc_init.c b/drivers/scsi/lpfc/lpfc_init.c index 93b94c64518d..750dd1e9f2cc 100644 --- a/drivers/scsi/lpfc/lpfc_init.c +++ b/drivers/scsi/lpfc/lpfc_init.c @@ -12188,7 +12188,7 @@ lpfc_sli_enable_msi(struct lpfc_hba *phba) rc = pci_enable_msi(phba->pcidev); if (!rc) lpfc_printf_log(phba, KERN_INFO, LOG_INIT, - "0462 PCI enable MSI mode success.\n"); + "0012 PCI enable MSI mode success.\n"); else { lpfc_printf_log(phba, KERN_INFO, LOG_INIT, "0471 PCI enable MSI mode failed (%d)\n", rc); diff --git a/drivers/scsi/lpfc/lpfc_sli.c b/drivers/scsi/lpfc/lpfc_sli.c index 6ed696c4602a..80ac3a051c19 100644 --- a/drivers/scsi/lpfc/lpfc_sli.c +++ b/drivers/scsi/lpfc/lpfc_sli.c @@ -1930,7 +1930,7 @@ lpfc_issue_cmf_sync_wqe(struct lpfc_hba *phba, u32 ms, u64 total) sync_buf = __lpfc_sli_get_iocbq(phba); if (!sync_buf) { lpfc_printf_log(phba, KERN_ERR, LOG_CGN_MGMT, - "6213 No available WQEs for CMF_SYNC_WQE\n"); + "6244 No available WQEs for CMF_SYNC_WQE\n"); ret_val = ENOMEM; goto out_unlock; } @@ -3805,7 +3805,7 @@ lpfc_sli_process_sol_iocb(struct lpfc_hba *phba, struct lpfc_sli_ring *pring, set_job_ulpword4(cmdiocbp, IOERR_ABORT_REQUESTED); /* - * For SLI4, irsiocb contains + * For SLI4, irspiocb contains * NO_XRI in sli_xritag, it * shall not affect releasing * sgl (xri) process. @@ -3823,7 +3823,7 @@ lpfc_sli_process_sol_iocb(struct lpfc_hba *phba, struct lpfc_sli_ring *pring, } } } - (cmdiocbp->cmd_cmpl) (phba, cmdiocbp, saveq); + cmdiocbp->cmd_cmpl(phba, cmdiocbp, saveq); } else lpfc_sli_release_iocbq(phba, cmdiocbp); } else { @@ -4063,8 +4063,7 @@ lpfc_sli_handle_fast_ring_event(struct lpfc_hba *phba, cmdiocbq->cmd_flag &= ~LPFC_DRIVER_ABORTED; if (cmdiocbq->cmd_cmpl) { spin_unlock_irqrestore(&phba->hbalock, iflag); - (cmdiocbq->cmd_cmpl)(phba, cmdiocbq, - &rspiocbq); + cmdiocbq->cmd_cmpl(phba, cmdiocbq, &rspiocbq); spin_lock_irqsave(&phba->hbalock, iflag); } break; @@ -10288,7 +10287,7 @@ __lpfc_sli_issue_iocb_s3(struct lpfc_hba *phba, uint32_t ring_number, * @flag: Flag indicating if this command can be put into txq. * * __lpfc_sli_issue_fcp_io_s3 is wrapper function to invoke lockless func to - * send an iocb command to an HBA with SLI-4 interface spec. + * send an iocb command to an HBA with SLI-3 interface spec. * * This function takes the hbalock before invoking the lockless version. * The function will return success after it successfully submit the wqe to @@ -12740,7 +12739,7 @@ lpfc_sli_wake_iocb_wait(struct lpfc_hba *phba, cmdiocbq->cmd_cmpl = cmdiocbq->wait_cmd_cmpl; cmdiocbq->wait_cmd_cmpl = NULL; if (cmdiocbq->cmd_cmpl) - (cmdiocbq->cmd_cmpl)(phba, cmdiocbq, NULL); + cmdiocbq->cmd_cmpl(phba, cmdiocbq, NULL); else lpfc_sli_release_iocbq(phba, cmdiocbq); return; @@ -12754,9 +12753,9 @@ lpfc_sli_wake_iocb_wait(struct lpfc_hba *phba, /* Set the exchange busy flag for task management commands */ if ((cmdiocbq->cmd_flag & LPFC_IO_FCP) && - !(cmdiocbq->cmd_flag & LPFC_IO_LIBDFC)) { + !(cmdiocbq->cmd_flag & LPFC_IO_LIBDFC)) { lpfc_cmd = container_of(cmdiocbq, struct lpfc_io_buf, - cur_iocbq); + cur_iocbq); if (rspiocbq && (rspiocbq->cmd_flag & LPFC_EXCHANGE_BUSY)) lpfc_cmd->flags |= LPFC_SBUF_XBUSY; else @@ -13896,7 +13895,7 @@ void lpfc_sli4_els_xri_abort_event_proc(struct lpfc_hba *phba) * @irspiocbq: Pointer to work-queue completion queue entry. * * This routine handles an ELS work-queue completion event and construct - * a pseudo response ELS IODBQ from the SLI4 ELS WCQE for the common + * a pseudo response ELS IOCBQ from the SLI4 ELS WCQE for the common * discovery engine to handle. * * Return: Pointer to the receive IOCBQ, NULL otherwise. @@ -13940,7 +13939,7 @@ lpfc_sli4_els_preprocess_rspiocbq(struct lpfc_hba *phba, if (bf_get(lpfc_wcqe_c_xb, wcqe)) { spin_lock_irqsave(&phba->hbalock, iflags); - cmdiocbq->cmd_flag |= LPFC_EXCHANGE_BUSY; + irspiocbq->cmd_flag |= LPFC_EXCHANGE_BUSY; spin_unlock_irqrestore(&phba->hbalock, iflags); } @@ -14799,7 +14798,7 @@ lpfc_sli4_fp_handle_fcp_wcqe(struct lpfc_hba *phba, struct lpfc_queue *cq, /* Pass the cmd_iocb and the wcqe to the upper layer */ memcpy(&cmdiocbq->wcqe_cmpl, wcqe, sizeof(struct lpfc_wcqe_complete)); - (cmdiocbq->cmd_cmpl)(phba, cmdiocbq, cmdiocbq); + cmdiocbq->cmd_cmpl(phba, cmdiocbq, cmdiocbq); } else { lpfc_printf_log(phba, KERN_WARNING, LOG_SLI, "0375 FCP cmdiocb not callback function " @@ -18956,7 +18955,7 @@ lpfc_sli4_send_seq_to_ulp(struct lpfc_vport *vport, /* Free iocb created in lpfc_prep_seq */ list_for_each_entry_safe(curr_iocb, next_iocb, - &iocbq->list, list) { + &iocbq->list, list) { list_del_init(&curr_iocb->list); lpfc_sli_release_iocbq(phba, curr_iocb); } -- cgit From 6f808bd78e8296b4ded813b7182988d57e1f6176 Mon Sep 17 00:00:00 2001 From: James Smart Date: Fri, 3 Jun 2022 10:43:24 -0700 Subject: scsi: lpfc: Address NULL pointer dereference after starget_to_rport() Calls to starget_to_rport() may return NULL. Add check for NULL rport before dereference. Link: https://lore.kernel.org/r/20220603174329.63777-5-jsmart2021@gmail.com Fixes: bb21fc9911ee ("scsi: lpfc: Use fc_block_rport()") Cc: # v5.18 Co-developed-by: Justin Tee Signed-off-by: Justin Tee Signed-off-by: James Smart Signed-off-by: Martin K. Petersen --- drivers/scsi/lpfc/lpfc_scsi.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/drivers/scsi/lpfc/lpfc_scsi.c b/drivers/scsi/lpfc/lpfc_scsi.c index d43968203248..ba5e4016262e 100644 --- a/drivers/scsi/lpfc/lpfc_scsi.c +++ b/drivers/scsi/lpfc/lpfc_scsi.c @@ -6062,6 +6062,9 @@ lpfc_device_reset_handler(struct scsi_cmnd *cmnd) int status; u32 logit = LOG_FCP; + if (!rport) + return FAILED; + rdata = rport->dd_data; if (!rdata || !rdata->pnode) { lpfc_printf_vlog(vport, KERN_ERR, LOG_TRACE_EVENT, @@ -6140,6 +6143,9 @@ lpfc_target_reset_handler(struct scsi_cmnd *cmnd) unsigned long flags; DECLARE_WAIT_QUEUE_HEAD_ONSTACK(waitq); + if (!rport) + return FAILED; + rdata = rport->dd_data; if (!rdata || !rdata->pnode) { lpfc_printf_vlog(vport, KERN_ERR, LOG_TRACE_EVENT, -- cgit From b1b3440f437b75fb2a9b0cfe58df461e40eca474 Mon Sep 17 00:00:00 2001 From: James Smart Date: Fri, 3 Jun 2022 10:43:25 -0700 Subject: scsi: lpfc: Resolve NULL ptr dereference after an ELS LOGO is aborted A use-after-free crash can occur after an ELS LOGO is aborted. Specifically, a nodelist structure is freed and then ndlp->vport->cfg_log_verbose is dereferenced in lpfc_nlp_get() when the discovery state machine is mistakenly called a second time with NLP_EVT_DEVICE_RM argument. Rework lpfc_cmpl_els_logo() to prevent the duplicate calls to release a nodelist structure. Link: https://lore.kernel.org/r/20220603174329.63777-6-jsmart2021@gmail.com Co-developed-by: Justin Tee Signed-off-by: Justin Tee Signed-off-by: James Smart Signed-off-by: Martin K. Petersen --- drivers/scsi/lpfc/lpfc_els.c | 21 +++++++++------------ 1 file changed, 9 insertions(+), 12 deletions(-) diff --git a/drivers/scsi/lpfc/lpfc_els.c b/drivers/scsi/lpfc/lpfc_els.c index 07f9a6e61e10..3fababb7c181 100644 --- a/drivers/scsi/lpfc/lpfc_els.c +++ b/drivers/scsi/lpfc/lpfc_els.c @@ -2998,10 +2998,7 @@ lpfc_cmpl_els_logo(struct lpfc_hba *phba, struct lpfc_iocbq *cmdiocb, ndlp->nlp_DID, ulp_status, ulp_word4); - /* Call NLP_EVT_DEVICE_RM if link is down or LOGO is aborted */ if (lpfc_error_lost_link(ulp_status, ulp_word4)) { - lpfc_disc_state_machine(vport, ndlp, cmdiocb, - NLP_EVT_DEVICE_RM); skip_recovery = 1; goto out; } @@ -3021,18 +3018,10 @@ lpfc_cmpl_els_logo(struct lpfc_hba *phba, struct lpfc_iocbq *cmdiocb, spin_unlock_irq(&ndlp->lock); lpfc_disc_state_machine(vport, ndlp, cmdiocb, NLP_EVT_DEVICE_RM); - lpfc_els_free_iocb(phba, cmdiocb); - lpfc_nlp_put(ndlp); - - /* Presume the node was released. */ - return; + goto out_rsrc_free; } out: - /* Driver is done with the IO. */ - lpfc_els_free_iocb(phba, cmdiocb); - lpfc_nlp_put(ndlp); - /* At this point, the LOGO processing is complete. NOTE: For a * pt2pt topology, we are assuming the NPortID will only change * on link up processing. For a LOGO / PLOGI initiated by the @@ -3059,6 +3048,10 @@ out: ndlp->nlp_DID, ulp_status, ulp_word4, tmo, vport->num_disc_nodes); + + lpfc_els_free_iocb(phba, cmdiocb); + lpfc_nlp_put(ndlp); + lpfc_disc_start(vport); return; } @@ -3075,6 +3068,10 @@ out: lpfc_disc_state_machine(vport, ndlp, cmdiocb, NLP_EVT_DEVICE_RM); } +out_rsrc_free: + /* Driver is done with the I/O. */ + lpfc_els_free_iocb(phba, cmdiocb); + lpfc_nlp_put(ndlp); } /** -- cgit From 336d63615466b4c06b9401c987813fd19bdde39b Mon Sep 17 00:00:00 2001 From: James Smart Date: Fri, 3 Jun 2022 10:43:26 -0700 Subject: scsi: lpfc: Fix port stuck in bypassed state after LIP in PT2PT topology After issuing a LIP, a specific target vendor does not ACC the FLOGI that lpfc sends. However, it does send its own FLOGI that lpfc ACCs. The target then establishes the port IDs by sending a PLOGI. lpfc PLOGI_ACCs and starts the RPI registration for DID 0x000001. The target then sends a LOGO to the fabric DID. lpfc is currently treating the LOGO from the fabric DID as a link down and cleans up all the ndlps. The ndlp for DID 0x000001 is put back into NPR and discovery stops, leaving the port in stuck in bypassed mode. Change lpfc behavior such that if a LOGO is received for the fabric DID in PT2PT topology skip the lpfc_linkdown_port() routine and just move the fabric DID back to NPR. Link: https://lore.kernel.org/r/20220603174329.63777-7-jsmart2021@gmail.com Co-developed-by: Justin Tee Signed-off-by: Justin Tee Signed-off-by: James Smart Signed-off-by: Martin K. Petersen --- drivers/scsi/lpfc/lpfc_nportdisc.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/scsi/lpfc/lpfc_nportdisc.c b/drivers/scsi/lpfc/lpfc_nportdisc.c index 639f86635127..b86ff9fcdf0c 100644 --- a/drivers/scsi/lpfc/lpfc_nportdisc.c +++ b/drivers/scsi/lpfc/lpfc_nportdisc.c @@ -834,7 +834,8 @@ lpfc_rcv_logo(struct lpfc_vport *vport, struct lpfc_nodelist *ndlp, lpfc_nvmet_invalidate_host(phba, ndlp); if (ndlp->nlp_DID == Fabric_DID) { - if (vport->port_state <= LPFC_FDISC) + if (vport->port_state <= LPFC_FDISC || + vport->fc_flag & FC_PT2PT) goto out; lpfc_linkdown_port(vport); spin_lock_irq(shost->host_lock); -- cgit From ea7bd1f393311e823716a232e9d8857fb64eb105 Mon Sep 17 00:00:00 2001 From: James Smart Date: Fri, 3 Jun 2022 10:43:27 -0700 Subject: scsi: lpfc: Add more logging of cmd and cqe information for aborted NVMe cmds When an NVMe command is aborted or completes with an ERSP, log the opcode and command ID fields to help provide more detail on the failed command. Link: https://lore.kernel.org/r/20220603174329.63777-8-jsmart2021@gmail.com Co-developed-by: Justin Tee Signed-off-by: Justin Tee Signed-off-by: James Smart Signed-off-by: Martin K. Petersen --- drivers/scsi/lpfc/lpfc_nvme.c | 35 +++++++++++++++++++++++++++-------- 1 file changed, 27 insertions(+), 8 deletions(-) diff --git a/drivers/scsi/lpfc/lpfc_nvme.c b/drivers/scsi/lpfc/lpfc_nvme.c index 88fa630ab93a..9cc22cefcb37 100644 --- a/drivers/scsi/lpfc/lpfc_nvme.c +++ b/drivers/scsi/lpfc/lpfc_nvme.c @@ -1065,25 +1065,37 @@ lpfc_nvme_io_cmd_cmpl(struct lpfc_hba *phba, struct lpfc_iocbq *pwqeIn, nCmd->rcv_rsplen = wcqe->parameter; nCmd->status = 0; + /* Get the NVME cmd details for this unique error. */ + cp = (struct nvme_fc_cmd_iu *)nCmd->cmdaddr; + ep = (struct nvme_fc_ersp_iu *)nCmd->rspaddr; + /* Check if this is really an ERSP */ if (nCmd->rcv_rsplen == LPFC_NVME_ERSP_LEN) { lpfc_ncmd->status = IOSTAT_SUCCESS; lpfc_ncmd->result = 0; lpfc_printf_vlog(vport, KERN_INFO, LOG_NVME, - "6084 NVME Completion ERSP: " - "xri %x placed x%x\n", - lpfc_ncmd->cur_iocbq.sli4_xritag, - wcqe->total_data_placed); + "6084 NVME FCP_ERR ERSP: " + "xri %x placed x%x opcode x%x cmd_id " + "x%x cqe_status x%x\n", + lpfc_ncmd->cur_iocbq.sli4_xritag, + wcqe->total_data_placed, + cp->sqe.common.opcode, + cp->sqe.common.command_id, + ep->cqe.status); break; } lpfc_printf_vlog(vport, KERN_ERR, LOG_TRACE_EVENT, "6081 NVME Completion Protocol Error: " "xri %x status x%x result x%x " - "placed x%x\n", + "placed x%x opcode x%x cmd_id x%x, " + "cqe_status x%x\n", lpfc_ncmd->cur_iocbq.sli4_xritag, lpfc_ncmd->status, lpfc_ncmd->result, - wcqe->total_data_placed); + wcqe->total_data_placed, + cp->sqe.common.opcode, + cp->sqe.common.command_id, + ep->cqe.status); break; case IOSTAT_LOCAL_REJECT: /* Let fall through to set command final state. */ @@ -1842,6 +1854,7 @@ lpfc_nvme_fcp_abort(struct nvme_fc_local_port *pnvme_lport, struct lpfc_nvme_fcpreq_priv *freqpriv; unsigned long flags; int ret_val; + struct nvme_fc_cmd_iu *cp; /* Validate pointers. LLDD fault handling with transport does * have timing races. @@ -1965,10 +1978,16 @@ lpfc_nvme_fcp_abort(struct nvme_fc_local_port *pnvme_lport, return; } + /* + * Get Command Id from cmd to plug into response. This + * code is not needed in the next NVME Transport drop. + */ + cp = (struct nvme_fc_cmd_iu *)lpfc_nbuf->nvmeCmd->cmdaddr; lpfc_printf_vlog(vport, KERN_INFO, LOG_NVME_ABTS, "6138 Transport Abort NVME Request Issued for " - "ox_id x%x\n", - nvmereq_wqe->sli4_xritag); + "ox_id x%x nvme opcode x%x nvme cmd_id x%x\n", + nvmereq_wqe->sli4_xritag, cp->sqe.common.opcode, + cp->sqe.common.command_id); return; out_unlock: -- cgit From 2e7e9c0c1ec05f18d320ecc8a31eec59d2af1af9 Mon Sep 17 00:00:00 2001 From: James Smart Date: Fri, 3 Jun 2022 10:43:28 -0700 Subject: scsi: lpfc: Allow reduced polling rate for nvme_admin_async_event cmd completion NVMe Asynchronous Event Request commands have no command timeout value per specifications. Set WQE option to allow a reduced FLUSH polling rate for I/O error detection specifically for nvme_admin_async_event commands. Link: https://lore.kernel.org/r/20220603174329.63777-9-jsmart2021@gmail.com Co-developed-by: Justin Tee Signed-off-by: Justin Tee Signed-off-by: James Smart Signed-off-by: Martin K. Petersen --- drivers/scsi/lpfc/lpfc_hw4.h | 3 +++ drivers/scsi/lpfc/lpfc_nvme.c | 11 +++++++++-- 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/drivers/scsi/lpfc/lpfc_hw4.h b/drivers/scsi/lpfc/lpfc_hw4.h index 8511369d2cf8..f024415731ac 100644 --- a/drivers/scsi/lpfc/lpfc_hw4.h +++ b/drivers/scsi/lpfc/lpfc_hw4.h @@ -4487,6 +4487,9 @@ struct wqe_common { #define wqe_sup_SHIFT 6 #define wqe_sup_MASK 0x00000001 #define wqe_sup_WORD word11 +#define wqe_ffrq_SHIFT 6 +#define wqe_ffrq_MASK 0x00000001 +#define wqe_ffrq_WORD word11 #define wqe_wqec_SHIFT 7 #define wqe_wqec_MASK 0x00000001 #define wqe_wqec_WORD word11 diff --git a/drivers/scsi/lpfc/lpfc_nvme.c b/drivers/scsi/lpfc/lpfc_nvme.c index 9cc22cefcb37..cd10ee6482fc 100644 --- a/drivers/scsi/lpfc/lpfc_nvme.c +++ b/drivers/scsi/lpfc/lpfc_nvme.c @@ -1207,7 +1207,8 @@ lpfc_nvme_prep_io_cmd(struct lpfc_vport *vport, { struct lpfc_hba *phba = vport->phba; struct nvmefc_fcp_req *nCmd = lpfc_ncmd->nvmeCmd; - struct lpfc_iocbq *pwqeq = &(lpfc_ncmd->cur_iocbq); + struct nvme_common_command *sqe; + struct lpfc_iocbq *pwqeq = &lpfc_ncmd->cur_iocbq; union lpfc_wqe128 *wqe = &pwqeq->wqe; uint32_t req_len; @@ -1264,8 +1265,14 @@ lpfc_nvme_prep_io_cmd(struct lpfc_vport *vport, cstat->control_requests++; } - if (pnode->nlp_nvme_info & NLP_NVME_NSLER) + if (pnode->nlp_nvme_info & NLP_NVME_NSLER) { bf_set(wqe_erp, &wqe->generic.wqe_com, 1); + sqe = &((struct nvme_fc_cmd_iu *) + nCmd->cmdaddr)->sqe.common; + if (sqe->opcode == nvme_admin_async_event) + bf_set(wqe_ffrq, &wqe->generic.wqe_com, 1); + } + /* * Finish initializing those WQE fields that are independent * of the nvme_cmnd request_buffer -- cgit From 1af48fffd7ffe280e0c225659d826fd5ae802a08 Mon Sep 17 00:00:00 2001 From: James Smart Date: Fri, 3 Jun 2022 10:43:29 -0700 Subject: scsi: lpfc: Update lpfc version to 14.2.0.4 Update lpfc version to 14.2.0.4 Link: https://lore.kernel.org/r/20220603174329.63777-10-jsmart2021@gmail.com Co-developed-by: Justin Tee Signed-off-by: Justin Tee Signed-off-by: James Smart Signed-off-by: Martin K. Petersen --- drivers/scsi/lpfc/lpfc_version.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/scsi/lpfc/lpfc_version.h b/drivers/scsi/lpfc/lpfc_version.h index 4fab79ed58ed..2ab6f7db64d8 100644 --- a/drivers/scsi/lpfc/lpfc_version.h +++ b/drivers/scsi/lpfc/lpfc_version.h @@ -20,7 +20,7 @@ * included with this package. * *******************************************************************/ -#define LPFC_DRIVER_VERSION "14.2.0.3" +#define LPFC_DRIVER_VERSION "14.2.0.4" #define LPFC_DRIVER_NAME "lpfc" /* Used for SLI 2/3 */ -- cgit From 120f1d95efb1cdb6fe023c84e38ba06d8f78cd03 Mon Sep 17 00:00:00 2001 From: Helge Deller Date: Tue, 31 May 2022 22:09:27 +0200 Subject: scsi: mpt3sas: Fix out-of-bounds compiler warning MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit I'm facing this warning when building for the parisc64 architecture: drivers/scsi/mpt3sas/mpt3sas_base.c: In function ‘_base_make_ioc_operational’: drivers/scsi/mpt3sas/mpt3sas_base.c:5396:40: warning: array subscript ‘Mpi2SasIOUnitPage1_t {aka struct _MPI2_CONFIG_PAGE_SASIOUNIT_1}[0]’ is partly outside array bounds of ‘unsigned char[20]’ [-Warray-bounds] 5396 | (le16_to_cpu(sas_iounit_pg1->SASWideMaxQueueDepth)) ? drivers/scsi/mpt3sas/mpt3sas_base.c:5382:26: note: referencing an object of size 20 allocated by ‘kzalloc’ 5382 | sas_iounit_pg1 = kzalloc(sz, GFP_KERNEL); | ^~~~~~~~~~~~~~~~~~~~~~~ The problem is, that only 20 bytes are allocated with kmalloc(), which is sufficient to hold the bytes which are needed. Nevertheless, gcc complains because the whole Mpi2SasIOUnitPage1_t struct is 32 bytes in size and thus doesn't fit into those 20 bytes. This patch simply allocates all 32 bytes (instead of 20) and thus avoids the warning. There is no functional change introduced by this patch. While touching the code I cleaned up to calculation of max_wideport_qd, max_narrowport_qd and max_sata_qd to make it easier readable. Test successfully tested on a HP C8000 PA-RISC workstation with 64-bit kernel. Link: https://lore.kernel.org/r/YpZ197iZdDZSCzrT@p100 Signed-off-by: Helge Deller Signed-off-by: Martin K. Petersen --- drivers/scsi/mpt3sas/mpt3sas_base.c | 23 ++++++++++++----------- 1 file changed, 12 insertions(+), 11 deletions(-) diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.c b/drivers/scsi/mpt3sas/mpt3sas_base.c index 37d46ae5c61d..9a1ae52bb621 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_base.c +++ b/drivers/scsi/mpt3sas/mpt3sas_base.c @@ -5369,6 +5369,7 @@ static int _base_assign_fw_reported_qd(struct MPT3SAS_ADAPTER *ioc) Mpi2ConfigReply_t mpi_reply; Mpi2SasIOUnitPage1_t *sas_iounit_pg1 = NULL; Mpi26PCIeIOUnitPage1_t pcie_iounit_pg1; + u16 depth; int sz; int rc = 0; @@ -5380,7 +5381,7 @@ static int _base_assign_fw_reported_qd(struct MPT3SAS_ADAPTER *ioc) goto out; /* sas iounit page 1 */ sz = offsetof(Mpi2SasIOUnitPage1_t, PhyData); - sas_iounit_pg1 = kzalloc(sz, GFP_KERNEL); + sas_iounit_pg1 = kzalloc(sizeof(Mpi2SasIOUnitPage1_t), GFP_KERNEL); if (!sas_iounit_pg1) { pr_err("%s: failure at %s:%d/%s()!\n", ioc->name, __FILE__, __LINE__, __func__); @@ -5393,16 +5394,16 @@ static int _base_assign_fw_reported_qd(struct MPT3SAS_ADAPTER *ioc) ioc->name, __FILE__, __LINE__, __func__); goto out; } - ioc->max_wideport_qd = - (le16_to_cpu(sas_iounit_pg1->SASWideMaxQueueDepth)) ? - le16_to_cpu(sas_iounit_pg1->SASWideMaxQueueDepth) : - MPT3SAS_SAS_QUEUE_DEPTH; - ioc->max_narrowport_qd = - (le16_to_cpu(sas_iounit_pg1->SASNarrowMaxQueueDepth)) ? - le16_to_cpu(sas_iounit_pg1->SASNarrowMaxQueueDepth) : - MPT3SAS_SAS_QUEUE_DEPTH; - ioc->max_sata_qd = (sas_iounit_pg1->SATAMaxQDepth) ? - sas_iounit_pg1->SATAMaxQDepth : MPT3SAS_SATA_QUEUE_DEPTH; + + depth = le16_to_cpu(sas_iounit_pg1->SASWideMaxQueueDepth); + ioc->max_wideport_qd = (depth ? depth : MPT3SAS_SAS_QUEUE_DEPTH); + + depth = le16_to_cpu(sas_iounit_pg1->SASNarrowMaxQueueDepth); + ioc->max_narrowport_qd = (depth ? depth : MPT3SAS_SAS_QUEUE_DEPTH); + + depth = sas_iounit_pg1->SATAMaxQDepth; + ioc->max_sata_qd = (depth ? depth : MPT3SAS_SATA_QUEUE_DEPTH); + /* pcie iounit page 1 */ rc = mpt3sas_config_get_pcie_iounit_pg1(ioc, &mpi_reply, &pcie_iounit_pg1, sizeof(Mpi26PCIeIOUnitPage1_t)); -- cgit From d64c491911322af1dcada98e5b9ee0d87e8c8fee Mon Sep 17 00:00:00 2001 From: Chengguang Xu Date: Sun, 29 May 2022 23:34:53 +0800 Subject: scsi: ipr: Fix missing/incorrect resource cleanup in error case Fix missing resource cleanup (when '(--i) == 0') for error case in ipr_alloc_mem() and skip incorrect resource cleanup (when '(--i) == 0') for error case in ipr_request_other_msi_irqs() because variable i started from 1. Link: https://lore.kernel.org/r/20220529153456.4183738-4-cgxu519@mykernel.net Reviewed-by: Dan Carpenter Acked-by: Brian King Signed-off-by: Chengguang Xu Signed-off-by: Martin K. Petersen --- drivers/scsi/ipr.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/scsi/ipr.c b/drivers/scsi/ipr.c index 256ec6d08c16..9d01a3e3c26a 100644 --- a/drivers/scsi/ipr.c +++ b/drivers/scsi/ipr.c @@ -9795,7 +9795,7 @@ static int ipr_alloc_mem(struct ipr_ioa_cfg *ioa_cfg) GFP_KERNEL); if (!ioa_cfg->hrrq[i].host_rrq) { - while (--i > 0) + while (--i >= 0) dma_free_coherent(&pdev->dev, sizeof(u32) * ioa_cfg->hrrq[i].size, ioa_cfg->hrrq[i].host_rrq, @@ -10068,7 +10068,7 @@ static int ipr_request_other_msi_irqs(struct ipr_ioa_cfg *ioa_cfg, ioa_cfg->vectors_info[i].desc, &ioa_cfg->hrrq[i]); if (rc) { - while (--i >= 0) + while (--i > 0) free_irq(pci_irq_vector(pdev, i), &ioa_cfg->hrrq[i]); return rc; -- cgit From ec1e8adcbdf661c57c395bca342945f4f815add7 Mon Sep 17 00:00:00 2001 From: Chengguang Xu Date: Sun, 29 May 2022 23:34:55 +0800 Subject: scsi: pmcraid: Fix missing resource cleanup in error case Fix missing resource cleanup (when '(--i) == 0') for error case in pmcraid_register_interrupt_handler(). Link: https://lore.kernel.org/r/20220529153456.4183738-6-cgxu519@mykernel.net Reviewed-by: Dan Carpenter Signed-off-by: Chengguang Xu Signed-off-by: Martin K. Petersen --- drivers/scsi/pmcraid.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/scsi/pmcraid.c b/drivers/scsi/pmcraid.c index bfce60183a6e..836ddc476764 100644 --- a/drivers/scsi/pmcraid.c +++ b/drivers/scsi/pmcraid.c @@ -4031,7 +4031,7 @@ pmcraid_register_interrupt_handler(struct pmcraid_instance *pinstance) return 0; out_unwind: - while (--i > 0) + while (--i >= 0) free_irq(pci_irq_vector(pdev, i), &pinstance->hrrq_vector[i]); pci_free_irq_vectors(pdev); return rc; -- cgit From 255b4658c809e194bc10236ac24a722ec14a83d6 Mon Sep 17 00:00:00 2001 From: Huacai Chen Date: Sun, 5 Jun 2022 16:19:53 +0800 Subject: LoongArch: Fix the !CONFIG_SMP build 1, We assume arch/loongarch/include/asm/smp.h be included in include/ linux/smp.h is valid and the reverse inclusion isn't. So remove the in arch/loongarch/include/asm/smp.h. 2, arch/loongarch/include/asm/smp.h is only needed when CONFIG_SMP, and setup.c include it only because it need plat_smp_setup(). So, reorganize setup.c & smp.h, and then remove in setup.c. 3, Fix cacheinfo.c and percpu.h build error by adding the missing header files when !CONFIG_SMP. 4, Fix acpi.c build error by adding CONFIG_SMP guards. 5, Move irq_stat definition from smp.c to irq.c and fix its declaration. 6, Select CONFIG_SMP for CONFIG_NUMA, similar as other architectures do. Signed-off-by: Huacai Chen --- arch/loongarch/Kconfig | 1 + arch/loongarch/include/asm/hardirq.h | 2 +- arch/loongarch/include/asm/percpu.h | 1 + arch/loongarch/include/asm/smp.h | 23 +++++++---------------- arch/loongarch/kernel/acpi.c | 4 ++++ arch/loongarch/kernel/cacheinfo.c | 1 + arch/loongarch/kernel/irq.c | 7 ++++++- arch/loongarch/kernel/setup.c | 5 ++--- arch/loongarch/kernel/smp.c | 2 -- 9 files changed, 23 insertions(+), 23 deletions(-) diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig index 80657bf83b05..1920d52653b4 100644 --- a/arch/loongarch/Kconfig +++ b/arch/loongarch/Kconfig @@ -343,6 +343,7 @@ config NR_CPUS config NUMA bool "NUMA Support" + select SMP select ACPI_NUMA if ACPI help Say Y to compile the kernel with NUMA (Non-Uniform Memory Access) diff --git a/arch/loongarch/include/asm/hardirq.h b/arch/loongarch/include/asm/hardirq.h index befe8184aa08..0ef3b18f8980 100644 --- a/arch/loongarch/include/asm/hardirq.h +++ b/arch/loongarch/include/asm/hardirq.h @@ -19,7 +19,7 @@ typedef struct { unsigned int __softirq_pending; } ____cacheline_aligned irq_cpustat_t; -DECLARE_PER_CPU_ALIGNED(irq_cpustat_t, irq_stat); +DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat); #define __ARCH_IRQ_STAT diff --git a/arch/loongarch/include/asm/percpu.h b/arch/loongarch/include/asm/percpu.h index 34f15a6fb1e7..e6569f18c6dd 100644 --- a/arch/loongarch/include/asm/percpu.h +++ b/arch/loongarch/include/asm/percpu.h @@ -6,6 +6,7 @@ #define __ASM_PERCPU_H #include +#include /* Use r21 for fast access */ register unsigned long __my_cpu_offset __asm__("$r21"); diff --git a/arch/loongarch/include/asm/smp.h b/arch/loongarch/include/asm/smp.h index 551e1f37c705..71189b28bfb2 100644 --- a/arch/loongarch/include/asm/smp.h +++ b/arch/loongarch/include/asm/smp.h @@ -9,10 +9,16 @@ #include #include #include -#include #include #include +extern int smp_num_siblings; +extern int num_processors; +extern int disabled_cpus; +extern cpumask_t cpu_sibling_map[]; +extern cpumask_t cpu_core_map[]; +extern cpumask_t cpu_foreign_map[]; + void loongson3_smp_setup(void); void loongson3_prepare_cpus(unsigned int max_cpus); void loongson3_boot_secondary(int cpu, struct task_struct *idle); @@ -25,26 +31,11 @@ int loongson3_cpu_disable(void); void loongson3_cpu_die(unsigned int cpu); #endif -#ifdef CONFIG_SMP - static inline void plat_smp_setup(void) { loongson3_smp_setup(); } -#else /* !CONFIG_SMP */ - -static inline void plat_smp_setup(void) { } - -#endif /* !CONFIG_SMP */ - -extern int smp_num_siblings; -extern int num_processors; -extern int disabled_cpus; -extern cpumask_t cpu_sibling_map[]; -extern cpumask_t cpu_core_map[]; -extern cpumask_t cpu_foreign_map[]; - static inline int raw_smp_processor_id(void) { #if defined(__VDSO__) diff --git a/arch/loongarch/kernel/acpi.c b/arch/loongarch/kernel/acpi.c index b16c3dea5eeb..bb729ee8a237 100644 --- a/arch/loongarch/kernel/acpi.c +++ b/arch/loongarch/kernel/acpi.c @@ -138,6 +138,7 @@ void __init acpi_boot_table_init(void) } } +#ifdef CONFIG_SMP static int set_processor_mask(u32 id, u32 flags) { @@ -166,15 +167,18 @@ static int set_processor_mask(u32 id, u32 flags) return cpu; } +#endif static void __init acpi_process_madt(void) { +#ifdef CONFIG_SMP int i; for (i = 0; i < NR_CPUS; i++) { __cpu_number_map[i] = -1; __cpu_logical_map[i] = -1; } +#endif loongson_sysconf.nr_cpus = num_processors; } diff --git a/arch/loongarch/kernel/cacheinfo.c b/arch/loongarch/kernel/cacheinfo.c index 8c9fe29e98f0..b38f5489d094 100644 --- a/arch/loongarch/kernel/cacheinfo.c +++ b/arch/loongarch/kernel/cacheinfo.c @@ -4,6 +4,7 @@ * * Copyright (C) 2020-2022 Loongson Technology Corporation Limited */ +#include #include /* Populates leaf and increments to next leaf */ diff --git a/arch/loongarch/kernel/irq.c b/arch/loongarch/kernel/irq.c index 4b671d305ede..b34b8d792aa4 100644 --- a/arch/loongarch/kernel/irq.c +++ b/arch/loongarch/kernel/irq.c @@ -22,6 +22,8 @@ #include DEFINE_PER_CPU(unsigned long, irq_stack); +DEFINE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat); +EXPORT_PER_CPU_SYMBOL(irq_stat); struct irq_domain *cpu_domain; struct irq_domain *liointc_domain; @@ -56,8 +58,11 @@ int arch_show_interrupts(struct seq_file *p, int prec) void __init init_IRQ(void) { - int i, r, ipi_irq; + int i; +#ifdef CONFIG_SMP + int r, ipi_irq; static int ipi_dummy_dev; +#endif unsigned int order = get_order(IRQ_STACK_SIZE); struct page *page; diff --git a/arch/loongarch/kernel/setup.c b/arch/loongarch/kernel/setup.c index 185e4035811a..c74860b53375 100644 --- a/arch/loongarch/kernel/setup.c +++ b/arch/loongarch/kernel/setup.c @@ -39,7 +39,6 @@ #include #include #include -#include #include #define SMBIOS_BIOSSIZE_OFFSET 0x09 @@ -349,8 +348,6 @@ static void __init prefill_possible_map(void) nr_cpu_ids = possible; } -#else -static inline void prefill_possible_map(void) {} #endif void __init setup_arch(char **cmdline_p) @@ -367,8 +364,10 @@ void __init setup_arch(char **cmdline_p) arch_mem_init(cmdline_p); resource_init(); +#ifdef CONFIG_SMP plat_smp_setup(); prefill_possible_map(); +#endif paging_init(); } diff --git a/arch/loongarch/kernel/smp.c b/arch/loongarch/kernel/smp.c index b8c53b755a25..73cec62504fb 100644 --- a/arch/loongarch/kernel/smp.c +++ b/arch/loongarch/kernel/smp.c @@ -66,8 +66,6 @@ static cpumask_t cpu_core_setup_map; struct secondary_data cpuboot_data; static DEFINE_PER_CPU(int, cpu_state); -DEFINE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat); -EXPORT_PER_CPU_SYMBOL(irq_stat); enum ipi_msg_type { IPI_RESCHEDULE, -- cgit From 0626e1c9f3e5536545431538d12c762d5abf59c8 Mon Sep 17 00:00:00 2001 From: Huacai Chen Date: Sun, 5 Jun 2022 16:20:03 +0800 Subject: LoongArch: Fix copy_thread() build errors Commit c5febea0956fd387 ("fork: Pass struct kernel_clone_args into copy_thread") change the prototype of copy_thread(), while commit 5bd2e97c868a8a44 ("fork: Generalize PF_IO_WORKER handling") change the structure of kernel_clone_args. They cause build errors, so fix it. Fixes: 5bd2e97c868a8a44 ("fork: Generalize PF_IO_WORKER handling") Fixes: c5febea0956fd387 ("fork: Pass struct kernel_clone_args into copy_thread") Signed-off-by: Huacai Chen --- arch/loongarch/kernel/process.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/arch/loongarch/kernel/process.c b/arch/loongarch/kernel/process.c index 6d944d65f600..bfa0dfe8b7d7 100644 --- a/arch/loongarch/kernel/process.c +++ b/arch/loongarch/kernel/process.c @@ -120,10 +120,12 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src) /* * Copy architecture-specific thread state */ -int copy_thread(unsigned long clone_flags, unsigned long usp, - unsigned long kthread_arg, struct task_struct *p, unsigned long tls) +int copy_thread(struct task_struct *p, const struct kernel_clone_args *args) { unsigned long childksp; + unsigned long tls = args->tls; + unsigned long usp = args->stack; + unsigned long clone_flags = args->flags; struct pt_regs *childregs, *regs = current_pt_regs(); childksp = (unsigned long)task_stack_page(p) + THREAD_SIZE - 32; @@ -136,12 +138,12 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, p->thread.csr_crmd = csr_read32(LOONGARCH_CSR_CRMD); p->thread.csr_prmd = csr_read32(LOONGARCH_CSR_PRMD); p->thread.csr_ecfg = csr_read32(LOONGARCH_CSR_ECFG); - if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) { + if (unlikely(args->fn)) { /* kernel thread */ - p->thread.reg23 = usp; /* fn */ - p->thread.reg24 = kthread_arg; p->thread.reg03 = childksp; - p->thread.reg01 = (unsigned long) ret_from_kernel_thread; + p->thread.reg23 = (unsigned long)args->fn; + p->thread.reg24 = (unsigned long)args->fn_arg; + p->thread.reg01 = (unsigned long)ret_from_kernel_thread; memset(childregs, 0, sizeof(struct pt_regs)); childregs->csr_euen = p->thread.csr_euen; childregs->csr_crmd = p->thread.csr_crmd; -- cgit From 5c95fe8b02011c3b69173e0d86aff6d4c2798601 Mon Sep 17 00:00:00 2001 From: "Jason A. Donenfeld" Date: Sun, 5 Jun 2022 16:20:08 +0800 Subject: LoongArch: Remove MIPS comment about cycle counter This comment block was taken originally from the MIPS architecture code, where indeed there are particular assumptions one can make regarding SMP and !SMP and cycle counters. On LoongArch, however, the rdtime family of functions is always available. As Xuerui wrote: The rdtime family of instructions is in fact guaranteed to be available on LoongArch; LoongArch's subsets all contain them, even the 32-bit "Primary" subset intended for university teaching -- they provide the rdtimeh.w and rdtimel.w pair of instructions that access the same 64-bit counter. So this commit simply removes the incorrect comment block. Link: https://lore.kernel.org/lkml/e78940bc-9be2-2fe7-026f-9e64a1416c9f@xen0n.name/ Fixes: b738c106f735 ("LoongArch: Add other common headers") Reviewed-by: WANG Xuerui Signed-off-by: Jason A. Donenfeld Signed-off-by: Huacai Chen --- arch/loongarch/include/asm/timex.h | 7 ------- 1 file changed, 7 deletions(-) diff --git a/arch/loongarch/include/asm/timex.h b/arch/loongarch/include/asm/timex.h index d3ed99a4fdbd..fb41e9e7a222 100644 --- a/arch/loongarch/include/asm/timex.h +++ b/arch/loongarch/include/asm/timex.h @@ -12,13 +12,6 @@ #include #include -/* - * Standard way to access the cycle counter. - * Currently only used on SMP for scheduling. - * - * We know that all SMP capable CPUs have cycle counters. - */ - typedef unsigned long cycles_t; #define get_cycles get_cycles -- cgit From 98432ccdec9f178ba041e1e5f9f32dbd71576504 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Tue, 7 Jun 2022 14:14:26 +0100 Subject: KVM: arm64: Replace vgic_v3_uaccess_read_pending with vgic_uaccess_read_pending Now that GICv2 has a proper userspace accessor for the pending state, switch GICv3 over to it, dropping the local version, moving over the specific behaviours that CGIv3 requires (such as the distinction between pending latch and line level which were never enforced with GICv2). We also gain extra locking that isn't really necessary for userspace, but that's a small price to pay for getting rid of superfluous code. Signed-off-by: Marc Zyngier Reviewed-by: Eric Auger Link: https://lore.kernel.org/r/20220607131427.1164881-3-maz@kernel.org --- arch/arm64/kvm/vgic/vgic-mmio-v3.c | 40 ++------------------------------------ arch/arm64/kvm/vgic/vgic-mmio.c | 21 +++++++++++++++++++- 2 files changed, 22 insertions(+), 39 deletions(-) diff --git a/arch/arm64/kvm/vgic/vgic-mmio-v3.c b/arch/arm64/kvm/vgic/vgic-mmio-v3.c index f7aa7bcd6fb8..f15e29cc63ce 100644 --- a/arch/arm64/kvm/vgic/vgic-mmio-v3.c +++ b/arch/arm64/kvm/vgic/vgic-mmio-v3.c @@ -353,42 +353,6 @@ static unsigned long vgic_mmio_read_v3_idregs(struct kvm_vcpu *vcpu, return 0; } -static unsigned long vgic_v3_uaccess_read_pending(struct kvm_vcpu *vcpu, - gpa_t addr, unsigned int len) -{ - u32 intid = VGIC_ADDR_TO_INTID(addr, 1); - u32 value = 0; - int i; - - /* - * pending state of interrupt is latched in pending_latch variable. - * Userspace will save and restore pending state and line_level - * separately. - * Refer to Documentation/virt/kvm/devices/arm-vgic-v3.rst - * for handling of ISPENDR and ICPENDR. - */ - for (i = 0; i < len * 8; i++) { - struct vgic_irq *irq = vgic_get_irq(vcpu->kvm, vcpu, intid + i); - bool state = irq->pending_latch; - - if (irq->hw && vgic_irq_is_sgi(irq->intid)) { - int err; - - err = irq_get_irqchip_state(irq->host_irq, - IRQCHIP_STATE_PENDING, - &state); - WARN_ON(err); - } - - if (state) - value |= (1U << i); - - vgic_put_irq(vcpu->kvm, irq); - } - - return value; -} - static int vgic_v3_uaccess_write_pending(struct kvm_vcpu *vcpu, gpa_t addr, unsigned int len, unsigned long val) @@ -666,7 +630,7 @@ static const struct vgic_register_region vgic_v3_dist_registers[] = { VGIC_ACCESS_32bit), REGISTER_DESC_WITH_BITS_PER_IRQ_SHARED(GICD_ISPENDR, vgic_mmio_read_pending, vgic_mmio_write_spending, - vgic_v3_uaccess_read_pending, vgic_v3_uaccess_write_pending, 1, + vgic_uaccess_read_pending, vgic_v3_uaccess_write_pending, 1, VGIC_ACCESS_32bit), REGISTER_DESC_WITH_BITS_PER_IRQ_SHARED(GICD_ICPENDR, vgic_mmio_read_pending, vgic_mmio_write_cpending, @@ -750,7 +714,7 @@ static const struct vgic_register_region vgic_v3_rd_registers[] = { VGIC_ACCESS_32bit), REGISTER_DESC_WITH_LENGTH_UACCESS(SZ_64K + GICR_ISPENDR0, vgic_mmio_read_pending, vgic_mmio_write_spending, - vgic_v3_uaccess_read_pending, vgic_v3_uaccess_write_pending, 4, + vgic_uaccess_read_pending, vgic_v3_uaccess_write_pending, 4, VGIC_ACCESS_32bit), REGISTER_DESC_WITH_LENGTH_UACCESS(SZ_64K + GICR_ICPENDR0, vgic_mmio_read_pending, vgic_mmio_write_cpending, diff --git a/arch/arm64/kvm/vgic/vgic-mmio.c b/arch/arm64/kvm/vgic/vgic-mmio.c index dc8c52487e47..997d0fce2088 100644 --- a/arch/arm64/kvm/vgic/vgic-mmio.c +++ b/arch/arm64/kvm/vgic/vgic-mmio.c @@ -240,6 +240,15 @@ static unsigned long __read_pending(struct kvm_vcpu *vcpu, unsigned long flags; bool val; + /* + * When used from userspace with a GICv3 model: + * + * Pending state of interrupt is latched in pending_latch + * variable. Userspace will save and restore pending state + * and line_level separately. + * Refer to Documentation/virt/kvm/devices/arm-vgic-v3.rst + * for handling of ISPENDR and ICPENDR. + */ raw_spin_lock_irqsave(&irq->irq_lock, flags); if (irq->hw && vgic_irq_is_sgi(irq->intid)) { int err; @@ -252,7 +261,17 @@ static unsigned long __read_pending(struct kvm_vcpu *vcpu, } else if (!is_user && vgic_irq_is_mapped_level(irq)) { val = vgic_get_phys_line_level(irq); } else { - val = irq_is_pending(irq); + switch (vcpu->kvm->arch.vgic.vgic_model) { + case KVM_DEV_TYPE_ARM_VGIC_V3: + if (is_user) { + val = irq->pending_latch; + break; + } + fallthrough; + default: + val = irq_is_pending(irq); + break; + } } value |= ((u32)val << i); -- cgit From efedd01de475e126e43a07d0b1221bb65e497163 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Tue, 7 Jun 2022 14:14:27 +0100 Subject: KVM: arm64: Warn if accessing timer pending state outside of vcpu context A recurrent bug in the KVM/arm64 code base consists in trying to access the timer pending state outside of the vcpu context, which makes zero sense (the pending state only exists when the vcpu is loaded). In order to avoid more embarassing crashes and catch the offenders red-handed, add a warning to kvm_arch_timer_get_input_level() and return the state as non-pending. This avoids taking the system down, and still helps tracking down silly bugs. Reviewed-by: Eric Auger Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220607131427.1164881-4-maz@kernel.org --- arch/arm64/kvm/arch_timer.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c index 4e39ace073af..3b8d062e30ea 100644 --- a/arch/arm64/kvm/arch_timer.c +++ b/arch/arm64/kvm/arch_timer.c @@ -1230,6 +1230,9 @@ bool kvm_arch_timer_get_input_level(int vintid) struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); struct arch_timer_context *timer; + if (WARN(!vcpu, "No vcpu context!\n")) + return false; + if (vintid == vcpu_vtimer(vcpu)->irq.irq) timer = vcpu_vtimer(vcpu); else if (vintid == vcpu_ptimer(vcpu)->irq.irq) -- cgit From 49c3ca34f7dbe5227c0163cba4deb5d29e145fae Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Wed, 8 Jun 2022 00:18:40 +0900 Subject: scripts/nsdeps: adjust to the format change of *.mod files Commit 22f26f21774f ("kbuild: get rid of duplication in *.mod files") changed the format of *.mod files to put one object per line, but missed to adjust scripts/nsdeps. Fixes: 22f26f21774f ("kbuild: get rid of duplication in *.mod files") Signed-off-by: Masahiro Yamada --- scripts/nsdeps | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/scripts/nsdeps b/scripts/nsdeps index 04c4b96e95ec..f1718cc0d700 100644 --- a/scripts/nsdeps +++ b/scripts/nsdeps @@ -34,9 +34,8 @@ generate_deps() { local mod=${1%.ko:} shift local namespaces="$*" - local mod_source_files="`cat $mod.mod | sed -n 1p \ - | sed -e 's/\.o/\.c/g' \ - | sed "s|[^ ]* *|${src_prefix}&|g"`" + local mod_source_files=$(sed "s|^\(.*\)\.o$|${src_prefix}\1.c|" $mod.mod) + for ns in $namespaces; do echo "Adding namespace $ns to module $mod.ko." generate_deps_for_ns $ns "$mod_source_files" -- cgit From 228432551bd8783211e494ab35f42a4344580502 Mon Sep 17 00:00:00 2001 From: Jason Wang Date: Wed, 8 Jun 2022 14:14:22 +0800 Subject: virtio-rng: make device ready before making request Current virtio-rng does a entropy request before DRIVER_OK, this violates the spec: virtio spec requires that all drivers set DRIVER_OK before using devices. Further, kernel will ignore the interrupt after commit 8b4ec69d7e09 ("virtio: harden vring IRQ"). Fixing this by making device ready before the request. Cc: stable@vger.kernel.org Fixes: 8b4ec69d7e09 ("virtio: harden vring IRQ") Fixes: f7f510ec1957 ("virtio: An entropy device, as suggested by hpa.") Reported-and-tested-by: syzbot+5b59d6d459306a556f54@syzkaller.appspotmail.com Signed-off-by: Jason Wang Message-Id: <20220608061422.38437-1-jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin Reviewed-by: Laurent Vivier --- drivers/char/hw_random/virtio-rng.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/char/hw_random/virtio-rng.c b/drivers/char/hw_random/virtio-rng.c index e856df7e285c..a6f3a8a2aca6 100644 --- a/drivers/char/hw_random/virtio-rng.c +++ b/drivers/char/hw_random/virtio-rng.c @@ -159,6 +159,8 @@ static int probe_common(struct virtio_device *vdev) goto err_find; } + virtio_device_ready(vdev); + /* we always have a pending entropy request */ request_entropy(vi); -- cgit From 2f72b2262d317093596c72bd5b27b9880be7611e Mon Sep 17 00:00:00 2001 From: Xiang wangx Date: Sat, 4 Jun 2022 22:38:58 +0800 Subject: vdpa/mlx5: Fix syntax errors in comments Delete the redundant word 'is'. Signed-off-by: Xiang wangx Message-Id: <20220604143858.16073-1-wangxiang@cdjrlc.com> Signed-off-by: Michael S. Tsirkin Acked-by: Jason Wang --- drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index b7a955479156..b878c1095530 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -107,7 +107,7 @@ struct mlx5_vdpa_virtqueue { /* Resources for implementing the notification channel from the device * to the driver. fwqp is the firmware end of an RC connection; the - * other end is vqqp used by the driver. cq is is where completions are + * other end is vqqp used by the driver. cq is where completions are * reported. */ struct mlx5_vdpa_cq cq; -- cgit From a58a7f97ba11391d2d0d408e0b24f38d86ae748e Mon Sep 17 00:00:00 2001 From: chengkaitao Date: Thu, 2 Jun 2022 08:55:42 +0800 Subject: virtio-mmio: fix missing put_device() when vm_cmdline_parent registration failed The reference must be released when device_register(&vm_cmdline_parent) failed. Add the corresponding 'put_device()' in the error handling path. Signed-off-by: chengkaitao Message-Id: <20220602005542.16489-1-chengkaitao@didiglobal.com> Signed-off-by: Michael S. Tsirkin Acked-by: Jason Wang --- drivers/virtio/virtio_mmio.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c index f9a36bc7ac27..5ce79bf9f92b 100644 --- a/drivers/virtio/virtio_mmio.c +++ b/drivers/virtio/virtio_mmio.c @@ -701,6 +701,7 @@ static int vm_cmdline_set(const char *device, if (!vm_cmdline_parent_registered) { err = device_register(&vm_cmdline_parent); if (err) { + put_device(&vm_cmdline_parent); pr_err("Failed to register parent device!\n"); return err; } -- cgit From f766c409fcb33cfd0f511e8251831520e089eb89 Mon Sep 17 00:00:00 2001 From: Dan Carpenter Date: Tue, 7 Jun 2022 09:49:25 +0300 Subject: vdpa/mlx5: fix error code for deleting vlan Return success if we were able to delete a vlan. The current code always returns failure. Fixes: baf2ad3f6a98 ("vdpa/mlx5: Add RX MAC VLAN filter support") Signed-off-by: Dan Carpenter Message-Id: Signed-off-by: Michael S. Tsirkin Acked-by: Eli Cohen Acked-by: Si-Wei Liu --- drivers/vdpa/mlx5/net/mlx5_vnet.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index b878c1095530..51fb15c35e42 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -1814,6 +1814,7 @@ static virtio_net_ctrl_ack handle_ctrl_vlan(struct mlx5_vdpa_dev *mvdev, u8 cmd) id = mlx5vdpa16_to_cpu(mvdev, vlan); mac_vlan_del(ndev, ndev->config.mac, id, true); + status = VIRTIO_NET_OK; break; default: break; -- cgit From f38b3c6a788f75da151b46c7da61ff26649e1843 Mon Sep 17 00:00:00 2001 From: Dan Carpenter Date: Tue, 7 Jun 2022 09:50:09 +0300 Subject: vdpa/mlx5: clean up indenting in handle_ctrl_vlan() These lines were supposed to be indented. Signed-off-by: Dan Carpenter Message-Id: Signed-off-by: Michael S. Tsirkin Acked-by: Eli Cohen Acked-by: Si-Wei Liu --- drivers/vdpa/mlx5/net/mlx5_vnet.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 51fb15c35e42..1b6d46b86f81 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -1817,10 +1817,10 @@ static virtio_net_ctrl_ack handle_ctrl_vlan(struct mlx5_vdpa_dev *mvdev, u8 cmd) status = VIRTIO_NET_OK; break; default: - break; -} + break; + } -return status; + return status; } static void mlx5_cvq_kick_handler(struct work_struct *work) -- cgit From dbd29e0752286af74243cf891accf472b2f3edd8 Mon Sep 17 00:00:00 2001 From: Xie Yongji Date: Thu, 5 May 2022 18:09:10 +0800 Subject: vringh: Fix loop descriptors check in the indirect cases We should use size of descriptor chain to test loop condition in the indirect case. And another statistical count is also introduced for indirect descriptors to avoid conflict with the statistical count of direct descriptors. Fixes: f87d0fbb5798 ("vringh: host-side implementation of virtio rings.") Signed-off-by: Xie Yongji Signed-off-by: Fam Zheng Message-Id: <20220505100910.137-1-xieyongji@bytedance.com> Signed-off-by: Michael S. Tsirkin Acked-by: Jason Wang --- drivers/vhost/vringh.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/vhost/vringh.c b/drivers/vhost/vringh.c index 14e2043d7685..eab55accf381 100644 --- a/drivers/vhost/vringh.c +++ b/drivers/vhost/vringh.c @@ -292,7 +292,7 @@ __vringh_iov(struct vringh *vrh, u16 i, int (*copy)(const struct vringh *vrh, void *dst, const void *src, size_t len)) { - int err, count = 0, up_next, desc_max; + int err, count = 0, indirect_count = 0, up_next, desc_max; struct vring_desc desc, *descs; struct vringh_range range = { -1ULL, 0 }, slowrange; bool slow = false; @@ -349,7 +349,12 @@ __vringh_iov(struct vringh *vrh, u16 i, continue; } - if (count++ == vrh->vring.num) { + if (up_next == -1) + count++; + else + indirect_count++; + + if (count > vrh->vring.num || indirect_count > desc_max) { vringh_bad("Descriptor loop in %p", descs); err = -ELOOP; goto fail; @@ -411,6 +416,7 @@ __vringh_iov(struct vringh *vrh, u16 i, i = return_from_indirect(vrh, &up_next, &descs, &desc_max); slow = false; + indirect_count = 0; } else break; } -- cgit From b27ee76c74dc831d6e092eaebc2dfc9c0beed1c9 Mon Sep 17 00:00:00 2001 From: Xie Yongji Date: Tue, 26 Apr 2022 15:36:56 +0800 Subject: vduse: Fix NULL pointer dereference on sysfs access The control device has no drvdata. So we will get a NULL pointer dereference when accessing control device's msg_timeout attribute via sysfs: [ 132.841881][ T3644] BUG: kernel NULL pointer dereference, address: 00000000000000f8 [ 132.850619][ T3644] RIP: 0010:msg_timeout_show (drivers/vdpa/vdpa_user/vduse_dev.c:1271) [ 132.869447][ T3644] dev_attr_show (drivers/base/core.c:2094) [ 132.870215][ T3644] sysfs_kf_seq_show (fs/sysfs/file.c:59) [ 132.871164][ T3644] ? device_remove_bin_file (drivers/base/core.c:2088) [ 132.872082][ T3644] kernfs_seq_show (fs/kernfs/file.c:164) [ 132.872838][ T3644] seq_read_iter (fs/seq_file.c:230) [ 132.873578][ T3644] ? __vmalloc_area_node (mm/vmalloc.c:3041) [ 132.874532][ T3644] kernfs_fop_read_iter (fs/kernfs/file.c:238) [ 132.875513][ T3644] __kernel_read (fs/read_write.c:440 (discriminator 1)) [ 132.876319][ T3644] kernel_read (fs/read_write.c:459) [ 132.877129][ T3644] kernel_read_file (fs/kernel_read_file.c:94) [ 132.877978][ T3644] kernel_read_file_from_fd (include/linux/file.h:45 fs/kernel_read_file.c:186) [ 132.879019][ T3644] __do_sys_finit_module (kernel/module.c:4207) [ 132.879930][ T3644] __ia32_sys_finit_module (kernel/module.c:4189) [ 132.880930][ T3644] do_int80_syscall_32 (arch/x86/entry/common.c:112 arch/x86/entry/common.c:132) [ 132.881847][ T3644] entry_INT80_compat (arch/x86/entry/entry_64_compat.S:419) To fix it, don't create the unneeded attribute for control device anymore. Fixes: c8a6153b6c59 ("vduse: Introduce VDUSE - vDPA Device in Userspace") Reported-by: kernel test robot Cc: stable@vger.kernel.org Signed-off-by: Xie Yongji Message-Id: <20220426073656.229-1-xieyongji@bytedance.com> Signed-off-by: Michael S. Tsirkin --- drivers/vdpa/vdpa_user/vduse_dev.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c index d503848b3b6e..776ad7496f53 100644 --- a/drivers/vdpa/vdpa_user/vduse_dev.c +++ b/drivers/vdpa/vdpa_user/vduse_dev.c @@ -1345,9 +1345,9 @@ static int vduse_create_dev(struct vduse_dev_config *config, dev->minor = ret; dev->msg_timeout = VDUSE_MSG_DEFAULT_TIMEOUT; - dev->dev = device_create(vduse_class, NULL, - MKDEV(MAJOR(vduse_major), dev->minor), - dev, "%s", config->name); + dev->dev = device_create_with_groups(vduse_class, NULL, + MKDEV(MAJOR(vduse_major), dev->minor), + dev, vduse_dev_groups, "%s", config->name); if (IS_ERR(dev->dev)) { ret = PTR_ERR(dev->dev); goto err_dev; @@ -1596,7 +1596,6 @@ static int vduse_init(void) return PTR_ERR(vduse_class); vduse_class->devnode = vduse_devnode; - vduse_class->dev_groups = vduse_dev_groups; ret = alloc_chrdev_region(&vduse_major, 0, VDUSE_DEV_MAX, "vduse"); if (ret) -- cgit From 6c254bf3b637dd4ef4f78eb78c7447419c0161d7 Mon Sep 17 00:00:00 2001 From: Chuck Lever Date: Tue, 7 Jun 2022 16:47:52 -0400 Subject: SUNRPC: Fix the calculation of xdr->end in xdr_get_next_encode_buffer() I found that NFSD's new NFSv3 READDIRPLUS XDR encoder was screwing up right at the end of the page array. xdr_get_next_encode_buffer() does not compute the value of xdr->end correctly: * The check to see if we're on the final available page in xdr->buf needs to account for the space consumed by @nbytes. * The new xdr->end value needs to account for the portion of @nbytes that is to be encoded into the previous buffer. Fixes: 2825a7f90753 ("nfsd4: allow encoding across page boundaries") Signed-off-by: Chuck Lever Reviewed-by: NeilBrown Reviewed-by: J. Bruce Fields --- net/sunrpc/xdr.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c index df194cc07035..b57cf9df4de8 100644 --- a/net/sunrpc/xdr.c +++ b/net/sunrpc/xdr.c @@ -979,7 +979,11 @@ static __be32 *xdr_get_next_encode_buffer(struct xdr_stream *xdr, */ xdr->p = (void *)p + frag2bytes; space_left = xdr->buf->buflen - xdr->buf->len; - xdr->end = (void *)p + min_t(int, space_left, PAGE_SIZE); + if (space_left - nbytes >= PAGE_SIZE) + xdr->end = (void *)p + PAGE_SIZE; + else + xdr->end = (void *)p + space_left - frag1bytes; + xdr->buf->page_len += frag2bytes; xdr->buf->len += nbytes; return p; -- cgit From 62ed448cc53b654036f7d7f3c99f299d79ad14c3 Mon Sep 17 00:00:00 2001 From: Chuck Lever Date: Tue, 7 Jun 2022 16:47:58 -0400 Subject: SUNRPC: Optimize xdr_reserve_space() Transitioning between encode buffers is quite infrequent. It happens about 1 time in 400 calls to xdr_reserve_space(), measured on NFSD with a typical build/test workload. Force the compiler to remove that code from xdr_reserve_space(), which is a hot path on both the server and the client. This change reduces the size of xdr_reserve_space() from 10 cache lines to 2 when compiled with -Os. Signed-off-by: Chuck Lever Reviewed-by: J. Bruce Fields --- include/linux/sunrpc/xdr.h | 16 +++++++++++++++- net/sunrpc/xdr.c | 17 ++++++++++------- 2 files changed, 25 insertions(+), 8 deletions(-) diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h index 4417f667c757..5860f32e3958 100644 --- a/include/linux/sunrpc/xdr.h +++ b/include/linux/sunrpc/xdr.h @@ -243,7 +243,7 @@ extern void xdr_init_encode(struct xdr_stream *xdr, struct xdr_buf *buf, extern __be32 *xdr_reserve_space(struct xdr_stream *xdr, size_t nbytes); extern int xdr_reserve_space_vec(struct xdr_stream *xdr, struct kvec *vec, size_t nbytes); -extern void xdr_commit_encode(struct xdr_stream *xdr); +extern void __xdr_commit_encode(struct xdr_stream *xdr); extern void xdr_truncate_encode(struct xdr_stream *xdr, size_t len); extern int xdr_restrict_buflen(struct xdr_stream *xdr, int newbuflen); extern void xdr_write_pages(struct xdr_stream *xdr, struct page **pages, @@ -306,6 +306,20 @@ xdr_reset_scratch_buffer(struct xdr_stream *xdr) xdr_set_scratch_buffer(xdr, NULL, 0); } +/** + * xdr_commit_encode - Ensure all data is written to xdr->buf + * @xdr: pointer to xdr_stream + * + * Handle encoding across page boundaries by giving the caller a + * temporary location to write to, then later copying the data into + * place. __xdr_commit_encode() does that copying. + */ +static inline void xdr_commit_encode(struct xdr_stream *xdr) +{ + if (unlikely(xdr->scratch.iov_len)) + __xdr_commit_encode(xdr); +} + /** * xdr_stream_remaining - Return the number of bytes remaining in the stream * @xdr: pointer to struct xdr_stream diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c index b57cf9df4de8..1ad8b4ef14de 100644 --- a/net/sunrpc/xdr.c +++ b/net/sunrpc/xdr.c @@ -919,7 +919,7 @@ void xdr_init_encode(struct xdr_stream *xdr, struct xdr_buf *buf, __be32 *p, EXPORT_SYMBOL_GPL(xdr_init_encode); /** - * xdr_commit_encode - Ensure all data is written to buffer + * __xdr_commit_encode - Ensure all data is written to buffer * @xdr: pointer to xdr_stream * * We handle encoding across page boundaries by giving the caller a @@ -931,22 +931,25 @@ EXPORT_SYMBOL_GPL(xdr_init_encode); * required at the end of encoding, or any other time when the xdr_buf * data might be read. */ -inline void xdr_commit_encode(struct xdr_stream *xdr) +void __xdr_commit_encode(struct xdr_stream *xdr) { int shift = xdr->scratch.iov_len; void *page; - if (shift == 0) - return; page = page_address(*xdr->page_ptr); memcpy(xdr->scratch.iov_base, page, shift); memmove(page, page + shift, (void *)xdr->p - page); xdr_reset_scratch_buffer(xdr); } -EXPORT_SYMBOL_GPL(xdr_commit_encode); +EXPORT_SYMBOL_GPL(__xdr_commit_encode); -static __be32 *xdr_get_next_encode_buffer(struct xdr_stream *xdr, - size_t nbytes) +/* + * The buffer space to be reserved crosses the boundary between + * xdr->buf->head and xdr->buf->pages, or between two pages + * in xdr->buf->pages. + */ +static noinline __be32 *xdr_get_next_encode_buffer(struct xdr_stream *xdr, + size_t nbytes) { __be32 *p; int space_left; -- cgit From 90d871b3b9bb7ef8f835d6b53095f01b9c74b7b3 Mon Sep 17 00:00:00 2001 From: Chuck Lever Date: Tue, 7 Jun 2022 16:48:05 -0400 Subject: SUNRPC: Clean up xdr_commit_encode() Both the kvec::iov_len field and the third parameter of memcpy() and memmove() are size_t. There's no reason for the implicit conversion from size_t to int and back. Change the type of @shift to make the code easier to read and understand. Signed-off-by: Chuck Lever Reviewed-by: NeilBrown Reviewed-by: J. Bruce Fields --- net/sunrpc/xdr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c index 1ad8b4ef14de..3c182041e790 100644 --- a/net/sunrpc/xdr.c +++ b/net/sunrpc/xdr.c @@ -933,7 +933,7 @@ EXPORT_SYMBOL_GPL(xdr_init_encode); */ void __xdr_commit_encode(struct xdr_stream *xdr) { - int shift = xdr->scratch.iov_len; + size_t shift = xdr->scratch.iov_len; void *page; page = page_address(*xdr->page_ptr); -- cgit From bd07a64176a2be03f5195c64943063fd119f9f21 Mon Sep 17 00:00:00 2001 From: Chuck Lever Date: Tue, 7 Jun 2022 16:48:11 -0400 Subject: SUNRPC: Clean up xdr_get_next_encode_buffer() The value of @p is not used until the "location of the next item" is computed. Help human readers by moving its initial assignment to the paragraph where that value is used and by clarifying the antecedents in the documenting comment. Signed-off-by: Chuck Lever Reviewed-by: NeilBrown Reviewed-by: J. Bruce Fields --- net/sunrpc/xdr.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c index 3c182041e790..eca02d122476 100644 --- a/net/sunrpc/xdr.c +++ b/net/sunrpc/xdr.c @@ -967,6 +967,7 @@ static noinline __be32 *xdr_get_next_encode_buffer(struct xdr_stream *xdr, xdr->buf->page_len += frag1bytes; xdr->page_ptr++; xdr->iov = NULL; + /* * If the last encode didn't end exactly on a page boundary, the * next one will straddle boundaries. Encode into the next @@ -975,11 +976,12 @@ static noinline __be32 *xdr_get_next_encode_buffer(struct xdr_stream *xdr, * space at the end of the previous buffer: */ xdr_set_scratch_buffer(xdr, xdr->p, frag1bytes); - p = page_address(*xdr->page_ptr); + /* - * Note this is where the next encode will start after we've - * shifted this one back: + * xdr->p is where the next encode will start after + * xdr_commit_encode() has shifted this one back: */ + p = page_address(*xdr->page_ptr); xdr->p = (void *)p + frag2bytes; space_left = xdr->buf->buflen - xdr->buf->len; if (space_left - nbytes >= PAGE_SIZE) -- cgit From da9e94fe000e11f21d3d6f66012fe5c6379bd93c Mon Sep 17 00:00:00 2001 From: Chuck Lever Date: Tue, 7 Jun 2022 16:48:18 -0400 Subject: SUNRPC: Remove pointer type casts from xdr_get_next_encode_buffer() To make the code easier to read, remove visual clutter by changing the declared type of @p. Signed-off-by: Chuck Lever Reviewed-by: NeilBrown Reviewed-by: J. Bruce Fields --- net/sunrpc/xdr.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c index eca02d122476..f87a2d8f23a7 100644 --- a/net/sunrpc/xdr.c +++ b/net/sunrpc/xdr.c @@ -951,9 +951,9 @@ EXPORT_SYMBOL_GPL(__xdr_commit_encode); static noinline __be32 *xdr_get_next_encode_buffer(struct xdr_stream *xdr, size_t nbytes) { - __be32 *p; int space_left; int frag1bytes, frag2bytes; + void *p; if (nbytes > PAGE_SIZE) goto out_overflow; /* Bigger buffers require special handling */ @@ -982,12 +982,12 @@ static noinline __be32 *xdr_get_next_encode_buffer(struct xdr_stream *xdr, * xdr_commit_encode() has shifted this one back: */ p = page_address(*xdr->page_ptr); - xdr->p = (void *)p + frag2bytes; + xdr->p = p + frag2bytes; space_left = xdr->buf->buflen - xdr->buf->len; if (space_left - nbytes >= PAGE_SIZE) - xdr->end = (void *)p + PAGE_SIZE; + xdr->end = p + PAGE_SIZE; else - xdr->end = (void *)p + space_left - frag1bytes; + xdr->end = p + space_left - frag1bytes; xdr->buf->page_len += frag2bytes; xdr->buf->len += nbytes; -- cgit From 29dec90a0f1d961b93f34f910e9319d8cb23edbd Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Wed, 8 Jun 2022 08:34:06 +0200 Subject: dm: fix bio_set allocation The use of bioset_init_from_src mean that the pre-allocated pools weren't used for anything except parameter passing, and the integrity pool creation got completely lost for the actual live mapped_device. Fix that by assigning the actual preallocated dm_md_mempools to the mapped_device and using that for I/O instead of creating new mempools. Fixes: 2a2a4c510b76 ("dm: use bioset_init_from_src() to copy bio_set") Signed-off-by: Christoph Hellwig Signed-off-by: Mike Snitzer --- drivers/md/dm-core.h | 11 +++++-- drivers/md/dm-rq.c | 2 +- drivers/md/dm-table.c | 11 ------- drivers/md/dm.c | 84 +++++++++++++++------------------------------------ drivers/md/dm.h | 2 -- 5 files changed, 35 insertions(+), 75 deletions(-) diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h index d21648a923ea..54c0473a51dd 100644 --- a/drivers/md/dm-core.h +++ b/drivers/md/dm-core.h @@ -33,6 +33,14 @@ struct dm_kobject_holder { * access their members! */ +/* + * For mempools pre-allocation at the table loading time. + */ +struct dm_md_mempools { + struct bio_set bs; + struct bio_set io_bs; +}; + struct mapped_device { struct mutex suspend_lock; @@ -110,8 +118,7 @@ struct mapped_device { /* * io objects are allocated from here. */ - struct bio_set io_bs; - struct bio_set bs; + struct dm_md_mempools *mempools; /* kobject and completion */ struct dm_kobject_holder kobj_holder; diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c index 6087cdcaad46..a83b98a8d2a9 100644 --- a/drivers/md/dm-rq.c +++ b/drivers/md/dm-rq.c @@ -319,7 +319,7 @@ static int setup_clone(struct request *clone, struct request *rq, { int r; - r = blk_rq_prep_clone(clone, rq, &tio->md->bs, gfp_mask, + r = blk_rq_prep_clone(clone, rq, &tio->md->mempools->bs, gfp_mask, dm_rq_bio_constructor, tio); if (r) return r; diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index 0e833a154b31..bd539afbfe88 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -1038,17 +1038,6 @@ static int dm_table_alloc_md_mempools(struct dm_table *t, struct mapped_device * return 0; } -void dm_table_free_md_mempools(struct dm_table *t) -{ - dm_free_md_mempools(t->mempools); - t->mempools = NULL; -} - -struct dm_md_mempools *dm_table_get_md_mempools(struct dm_table *t) -{ - return t->mempools; -} - static int setup_indexes(struct dm_table *t) { int i; diff --git a/drivers/md/dm.c b/drivers/md/dm.c index dfb0a551bd88..8b21155d3c4f 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -136,14 +136,6 @@ static int get_swap_bios(void) return latch; } -/* - * For mempools pre-allocation at the table loading time. - */ -struct dm_md_mempools { - struct bio_set bs; - struct bio_set io_bs; -}; - struct table_device { struct list_head list; refcount_t count; @@ -581,7 +573,7 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio) struct dm_target_io *tio; struct bio *clone; - clone = bio_alloc_clone(NULL, bio, GFP_NOIO, &md->io_bs); + clone = bio_alloc_clone(NULL, bio, GFP_NOIO, &md->mempools->io_bs); /* Set default bdev, but target must bio_set_dev() before issuing IO */ clone->bi_bdev = md->disk->part0; @@ -628,7 +620,8 @@ static struct bio *alloc_tio(struct clone_info *ci, struct dm_target *ti, } else { struct mapped_device *md = ci->io->md; - clone = bio_alloc_clone(NULL, ci->bio, gfp_mask, &md->bs); + clone = bio_alloc_clone(NULL, ci->bio, gfp_mask, + &md->mempools->bs); if (!clone) return NULL; /* Set default bdev, but target must bio_set_dev() before issuing IO */ @@ -1876,8 +1869,7 @@ static void cleanup_mapped_device(struct mapped_device *md) { if (md->wq) destroy_workqueue(md->wq); - bioset_exit(&md->bs); - bioset_exit(&md->io_bs); + dm_free_md_mempools(md->mempools); if (md->dax_dev) { dax_remove_host(md->disk); @@ -2049,48 +2041,6 @@ static void free_dev(struct mapped_device *md) kvfree(md); } -static int __bind_mempools(struct mapped_device *md, struct dm_table *t) -{ - struct dm_md_mempools *p = dm_table_get_md_mempools(t); - int ret = 0; - - if (dm_table_bio_based(t)) { - /* - * The md may already have mempools that need changing. - * If so, reload bioset because front_pad may have changed - * because a different table was loaded. - */ - bioset_exit(&md->bs); - bioset_exit(&md->io_bs); - - } else if (bioset_initialized(&md->bs)) { - /* - * There's no need to reload with request-based dm - * because the size of front_pad doesn't change. - * Note for future: If you are to reload bioset, - * prep-ed requests in the queue may refer - * to bio from the old bioset, so you must walk - * through the queue to unprep. - */ - goto out; - } - - BUG_ON(!p || - bioset_initialized(&md->bs) || - bioset_initialized(&md->io_bs)); - - ret = bioset_init_from_src(&md->bs, &p->bs); - if (ret) - goto out; - ret = bioset_init_from_src(&md->io_bs, &p->io_bs); - if (ret) - bioset_exit(&md->bs); -out: - /* mempool bind completed, no longer need any mempools in the table */ - dm_table_free_md_mempools(t); - return ret; -} - /* * Bind a table to the device. */ @@ -2144,12 +2094,28 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t, * immutable singletons - used to optimize dm_mq_queue_rq. */ md->immutable_target = dm_table_get_immutable_target(t); - } - ret = __bind_mempools(md, t); - if (ret) { - old_map = ERR_PTR(ret); - goto out; + /* + * There is no need to reload with request-based dm because the + * size of front_pad doesn't change. + * + * Note for future: If you are to reload bioset, prep-ed + * requests in the queue may refer to bio from the old bioset, + * so you must walk through the queue to unprep. + */ + if (!md->mempools) { + md->mempools = t->mempools; + t->mempools = NULL; + } + } else { + /* + * The md may already have mempools that need changing. + * If so, reload bioset because front_pad may have changed + * because a different table was loaded. + */ + dm_free_md_mempools(md->mempools); + md->mempools = t->mempools; + t->mempools = NULL; } ret = dm_table_set_restrictions(t, md->queue, limits); diff --git a/drivers/md/dm.h b/drivers/md/dm.h index 3f89664fea01..a8405ce305a9 100644 --- a/drivers/md/dm.h +++ b/drivers/md/dm.h @@ -71,8 +71,6 @@ struct dm_target *dm_table_get_immutable_target(struct dm_table *t); struct dm_target *dm_table_get_wildcard_target(struct dm_table *t); bool dm_table_bio_based(struct dm_table *t); bool dm_table_request_based(struct dm_table *t); -void dm_table_free_md_mempools(struct dm_table *t); -struct dm_md_mempools *dm_table_get_md_mempools(struct dm_table *t); void dm_lock_md_type(struct mapped_device *md); void dm_unlock_md_type(struct mapped_device *md); -- cgit From d5a37b19983725d2045588cfa3a4699f5b39ae26 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Wed, 8 Jun 2022 08:34:07 +0200 Subject: block: remove bioset_init_from_src Unused now, and the interface never really made a whole lot of sense to start with. Signed-off-by: Christoph Hellwig Signed-off-by: Mike Snitzer --- block/bio.c | 20 -------------------- include/linux/bio.h | 1 - 2 files changed, 21 deletions(-) diff --git a/block/bio.c b/block/bio.c index f92d0223247b..51c99f2c5c90 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1747,26 +1747,6 @@ bad: } EXPORT_SYMBOL(bioset_init); -/* - * Initialize and setup a new bio_set, based on the settings from - * another bio_set. - */ -int bioset_init_from_src(struct bio_set *bs, struct bio_set *src) -{ - int flags; - - flags = 0; - if (src->bvec_pool.min_nr) - flags |= BIOSET_NEED_BVECS; - if (src->rescue_workqueue) - flags |= BIOSET_NEED_RESCUER; - if (src->cache) - flags |= BIOSET_PERCPU_CACHE; - - return bioset_init(bs, src->bio_pool.min_nr, src->front_pad, flags); -} -EXPORT_SYMBOL(bioset_init_from_src); - static int __init init_bio(void) { int i; diff --git a/include/linux/bio.h b/include/linux/bio.h index 1cf3738ef1ea..992ee987f273 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -403,7 +403,6 @@ enum { extern int bioset_init(struct bio_set *, unsigned int, unsigned int, int flags); extern void bioset_exit(struct bio_set *); extern int biovec_init_pool(mempool_t *pool, int pool_entries); -extern int bioset_init_from_src(struct bio_set *bs, struct bio_set *src); struct bio *bio_alloc_bioset(struct block_device *bdev, unsigned short nr_vecs, unsigned int opf, gfp_t gfp_mask, -- cgit From ea6c1213217dec65a8f9f396752b4d8bbcf226ea Mon Sep 17 00:00:00 2001 From: Julia Lawall Date: Thu, 9 Jun 2022 09:18:15 +0530 Subject: RISC-V: KVM: fix typos in comments Various spelling mistakes in comments. Detected with the help of Coccinelle. Signed-off-by: Julia Lawall Signed-off-by: Anup Patel --- arch/riscv/kvm/vmid.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/riscv/kvm/vmid.c b/arch/riscv/kvm/vmid.c index 9f764df125db..6cd93995fb65 100644 --- a/arch/riscv/kvm/vmid.c +++ b/arch/riscv/kvm/vmid.c @@ -97,7 +97,7 @@ void kvm_riscv_gstage_vmid_update(struct kvm_vcpu *vcpu) * We ran out of VMIDs so we increment vmid_version and * start assigning VMIDs from 1. * - * This also means existing VMIDs assignement to all Guest + * This also means existing VMIDs assignment to all Guest * instances is invalid and we have force VMID re-assignement * for all Guest instances. The Guest instances that were not * running will automatically pick-up new VMIDs because will -- cgit From 1a12b25274b9e54b0d2d59e21620f8cf13b268cb Mon Sep 17 00:00:00 2001 From: Lukas Bulwahn Date: Thu, 9 Jun 2022 09:18:22 +0530 Subject: MAINTAINERS: Limit KVM RISC-V entry to existing selftests Commit fed9b26b2501 ("MAINTAINERS: Update KVM RISC-V entry to cover selftests support") optimistically adds a file entry for tools/testing/selftests/kvm/riscv/, but this directory does not exist. Hence, ./scripts/get_maintainer.pl --self-test=patterns complains about a broken reference. The script is very useful to keep MAINTAINERS up to date and MAINTAINERS can be kept in a state where the script emits no warning. So, just drop the non-matching file entry rather than starting to collect exceptions of entries that may match in some close or distant future. Fixes: fed9b26b2501 ("MAINTAINERS: Update KVM RISC-V entry to cover selftests support") Signed-off-by: Lukas Bulwahn Signed-off-by: Anup Patel --- MAINTAINERS | 1 - 1 file changed, 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index a6d3bd9d2a8d..e549a84e21c8 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -10863,7 +10863,6 @@ F: arch/riscv/include/asm/kvm* F: arch/riscv/include/uapi/asm/kvm* F: arch/riscv/kvm/ F: tools/testing/selftests/kvm/*/riscv/ -F: tools/testing/selftests/kvm/riscv/ KERNEL VIRTUAL MACHINE for s390 (KVM/s390) M: Christian Borntraeger -- cgit From acb0055e187334554398a381a16de72f1a3d47bb Mon Sep 17 00:00:00 2001 From: Bo Liu Date: Wed, 8 Jun 2022 23:11:06 -0400 Subject: virtio: Fix all occurences of the "the the" typo There are double "the" in message in file virtio_mmio.c and virtio_pci_modern_dev.c, fix it. Signed-off-by: Bo Liu Message-Id: <20220609031106.2161-1-liubo03@inspur.com> Signed-off-by: Michael S. Tsirkin --- drivers/virtio/virtio_mmio.c | 2 +- drivers/virtio/virtio_pci_modern_dev.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c index 5ce79bf9f92b..c9bec3813e94 100644 --- a/drivers/virtio/virtio_mmio.c +++ b/drivers/virtio/virtio_mmio.c @@ -255,7 +255,7 @@ static void vm_set_status(struct virtio_device *vdev, u8 status) /* * Per memory-barriers.txt, wmb() is not needed to guarantee - * that the the cache coherent memory writes have completed + * that the cache coherent memory writes have completed * before writing to the MMIO region. */ writel(status, vm_dev->base + VIRTIO_MMIO_STATUS); diff --git a/drivers/virtio/virtio_pci_modern_dev.c b/drivers/virtio/virtio_pci_modern_dev.c index a0fa14f28a7f..b790f30b2b56 100644 --- a/drivers/virtio/virtio_pci_modern_dev.c +++ b/drivers/virtio/virtio_pci_modern_dev.c @@ -469,7 +469,7 @@ void vp_modern_set_status(struct virtio_pci_modern_device *mdev, /* * Per memory-barriers.txt, wmb() is not needed to guarantee - * that the the cache coherent memory writes have completed + * that the cache coherent memory writes have completed * before writing to the MMIO region. */ vp_iowrite8(status, &cfg->device_status); -- cgit From 00d1f546470d89e072dd3cda12b5c794341e7268 Mon Sep 17 00:00:00 2001 From: Jason Wang Date: Thu, 9 Jun 2022 12:19:01 +0800 Subject: vdpa: make get_vq_group and set_group_asid optional This patch makes get_vq_group and set_group_asid optional. This is needed to unbreak the vDPA parent that doesn't support multiple address spaces. Cc: Gautam Dawar Fixes: aaca8373c4b1 ("vhost-vdpa: support ASID based IOTLB API") Signed-off-by: Jason Wang Message-Id: <20220609041901.2029-1-jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin --- drivers/vhost/vdpa.c | 2 ++ include/linux/vdpa.h | 5 +++-- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 935a1d0ddb97..5ad2596c6e8a 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -499,6 +499,8 @@ static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, unsigned int cmd, ops->set_vq_ready(vdpa, idx, s.num); return 0; case VHOST_VDPA_GET_VRING_GROUP: + if (!ops->get_vq_group) + return -EOPNOTSUPP; s.index = idx; s.num = ops->get_vq_group(vdpa, idx); if (s.num >= vdpa->ngroups) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index 4700a88a28f6..7b4a13d3bd91 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -178,7 +178,8 @@ struct vdpa_map_file { * for the device * @vdev: vdpa device * Returns virtqueue algin requirement - * @get_vq_group: Get the group id for a specific virtqueue + * @get_vq_group: Get the group id for a specific + * virtqueue (optional) * @vdev: vdpa device * @idx: virtqueue index * Returns u32: group id for this virtqueue @@ -243,7 +244,7 @@ struct vdpa_map_file { * Returns the iova range supported by * the device. * @set_group_asid: Set address space identifier for a - * virtqueue group + * virtqueue group (optional) * @vdev: vdpa device * @group: virtqueue group * @asid: address space id for this group -- cgit From ae187fec75aa670a551d9662f83e3947d3f02a69 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Thu, 9 Jun 2022 13:12:18 +0100 Subject: KVM: arm64: Return error from kvm_arch_init_vm() on allocation failure If we fail to allocate the 'supported_cpus' cpumask in kvm_arch_init_vm() then be sure to return -ENOMEM instead of success (0) on the failure path. Reviewed-by: Alexandru Elisei Signed-off-by: Will Deacon Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220609121223.2551-2-will@kernel.org --- arch/arm64/kvm/arm.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 400bb0fe2745..0da0f06037db 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -150,8 +150,10 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) if (ret) goto out_free_stage2_pgd; - if (!zalloc_cpumask_var(&kvm->arch.supported_cpus, GFP_KERNEL)) + if (!zalloc_cpumask_var(&kvm->arch.supported_cpus, GFP_KERNEL)) { + ret = -ENOMEM; goto out_free_stage2_pgd; + } cpumask_copy(kvm->arch.supported_cpus, cpu_possible_mask); kvm_vgic_early_init(kvm); -- cgit From fa7a17214488ef7df347dcd1a5594f69ea17f4dc Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Thu, 9 Jun 2022 13:12:19 +0100 Subject: KVM: arm64: Handle all ID registers trapped for a protected VM A protected VM accessing ID_AA64ISAR2_EL1 gets punished with an UNDEF, while it really should only get a zero back if the register is not handled by the hypervisor emulation (as mandated by the architecture). Introduce all the missing ID registers (including the unallocated ones), and have them to return 0. Reported-by: Will Deacon Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220609121223.2551-3-will@kernel.org --- arch/arm64/kvm/hyp/nvhe/sys_regs.c | 42 ++++++++++++++++++++++++++++++-------- 1 file changed, 34 insertions(+), 8 deletions(-) diff --git a/arch/arm64/kvm/hyp/nvhe/sys_regs.c b/arch/arm64/kvm/hyp/nvhe/sys_regs.c index b6d86e423319..35a4331ba5f3 100644 --- a/arch/arm64/kvm/hyp/nvhe/sys_regs.c +++ b/arch/arm64/kvm/hyp/nvhe/sys_regs.c @@ -243,15 +243,9 @@ u64 pvm_read_id_reg(const struct kvm_vcpu *vcpu, u32 id) case SYS_ID_AA64MMFR2_EL1: return get_pvm_id_aa64mmfr2(vcpu); default: - /* - * Should never happen because all cases are covered in - * pvm_sys_reg_descs[]. - */ - WARN_ON(1); - break; + /* Unhandled ID register, RAZ */ + return 0; } - - return 0; } static u64 read_id_reg(const struct kvm_vcpu *vcpu, @@ -332,6 +326,16 @@ static bool pvm_gic_read_sre(struct kvm_vcpu *vcpu, /* Mark the specified system register as an AArch64 feature id register. */ #define AARCH64(REG) { SYS_DESC(REG), .access = pvm_access_id_aarch64 } +/* + * sys_reg_desc initialiser for architecturally unallocated cpufeature ID + * register with encoding Op0=3, Op1=0, CRn=0, CRm=crm, Op2=op2 + * (1 <= crm < 8, 0 <= Op2 < 8). + */ +#define ID_UNALLOCATED(crm, op2) { \ + Op0(3), Op1(0), CRn(0), CRm(crm), Op2(op2), \ + .access = pvm_access_id_aarch64, \ +} + /* Mark the specified system register as Read-As-Zero/Write-Ignored */ #define RAZ_WI(REG) { SYS_DESC(REG), .access = pvm_access_raz_wi } @@ -375,24 +379,46 @@ static const struct sys_reg_desc pvm_sys_reg_descs[] = { AARCH32(SYS_MVFR0_EL1), AARCH32(SYS_MVFR1_EL1), AARCH32(SYS_MVFR2_EL1), + ID_UNALLOCATED(3,3), AARCH32(SYS_ID_PFR2_EL1), AARCH32(SYS_ID_DFR1_EL1), AARCH32(SYS_ID_MMFR5_EL1), + ID_UNALLOCATED(3,7), /* AArch64 ID registers */ /* CRm=4 */ AARCH64(SYS_ID_AA64PFR0_EL1), AARCH64(SYS_ID_AA64PFR1_EL1), + ID_UNALLOCATED(4,2), + ID_UNALLOCATED(4,3), AARCH64(SYS_ID_AA64ZFR0_EL1), + ID_UNALLOCATED(4,5), + ID_UNALLOCATED(4,6), + ID_UNALLOCATED(4,7), AARCH64(SYS_ID_AA64DFR0_EL1), AARCH64(SYS_ID_AA64DFR1_EL1), + ID_UNALLOCATED(5,2), + ID_UNALLOCATED(5,3), AARCH64(SYS_ID_AA64AFR0_EL1), AARCH64(SYS_ID_AA64AFR1_EL1), + ID_UNALLOCATED(5,6), + ID_UNALLOCATED(5,7), AARCH64(SYS_ID_AA64ISAR0_EL1), AARCH64(SYS_ID_AA64ISAR1_EL1), + AARCH64(SYS_ID_AA64ISAR2_EL1), + ID_UNALLOCATED(6,3), + ID_UNALLOCATED(6,4), + ID_UNALLOCATED(6,5), + ID_UNALLOCATED(6,6), + ID_UNALLOCATED(6,7), AARCH64(SYS_ID_AA64MMFR0_EL1), AARCH64(SYS_ID_AA64MMFR1_EL1), AARCH64(SYS_ID_AA64MMFR2_EL1), + ID_UNALLOCATED(7,3), + ID_UNALLOCATED(7,4), + ID_UNALLOCATED(7,5), + ID_UNALLOCATED(7,6), + ID_UNALLOCATED(7,7), /* Scalable Vector Registers are restricted. */ -- cgit From cde5042adf11b0a30a6ce0ec3d071afcf8d2efaf Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Thu, 9 Jun 2022 13:12:20 +0100 Subject: KVM: arm64: Ignore 'kvm-arm.mode=protected' when using VHE Ignore 'kvm-arm.mode=protected' when using VHE so that kvm_get_mode() only returns KVM_MODE_PROTECTED on systems where the feature is available. Cc: David Brazdil Acked-by: Mark Rutland Signed-off-by: Will Deacon Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220609121223.2551-4-will@kernel.org --- Documentation/admin-guide/kernel-parameters.txt | 1 - arch/arm64/kernel/cpufeature.c | 10 +--------- arch/arm64/kvm/arm.c | 6 +++++- 3 files changed, 6 insertions(+), 11 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 8090130b544b..97c16aa2f53f 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2469,7 +2469,6 @@ protected: nVHE-based mode with support for guests whose state is kept private from the host. - Not valid if the kernel is running in EL2. Defaults to VHE/nVHE based on hardware support. Setting mode to "protected" will disable kexec and hibernation diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index 42ea2bd856c6..79fac13ab2ef 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -1974,15 +1974,7 @@ static void cpu_enable_mte(struct arm64_cpu_capabilities const *cap) #ifdef CONFIG_KVM static bool is_kvm_protected_mode(const struct arm64_cpu_capabilities *entry, int __unused) { - if (kvm_get_mode() != KVM_MODE_PROTECTED) - return false; - - if (is_kernel_in_hyp_mode()) { - pr_warn("Protected KVM not available with VHE\n"); - return false; - } - - return true; + return kvm_get_mode() == KVM_MODE_PROTECTED; } #endif /* CONFIG_KVM */ diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 0da0f06037db..a0188144a122 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -2273,7 +2273,11 @@ static int __init early_kvm_mode_cfg(char *arg) return -EINVAL; if (strcmp(arg, "protected") == 0) { - kvm_mode = KVM_MODE_PROTECTED; + if (!is_kernel_in_hyp_mode()) + kvm_mode = KVM_MODE_PROTECTED; + else + pr_warn_once("Protected KVM not available with VHE\n"); + return 0; } -- cgit From 112f3bab41113dc53b4f35e9034b2208245bc002 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Thu, 9 Jun 2022 13:12:21 +0100 Subject: KVM: arm64: Extend comment in has_vhe() has_vhe() expands to a compile-time constant when evaluated from the VHE or nVHE code, alternatively checking a static key when called from elsewhere in the kernel. On face value, this looks like a case of premature optimization, but in fact this allows symbol references on VHE-specific code paths to be dropped from the nVHE object. Expand the comment in has_vhe() to make this clearer, hopefully discouraging anybody from simplifying the code. Cc: David Brazdil Acked-by: Mark Rutland Signed-off-by: Will Deacon Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220609121223.2551-5-will@kernel.org --- arch/arm64/include/asm/virt.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/arm64/include/asm/virt.h b/arch/arm64/include/asm/virt.h index 3c8af033a997..0e80db4327b6 100644 --- a/arch/arm64/include/asm/virt.h +++ b/arch/arm64/include/asm/virt.h @@ -113,6 +113,9 @@ static __always_inline bool has_vhe(void) /* * Code only run in VHE/NVHE hyp context can assume VHE is present or * absent. Otherwise fall back to caps. + * This allows the compiler to discard VHE-specific code from the + * nVHE object, reducing the number of external symbol references + * needed to link. */ if (is_vhe_hyp_code()) return true; -- cgit From 5879c97f37022ff22a3f13174c24fcf2807fdbc0 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Thu, 9 Jun 2022 13:12:22 +0100 Subject: KVM: arm64: Remove redundant hyp_assert_lock_held() assertions host_stage2_try() asserts that the KVM host lock is held, so there's no need to duplicate the assertion in its wrappers. Signed-off-by: Will Deacon Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220609121223.2551-6-will@kernel.org --- arch/arm64/kvm/hyp/nvhe/mem_protect.c | 4 ---- 1 file changed, 4 deletions(-) diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c index 78edf077fa3b..1e78acf9662e 100644 --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c @@ -314,15 +314,11 @@ static int host_stage2_adjust_range(u64 addr, struct kvm_mem_range *range) int host_stage2_idmap_locked(phys_addr_t addr, u64 size, enum kvm_pgtable_prot prot) { - hyp_assert_lock_held(&host_kvm.lock); - return host_stage2_try(__host_stage2_idmap, addr, addr + size, prot); } int host_stage2_set_owner_locked(phys_addr_t addr, u64 size, u8 owner_id) { - hyp_assert_lock_held(&host_kvm.lock); - return host_stage2_try(kvm_pgtable_stage2_set_owner, &host_kvm.pgt, addr, size, &host_s2_pool, owner_id); } -- cgit From bcbfb588cf323929ac46767dd14e392016bbce04 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Thu, 9 Jun 2022 13:12:23 +0100 Subject: KVM: arm64: Drop stale comment The layout of 'struct kvm_vcpu_arch' has evolved significantly since the initial port of KVM/arm64, so remove the stale comment suggesting that a prefix of the structure is used exclusively from assembly code. Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20220609121223.2551-7-will@kernel.org --- arch/arm64/include/asm/kvm_host.h | 5 ----- 1 file changed, 5 deletions(-) diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 47a1e25e25bb..de32152cea04 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -362,11 +362,6 @@ struct kvm_vcpu_arch { struct arch_timer_cpu timer_cpu; struct kvm_pmu pmu; - /* - * Anything that is not used directly from assembly code goes - * here. - */ - /* * Guest registers we preserve during guest debugging. * -- cgit From d2263de1372a452cb64666990043b8be5c40b2a1 Mon Sep 17 00:00:00 2001 From: Yuan Yao Date: Wed, 8 Jun 2022 09:20:15 +0800 Subject: KVM: x86/mmu: Set memory encryption "value", not "mask", in shadow PDPTRs Assign shadow_me_value, not shadow_me_mask, to PAE root entries, a.k.a. shadow PDPTRs, when host memory encryption is supported. The "mask" is the set of all possible memory encryption bits, e.g. MKTME KeyIDs, whereas "value" holds the actual value that needs to be stuffed into host page tables. Using shadow_me_mask results in a failed VM-Entry due to setting reserved PA bits in the PDPTRs, and ultimately causes an OOPS due to physical addresses with non-zero MKTME bits sending to_shadow_page() into the weeds: set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state. BUG: unable to handle page fault for address: ffd43f00063049e8 PGD 86dfd8067 P4D 0 Oops: 0000 [#1] PREEMPT SMP RIP: 0010:mmu_free_root_page+0x3c/0x90 [kvm] kvm_mmu_free_roots+0xd1/0x200 [kvm] __kvm_mmu_unload+0x29/0x70 [kvm] kvm_mmu_unload+0x13/0x20 [kvm] kvm_arch_destroy_vm+0x8a/0x190 [kvm] kvm_put_kvm+0x197/0x2d0 [kvm] kvm_vm_release+0x21/0x30 [kvm] __fput+0x8e/0x260 ____fput+0xe/0x10 task_work_run+0x6f/0xb0 do_exit+0x327/0xa90 do_group_exit+0x35/0xa0 get_signal+0x911/0x930 arch_do_signal_or_restart+0x37/0x720 exit_to_user_mode_prepare+0xb2/0x140 syscall_exit_to_user_mode+0x16/0x30 do_syscall_64+0x4e/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes: e54f1ff244ac ("KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_mask") Signed-off-by: Yuan Yao Reviewed-by: Kai Huang Message-Id: <20220608012015.19566-1-yuan.yao@intel.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index e826ee9138fa..17252f39bd7c 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3411,7 +3411,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu) root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), i << 30, PT32_ROOT_LEVEL, true); mmu->pae_root[i] = root | PT_PRESENT_MASK | - shadow_me_mask; + shadow_me_value; } mmu->root.hpa = __pa(mmu->pae_root); } else { -- cgit From a9603ae0e4ee6e7de0184801d4abe5925f43b49c Mon Sep 17 00:00:00 2001 From: Maxim Levitsky Date: Mon, 6 Jun 2022 21:08:23 +0300 Subject: KVM: x86: document AVIC/APICv inhibit reasons These days there are too many AVIC/APICv inhibit reasons, and it doesn't hurt to have some documentation for them. Signed-off-by: Maxim Levitsky Message-Id: <20220606180829.102503-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 59 +++++++++++++++++++++++++++++++++++++++-- 1 file changed, 57 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 3a240a64ac68..1f9e47b895cf 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1047,14 +1047,69 @@ struct kvm_x86_msr_filter { }; enum kvm_apicv_inhibit { + + /********************************************************************/ + /* INHIBITs that are relevant to both Intel's APICv and AMD's AVIC. */ + /********************************************************************/ + + /* + * APIC acceleration is disabled by a module parameter + * and/or not supported in hardware. + */ APICV_INHIBIT_REASON_DISABLE, + + /* + * APIC acceleration is inhibited because AutoEOI feature is + * being used by a HyperV guest. + */ APICV_INHIBIT_REASON_HYPERV, + + /* + * APIC acceleration is inhibited because the userspace didn't yet + * enable the kernel/split irqchip. + */ + APICV_INHIBIT_REASON_ABSENT, + + /* APIC acceleration is inhibited because KVM_GUESTDBG_BLOCKIRQ + * (out of band, debug measure of blocking all interrupts on this vCPU) + * was enabled, to avoid AVIC/APICv bypassing it. + */ + APICV_INHIBIT_REASON_BLOCKIRQ, + + /******************************************************/ + /* INHIBITs that are relevant only to the AMD's AVIC. */ + /******************************************************/ + + /* + * AVIC is inhibited on a vCPU because it runs a nested guest. + * + * This is needed because unlike APICv, the peers of this vCPU + * cannot use the doorbell mechanism to signal interrupts via AVIC when + * a vCPU runs nested. + */ APICV_INHIBIT_REASON_NESTED, + + /* + * On SVM, the wait for the IRQ window is implemented with pending vIRQ, + * which cannot be injected when the AVIC is enabled, thus AVIC + * is inhibited while KVM waits for IRQ window. + */ APICV_INHIBIT_REASON_IRQWIN, + + /* + * PIT (i8254) 're-inject' mode, relies on EOI intercept, + * which AVIC doesn't support for edge triggered interrupts. + */ APICV_INHIBIT_REASON_PIT_REINJ, + + /* + * AVIC is inhibited because the guest has x2apic in its CPUID. + */ APICV_INHIBIT_REASON_X2APIC, - APICV_INHIBIT_REASON_BLOCKIRQ, - APICV_INHIBIT_REASON_ABSENT, + + /* + * AVIC is disabled because SEV doesn't support it. + */ APICV_INHIBIT_REASON_SEV, }; -- cgit From 3743c2f0251743b8ae968329708bbbeefff244cf Mon Sep 17 00:00:00 2001 From: Maxim Levitsky Date: Mon, 6 Jun 2022 21:08:24 +0300 Subject: KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base Neither of these settings should be changed by the guest and it is a burden to support it in the acceleration code, so just inhibit this code instead. Signed-off-by: Maxim Levitsky Message-Id: <20220606180829.102503-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 8 ++++++++ arch/x86/kvm/lapic.c | 27 +++++++++++++++++++++++---- arch/x86/kvm/svm/avic.c | 4 +++- arch/x86/kvm/vmx/vmx.c | 4 +++- 4 files changed, 37 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 1f9e47b895cf..9217bd6cf0d1 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1076,6 +1076,14 @@ enum kvm_apicv_inhibit { */ APICV_INHIBIT_REASON_BLOCKIRQ, + /* + * For simplicity, the APIC acceleration is inhibited + * first time either APIC ID or APIC base are changed by the guest + * from their reset values. + */ + APICV_INHIBIT_REASON_APIC_ID_MODIFIED, + APICV_INHIBIT_REASON_APIC_BASE_MODIFIED, + /******************************************************/ /* INHIBITs that are relevant only to the AMD's AVIC. */ /******************************************************/ diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index f1bdac3f5aa8..0e68b4c937fc 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -2039,6 +2039,19 @@ static void apic_manage_nmi_watchdog(struct kvm_lapic *apic, u32 lvt0_val) } } +static void kvm_lapic_xapic_id_updated(struct kvm_lapic *apic) +{ + struct kvm *kvm = apic->vcpu->kvm; + + if (KVM_BUG_ON(apic_x2apic_mode(apic), kvm)) + return; + + if (kvm_xapic_id(apic) == apic->vcpu->vcpu_id) + return; + + kvm_set_apicv_inhibit(apic->vcpu->kvm, APICV_INHIBIT_REASON_APIC_ID_MODIFIED); +} + static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val) { int ret = 0; @@ -2047,10 +2060,12 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val) switch (reg) { case APIC_ID: /* Local APIC ID */ - if (!apic_x2apic_mode(apic)) + if (!apic_x2apic_mode(apic)) { kvm_apic_set_xapic_id(apic, val >> 24); - else + kvm_lapic_xapic_id_updated(apic); + } else { ret = 1; + } break; case APIC_TASKPRI: @@ -2336,8 +2351,10 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value) MSR_IA32_APICBASE_BASE; if ((value & MSR_IA32_APICBASE_ENABLE) && - apic->base_address != APIC_DEFAULT_PHYS_BASE) - pr_warn_once("APIC base relocation is unsupported by KVM"); + apic->base_address != APIC_DEFAULT_PHYS_BASE) { + kvm_set_apicv_inhibit(apic->vcpu->kvm, + APICV_INHIBIT_REASON_APIC_BASE_MODIFIED); + } } void kvm_apic_update_apicv(struct kvm_vcpu *vcpu) @@ -2648,6 +2665,8 @@ static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu, icr = __kvm_lapic_get_reg64(s->regs, APIC_ICR); __kvm_lapic_set_reg(s->regs, APIC_ICR2, icr >> 32); } + } else { + kvm_lapic_xapic_id_updated(vcpu->arch.apic); } return 0; diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c index 54fe03714f8a..8dffd67f6086 100644 --- a/arch/x86/kvm/svm/avic.c +++ b/arch/x86/kvm/svm/avic.c @@ -910,7 +910,9 @@ bool avic_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason) BIT(APICV_INHIBIT_REASON_PIT_REINJ) | BIT(APICV_INHIBIT_REASON_X2APIC) | BIT(APICV_INHIBIT_REASON_BLOCKIRQ) | - BIT(APICV_INHIBIT_REASON_SEV); + BIT(APICV_INHIBIT_REASON_SEV | + BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) | + BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED)); return supported & BIT(reason); } diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 9bd86ecccdab..553dd2317b9c 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7709,7 +7709,9 @@ static bool vmx_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason) ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) | BIT(APICV_INHIBIT_REASON_ABSENT) | BIT(APICV_INHIBIT_REASON_HYPERV) | - BIT(APICV_INHIBIT_REASON_BLOCKIRQ); + BIT(APICV_INHIBIT_REASON_BLOCKIRQ) | + BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) | + BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED); return supported & BIT(reason); } -- cgit From f5f9089f76ddc882b915c5d78e4beeb48dcabd1b Mon Sep 17 00:00:00 2001 From: Maxim Levitsky Date: Mon, 6 Jun 2022 21:08:25 +0300 Subject: KVM: x86: SVM: remove avic's broken code that updated APIC ID AVIC is now inhibited if the guest changes the apic id, and therefore this code is no longer needed. There are several ways this code was broken, including: 1. a vCPU was only allowed to change its apic id to an apic id of an existing vCPU. 2. After such change, the vCPU whose apic id entry was overwritten, could not correctly change its own apic id, because its own entry is already overwritten. Signed-off-by: Maxim Levitsky Message-Id: <20220606180829.102503-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm/avic.c | 35 ----------------------------------- 1 file changed, 35 deletions(-) diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c index 8dffd67f6086..072e2c8cc66a 100644 --- a/arch/x86/kvm/svm/avic.c +++ b/arch/x86/kvm/svm/avic.c @@ -508,35 +508,6 @@ static int avic_handle_ldr_update(struct kvm_vcpu *vcpu) return ret; } -static int avic_handle_apic_id_update(struct kvm_vcpu *vcpu) -{ - u64 *old, *new; - struct vcpu_svm *svm = to_svm(vcpu); - u32 id = kvm_xapic_id(vcpu->arch.apic); - - if (vcpu->vcpu_id == id) - return 0; - - old = avic_get_physical_id_entry(vcpu, vcpu->vcpu_id); - new = avic_get_physical_id_entry(vcpu, id); - if (!new || !old) - return 1; - - /* We need to move physical_id_entry to new offset */ - *new = *old; - *old = 0ULL; - to_svm(vcpu)->avic_physical_id_cache = new; - - /* - * Also update the guest physical APIC ID in the logical - * APIC ID table entry if already setup the LDR. - */ - if (svm->ldr_reg) - avic_handle_ldr_update(vcpu); - - return 0; -} - static void avic_handle_dfr_update(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); @@ -555,10 +526,6 @@ static int avic_unaccel_trap_write(struct kvm_vcpu *vcpu) AVIC_UNACCEL_ACCESS_OFFSET_MASK; switch (offset) { - case APIC_ID: - if (avic_handle_apic_id_update(vcpu)) - return 0; - break; case APIC_LDR: if (avic_handle_ldr_update(vcpu)) return 0; @@ -650,8 +617,6 @@ int avic_init_vcpu(struct vcpu_svm *svm) void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu) { - if (avic_handle_apic_id_update(vcpu) != 0) - return; avic_handle_dfr_update(vcpu); avic_handle_ldr_update(vcpu); } -- cgit From 603ccef42ce9f07840cf4c0448f3261413460b07 Mon Sep 17 00:00:00 2001 From: Maxim Levitsky Date: Mon, 6 Jun 2022 21:08:26 +0300 Subject: KVM: x86: SVM: fix avic_kick_target_vcpus_fast MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit There are two issues in avic_kick_target_vcpus_fast 1. It is legal to issue an IPI request with APIC_DEST_NOSHORT and a physical destination of 0xFF (or 0xFFFFFFFF in case of x2apic), which must be treated as a broadcast destination. Fix this by explicitly checking for it. Also don’t use ‘index’ in this case as it gives no new information. 2. It is legal to issue a logical IPI request to more than one target. Index field only provides index in physical id table of first such target and therefore can't be used before we are sure that only a single target was addressed. Instead, parse the ICRL/ICRH, double check that a unicast interrupt was requested, and use that info to figure out the physical id of the target vCPU. At that point there is no need to use the index field as well. In addition to fixing the above issues, also skip the call to kvm_apic_match_dest. It is possible to do this now, because now as long as AVIC is not inhibited, it is guaranteed that none of the vCPUs changed their apic id from its default value. This fixes boot of windows guest with AVIC enabled because it uses IPI with 0xFF destination and no destination shorthand. Fixes: 7223fd2d5338 ("KVM: SVM: Use target APIC ID to complete AVIC IRQs when possible") Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky Message-Id: <20220606180829.102503-5-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm/avic.c | 105 +++++++++++++++++++++++++++++++----------------- 1 file changed, 69 insertions(+), 36 deletions(-) diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c index 072e2c8cc66a..5d98ac575ded 100644 --- a/arch/x86/kvm/svm/avic.c +++ b/arch/x86/kvm/svm/avic.c @@ -291,58 +291,91 @@ void avic_ring_doorbell(struct kvm_vcpu *vcpu) static int avic_kick_target_vcpus_fast(struct kvm *kvm, struct kvm_lapic *source, u32 icrl, u32 icrh, u32 index) { - u32 dest, apic_id; - struct kvm_vcpu *vcpu; + u32 l1_physical_id, dest; + struct kvm_vcpu *target_vcpu; int dest_mode = icrl & APIC_DEST_MASK; int shorthand = icrl & APIC_SHORT_MASK; struct kvm_svm *kvm_svm = to_kvm_svm(kvm); - u32 *avic_logical_id_table = page_address(kvm_svm->avic_logical_id_table_page); if (shorthand != APIC_DEST_NOSHORT) return -EINVAL; - /* - * The AVIC incomplete IPI #vmexit info provides index into - * the physical APIC ID table, which can be used to derive - * guest physical APIC ID. - */ + if (apic_x2apic_mode(source)) + dest = icrh; + else + dest = GET_APIC_DEST_FIELD(icrh); + if (dest_mode == APIC_DEST_PHYSICAL) { - apic_id = index; + /* broadcast destination, use slow path */ + if (apic_x2apic_mode(source) && dest == X2APIC_BROADCAST) + return -EINVAL; + if (!apic_x2apic_mode(source) && dest == APIC_BROADCAST) + return -EINVAL; + + l1_physical_id = dest; + + if (WARN_ON_ONCE(l1_physical_id != index)) + return -EINVAL; + } else { - if (!apic_x2apic_mode(source)) { - /* For xAPIC logical mode, the index is for logical APIC table. */ - apic_id = avic_logical_id_table[index] & 0x1ff; + u32 bitmap, cluster; + int logid_index; + + if (apic_x2apic_mode(source)) { + /* 16 bit dest mask, 16 bit cluster id */ + bitmap = dest & 0xFFFF0000; + cluster = (dest >> 16) << 4; + } else if (kvm_lapic_get_reg(source, APIC_DFR) == APIC_DFR_FLAT) { + /* 8 bit dest mask*/ + bitmap = dest; + cluster = 0; } else { - return -EINVAL; + /* 4 bit desk mask, 4 bit cluster id */ + bitmap = dest & 0xF; + cluster = (dest >> 4) << 2; } - } - /* - * Assuming vcpu ID is the same as physical apic ID, - * and use it to retrieve the target vCPU. - */ - vcpu = kvm_get_vcpu_by_id(kvm, apic_id); - if (!vcpu) - return -EINVAL; + if (unlikely(!bitmap)) + /* guest bug: nobody to send the logical interrupt to */ + return 0; - if (apic_x2apic_mode(vcpu->arch.apic)) - dest = icrh; - else - dest = GET_APIC_DEST_FIELD(icrh); + if (!is_power_of_2(bitmap)) + /* multiple logical destinations, use slow path */ + return -EINVAL; - /* - * Try matching the destination APIC ID with the vCPU. - */ - if (kvm_apic_match_dest(vcpu, source, shorthand, dest, dest_mode)) { - vcpu->arch.apic->irr_pending = true; - svm_complete_interrupt_delivery(vcpu, - icrl & APIC_MODE_MASK, - icrl & APIC_INT_LEVELTRIG, - icrl & APIC_VECTOR_MASK); - return 0; + logid_index = cluster + __ffs(bitmap); + + if (apic_x2apic_mode(source)) { + l1_physical_id = logid_index; + } else { + u32 *avic_logical_id_table = + page_address(kvm_svm->avic_logical_id_table_page); + + u32 logid_entry = avic_logical_id_table[logid_index]; + + if (WARN_ON_ONCE(index != logid_index)) + return -EINVAL; + + /* guest bug: non existing/reserved logical destination */ + if (unlikely(!(logid_entry & AVIC_LOGICAL_ID_ENTRY_VALID_MASK))) + return 0; + + l1_physical_id = logid_entry & + AVIC_LOGICAL_ID_ENTRY_GUEST_PHYSICAL_ID_MASK; + } } - return -EINVAL; + target_vcpu = kvm_get_vcpu_by_id(kvm, l1_physical_id); + if (unlikely(!target_vcpu)) + /* guest bug: non existing vCPU is a target of this IPI*/ + return 0; + + target_vcpu->arch.apic->irr_pending = true; + svm_complete_interrupt_delivery(target_vcpu, + icrl & APIC_MODE_MASK, + icrl & APIC_INT_LEVELTRIG, + icrl & APIC_VECTOR_MASK); + return 0; } static void avic_kick_target_vcpus(struct kvm *kvm, struct kvm_lapic *source, -- cgit From 66c768d30e64e1280520f34dbef83419f55f3459 Mon Sep 17 00:00:00 2001 From: Maxim Levitsky Date: Mon, 6 Jun 2022 21:08:27 +0300 Subject: KVM: x86: disable preemption while updating apicv inhibition Currently nothing prevents preemption in kvm_vcpu_update_apicv. On SVM, If the preemption happens after we update the vcpu->arch.apicv_active, the preemption itself will 'update' the inhibition since the AVIC will be first disabled on vCPU unload and then enabled, when the current task is loaded again. Then we will try to update it again, which will lead to a warning in __avic_vcpu_load, that the AVIC is already enabled. Fix this by disabling preemption in this code. Signed-off-by: Maxim Levitsky Message-Id: <20220606180829.102503-6-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/x86.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 03fbfbbec460..158b2e135efc 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9850,6 +9850,7 @@ void kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu) return; down_read(&vcpu->kvm->arch.apicv_update_lock); + preempt_disable(); activate = kvm_vcpu_apicv_activated(vcpu); @@ -9870,6 +9871,7 @@ void kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu) kvm_make_request(KVM_REQ_EVENT, vcpu); out: + preempt_enable(); up_read(&vcpu->kvm->arch.apicv_update_lock); } EXPORT_SYMBOL_GPL(kvm_vcpu_update_apicv); -- cgit From 18869f26df1a11ed11031dfb7392bc7d774062e8 Mon Sep 17 00:00:00 2001 From: Maxim Levitsky Date: Mon, 6 Jun 2022 21:08:28 +0300 Subject: KVM: x86: disable preemption around the call to kvm_arch_vcpu_{un|}blocking On SVM, if preemption happens right after the call to finish_rcuwait but before call to kvm_arch_vcpu_unblocking on SVM/AVIC, it itself will re-enable AVIC, and then we will try to re-enable it again in kvm_arch_vcpu_unblocking which will lead to a warning in __avic_vcpu_load. The same problem can happen if the vCPU is preempted right after the call to kvm_arch_vcpu_blocking but before the call to prepare_to_rcuwait and in this case, we will end up with AVIC enabled during sleep - Ooops. Signed-off-by: Maxim Levitsky Message-Id: <20220606180829.102503-7-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini --- virt/kvm/kvm_main.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 44c47670447a..a49df8988cd6 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -3328,9 +3328,11 @@ bool kvm_vcpu_block(struct kvm_vcpu *vcpu) vcpu->stat.generic.blocking = 1; + preempt_disable(); kvm_arch_vcpu_blocking(vcpu); - prepare_to_rcuwait(wait); + preempt_enable(); + for (;;) { set_current_state(TASK_INTERRUPTIBLE); @@ -3340,9 +3342,11 @@ bool kvm_vcpu_block(struct kvm_vcpu *vcpu) waited = true; schedule(); } - finish_rcuwait(wait); + preempt_disable(); + finish_rcuwait(wait); kvm_arch_vcpu_unblocking(vcpu); + preempt_enable(); vcpu->stat.generic.blocking = 0; -- cgit From ba8ec273240a7a67819b5957c8d06a267ec54db7 Mon Sep 17 00:00:00 2001 From: Maxim Levitsky Date: Mon, 6 Jun 2022 21:08:29 +0300 Subject: KVM: x86: SVM: drop preempt-safe wrappers for avic_vcpu_load/put Now that these functions are always called with preemption disabled, remove the preempt_disable()/preempt_enable() pair inside them. No functional change intended. Signed-off-by: Maxim Levitsky Message-Id: <20220606180829.102503-8-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm/avic.c | 27 ++++----------------------- arch/x86/kvm/svm/svm.c | 4 ++-- arch/x86/kvm/svm/svm.h | 4 ++-- 3 files changed, 8 insertions(+), 27 deletions(-) diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c index 5d98ac575ded..5542d8959e11 100644 --- a/arch/x86/kvm/svm/avic.c +++ b/arch/x86/kvm/svm/avic.c @@ -946,7 +946,7 @@ out: return ret; } -void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu) +void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { u64 entry; int h_physical_id = kvm_cpu_get_apicid(cpu); @@ -978,7 +978,7 @@ void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu) avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, true); } -void __avic_vcpu_put(struct kvm_vcpu *vcpu) +void avic_vcpu_put(struct kvm_vcpu *vcpu) { u64 entry; struct vcpu_svm *svm = to_svm(vcpu); @@ -997,25 +997,6 @@ void __avic_vcpu_put(struct kvm_vcpu *vcpu) WRITE_ONCE(*(svm->avic_physical_id_cache), entry); } -static void avic_vcpu_load(struct kvm_vcpu *vcpu) -{ - int cpu = get_cpu(); - - WARN_ON(cpu != vcpu->cpu); - - __avic_vcpu_load(vcpu, cpu); - - put_cpu(); -} - -static void avic_vcpu_put(struct kvm_vcpu *vcpu) -{ - preempt_disable(); - - __avic_vcpu_put(vcpu); - - preempt_enable(); -} void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu) { @@ -1042,7 +1023,7 @@ void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu) vmcb_mark_dirty(vmcb, VMCB_AVIC); if (activated) - avic_vcpu_load(vcpu); + avic_vcpu_load(vcpu, vcpu->cpu); else avic_vcpu_put(vcpu); @@ -1075,5 +1056,5 @@ void avic_vcpu_unblocking(struct kvm_vcpu *vcpu) if (!kvm_vcpu_apicv_active(vcpu)) return; - avic_vcpu_load(vcpu); + avic_vcpu_load(vcpu, vcpu->cpu); } diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 1dc02cdf6960..1ac66fbceaa1 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1400,13 +1400,13 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu) indirect_branch_prediction_barrier(); } if (kvm_vcpu_apicv_active(vcpu)) - __avic_vcpu_load(vcpu, cpu); + avic_vcpu_load(vcpu, cpu); } static void svm_vcpu_put(struct kvm_vcpu *vcpu) { if (kvm_vcpu_apicv_active(vcpu)) - __avic_vcpu_put(vcpu); + avic_vcpu_put(vcpu); svm_prepare_host_switch(vcpu); diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 500348c1cb35..1bddd336a27e 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -610,8 +610,8 @@ void avic_init_vmcb(struct vcpu_svm *svm, struct vmcb *vmcb); int avic_incomplete_ipi_interception(struct kvm_vcpu *vcpu); int avic_unaccelerated_access_interception(struct kvm_vcpu *vcpu); int avic_init_vcpu(struct vcpu_svm *svm); -void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu); -void __avic_vcpu_put(struct kvm_vcpu *vcpu); +void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu); +void avic_vcpu_put(struct kvm_vcpu *vcpu); void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu); void avic_set_virtual_apic_mode(struct kvm_vcpu *vcpu); void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu); -- cgit From e3cdaab5ff022874e65df80ae8b8382ccc0a4fe0 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Tue, 31 May 2022 13:57:32 -0400 Subject: KVM: x86: SVM: fix nested PAUSE filtering when L0 intercepts PAUSE Commit 74fd41ed16fd ("KVM: x86: nSVM: support PAUSE filtering when L0 doesn't intercept PAUSE") introduced passthrough support for nested pause filtering, (when the host doesn't intercept PAUSE) (either disabled with kvm module param, or disabled with '-overcommit cpu-pm=on') Before this commit, L1 KVM didn't intercept PAUSE at all; afterwards, the feature was exposed as supported by KVM cpuid unconditionally, thus if L1 could try to use it even when the L0 KVM can't really support it. In this case the fallback caused KVM to intercept each PAUSE instruction; in some cases, such intercept can slow down the nested guest so much that it can fail to boot. Instead, before the problematic commit KVM was already setting both thresholds to 0 in vmcb02, but after the first userspace VM exit shrink_ple_window was called and would reset the pause_filter_count to the default value. To fix this, change the fallback strategy - ignore the guest threshold values, but use/update the host threshold values unless the guest specifically requests disabling PAUSE filtering (either simple or advanced). Also fix a minor bug: on nested VM exit, when PAUSE filter counter were copied back to vmcb01, a dirty bit was not set. Thanks a lot to Suravee Suthikulpanit for debugging this! Fixes: 74fd41ed16fd ("KVM: x86: nSVM: support PAUSE filtering when L0 doesn't intercept PAUSE") Reported-by: Suravee Suthikulpanit Tested-by: Suravee Suthikulpanit Co-developed-by: Maxim Levitsky Message-Id: <20220518072709.730031-1-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm/nested.c | 39 +++++++++++++++++++++------------------ arch/x86/kvm/svm/svm.c | 4 ++-- 2 files changed, 23 insertions(+), 20 deletions(-) diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 3361258640a2..ba7cd26f438f 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -616,6 +616,8 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) struct kvm_vcpu *vcpu = &svm->vcpu; struct vmcb *vmcb01 = svm->vmcb01.ptr; struct vmcb *vmcb02 = svm->nested.vmcb02.ptr; + u32 pause_count12; + u32 pause_thresh12; /* * Filled at exit: exit_code, exit_code_hi, exit_info_1, exit_info_2, @@ -671,27 +673,25 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) if (!nested_vmcb_needs_vls_intercept(svm)) vmcb02->control.virt_ext |= VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK; + pause_count12 = svm->pause_filter_enabled ? svm->nested.ctl.pause_filter_count : 0; + pause_thresh12 = svm->pause_threshold_enabled ? svm->nested.ctl.pause_filter_thresh : 0; if (kvm_pause_in_guest(svm->vcpu.kvm)) { - /* use guest values since host doesn't use them */ - vmcb02->control.pause_filter_count = - svm->pause_filter_enabled ? - svm->nested.ctl.pause_filter_count : 0; + /* use guest values since host doesn't intercept PAUSE */ + vmcb02->control.pause_filter_count = pause_count12; + vmcb02->control.pause_filter_thresh = pause_thresh12; - vmcb02->control.pause_filter_thresh = - svm->pause_threshold_enabled ? - svm->nested.ctl.pause_filter_thresh : 0; - - } else if (!vmcb12_is_intercept(&svm->nested.ctl, INTERCEPT_PAUSE)) { - /* use host values when guest doesn't use them */ + } else { + /* start from host values otherwise */ vmcb02->control.pause_filter_count = vmcb01->control.pause_filter_count; vmcb02->control.pause_filter_thresh = vmcb01->control.pause_filter_thresh; - } else { - /* - * Intercept every PAUSE otherwise and - * ignore both host and guest values - */ - vmcb02->control.pause_filter_count = 0; - vmcb02->control.pause_filter_thresh = 0; + + /* ... but ensure filtering is disabled if so requested. */ + if (vmcb12_is_intercept(&svm->nested.ctl, INTERCEPT_PAUSE)) { + if (!pause_count12) + vmcb02->control.pause_filter_count = 0; + if (!pause_thresh12) + vmcb02->control.pause_filter_thresh = 0; + } } nested_svm_transition_tlb_flush(vcpu); @@ -951,8 +951,11 @@ int nested_svm_vmexit(struct vcpu_svm *svm) vmcb12->control.event_inj = svm->nested.ctl.event_inj; vmcb12->control.event_inj_err = svm->nested.ctl.event_inj_err; - if (!kvm_pause_in_guest(vcpu->kvm) && vmcb02->control.pause_filter_count) + if (!kvm_pause_in_guest(vcpu->kvm)) { vmcb01->control.pause_filter_count = vmcb02->control.pause_filter_count; + vmcb_mark_dirty(vmcb01, VMCB_INTERCEPTS); + + } nested_svm_copy_common_state(svm->nested.vmcb02.ptr, svm->vmcb01.ptr); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 1ac66fbceaa1..87da90360bc7 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -921,7 +921,7 @@ static void grow_ple_window(struct kvm_vcpu *vcpu) struct vmcb_control_area *control = &svm->vmcb->control; int old = control->pause_filter_count; - if (kvm_pause_in_guest(vcpu->kvm) || !old) + if (kvm_pause_in_guest(vcpu->kvm)) return; control->pause_filter_count = __grow_ple_window(old, @@ -942,7 +942,7 @@ static void shrink_ple_window(struct kvm_vcpu *vcpu) struct vmcb_control_area *control = &svm->vmcb->control; int old = control->pause_filter_count; - if (kvm_pause_in_guest(vcpu->kvm) || !old) + if (kvm_pause_in_guest(vcpu->kvm)) return; control->pause_filter_count = -- cgit From 4ee602e78d706e740a48be9b6ddb239df4a113b5 Mon Sep 17 00:00:00 2001 From: David Matlack Date: Fri, 20 May 2022 23:32:39 +0000 Subject: KVM: selftests: Replace x86_page_size with PG_LEVEL_XX x86_page_size is an enum used to communicate the desired page size with which to map a range of memory. Under the hood they just encode the desired level at which to map the page. This ends up being clunky in a few ways: - The name suggests it encodes the size of the page rather than the level. - In other places in x86_64/processor.c we just use a raw int to encode the level. Simplify this by adopting the kernel style of PG_LEVEL_XX enums and pass around raw ints when referring to the level. This makes the code easier to understand since these macros are very common in KVM MMU code. Signed-off-by: David Matlack Message-Id: <20220520233249.3776001-2-dmatlack@google.com> Signed-off-by: Paolo Bonzini --- .../selftests/kvm/include/x86_64/processor.h | 18 ++++++++----- tools/testing/selftests/kvm/lib/x86_64/processor.c | 31 +++++++++++----------- .../testing/selftests/kvm/max_guest_memory_test.c | 2 +- tools/testing/selftests/kvm/x86_64/mmu_role_test.c | 2 +- 4 files changed, 29 insertions(+), 24 deletions(-) diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h index d0d51adec76e..273c70e91647 100644 --- a/tools/testing/selftests/kvm/include/x86_64/processor.h +++ b/tools/testing/selftests/kvm/include/x86_64/processor.h @@ -482,13 +482,19 @@ void vcpu_set_hv_cpuid(struct kvm_vm *vm, uint32_t vcpuid); struct kvm_cpuid2 *vcpu_get_supported_hv_cpuid(struct kvm_vm *vm, uint32_t vcpuid); void vm_xsave_req_perm(int bit); -enum x86_page_size { - X86_PAGE_SIZE_4K = 0, - X86_PAGE_SIZE_2M, - X86_PAGE_SIZE_1G, +enum pg_level { + PG_LEVEL_NONE, + PG_LEVEL_4K, + PG_LEVEL_2M, + PG_LEVEL_1G, + PG_LEVEL_512G, + PG_LEVEL_NUM }; -void __virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, - enum x86_page_size page_size); + +#define PG_LEVEL_SHIFT(_level) ((_level - 1) * 9 + 12) +#define PG_LEVEL_SIZE(_level) (1ull << PG_LEVEL_SHIFT(_level)) + +void __virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, int level); /* * Basic CPU control in CR0 diff --git a/tools/testing/selftests/kvm/lib/x86_64/processor.c b/tools/testing/selftests/kvm/lib/x86_64/processor.c index 33ea5e9955d9..ead7011ee8f6 100644 --- a/tools/testing/selftests/kvm/lib/x86_64/processor.c +++ b/tools/testing/selftests/kvm/lib/x86_64/processor.c @@ -158,7 +158,7 @@ static void *virt_get_pte(struct kvm_vm *vm, uint64_t pt_pfn, uint64_t vaddr, int level) { uint64_t *page_table = addr_gpa2hva(vm, pt_pfn << vm->page_shift); - int index = vaddr >> (vm->page_shift + level * 9) & 0x1ffu; + int index = (vaddr >> PG_LEVEL_SHIFT(level)) & 0x1ffu; return &page_table[index]; } @@ -167,14 +167,14 @@ static uint64_t *virt_create_upper_pte(struct kvm_vm *vm, uint64_t pt_pfn, uint64_t vaddr, uint64_t paddr, - int level, - enum x86_page_size page_size) + int current_level, + int target_level) { - uint64_t *pte = virt_get_pte(vm, pt_pfn, vaddr, level); + uint64_t *pte = virt_get_pte(vm, pt_pfn, vaddr, current_level); if (!(*pte & PTE_PRESENT_MASK)) { *pte = PTE_PRESENT_MASK | PTE_WRITABLE_MASK; - if (level == page_size) + if (current_level == target_level) *pte |= PTE_LARGE_MASK | (paddr & PHYSICAL_PAGE_MASK); else *pte |= vm_alloc_page_table(vm) & PHYSICAL_PAGE_MASK; @@ -184,20 +184,19 @@ static uint64_t *virt_create_upper_pte(struct kvm_vm *vm, * a hugepage at this level, and that there isn't a hugepage at * this level. */ - TEST_ASSERT(level != page_size, + TEST_ASSERT(current_level != target_level, "Cannot create hugepage at level: %u, vaddr: 0x%lx\n", - page_size, vaddr); + current_level, vaddr); TEST_ASSERT(!(*pte & PTE_LARGE_MASK), "Cannot create page table at level: %u, vaddr: 0x%lx\n", - level, vaddr); + current_level, vaddr); } return pte; } -void __virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, - enum x86_page_size page_size) +void __virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, int level) { - const uint64_t pg_size = 1ull << ((page_size * 9) + 12); + const uint64_t pg_size = PG_LEVEL_SIZE(level); uint64_t *pml4e, *pdpe, *pde; uint64_t *pte; @@ -222,20 +221,20 @@ void __virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, * early if a hugepage was created. */ pml4e = virt_create_upper_pte(vm, vm->pgd >> vm->page_shift, - vaddr, paddr, 3, page_size); + vaddr, paddr, PG_LEVEL_512G, level); if (*pml4e & PTE_LARGE_MASK) return; - pdpe = virt_create_upper_pte(vm, PTE_GET_PFN(*pml4e), vaddr, paddr, 2, page_size); + pdpe = virt_create_upper_pte(vm, PTE_GET_PFN(*pml4e), vaddr, paddr, PG_LEVEL_1G, level); if (*pdpe & PTE_LARGE_MASK) return; - pde = virt_create_upper_pte(vm, PTE_GET_PFN(*pdpe), vaddr, paddr, 1, page_size); + pde = virt_create_upper_pte(vm, PTE_GET_PFN(*pdpe), vaddr, paddr, PG_LEVEL_2M, level); if (*pde & PTE_LARGE_MASK) return; /* Fill in page table entry. */ - pte = virt_get_pte(vm, PTE_GET_PFN(*pde), vaddr, 0); + pte = virt_get_pte(vm, PTE_GET_PFN(*pde), vaddr, PG_LEVEL_4K); TEST_ASSERT(!(*pte & PTE_PRESENT_MASK), "PTE already present for 4k page at vaddr: 0x%lx\n", vaddr); *pte = PTE_PRESENT_MASK | PTE_WRITABLE_MASK | (paddr & PHYSICAL_PAGE_MASK); @@ -243,7 +242,7 @@ void __virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, void virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr) { - __virt_pg_map(vm, vaddr, paddr, X86_PAGE_SIZE_4K); + __virt_pg_map(vm, vaddr, paddr, PG_LEVEL_4K); } static uint64_t *_vm_get_page_table_entry(struct kvm_vm *vm, int vcpuid, diff --git a/tools/testing/selftests/kvm/max_guest_memory_test.c b/tools/testing/selftests/kvm/max_guest_memory_test.c index 3875c4b23a04..15f046e19cb2 100644 --- a/tools/testing/selftests/kvm/max_guest_memory_test.c +++ b/tools/testing/selftests/kvm/max_guest_memory_test.c @@ -244,7 +244,7 @@ int main(int argc, char *argv[]) #ifdef __x86_64__ /* Identity map memory in the guest using 1gb pages. */ for (i = 0; i < slot_size; i += size_1gb) - __virt_pg_map(vm, gpa + i, gpa + i, X86_PAGE_SIZE_1G); + __virt_pg_map(vm, gpa + i, gpa + i, PG_LEVEL_1G); #else for (i = 0; i < slot_size; i += vm_get_page_size(vm)) virt_pg_map(vm, gpa + i, gpa + i); diff --git a/tools/testing/selftests/kvm/x86_64/mmu_role_test.c b/tools/testing/selftests/kvm/x86_64/mmu_role_test.c index da2325fcad87..bdecd532f935 100644 --- a/tools/testing/selftests/kvm/x86_64/mmu_role_test.c +++ b/tools/testing/selftests/kvm/x86_64/mmu_role_test.c @@ -35,7 +35,7 @@ static void mmu_role_test(u32 *cpuid_reg, u32 evil_cpuid_val) run = vcpu_state(vm, VCPU_ID); /* Map 1gb page without a backing memlot. */ - __virt_pg_map(vm, MMIO_GPA, MMIO_GPA, X86_PAGE_SIZE_1G); + __virt_pg_map(vm, MMIO_GPA, MMIO_GPA, PG_LEVEL_1G); r = _vcpu_run(vm, VCPU_ID); -- cgit From c5a0ccec4cb4edde8e5b7e369dbe4d169b111e42 Mon Sep 17 00:00:00 2001 From: David Matlack Date: Fri, 20 May 2022 23:32:40 +0000 Subject: KVM: selftests: Add option to create 2M and 1G EPT mappings The current EPT mapping code in the selftests only supports mapping 4K pages. This commit extends that support with an option to map at 2M or 1G. This will be used in a future commit to create large page mappings to test eager page splitting. No functional change intended. Signed-off-by: David Matlack Message-Id: <20220520233249.3776001-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini --- tools/testing/selftests/kvm/lib/x86_64/vmx.c | 110 +++++++++++++++------------ 1 file changed, 60 insertions(+), 50 deletions(-) diff --git a/tools/testing/selftests/kvm/lib/x86_64/vmx.c b/tools/testing/selftests/kvm/lib/x86_64/vmx.c index d089d8b850b5..fdc1e6deb922 100644 --- a/tools/testing/selftests/kvm/lib/x86_64/vmx.c +++ b/tools/testing/selftests/kvm/lib/x86_64/vmx.c @@ -392,80 +392,90 @@ void nested_vmx_check_supported(void) } } -void nested_pg_map(struct vmx_pages *vmx, struct kvm_vm *vm, - uint64_t nested_paddr, uint64_t paddr) +static void nested_create_pte(struct kvm_vm *vm, + struct eptPageTableEntry *pte, + uint64_t nested_paddr, + uint64_t paddr, + int current_level, + int target_level) +{ + if (!pte->readable) { + pte->writable = true; + pte->readable = true; + pte->executable = true; + pte->page_size = (current_level == target_level); + if (pte->page_size) + pte->address = paddr >> vm->page_shift; + else + pte->address = vm_alloc_page_table(vm) >> vm->page_shift; + } else { + /* + * Entry already present. Assert that the caller doesn't want + * a hugepage at this level, and that there isn't a hugepage at + * this level. + */ + TEST_ASSERT(current_level != target_level, + "Cannot create hugepage at level: %u, nested_paddr: 0x%lx\n", + current_level, nested_paddr); + TEST_ASSERT(!pte->page_size, + "Cannot create page table at level: %u, nested_paddr: 0x%lx\n", + current_level, nested_paddr); + } +} + + +void __nested_pg_map(struct vmx_pages *vmx, struct kvm_vm *vm, + uint64_t nested_paddr, uint64_t paddr, int target_level) { - uint16_t index[4]; - struct eptPageTableEntry *pml4e; + const uint64_t page_size = PG_LEVEL_SIZE(target_level); + struct eptPageTableEntry *pt = vmx->eptp_hva, *pte; + uint16_t index; TEST_ASSERT(vm->mode == VM_MODE_PXXV48_4K, "Attempt to use " "unknown or unsupported guest mode, mode: 0x%x", vm->mode); - TEST_ASSERT((nested_paddr % vm->page_size) == 0, + TEST_ASSERT((nested_paddr % page_size) == 0, "Nested physical address not on page boundary,\n" - " nested_paddr: 0x%lx vm->page_size: 0x%x", - nested_paddr, vm->page_size); + " nested_paddr: 0x%lx page_size: 0x%lx", + nested_paddr, page_size); TEST_ASSERT((nested_paddr >> vm->page_shift) <= vm->max_gfn, "Physical address beyond beyond maximum supported,\n" " nested_paddr: 0x%lx vm->max_gfn: 0x%lx vm->page_size: 0x%x", paddr, vm->max_gfn, vm->page_size); - TEST_ASSERT((paddr % vm->page_size) == 0, + TEST_ASSERT((paddr % page_size) == 0, "Physical address not on page boundary,\n" - " paddr: 0x%lx vm->page_size: 0x%x", - paddr, vm->page_size); + " paddr: 0x%lx page_size: 0x%lx", + paddr, page_size); TEST_ASSERT((paddr >> vm->page_shift) <= vm->max_gfn, "Physical address beyond beyond maximum supported,\n" " paddr: 0x%lx vm->max_gfn: 0x%lx vm->page_size: 0x%x", paddr, vm->max_gfn, vm->page_size); - index[0] = (nested_paddr >> 12) & 0x1ffu; - index[1] = (nested_paddr >> 21) & 0x1ffu; - index[2] = (nested_paddr >> 30) & 0x1ffu; - index[3] = (nested_paddr >> 39) & 0x1ffu; - - /* Allocate page directory pointer table if not present. */ - pml4e = vmx->eptp_hva; - if (!pml4e[index[3]].readable) { - pml4e[index[3]].address = vm_alloc_page_table(vm) >> vm->page_shift; - pml4e[index[3]].writable = true; - pml4e[index[3]].readable = true; - pml4e[index[3]].executable = true; - } + for (int level = PG_LEVEL_512G; level >= PG_LEVEL_4K; level--) { + index = (nested_paddr >> PG_LEVEL_SHIFT(level)) & 0x1ffu; + pte = &pt[index]; - /* Allocate page directory table if not present. */ - struct eptPageTableEntry *pdpe; - pdpe = addr_gpa2hva(vm, pml4e[index[3]].address * vm->page_size); - if (!pdpe[index[2]].readable) { - pdpe[index[2]].address = vm_alloc_page_table(vm) >> vm->page_shift; - pdpe[index[2]].writable = true; - pdpe[index[2]].readable = true; - pdpe[index[2]].executable = true; - } + nested_create_pte(vm, pte, nested_paddr, paddr, level, target_level); - /* Allocate page table if not present. */ - struct eptPageTableEntry *pde; - pde = addr_gpa2hva(vm, pdpe[index[2]].address * vm->page_size); - if (!pde[index[1]].readable) { - pde[index[1]].address = vm_alloc_page_table(vm) >> vm->page_shift; - pde[index[1]].writable = true; - pde[index[1]].readable = true; - pde[index[1]].executable = true; - } + if (pte->page_size) + break; - /* Fill in page table entry. */ - struct eptPageTableEntry *pte; - pte = addr_gpa2hva(vm, pde[index[1]].address * vm->page_size); - pte[index[0]].address = paddr >> vm->page_shift; - pte[index[0]].writable = true; - pte[index[0]].readable = true; - pte[index[0]].executable = true; + pt = addr_gpa2hva(vm, pte->address * vm->page_size); + } /* * For now mark these as accessed and dirty because the only * testcase we have needs that. Can be reconsidered later. */ - pte[index[0]].accessed = true; - pte[index[0]].dirty = true; + pte->accessed = true; + pte->dirty = true; + +} + +void nested_pg_map(struct vmx_pages *vmx, struct kvm_vm *vm, + uint64_t nested_paddr, uint64_t paddr) +{ + __nested_pg_map(vmx, vm, nested_paddr, paddr, PG_LEVEL_4K); } /* -- cgit From b8ca01ea19068b54938ebb4ebc06814a89dee8ea Mon Sep 17 00:00:00 2001 From: David Matlack Date: Fri, 20 May 2022 23:32:41 +0000 Subject: KVM: selftests: Drop stale function parameter comment for nested_map() nested_map() does not take a parameter named eptp_memslot. Drop the comment referring to it. Reviewed-by: Peter Xu Signed-off-by: David Matlack Message-Id: <20220520233249.3776001-4-dmatlack@google.com> Signed-off-by: Paolo Bonzini --- tools/testing/selftests/kvm/lib/x86_64/vmx.c | 1 - 1 file changed, 1 deletion(-) diff --git a/tools/testing/selftests/kvm/lib/x86_64/vmx.c b/tools/testing/selftests/kvm/lib/x86_64/vmx.c index fdc1e6deb922..baeaa35de113 100644 --- a/tools/testing/selftests/kvm/lib/x86_64/vmx.c +++ b/tools/testing/selftests/kvm/lib/x86_64/vmx.c @@ -486,7 +486,6 @@ void nested_pg_map(struct vmx_pages *vmx, struct kvm_vm *vm, * nested_paddr - Nested guest physical address to map * paddr - VM Physical Address * size - The size of the range to map - * eptp_memslot - Memory region slot for new virtual translation tables * * Output Args: None * -- cgit From ce690e9c17d27486af879defc506679cbbb14777 Mon Sep 17 00:00:00 2001 From: David Matlack Date: Fri, 20 May 2022 23:32:42 +0000 Subject: KVM: selftests: Refactor nested_map() to specify target level Refactor nested_map() to specify that it explicityl wants 4K mappings (the existing behavior) and push the implementation down into __nested_map(), which can be used in subsequent commits to create huge page mappings. No function change intended. Reviewed-by: Peter Xu Signed-off-by: David Matlack Message-Id: <20220520233249.3776001-5-dmatlack@google.com> Signed-off-by: Paolo Bonzini --- tools/testing/selftests/kvm/lib/x86_64/vmx.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/kvm/lib/x86_64/vmx.c b/tools/testing/selftests/kvm/lib/x86_64/vmx.c index baeaa35de113..b8cfe4914a3a 100644 --- a/tools/testing/selftests/kvm/lib/x86_64/vmx.c +++ b/tools/testing/selftests/kvm/lib/x86_64/vmx.c @@ -486,6 +486,7 @@ void nested_pg_map(struct vmx_pages *vmx, struct kvm_vm *vm, * nested_paddr - Nested guest physical address to map * paddr - VM Physical Address * size - The size of the range to map + * level - The level at which to map the range * * Output Args: None * @@ -494,22 +495,29 @@ void nested_pg_map(struct vmx_pages *vmx, struct kvm_vm *vm, * Within the VM given by vm, creates a nested guest translation for the * page range starting at nested_paddr to the page range starting at paddr. */ -void nested_map(struct vmx_pages *vmx, struct kvm_vm *vm, - uint64_t nested_paddr, uint64_t paddr, uint64_t size) +void __nested_map(struct vmx_pages *vmx, struct kvm_vm *vm, + uint64_t nested_paddr, uint64_t paddr, uint64_t size, + int level) { - size_t page_size = vm->page_size; + size_t page_size = PG_LEVEL_SIZE(level); size_t npages = size / page_size; TEST_ASSERT(nested_paddr + size > nested_paddr, "Vaddr overflow"); TEST_ASSERT(paddr + size > paddr, "Paddr overflow"); while (npages--) { - nested_pg_map(vmx, vm, nested_paddr, paddr); + __nested_pg_map(vmx, vm, nested_paddr, paddr, level); nested_paddr += page_size; paddr += page_size; } } +void nested_map(struct vmx_pages *vmx, struct kvm_vm *vm, + uint64_t nested_paddr, uint64_t paddr, uint64_t size) +{ + __nested_map(vmx, vm, nested_paddr, paddr, size, PG_LEVEL_4K); +} + /* Prepare an identity extended page table that maps all the * physical pages in VM. */ -- cgit From b6c086d04c0a1ba356145cdba5b46bd6cea2b9bd Mon Sep 17 00:00:00 2001 From: David Matlack Date: Fri, 20 May 2022 23:32:43 +0000 Subject: KVM: selftests: Move VMX_EPT_VPID_CAP_AD_BITS to vmx.h This is a VMX-related macro so move it to vmx.h. While here, open code the mask like the rest of the VMX bitmask macros. No functional change intended. Reviewed-by: Peter Xu Signed-off-by: David Matlack Message-Id: <20220520233249.3776001-6-dmatlack@google.com> Signed-off-by: Paolo Bonzini --- tools/testing/selftests/kvm/include/x86_64/processor.h | 3 --- tools/testing/selftests/kvm/include/x86_64/vmx.h | 2 ++ 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h index 273c70e91647..d78f97f502b5 100644 --- a/tools/testing/selftests/kvm/include/x86_64/processor.h +++ b/tools/testing/selftests/kvm/include/x86_64/processor.h @@ -511,9 +511,6 @@ void __virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, int level) #define X86_CR0_CD (1UL<<30) /* Cache Disable */ #define X86_CR0_PG (1UL<<31) /* Paging */ -/* VMX_EPT_VPID_CAP bits */ -#define VMX_EPT_VPID_CAP_AD_BITS (1ULL << 21) - #define XSTATE_XTILE_CFG_BIT 17 #define XSTATE_XTILE_DATA_BIT 18 diff --git a/tools/testing/selftests/kvm/include/x86_64/vmx.h b/tools/testing/selftests/kvm/include/x86_64/vmx.h index 583ceb0d1457..3b1794baa97c 100644 --- a/tools/testing/selftests/kvm/include/x86_64/vmx.h +++ b/tools/testing/selftests/kvm/include/x86_64/vmx.h @@ -96,6 +96,8 @@ #define VMX_MISC_PREEMPTION_TIMER_RATE_MASK 0x0000001f #define VMX_MISC_SAVE_EFER_LMA 0x00000020 +#define VMX_EPT_VPID_CAP_AD_BITS 0x00200000 + #define EXIT_REASON_FAILED_VMENTRY 0x80000000 #define EXIT_REASON_EXCEPTION_NMI 0 #define EXIT_REASON_EXTERNAL_INTERRUPT 1 -- cgit From c363d95986b1b930947305e2372665141721d15f Mon Sep 17 00:00:00 2001 From: David Matlack Date: Fri, 20 May 2022 23:32:44 +0000 Subject: KVM: selftests: Add a helper to check EPT/VPID capabilities Create a small helper function to check if a given EPT/VPID capability is supported. This will be re-used in a follow-up commit to check for 1G page support. No functional change intended. Reviewed-by: Peter Xu Signed-off-by: David Matlack Message-Id: <20220520233249.3776001-7-dmatlack@google.com> Signed-off-by: Paolo Bonzini --- tools/testing/selftests/kvm/lib/x86_64/vmx.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/kvm/lib/x86_64/vmx.c b/tools/testing/selftests/kvm/lib/x86_64/vmx.c index b8cfe4914a3a..5bf169179455 100644 --- a/tools/testing/selftests/kvm/lib/x86_64/vmx.c +++ b/tools/testing/selftests/kvm/lib/x86_64/vmx.c @@ -198,6 +198,11 @@ bool load_vmcs(struct vmx_pages *vmx) return true; } +static bool ept_vpid_cap_supported(uint64_t mask) +{ + return rdmsr(MSR_IA32_VMX_EPT_VPID_CAP) & mask; +} + /* * Initialize the control fields to the most basic settings possible. */ @@ -215,7 +220,7 @@ static inline void init_vmcs_control_fields(struct vmx_pages *vmx) struct eptPageTablePointer eptp = { .memory_type = VMX_BASIC_MEM_TYPE_WB, .page_walk_length = 3, /* + 1 */ - .ad_enabled = !!(rdmsr(MSR_IA32_VMX_EPT_VPID_CAP) & VMX_EPT_VPID_CAP_AD_BITS), + .ad_enabled = ept_vpid_cap_supported(VMX_EPT_VPID_CAP_AD_BITS), .address = vmx->eptp_gpa >> PAGE_SHIFT_4K, }; -- cgit From acf57736e755ba5c467fc6fa85e4a0750cc36150 Mon Sep 17 00:00:00 2001 From: David Matlack Date: Fri, 20 May 2022 23:32:45 +0000 Subject: KVM: selftests: Drop unnecessary rule for STATIC_LIBS Drop the "all: $(STATIC_LIBS)" rule. The KVM selftests already depend on $(STATIC_LIBS), so there is no reason to have an extra "all" rule. Suggested-by: Peter Xu Signed-off-by: David Matlack Message-Id: <20220520233249.3776001-8-dmatlack@google.com> Signed-off-by: Paolo Bonzini --- tools/testing/selftests/kvm/Makefile | 1 - 1 file changed, 1 deletion(-) diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index 81470a99ed1c..e7d65e04b16a 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -192,7 +192,6 @@ $(OUTPUT)/libkvm.a: $(LIBKVM_OBJS) $(AR) crs $@ $^ x := $(shell mkdir -p $(sort $(dir $(TEST_GEN_PROGS)))) -all: $(STATIC_LIBS) $(TEST_GEN_PROGS): $(STATIC_LIBS) cscope: include_paths = $(LINUX_TOOL_INCLUDE) $(LINUX_HDR_PATH) include lib .. -- cgit From cdc979dae265cc77a035b736f78f58e4c7309bb2 Mon Sep 17 00:00:00 2001 From: David Matlack Date: Fri, 20 May 2022 23:32:46 +0000 Subject: KVM: selftests: Link selftests directly with lib object files The linker does obey strong/weak symbols when linking static libraries, it simply resolves an undefined symbol to the first-encountered symbol. This means that defining __weak arch-generic functions and then defining arch-specific strong functions to override them in libkvm will not always work. More specifically, if we have: lib/generic.c: void __weak foo(void) { pr_info("weak\n"); } void bar(void) { foo(); } lib/x86_64/arch.c: void foo(void) { pr_info("strong\n"); } And a selftest that calls bar(), it will print "weak". Now if you make generic.o explicitly depend on arch.o (e.g. add function to arch.c that is called directly from generic.c) it will print "strong". In other words, it seems that the linker is free to throw out arch.o when linking because generic.o does not explicitly depend on it, which causes the linker to lose the strong symbol. One solution is to link libkvm.a with --whole-archive so that the linker doesn't throw away object files it thinks are unnecessary. However that is a bit difficult to plumb since we are using the common selftests makefile rules. An easier solution is to drop libkvm.a just link selftests with all the .o files that were originally in libkvm.a. Reviewed-by: Peter Xu Signed-off-by: David Matlack Message-Id: <20220520233249.3776001-9-dmatlack@google.com> Signed-off-by: Paolo Bonzini --- tools/testing/selftests/kvm/Makefile | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index e7d65e04b16a..804bf927618a 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -173,12 +173,13 @@ LDFLAGS += -pthread $(no-pie-option) $(pgste-option) # $(TEST_GEN_PROGS) starts with $(OUTPUT)/ include ../lib.mk -STATIC_LIBS := $(OUTPUT)/libkvm.a LIBKVM_C := $(filter %.c,$(LIBKVM)) LIBKVM_S := $(filter %.S,$(LIBKVM)) LIBKVM_C_OBJ := $(patsubst %.c, $(OUTPUT)/%.o, $(LIBKVM_C)) LIBKVM_S_OBJ := $(patsubst %.S, $(OUTPUT)/%.o, $(LIBKVM_S)) -EXTRA_CLEAN += $(LIBKVM_C_OBJ) $(LIBKVM_S_OBJ) $(STATIC_LIBS) cscope.* +LIBKVM_OBJS = $(LIBKVM_C_OBJ) $(LIBKVM_S_OBJ) + +EXTRA_CLEAN += $(LIBKVM_OBJS) cscope.* x := $(shell mkdir -p $(sort $(dir $(LIBKVM_C_OBJ) $(LIBKVM_S_OBJ)))) $(LIBKVM_C_OBJ): $(OUTPUT)/%.o: %.c @@ -187,12 +188,8 @@ $(LIBKVM_C_OBJ): $(OUTPUT)/%.o: %.c $(LIBKVM_S_OBJ): $(OUTPUT)/%.o: %.S $(CC) $(CFLAGS) $(CPPFLAGS) $(TARGET_ARCH) -c $< -o $@ -LIBKVM_OBJS = $(LIBKVM_C_OBJ) $(LIBKVM_S_OBJ) -$(OUTPUT)/libkvm.a: $(LIBKVM_OBJS) - $(AR) crs $@ $^ - x := $(shell mkdir -p $(sort $(dir $(TEST_GEN_PROGS)))) -$(TEST_GEN_PROGS): $(STATIC_LIBS) +$(TEST_GEN_PROGS): $(LIBKVM_OBJS) cscope: include_paths = $(LINUX_TOOL_INCLUDE) $(LINUX_HDR_PATH) include lib .. cscope: -- cgit From cf97d5e99f69f876dc310ea21b5f97c3a493a18a Mon Sep 17 00:00:00 2001 From: David Matlack Date: Fri, 20 May 2022 23:32:47 +0000 Subject: KVM: selftests: Clean up LIBKVM files in Makefile Break up the long lines for LIBKVM and alphabetize each architecture. This makes reading the Makefile easier, and will make reading diffs to LIBKVM easier. No functional change intended. Reviewed-by: Peter Xu Signed-off-by: David Matlack Message-Id: <20220520233249.3776001-10-dmatlack@google.com> Signed-off-by: Paolo Bonzini --- tools/testing/selftests/kvm/Makefile | 36 +++++++++++++++++++++++++++++++----- 1 file changed, 31 insertions(+), 5 deletions(-) diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index 804bf927618a..14566c0a330d 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -37,11 +37,37 @@ ifeq ($(ARCH),riscv) UNAME_M := riscv endif -LIBKVM = lib/assert.c lib/elf.c lib/io.c lib/kvm_util.c lib/rbtree.c lib/sparsebit.c lib/test_util.c lib/guest_modes.c lib/perf_test_util.c -LIBKVM_x86_64 = lib/x86_64/apic.c lib/x86_64/processor.c lib/x86_64/vmx.c lib/x86_64/svm.c lib/x86_64/ucall.c lib/x86_64/handlers.S -LIBKVM_aarch64 = lib/aarch64/processor.c lib/aarch64/ucall.c lib/aarch64/handlers.S lib/aarch64/spinlock.c lib/aarch64/gic.c lib/aarch64/gic_v3.c lib/aarch64/vgic.c -LIBKVM_s390x = lib/s390x/processor.c lib/s390x/ucall.c lib/s390x/diag318_test_handler.c -LIBKVM_riscv = lib/riscv/processor.c lib/riscv/ucall.c +LIBKVM += lib/assert.c +LIBKVM += lib/elf.c +LIBKVM += lib/guest_modes.c +LIBKVM += lib/io.c +LIBKVM += lib/kvm_util.c +LIBKVM += lib/perf_test_util.c +LIBKVM += lib/rbtree.c +LIBKVM += lib/sparsebit.c +LIBKVM += lib/test_util.c + +LIBKVM_x86_64 += lib/x86_64/apic.c +LIBKVM_x86_64 += lib/x86_64/handlers.S +LIBKVM_x86_64 += lib/x86_64/processor.c +LIBKVM_x86_64 += lib/x86_64/svm.c +LIBKVM_x86_64 += lib/x86_64/ucall.c +LIBKVM_x86_64 += lib/x86_64/vmx.c + +LIBKVM_aarch64 += lib/aarch64/gic.c +LIBKVM_aarch64 += lib/aarch64/gic_v3.c +LIBKVM_aarch64 += lib/aarch64/handlers.S +LIBKVM_aarch64 += lib/aarch64/processor.c +LIBKVM_aarch64 += lib/aarch64/spinlock.c +LIBKVM_aarch64 += lib/aarch64/ucall.c +LIBKVM_aarch64 += lib/aarch64/vgic.c + +LIBKVM_s390x += lib/s390x/diag318_test_handler.c +LIBKVM_s390x += lib/s390x/processor.c +LIBKVM_s390x += lib/s390x/ucall.c + +LIBKVM_riscv += lib/riscv/processor.c +LIBKVM_riscv += lib/riscv/ucall.c TEST_GEN_PROGS_x86_64 = x86_64/cpuid_test TEST_GEN_PROGS_x86_64 += x86_64/cr4_cpuid_sync_test -- cgit From 71d489661904fcc3ec31b343acd5c0dac84b5410 Mon Sep 17 00:00:00 2001 From: David Matlack Date: Fri, 20 May 2022 23:32:48 +0000 Subject: KVM: selftests: Add option to run dirty_log_perf_test vCPUs in L2 Add an option to dirty_log_perf_test that configures the vCPUs to run in L2 instead of L1. This makes it possible to benchmark the dirty logging performance of nested virtualization, which is particularly interesting because KVM must shadow L1's EPT/NPT tables. For now this support only works on x86_64 CPUs with VMX. Otherwise passing -n results in the test being skipped. Signed-off-by: David Matlack Message-Id: <20220520233249.3776001-11-dmatlack@google.com> Signed-off-by: Paolo Bonzini --- tools/testing/selftests/kvm/Makefile | 1 + tools/testing/selftests/kvm/dirty_log_perf_test.c | 10 +- .../testing/selftests/kvm/include/perf_test_util.h | 9 ++ .../selftests/kvm/include/x86_64/processor.h | 4 + tools/testing/selftests/kvm/include/x86_64/vmx.h | 4 + tools/testing/selftests/kvm/lib/perf_test_util.c | 35 ++++++- .../selftests/kvm/lib/x86_64/perf_test_util.c | 112 +++++++++++++++++++++ tools/testing/selftests/kvm/lib/x86_64/vmx.c | 15 +++ 8 files changed, 182 insertions(+), 8 deletions(-) create mode 100644 tools/testing/selftests/kvm/lib/x86_64/perf_test_util.c diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index 14566c0a330d..22423c871ed6 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -49,6 +49,7 @@ LIBKVM += lib/test_util.c LIBKVM_x86_64 += lib/x86_64/apic.c LIBKVM_x86_64 += lib/x86_64/handlers.S +LIBKVM_x86_64 += lib/x86_64/perf_test_util.c LIBKVM_x86_64 += lib/x86_64/processor.c LIBKVM_x86_64 += lib/x86_64/svm.c LIBKVM_x86_64 += lib/x86_64/ucall.c diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c index 7b47ae4f952e..d60a34cdfaee 100644 --- a/tools/testing/selftests/kvm/dirty_log_perf_test.c +++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c @@ -336,8 +336,8 @@ static void run_test(enum vm_guest_mode mode, void *arg) static void help(char *name) { puts(""); - printf("usage: %s [-h] [-i iterations] [-p offset] [-g]" - "[-m mode] [-b vcpu bytes] [-v vcpus] [-o] [-s mem type]" + printf("usage: %s [-h] [-i iterations] [-p offset] [-g] " + "[-m mode] [-n] [-b vcpu bytes] [-v vcpus] [-o] [-s mem type]" "[-x memslots]\n", name); puts(""); printf(" -i: specify iteration counts (default: %"PRIu64")\n", @@ -351,6 +351,7 @@ static void help(char *name) printf(" -p: specify guest physical test memory offset\n" " Warning: a low offset can conflict with the loaded test code.\n"); guest_modes_help(); + printf(" -n: Run the vCPUs in nested mode (L2)\n"); printf(" -b: specify the size of the memory region which should be\n" " dirtied by each vCPU. e.g. 10M or 3G.\n" " (default: 1G)\n"); @@ -387,7 +388,7 @@ int main(int argc, char *argv[]) guest_modes_append_default(); - while ((opt = getopt(argc, argv, "ghi:p:m:b:f:v:os:x:")) != -1) { + while ((opt = getopt(argc, argv, "ghi:p:m:nb:f:v:os:x:")) != -1) { switch (opt) { case 'g': dirty_log_manual_caps = 0; @@ -401,6 +402,9 @@ int main(int argc, char *argv[]) case 'm': guest_modes_cmdline(optarg); break; + case 'n': + perf_test_args.nested = true; + break; case 'b': guest_percpu_mem_size = parse_size(optarg); break; diff --git a/tools/testing/selftests/kvm/include/perf_test_util.h b/tools/testing/selftests/kvm/include/perf_test_util.h index a86f953d8d36..d822cb670f1c 100644 --- a/tools/testing/selftests/kvm/include/perf_test_util.h +++ b/tools/testing/selftests/kvm/include/perf_test_util.h @@ -30,10 +30,15 @@ struct perf_test_vcpu_args { struct perf_test_args { struct kvm_vm *vm; + /* The starting address and size of the guest test region. */ uint64_t gpa; + uint64_t size; uint64_t guest_page_size; int wr_fract; + /* Run vCPUs in L2 instead of L1, if the architecture supports it. */ + bool nested; + struct perf_test_vcpu_args vcpu_args[KVM_MAX_VCPUS]; }; @@ -49,5 +54,9 @@ void perf_test_set_wr_fract(struct kvm_vm *vm, int wr_fract); void perf_test_start_vcpu_threads(int vcpus, void (*vcpu_fn)(struct perf_test_vcpu_args *)); void perf_test_join_vcpu_threads(int vcpus); +void perf_test_guest_code(uint32_t vcpu_id); + +uint64_t perf_test_nested_pages(int nr_vcpus); +void perf_test_setup_nested(struct kvm_vm *vm, int nr_vcpus); #endif /* SELFTEST_KVM_PERF_TEST_UTIL_H */ diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h index d78f97f502b5..6ce185449259 100644 --- a/tools/testing/selftests/kvm/include/x86_64/processor.h +++ b/tools/testing/selftests/kvm/include/x86_64/processor.h @@ -494,6 +494,10 @@ enum pg_level { #define PG_LEVEL_SHIFT(_level) ((_level - 1) * 9 + 12) #define PG_LEVEL_SIZE(_level) (1ull << PG_LEVEL_SHIFT(_level)) +#define PG_SIZE_4K PG_LEVEL_SIZE(PG_LEVEL_4K) +#define PG_SIZE_2M PG_LEVEL_SIZE(PG_LEVEL_2M) +#define PG_SIZE_1G PG_LEVEL_SIZE(PG_LEVEL_1G) + void __virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, int level); /* diff --git a/tools/testing/selftests/kvm/include/x86_64/vmx.h b/tools/testing/selftests/kvm/include/x86_64/vmx.h index 3b1794baa97c..cc3604f8f1d3 100644 --- a/tools/testing/selftests/kvm/include/x86_64/vmx.h +++ b/tools/testing/selftests/kvm/include/x86_64/vmx.h @@ -96,6 +96,7 @@ #define VMX_MISC_PREEMPTION_TIMER_RATE_MASK 0x0000001f #define VMX_MISC_SAVE_EFER_LMA 0x00000020 +#define VMX_EPT_VPID_CAP_1G_PAGES 0x00020000 #define VMX_EPT_VPID_CAP_AD_BITS 0x00200000 #define EXIT_REASON_FAILED_VMENTRY 0x80000000 @@ -608,6 +609,7 @@ bool load_vmcs(struct vmx_pages *vmx); bool nested_vmx_supported(void); void nested_vmx_check_supported(void); +bool ept_1g_pages_supported(void); void nested_pg_map(struct vmx_pages *vmx, struct kvm_vm *vm, uint64_t nested_paddr, uint64_t paddr); @@ -615,6 +617,8 @@ void nested_map(struct vmx_pages *vmx, struct kvm_vm *vm, uint64_t nested_paddr, uint64_t paddr, uint64_t size); void nested_map_memslot(struct vmx_pages *vmx, struct kvm_vm *vm, uint32_t memslot); +void nested_identity_map_1g(struct vmx_pages *vmx, struct kvm_vm *vm, + uint64_t addr, uint64_t size); void prepare_eptp(struct vmx_pages *vmx, struct kvm_vm *vm, uint32_t eptp_memslot); void prepare_virtualize_apic_accesses(struct vmx_pages *vmx, struct kvm_vm *vm); diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c index 722df3a28791..b2ff2cee2e51 100644 --- a/tools/testing/selftests/kvm/lib/perf_test_util.c +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c @@ -40,7 +40,7 @@ static bool all_vcpu_threads_running; * Continuously write to the first 8 bytes of each page in the * specified region. */ -static void guest_code(uint32_t vcpu_id) +void perf_test_guest_code(uint32_t vcpu_id) { struct perf_test_args *pta = &perf_test_args; struct perf_test_vcpu_args *vcpu_args = &pta->vcpu_args[vcpu_id]; @@ -108,7 +108,7 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int vcpus, { struct perf_test_args *pta = &perf_test_args; struct kvm_vm *vm; - uint64_t guest_num_pages; + uint64_t guest_num_pages, slot0_pages = DEFAULT_GUEST_PHY_PAGES; uint64_t backing_src_pagesz = get_backing_src_pagesz(backing_src); int i; @@ -134,13 +134,20 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int vcpus, "Guest memory cannot be evenly divided into %d slots.", slots); + /* + * If using nested, allocate extra pages for the nested page tables and + * in-memory data structures. + */ + if (pta->nested) + slot0_pages += perf_test_nested_pages(vcpus); + /* * Pass guest_num_pages to populate the page tables for test memory. * The memory is also added to memslot 0, but that's a benign side * effect as KVM allows aliasing HVAs in meslots. */ - vm = vm_create_with_vcpus(mode, vcpus, DEFAULT_GUEST_PHY_PAGES, - guest_num_pages, 0, guest_code, NULL); + vm = vm_create_with_vcpus(mode, vcpus, slot0_pages, guest_num_pages, 0, + perf_test_guest_code, NULL); pta->vm = vm; @@ -161,7 +168,9 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int vcpus, /* Align to 1M (segment size) */ pta->gpa = align_down(pta->gpa, 1 << 20); #endif - pr_info("guest physical test memory offset: 0x%lx\n", pta->gpa); + pta->size = guest_num_pages * pta->guest_page_size; + pr_info("guest physical test memory: [0x%lx, 0x%lx)\n", + pta->gpa, pta->gpa + pta->size); /* Add extra memory slots for testing */ for (i = 0; i < slots; i++) { @@ -178,6 +187,11 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int vcpus, perf_test_setup_vcpus(vm, vcpus, vcpu_memory_bytes, partition_vcpu_memory_access); + if (pta->nested) { + pr_info("Configuring vCPUs to run in L2 (nested).\n"); + perf_test_setup_nested(vm, vcpus); + } + ucall_init(vm, NULL); /* Export the shared variables to the guest. */ @@ -198,6 +212,17 @@ void perf_test_set_wr_fract(struct kvm_vm *vm, int wr_fract) sync_global_to_guest(vm, perf_test_args); } +uint64_t __weak perf_test_nested_pages(int nr_vcpus) +{ + return 0; +} + +void __weak perf_test_setup_nested(struct kvm_vm *vm, int nr_vcpus) +{ + pr_info("%s() not support on this architecture, skipping.\n", __func__); + exit(KSFT_SKIP); +} + static void *vcpu_thread_main(void *data) { struct vcpu_thread *vcpu = data; diff --git a/tools/testing/selftests/kvm/lib/x86_64/perf_test_util.c b/tools/testing/selftests/kvm/lib/x86_64/perf_test_util.c new file mode 100644 index 000000000000..e258524435a0 --- /dev/null +++ b/tools/testing/selftests/kvm/lib/x86_64/perf_test_util.c @@ -0,0 +1,112 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * x86_64-specific extensions to perf_test_util.c. + * + * Copyright (C) 2022, Google, Inc. + */ +#include +#include +#include +#include + +#include "test_util.h" +#include "kvm_util.h" +#include "perf_test_util.h" +#include "../kvm_util_internal.h" +#include "processor.h" +#include "vmx.h" + +void perf_test_l2_guest_code(uint64_t vcpu_id) +{ + perf_test_guest_code(vcpu_id); + vmcall(); +} + +extern char perf_test_l2_guest_entry[]; +__asm__( +"perf_test_l2_guest_entry:" +" mov (%rsp), %rdi;" +" call perf_test_l2_guest_code;" +" ud2;" +); + +static void perf_test_l1_guest_code(struct vmx_pages *vmx, uint64_t vcpu_id) +{ +#define L2_GUEST_STACK_SIZE 64 + unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; + unsigned long *rsp; + + GUEST_ASSERT(vmx->vmcs_gpa); + GUEST_ASSERT(prepare_for_vmx_operation(vmx)); + GUEST_ASSERT(load_vmcs(vmx)); + GUEST_ASSERT(ept_1g_pages_supported()); + + rsp = &l2_guest_stack[L2_GUEST_STACK_SIZE - 1]; + *rsp = vcpu_id; + prepare_vmcs(vmx, perf_test_l2_guest_entry, rsp); + + GUEST_ASSERT(!vmlaunch()); + GUEST_ASSERT(vmreadz(VM_EXIT_REASON) == EXIT_REASON_VMCALL); + GUEST_DONE(); +} + +uint64_t perf_test_nested_pages(int nr_vcpus) +{ + /* + * 513 page tables is enough to identity-map 256 TiB of L2 with 1G + * pages and 4-level paging, plus a few pages per-vCPU for data + * structures such as the VMCS. + */ + return 513 + 10 * nr_vcpus; +} + +void perf_test_setup_ept(struct vmx_pages *vmx, struct kvm_vm *vm) +{ + uint64_t start, end; + + prepare_eptp(vmx, vm, 0); + + /* + * Identity map the first 4G and the test region with 1G pages so that + * KVM can shadow the EPT12 with the maximum huge page size supported + * by the backing source. + */ + nested_identity_map_1g(vmx, vm, 0, 0x100000000ULL); + + start = align_down(perf_test_args.gpa, PG_SIZE_1G); + end = align_up(perf_test_args.gpa + perf_test_args.size, PG_SIZE_1G); + nested_identity_map_1g(vmx, vm, start, end - start); +} + +void perf_test_setup_nested(struct kvm_vm *vm, int nr_vcpus) +{ + struct vmx_pages *vmx, *vmx0 = NULL; + struct kvm_regs regs; + vm_vaddr_t vmx_gva; + int vcpu_id; + + nested_vmx_check_supported(); + + for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++) { + vmx = vcpu_alloc_vmx(vm, &vmx_gva); + + if (vcpu_id == 0) { + perf_test_setup_ept(vmx, vm); + vmx0 = vmx; + } else { + /* Share the same EPT table across all vCPUs. */ + vmx->eptp = vmx0->eptp; + vmx->eptp_hva = vmx0->eptp_hva; + vmx->eptp_gpa = vmx0->eptp_gpa; + } + + /* + * Override the vCPU to run perf_test_l1_guest_code() which will + * bounce it into L2 before calling perf_test_guest_code(). + */ + vcpu_regs_get(vm, vcpu_id, ®s); + regs.rip = (unsigned long) perf_test_l1_guest_code; + vcpu_regs_set(vm, vcpu_id, ®s); + vcpu_args_set(vm, vcpu_id, 2, vmx_gva, vcpu_id); + } +} diff --git a/tools/testing/selftests/kvm/lib/x86_64/vmx.c b/tools/testing/selftests/kvm/lib/x86_64/vmx.c index 5bf169179455..b77a01d0a271 100644 --- a/tools/testing/selftests/kvm/lib/x86_64/vmx.c +++ b/tools/testing/selftests/kvm/lib/x86_64/vmx.c @@ -203,6 +203,11 @@ static bool ept_vpid_cap_supported(uint64_t mask) return rdmsr(MSR_IA32_VMX_EPT_VPID_CAP) & mask; } +bool ept_1g_pages_supported(void) +{ + return ept_vpid_cap_supported(VMX_EPT_VPID_CAP_1G_PAGES); +} + /* * Initialize the control fields to the most basic settings possible. */ @@ -439,6 +444,9 @@ void __nested_pg_map(struct vmx_pages *vmx, struct kvm_vm *vm, TEST_ASSERT(vm->mode == VM_MODE_PXXV48_4K, "Attempt to use " "unknown or unsupported guest mode, mode: 0x%x", vm->mode); + TEST_ASSERT((nested_paddr >> 48) == 0, + "Nested physical address 0x%lx requires 5-level paging", + nested_paddr); TEST_ASSERT((nested_paddr % page_size) == 0, "Nested physical address not on page boundary,\n" " nested_paddr: 0x%lx page_size: 0x%lx", @@ -547,6 +555,13 @@ void nested_map_memslot(struct vmx_pages *vmx, struct kvm_vm *vm, } } +/* Identity map a region with 1GiB Pages. */ +void nested_identity_map_1g(struct vmx_pages *vmx, struct kvm_vm *vm, + uint64_t addr, uint64_t size) +{ + __nested_map(vmx, vm, addr, addr, size, PG_LEVEL_1G); +} + void prepare_eptp(struct vmx_pages *vmx, struct kvm_vm *vm, uint32_t eptp_memslot) { -- cgit From e0f3f46e42064a51573914766897b4ab95d943e3 Mon Sep 17 00:00:00 2001 From: David Matlack Date: Fri, 20 May 2022 23:32:49 +0000 Subject: KVM: selftests: Restrict test region to 48-bit physical addresses when using nested The selftests nested code only supports 4-level paging at the moment. This means it cannot map nested guest physical addresses with more than 48 bits. Allow perf_test_util nested mode to work on hosts with more than 48 physical addresses by restricting the guest test region to 48-bits. While here, opportunistically fix an off-by-one error when dealing with vm_get_max_gfn(). perf_test_util.c was treating this as the maximum number of GFNs, rather than the maximum allowed GFN. This didn't result in any correctness issues, but it did end up shifting the test region down slightly when using huge pages. Suggested-by: Sean Christopherson Signed-off-by: David Matlack Message-Id: <20220520233249.3776001-12-dmatlack@google.com> Signed-off-by: Paolo Bonzini --- tools/testing/selftests/kvm/lib/perf_test_util.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c index b2ff2cee2e51..f989ff91f022 100644 --- a/tools/testing/selftests/kvm/lib/perf_test_util.c +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c @@ -110,6 +110,7 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int vcpus, struct kvm_vm *vm; uint64_t guest_num_pages, slot0_pages = DEFAULT_GUEST_PHY_PAGES; uint64_t backing_src_pagesz = get_backing_src_pagesz(backing_src); + uint64_t region_end_gfn; int i; pr_info("Testing guest mode: %s\n", vm_guest_mode_string(mode)); @@ -151,18 +152,29 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int vcpus, pta->vm = vm; + /* Put the test region at the top guest physical memory. */ + region_end_gfn = vm_get_max_gfn(vm) + 1; + +#ifdef __x86_64__ + /* + * When running vCPUs in L2, restrict the test region to 48 bits to + * avoid needing 5-level page tables to identity map L2. + */ + if (pta->nested) + region_end_gfn = min(region_end_gfn, (1UL << 48) / pta->guest_page_size); +#endif /* * If there should be more memory in the guest test region than there * can be pages in the guest, it will definitely cause problems. */ - TEST_ASSERT(guest_num_pages < vm_get_max_gfn(vm), + TEST_ASSERT(guest_num_pages < region_end_gfn, "Requested more guest memory than address space allows.\n" " guest pages: %" PRIx64 " max gfn: %" PRIx64 " vcpus: %d wss: %" PRIx64 "]\n", - guest_num_pages, vm_get_max_gfn(vm), vcpus, + guest_num_pages, region_end_gfn - 1, vcpus, vcpu_memory_bytes); - pta->gpa = (vm_get_max_gfn(vm) - guest_num_pages) * pta->guest_page_size; + pta->gpa = (region_end_gfn - guest_num_pages) * pta->guest_page_size; pta->gpa = align_down(pta->gpa, backing_src_pagesz); #ifdef __s390x__ /* Align to 1M (segment size) */ -- cgit From c3238d36c3a2be0a29a9d848d6c51e1b14be6692 Mon Sep 17 00:00:00 2001 From: Grzegorz Szczurek Date: Fri, 29 Apr 2022 14:27:08 +0200 Subject: i40e: Fix adding ADQ filter to TC0 Procedure of configure tc flower filters erroneously allows to create filters on TC0 where unfiltered packets are also directed by default. Issue was caused by insufficient checks of hw_tc parameter specifying the hardware traffic class to pass matching packets to. Fix checking hw_tc parameter which blocks creation of filters on TC0. Fixes: 2f4b411a3d67 ("i40e: Enable cloud filters via tc-flower") Signed-off-by: Grzegorz Szczurek Signed-off-by: Jedrzej Jagielski Tested-by: Bharathi Sreenivas Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/i40e/i40e_main.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 332a608dbaa6..72576bb3e94d 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -8542,6 +8542,11 @@ static int i40e_configure_clsflower(struct i40e_vsi *vsi, return -EOPNOTSUPP; } + if (!tc) { + dev_err(&pf->pdev->dev, "Unable to add filter because of invalid destination"); + return -EINVAL; + } + if (test_bit(__I40E_RESET_RECOVERY_PENDING, pf->state) || test_bit(__I40E_RESET_INTR_RECEIVED, pf->state)) return -EBUSY; -- cgit From 0bb050670ac90a167ecfa3f9590f92966c9a3677 Mon Sep 17 00:00:00 2001 From: Grzegorz Szczurek Date: Fri, 29 Apr 2022 14:40:23 +0200 Subject: i40e: Fix calculating the number of queue pairs If ADQ is enabled for a VF, then actual number of queue pair is a number of currently available traffic classes for this VF. Without this change the configuration of the Rx/Tx queues fails with error. Fixes: d29e0d233e0d ("i40e: missing input validation on VF message handling by the PF") Signed-off-by: Grzegorz Szczurek Signed-off-by: Jedrzej Jagielski Tested-by: Bharathi Sreenivas Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c index 2606e8f0f19b..033ea71763e3 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c +++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c @@ -2282,7 +2282,7 @@ static int i40e_vc_config_queues_msg(struct i40e_vf *vf, u8 *msg) } if (vf->adq_enabled) { - for (i = 0; i < I40E_MAX_VF_VSI; i++) + for (i = 0; i < vf->num_tc; i++) num_qps_all += vf->ch[i].num_qps; if (num_qps_all != qci->num_queue_pairs) { aq_ret = I40E_ERR_PARAM; -- cgit From fd5855e6b1358e816710afee68a1d2bc685176ca Mon Sep 17 00:00:00 2001 From: Aleksandr Loktionov Date: Thu, 19 May 2022 16:01:45 +0200 Subject: i40e: Fix call trace in setup_tx_descriptors After PF reset and ethtool -t there was call trace in dmesg sometimes leading to panic. When there was some time, around 5 seconds, between reset and test there were no errors. Problem was that pf reset calls i40e_vsi_close in prep_for_reset and ethtool -t calls i40e_vsi_close in diag_test. If there was not enough time between those commands the second i40e_vsi_close starts before previous i40e_vsi_close was done which leads to crash. Add check to diag_test if pf is in reset and don't start offline tests if it is true. Add netif_info("testing failed") into unhappy path of i40e_diag_test() Fixes: e17bc411aea8 ("i40e: Disable offline diagnostics if VFs are enabled") Fixes: 510efb2682b3 ("i40e: Fix ethtool offline diagnostic with netqueues") Signed-off-by: Michal Jaron Signed-off-by: Aleksandr Loktionov Tested-by: Gurucharan (A Contingent worker at Intel) Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 25 +++++++++++++++++-------- 1 file changed, 17 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c index 610f00cbaff9..19704f5c8291 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c @@ -2586,15 +2586,16 @@ static void i40e_diag_test(struct net_device *netdev, set_bit(__I40E_TESTING, pf->state); + if (test_bit(__I40E_RESET_RECOVERY_PENDING, pf->state) || + test_bit(__I40E_RESET_INTR_RECEIVED, pf->state)) { + dev_warn(&pf->pdev->dev, + "Cannot start offline testing when PF is in reset state.\n"); + goto skip_ol_tests; + } + if (i40e_active_vfs(pf) || i40e_active_vmdqs(pf)) { dev_warn(&pf->pdev->dev, "Please take active VFs and Netqueues offline and restart the adapter before running NIC diagnostics\n"); - data[I40E_ETH_TEST_REG] = 1; - data[I40E_ETH_TEST_EEPROM] = 1; - data[I40E_ETH_TEST_INTR] = 1; - data[I40E_ETH_TEST_LINK] = 1; - eth_test->flags |= ETH_TEST_FL_FAILED; - clear_bit(__I40E_TESTING, pf->state); goto skip_ol_tests; } @@ -2641,9 +2642,17 @@ static void i40e_diag_test(struct net_device *netdev, data[I40E_ETH_TEST_INTR] = 0; } -skip_ol_tests: - netif_info(pf, drv, netdev, "testing finished\n"); + return; + +skip_ol_tests: + data[I40E_ETH_TEST_REG] = 1; + data[I40E_ETH_TEST_EEPROM] = 1; + data[I40E_ETH_TEST_INTR] = 1; + data[I40E_ETH_TEST_LINK] = 1; + eth_test->flags |= ETH_TEST_FL_FAILED; + clear_bit(__I40E_TESTING, pf->state); + netif_info(pf, drv, netdev, "testing failed\n"); } static void i40e_get_wol(struct net_device *netdev, -- cgit From 645603844270b69175899268be68b871295764fe Mon Sep 17 00:00:00 2001 From: Michal Wilczynski Date: Fri, 20 May 2022 13:19:27 +0200 Subject: iavf: Fix issue with MAC address of VF shown as zero After reinitialization of iavf, ice driver gets VIRTCHNL_OP_ADD_ETH_ADDR message with incorrectly set type of MAC address. Hardware address should have is_primary flag set as true. This way ice driver knows what it has to set as a MAC address. Check if the address is primary in iavf_add_filter function and set flag accordingly. To test set all-zero MAC on a VF. This triggers iavf re-initialization and VIRTCHNL_OP_ADD_ETH_ADDR message gets sent to PF. For example: ip link set dev ens785 vf 0 mac 00:00:00:00:00:00 This triggers re-initialization of iavf. New MAC should be assigned. Now check if MAC is non-zero: ip link show dev ens785 Fixes: a3e839d539e0 ("iavf: Add usage of new virtchnl format to set default MAC") Signed-off-by: Michal Wilczynski Reviewed-by: Maciej Fijalkowski Tested-by: Konrad Jankowski Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/iavf/iavf_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c index 7dfcf78b57fb..f3ecb3bca33d 100644 --- a/drivers/net/ethernet/intel/iavf/iavf_main.c +++ b/drivers/net/ethernet/intel/iavf/iavf_main.c @@ -984,7 +984,7 @@ struct iavf_mac_filter *iavf_add_filter(struct iavf_adapter *adapter, list_add_tail(&f->list, &adapter->mac_filter_list); f->add = true; f->is_new_mac = true; - f->is_primary = false; + f->is_primary = ether_addr_equal(macaddr, adapter->hw.mac.addr); adapter->aq_required |= IAVF_FLAG_AQ_ADD_MAC_FILTER; } else { f->remove = false; -- cgit From da4288b95baa1c7c9aa8a476f58b37eb238745b0 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Wed, 8 Jun 2022 10:11:00 +0900 Subject: scripts/check-local-export: avoid 'wait $!' for process substitution Bash 4.4, released in 2016, supports 'wait $!' to check the exit status of a process substitution, but it seems too new. Some people using older bash versions (on CentOS 7, Ubuntu 16.04, etc.) reported an error like this: ./scripts/check-local-export: line 54: wait: pid 17328 is not a child of this shell I used the process substitution to avoid a pipeline, which executes each command in a subshell. If the while-loop is executed in the subshell context, variable changes within are lost after the subshell terminates. Fortunately, Bash 4.2, released in 2011, supports the 'lastpipe' option, which makes the last element of a pipeline run in the current shell process. Switch to the pipeline with 'lastpipe' solution, and also set 'pipefail' to catch errors from ${NM}. Add the bash requirement to Documentation/process/changes.rst. Fixes: 31cb50b5590f ("kbuild: check static EXPORT_SYMBOL* by script instead of modpost") Reported-by: Tetsuo Handa Reported-by: Michael Ellerman Reported-by: Wang Yugui Tested-by: Tetsuo Handa Tested-by: Jon Hunter Acked-by: Nick Desaulniers Tested-by: Sedat Dilek # LLVM-14 (x86-64) Signed-off-by: Masahiro Yamada --- Documentation/process/changes.rst | 12 ++++++++++++ scripts/check-local-export | 36 +++++++++++++++++++++--------------- 2 files changed, 33 insertions(+), 15 deletions(-) diff --git a/Documentation/process/changes.rst b/Documentation/process/changes.rst index 34415ae1af1b..19c286c23786 100644 --- a/Documentation/process/changes.rst +++ b/Documentation/process/changes.rst @@ -32,6 +32,7 @@ you probably needn't concern yourself with pcmciautils. GNU C 5.1 gcc --version Clang/LLVM (optional) 11.0.0 clang --version GNU make 3.81 make --version +bash 4.2 bash --version binutils 2.23 ld -v flex 2.5.35 flex --version bison 2.0 bison --version @@ -84,6 +85,12 @@ Make You will need GNU make 3.81 or later to build the kernel. +Bash +---- + +Some bash scripts are used for the kernel build. +Bash 4.2 or newer is needed. + Binutils -------- @@ -362,6 +369,11 @@ Make - +Bash +---- + +- + Binutils -------- diff --git a/scripts/check-local-export b/scripts/check-local-export index da745e2743b7..6ccc2f467416 100755 --- a/scripts/check-local-export +++ b/scripts/check-local-export @@ -8,11 +8,31 @@ set -e +# catch errors from ${NM} +set -o pipefail + +# Run the last element of a pipeline in the current shell. +# Without this, the while-loop would be executed in a subshell, and +# the changes made to 'symbol_types' and 'export_symbols' would be lost. +shopt -s lastpipe + declare -A symbol_types declare -a export_symbols exit_code=0 +# If there is no symbol in the object, ${NM} (both GNU nm and llvm-nm) shows +# 'no symbols' diagnostic (but exits with 0). It is harmless and hidden by +# '2>/dev/null'. However, it suppresses real error messages as well. Add a +# hand-crafted error message here. +# +# TODO: +# Use --quiet instead of 2>/dev/null when we upgrade the minimum version of +# binutils to 2.37, llvm to 13.0.0. +# Then, the following line will be really simple: +# ${NM} --quiet ${1} | + +{ ${NM} ${1} 2>/dev/null || { echo "${0}: ${NM} failed" >&2; false; } } | while read value type name do # Skip the line if the number of fields is less than 3. @@ -37,21 +57,7 @@ do if [[ ${name} == __ksymtab_* ]]; then export_symbols+=(${name#__ksymtab_}) fi - - # If there is no symbol in the object, ${NM} (both GNU nm and llvm-nm) - # shows 'no symbols' diagnostic (but exits with 0). It is harmless and - # hidden by '2>/dev/null'. However, it suppresses real error messages - # as well. Add a hand-crafted error message here. - # - # Use --quiet instead of 2>/dev/null when we upgrade the minimum version - # of binutils to 2.37, llvm to 13.0.0. - # - # Then, the following line will be really simple: - # done < <(${NM} --quiet ${1}) -done < <(${NM} ${1} 2>/dev/null || { echo "${0}: ${NM} failed" >&2; false; } ) - -# Catch error in the process substitution -wait $! +done for name in "${export_symbols[@]}" do -- cgit From 6bf74cddcffac0bc5ee0fad724aac778d2e53f75 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 7 Jun 2022 15:45:53 -0400 Subject: filemap: Don't release a locked folio We must hold a reference over the call to filemap_release_folio(), otherwise the page cache will put the last reference to the folio before we unlock it, leading to splats like this: BUG: Bad page state in process u8:5 pfn:1ab1f4 page:ffffea0006ac7d00 refcount:0 mapcount:0 mapping:0000000000000000 index:0x28b1de pfn:0x1ab1f4 flags: 0x17ff80000040001(locked|reclaim|node=0|zone=2|lastcpupid=0xfff) raw: 017ff80000040001 dead000000000100 dead000000000122 0000000000000000 raw: 000000000028b1de 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set It's an error path, so it doesn't see much testing. Reported-by: Darrick J. Wong Fixes: a42634a6c07d ("readahead: Use a folio in read_pages()") Signed-off-by: Matthew Wilcox (Oracle) --- mm/readahead.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/readahead.c b/mm/readahead.c index 415c39d764ea..57a015108254 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -164,12 +164,14 @@ static void read_pages(struct readahead_control *rac) while ((folio = readahead_folio(rac)) != NULL) { unsigned long nr = folio_nr_pages(folio); + folio_get(folio); rac->ra->size -= nr; if (rac->ra->async_size >= nr) { rac->ra->async_size -= nr; filemap_remove_folio(folio); } folio_unlock(folio); + folio_put(folio); } } else { while ((folio = readahead_folio(rac)) != NULL) -- cgit From dcfa24ba68991ab69a48254a18377b45180ae664 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 25 May 2022 14:23:45 -0400 Subject: filemap: Cache the value of vm_flags After we have unlocked the mmap_lock for I/O, the file is pinned, but the VMA is not. Checking this flag after that can be a use-after-free. It's not a terribly interesting use-after-free as it can only read one bit, and it's used to decide whether to read 2MB or 4MB. But it upsets the automated tools and it's generally bad practice anyway, so let's fix it. Reported-by: syzbot+5b96d55e5b54924c77ad@syzkaller.appspotmail.com Fixes: 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings") Cc: stable@vger.kernel.org Signed-off-by: Matthew Wilcox (Oracle) --- mm/filemap.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 9daeaab36081..ac3775c1ce4c 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2991,11 +2991,12 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf) struct address_space *mapping = file->f_mapping; DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff); struct file *fpin = NULL; + unsigned long vm_flags = vmf->vma->vm_flags; unsigned int mmap_miss; #ifdef CONFIG_TRANSPARENT_HUGEPAGE /* Use the readahead code, even if readahead is disabled */ - if (vmf->vma->vm_flags & VM_HUGEPAGE) { + if (vm_flags & VM_HUGEPAGE) { fpin = maybe_unlock_mmap_for_io(vmf, fpin); ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1); ra->size = HPAGE_PMD_NR; @@ -3003,7 +3004,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf) * Fetch two PMD folios, so we get the chance to actually * readahead, unless we've been told not to. */ - if (!(vmf->vma->vm_flags & VM_RAND_READ)) + if (!(vm_flags & VM_RAND_READ)) ra->size *= 2; ra->async_size = HPAGE_PMD_NR; page_cache_ra_order(&ractl, ra, HPAGE_PMD_ORDER); @@ -3012,12 +3013,12 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf) #endif /* If we don't want any read-ahead, don't bother */ - if (vmf->vma->vm_flags & VM_RAND_READ) + if (vm_flags & VM_RAND_READ) return fpin; if (!ra->ra_pages) return fpin; - if (vmf->vma->vm_flags & VM_SEQ_READ) { + if (vm_flags & VM_SEQ_READ) { fpin = maybe_unlock_mmap_for_io(vmf, fpin); page_cache_sync_ra(&ractl, ra->ra_pages); return fpin; -- cgit From 69a37a8ba1b408a1c7616494aa7018e4b3844cbe Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 8 Jun 2022 15:18:34 -0400 Subject: mm/huge_memory: Fix xarray node memory leak If xas_split_alloc() fails to allocate the necessary nodes to complete the xarray entry split, it sets the xa_state to -ENOMEM, which xas_nomem() then interprets as "Please allocate more memory", not as "Please free any unnecessary memory" (which was the intended outcome). It's confusing to use xas_nomem() to free memory in this context, so call xas_destroy() instead. Reported-by: syzbot+9e27a75a8c24f3fe75c1@syzkaller.appspotmail.com Fixes: 6b24ca4a1a8d ("mm: Use multi-index entries in the page cache") Cc: stable@vger.kernel.org Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/xarray.h | 1 + lib/xarray.c | 5 +++-- mm/huge_memory.c | 3 +-- 3 files changed, 5 insertions(+), 4 deletions(-) diff --git a/include/linux/xarray.h b/include/linux/xarray.h index 72feab5ea8d4..c29e11b2c073 100644 --- a/include/linux/xarray.h +++ b/include/linux/xarray.h @@ -1508,6 +1508,7 @@ void *xas_find_marked(struct xa_state *, unsigned long max, xa_mark_t); void xas_init_marks(const struct xa_state *); bool xas_nomem(struct xa_state *, gfp_t); +void xas_destroy(struct xa_state *); void xas_pause(struct xa_state *); void xas_create_range(struct xa_state *); diff --git a/lib/xarray.c b/lib/xarray.c index 54e646e8e6ee..ea9ce1f0b386 100644 --- a/lib/xarray.c +++ b/lib/xarray.c @@ -264,9 +264,10 @@ static void xa_node_free(struct xa_node *node) * xas_destroy() - Free any resources allocated during the XArray operation. * @xas: XArray operation state. * - * This function is now internal-only. + * Most users will not need to call this function; it is called for you + * by xas_nomem(). */ -static void xas_destroy(struct xa_state *xas) +void xas_destroy(struct xa_state *xas) { struct xa_node *next, *node = xas->xa_alloc; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index a77c78a2b6b5..f7248002dad9 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2672,8 +2672,7 @@ out_unlock: if (mapping) i_mmap_unlock_read(mapping); out: - /* Free any memory we didn't use */ - xas_nomem(&xas, 0); + xas_destroy(&xas); count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED); return ret; } -- cgit From 334f6f53abcf57782bd2fe81da1cbd893e4ef05c Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Thu, 9 Jun 2022 09:13:57 -0400 Subject: mm: Add kernel-doc for folio->mlock_count Fix "./include/linux/mm_types.h:279: warning: Function parameter or member 'mlock_count' not described in 'folio'". Also neaten the html by hiding the anon struct. Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/mm_types.h | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index b34ff2cdbc4f..c29ab4c0cd5c 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -227,6 +227,7 @@ struct page { * struct folio - Represents a contiguous set of bytes. * @flags: Identical to the page flags. * @lru: Least Recently Used list; tracks how recently this folio was used. + * @mlock_count: Number of times this folio has been pinned by mlock(). * @mapping: The file this page belongs to, or refers to the anon_vma for * anonymous memory. * @index: Offset within the file, in units of pages. For anonymous memory, @@ -255,10 +256,14 @@ struct folio { unsigned long flags; union { struct list_head lru; + /* private: avoid cluttering the output */ struct { void *__filler; + /* public: */ unsigned int mlock_count; + /* private: */ }; + /* public: */ }; struct address_space *mapping; pgoff_t index; -- cgit From 9b29b6b20376ab64e1b043df6301d8a92378e631 Mon Sep 17 00:00:00 2001 From: "Jason A. Donenfeld" Date: Tue, 7 Jun 2022 09:44:07 +0200 Subject: random: avoid checking crng_ready() twice in random_init() The current flow expands to: if (crng_ready()) ... else if (...) if (!crng_ready()) ... The second crng_ready() call is redundant, but can't so easily be optimized out by the compiler. This commit simplifies that to: if (crng_ready() ... else if (...) ... Fixes: 560181c27b58 ("random: move initialization functions out of hot pages") Cc: stable@vger.kernel.org Cc: Dominik Brodowski Signed-off-by: Jason A. Donenfeld --- drivers/char/random.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index b691b9d59503..4862d4d3ec49 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -801,7 +801,7 @@ int __init random_init(const char *command_line) if (crng_ready()) crng_reseed(); else if (trust_cpu) - credit_init_bits(arch_bytes * 8); + _credit_init_bits(arch_bytes * 8); used_arch_random = arch_bytes * 8 >= POOL_READY_BITS; WARN_ON(register_pm_notifier(&pm_notifier)); -- cgit From 39e0f991a62ed5efabd20711a7b6e7da92603170 Mon Sep 17 00:00:00 2001 From: "Jason A. Donenfeld" Date: Tue, 7 Jun 2022 17:00:16 +0200 Subject: random: mark bootloader randomness code as __init add_bootloader_randomness() and the variables it touches are only used during __init and not after, so mark these as __init. At the same time, unexport this, since it's only called by other __init code that's built-in. Cc: stable@vger.kernel.org Fixes: 428826f5358c ("fdt: add support for rng-seed") Signed-off-by: Jason A. Donenfeld --- drivers/char/random.c | 7 +++---- include/linux/random.h | 2 +- 2 files changed, 4 insertions(+), 5 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 4862d4d3ec49..0d6fb3eaf609 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -725,8 +725,8 @@ static void __cold _credit_init_bits(size_t bits) **********************************************************************/ static bool used_arch_random; -static bool trust_cpu __ro_after_init = IS_ENABLED(CONFIG_RANDOM_TRUST_CPU); -static bool trust_bootloader __ro_after_init = IS_ENABLED(CONFIG_RANDOM_TRUST_BOOTLOADER); +static bool trust_cpu __initdata = IS_ENABLED(CONFIG_RANDOM_TRUST_CPU); +static bool trust_bootloader __initdata = IS_ENABLED(CONFIG_RANDOM_TRUST_BOOTLOADER); static int __init parse_trust_cpu(char *arg) { return kstrtobool(arg, &trust_cpu); @@ -865,13 +865,12 @@ EXPORT_SYMBOL_GPL(add_hwgenerator_randomness); * Handle random seed passed by bootloader, and credit it if * CONFIG_RANDOM_TRUST_BOOTLOADER is set. */ -void __cold add_bootloader_randomness(const void *buf, size_t len) +void __init add_bootloader_randomness(const void *buf, size_t len) { mix_pool_bytes(buf, len); if (trust_bootloader) credit_init_bits(len * 8); } -EXPORT_SYMBOL_GPL(add_bootloader_randomness); #if IS_ENABLED(CONFIG_VMGENID) static BLOCKING_NOTIFIER_HEAD(vmfork_chain); diff --git a/include/linux/random.h b/include/linux/random.h index fae0c84027fd..223b4bd584e7 100644 --- a/include/linux/random.h +++ b/include/linux/random.h @@ -13,7 +13,7 @@ struct notifier_block; void add_device_randomness(const void *buf, size_t len); -void add_bootloader_randomness(const void *buf, size_t len); +void __init add_bootloader_randomness(const void *buf, size_t len); void add_input_randomness(unsigned int type, unsigned int code, unsigned int value) __latent_entropy; void add_interrupt_randomness(int irq) __latent_entropy; -- cgit From 77fc95f8c0dc9e1f8e620ec14d2fb65028fb7adc Mon Sep 17 00:00:00 2001 From: "Jason A. Donenfeld" Date: Tue, 7 Jun 2022 17:04:38 +0200 Subject: random: account for arch randomness in bits Rather than accounting in bytes and multiplying (shifting), we can just account in bits and avoid the shift. The main motivation for this is there are other patches in flux that expand this code a bit, and avoiding the duplication of "* 8" everywhere makes things a bit clearer. Cc: stable@vger.kernel.org Fixes: 12e45a2a6308 ("random: credit architectural init the exact amount") Signed-off-by: Jason A. Donenfeld --- drivers/char/random.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 0d6fb3eaf609..60d701a2516b 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -776,7 +776,7 @@ static struct notifier_block pm_notifier = { .notifier_call = random_pm_notifica int __init random_init(const char *command_line) { ktime_t now = ktime_get_real(); - unsigned int i, arch_bytes; + unsigned int i, arch_bits; unsigned long entropy; #if defined(LATENT_ENTROPY_PLUGIN) @@ -784,12 +784,12 @@ int __init random_init(const char *command_line) _mix_pool_bytes(compiletime_seed, sizeof(compiletime_seed)); #endif - for (i = 0, arch_bytes = BLAKE2S_BLOCK_SIZE; + for (i = 0, arch_bits = BLAKE2S_BLOCK_SIZE * 8; i < BLAKE2S_BLOCK_SIZE; i += sizeof(entropy)) { if (!arch_get_random_seed_long_early(&entropy) && !arch_get_random_long_early(&entropy)) { entropy = random_get_entropy(); - arch_bytes -= sizeof(entropy); + arch_bits -= sizeof(entropy) * 8; } _mix_pool_bytes(&entropy, sizeof(entropy)); } @@ -801,8 +801,8 @@ int __init random_init(const char *command_line) if (crng_ready()) crng_reseed(); else if (trust_cpu) - _credit_init_bits(arch_bytes * 8); - used_arch_random = arch_bytes * 8 >= POOL_READY_BITS; + _credit_init_bits(arch_bits); + used_arch_random = arch_bits >= POOL_READY_BITS; WARN_ON(register_pm_notifier(&pm_notifier)); -- cgit From 60e5b2886b92afa9e7af56bba7f5fa5f057e1e97 Mon Sep 17 00:00:00 2001 From: "Jason A. Donenfeld" Date: Tue, 7 Jun 2022 17:28:06 +0200 Subject: random: do not use jump labels before they are initialized Stephen reported that a static key warning splat appears during early boot on systems that credit randomness from device trees that contain an "rng-seed" property, because because setup_machine_fdt() is called before jump_label_init() during setup_arch(): static_key_enable_cpuslocked(): static key '0xffffffe51c6fcfc0' used before call to jump_label_init() WARNING: CPU: 0 PID: 0 at kernel/jump_label.c:166 static_key_enable_cpuslocked+0xb0/0xb8 Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 5.18.0+ #224 44b43e377bfc84bc99bb5ab885ff694984ee09ff pstate: 600001c9 (nZCv dAIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : static_key_enable_cpuslocked+0xb0/0xb8 lr : static_key_enable_cpuslocked+0xb0/0xb8 sp : ffffffe51c393cf0 x29: ffffffe51c393cf0 x28: 000000008185054c x27: 00000000f1042f10 x26: 0000000000000000 x25: 00000000f10302b2 x24: 0000002513200000 x23: 0000002513200000 x22: ffffffe51c1c9000 x21: fffffffdfdc00000 x20: ffffffe51c2f0831 x19: ffffffe51c6fcfc0 x18: 00000000ffff1020 x17: 00000000e1e2ac90 x16: 00000000000000e0 x15: ffffffe51b710708 x14: 0000000000000066 x13: 0000000000000018 x12: 0000000000000000 x11: 0000000000000000 x10: 00000000ffffffff x9 : 0000000000000000 x8 : 0000000000000000 x7 : 61632065726f6665 x6 : 6220646573752027 x5 : ffffffe51c641d25 x4 : ffffffe51c13142c x3 : ffff0a00ffffff05 x2 : 40000000ffffe003 x1 : 00000000000001c0 x0 : 0000000000000065 Call trace: static_key_enable_cpuslocked+0xb0/0xb8 static_key_enable+0x2c/0x40 crng_set_ready+0x24/0x30 execute_in_process_context+0x80/0x90 _credit_init_bits+0x100/0x154 add_bootloader_randomness+0x64/0x78 early_init_dt_scan_chosen+0x140/0x184 early_init_dt_scan_nodes+0x28/0x4c early_init_dt_scan+0x40/0x44 setup_machine_fdt+0x7c/0x120 setup_arch+0x74/0x1d8 start_kernel+0x84/0x44c __primary_switched+0xc0/0xc8 ---[ end trace 0000000000000000 ]--- random: crng init done Machine model: Google Lazor (rev1 - 2) with LTE A trivial fix went in to address this on arm64, 73e2d827a501 ("arm64: Initialize jump labels before setup_machine_fdt()"). I wrote patches as well for arm32 and risc-v. But still patches are needed on xtensa, powerpc, arc, and mips. So that's 7 platforms where things aren't quite right. This sort of points to larger issues that might need a larger solution. Instead, this commit just defers setting the static branch until later in the boot process. random_init() is called after jump_label_init() has been called, and so is always a safe place from which to adjust the static branch. Fixes: f5bda35fba61 ("random: use static branch for crng_ready()") Reported-by: Stephen Boyd Reported-by: Phil Elwell Tested-by: Phil Elwell Reviewed-by: Ard Biesheuvel Cc: Catalin Marinas Cc: Russell King Cc: Arnd Bergmann Signed-off-by: Jason A. Donenfeld --- drivers/char/random.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 60d701a2516b..0b78b9c4acf5 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -650,7 +650,8 @@ static void __cold _credit_init_bits(size_t bits) if (orig < POOL_READY_BITS && new >= POOL_READY_BITS) { crng_reseed(); /* Sets crng_init to CRNG_READY under base_crng.lock. */ - execute_in_process_context(crng_set_ready, &set_ready); + if (static_key_initialized) + execute_in_process_context(crng_set_ready, &set_ready); wake_up_interruptible(&crng_init_wait); kill_fasync(&fasync, SIGIO, POLL_IN); pr_notice("crng init done\n"); @@ -798,6 +799,14 @@ int __init random_init(const char *command_line) _mix_pool_bytes(command_line, strlen(command_line)); add_latent_entropy(); + /* + * If we were initialized by the bootloader before jump labels are + * initialized, then we should enable the static branch here, where + * it's guaranteed that jump labels have been initialized. + */ + if (!static_branch_likely(&crng_is_ready) && crng_init >= CRNG_READY) + crng_set_ready(NULL); + if (crng_ready()) crng_reseed(); else if (trust_cpu) -- cgit From 846bb97e131d7938847963cca00657c995b1fce1 Mon Sep 17 00:00:00 2001 From: "Jason A. Donenfeld" Date: Sun, 5 Jun 2022 18:30:46 +0200 Subject: random: credit cpu and bootloader seeds by default This commit changes the default Kconfig values of RANDOM_TRUST_CPU and RANDOM_TRUST_BOOTLOADER to be Y by default. It does not change any existing configs or change any kernel behavior. The reason for this is several fold. As background, I recently had an email thread with the kernel maintainers of Fedora/RHEL, Debian, Ubuntu, Gentoo, Arch, NixOS, Alpine, SUSE, and Void as recipients. I noted that some distros trust RDRAND, some trust EFI, and some trust both, and I asked why or why not. There wasn't really much of a "debate" but rather an interesting discussion of what the historical reasons have been for this, and it came up that some distros just missed the introduction of the bootloader Kconfig knob, while another didn't want to enable it until there was a boot time switch to turn it off for more concerned users (which has since been added). The result of the rather uneventful discussion is that every major Linux distro enables these two options by default. While I didn't have really too strong of an opinion going into this thread -- and I mostly wanted to learn what the distros' thinking was one way or another -- ultimately I think their choice was a decent enough one for a default option (which can be disabled at boot time). I'll try to summarize the pros and cons: Pros: - The RNG machinery gets initialized super quickly, and there's no messing around with subsequent blocking behavior. - The bootloader mechanism is used by kexec in order for the prior kernel to initialize the RNG of the next kernel, which increases the entropy available to early boot daemons of the next kernel. - Previous objections related to backdoors centered around Dual_EC_DRBG-like kleptographic systems, in which observing some amount of the output stream enables an adversary holding the right key to determine the entire output stream. This used to be a partially justified concern, because RDRAND output was mixed into the output stream in varying ways, some of which may have lacked pre-image resistance (e.g. XOR or an LFSR). But this is no longer the case. Now, all usage of RDRAND and bootloader seeds go through a cryptographic hash function. This means that the CPU would have to compute a hash pre-image, which is not considered to be feasible (otherwise the hash function would be terribly broken). - More generally, if the CPU is backdoored, the RNG is probably not the realistic vector of choice for an attacker. - These CPU or bootloader seeds are far from being the only source of entropy. Rather, there is generally a pretty huge amount of entropy, not all of which is credited, especially on CPUs that support instructions like RDRAND. In other words, assuming RDRAND outputs all zeros, an attacker would *still* have to accurately model every single other entropy source also in use. - The RNG now reseeds itself quite rapidly during boot, starting at 2 seconds, then 4, then 8, then 16, and so forth, so that other sources of entropy get used without much delay. - Paranoid users can set random.trust_{cpu,bootloader}=no in the kernel command line, and paranoid system builders can set the Kconfig options to N, so there's no reduction or restriction of optionality. - It's a practical default. - All the distros have it set this way. Microsoft and Apple trust it too. Bandwagon. Cons: - RDRAND *could* still be backdoored with something like a fixed key or limited space serial number seed or another indexable scheme like that. (However, it's hard to imagine threat models where the CPU is backdoored like this, yet people are still okay making *any* computations with it or connecting it to networks, etc.) - RDRAND *could* be defective, rather than backdoored, and produce garbage that is in one way or another insufficient for crypto. - Suggesting a *reduction* in paranoia, as this commit effectively does, may cause some to question my personal integrity as a "security person". - Bootloader seeds and RDRAND are generally very difficult if not all together impossible to audit. Keep in mind that this doesn't actually change any behavior. This is just a change in the default Kconfig value. The distros already are shipping kernels that set things this way. Ard made an additional argument in [1]: We're at the mercy of firmware and micro-architecture anyway, given that we are also relying on it to ensure that every instruction in the kernel's executable image has been faithfully copied to memory, and that the CPU implements those instructions as documented. So I don't think firmware or ISA bugs related to RNGs deserve special treatment - if they are broken, we should quirk around them like we usually do. So enabling these by default is a step in the right direction IMHO. In [2], Phil pointed out that having this disabled masked a bug that CI otherwise would have caught: A clean 5.15.45 boots cleanly, whereas a downstream kernel shows the static key warning (but it does go on to boot). The significant difference is that our defconfigs set CONFIG_RANDOM_TRUST_BOOTLOADER=y defining that on top of multi_v7_defconfig demonstrates the issue on a clean 5.15.45. Conversely, not setting that option in a downstream kernel build avoids the warning [1] https://lore.kernel.org/lkml/CAMj1kXGi+ieviFjXv9zQBSaGyyzeGW_VpMpTLJK8PJb2QHEQ-w@mail.gmail.com/ [2] https://lore.kernel.org/lkml/c47c42e3-1d56-5859-a6ad-976a1a3381c6@raspberrypi.com/ Cc: Theodore Ts'o Reviewed-by: Ard Biesheuvel Signed-off-by: Jason A. Donenfeld --- drivers/char/Kconfig | 50 +++++++++++++++++++++++++++++++------------------- 1 file changed, 31 insertions(+), 19 deletions(-) diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig index 69fd31ffb847..0b6c03643ddc 100644 --- a/drivers/char/Kconfig +++ b/drivers/char/Kconfig @@ -429,28 +429,40 @@ config ADI driver include crash and makedumpfile. config RANDOM_TRUST_CPU - bool "Trust the CPU manufacturer to initialize Linux's CRNG" + bool "Initialize RNG using CPU RNG instructions" + default y depends on ARCH_RANDOM - default n help - Assume that CPU manufacturer (e.g., Intel or AMD for RDSEED or - RDRAND, IBM for the S390 and Power PC architectures) is trustworthy - for the purposes of initializing Linux's CRNG. Since this is not - something that can be independently audited, this amounts to trusting - that CPU manufacturer (perhaps with the insistence or mandate - of a Nation State's intelligence or law enforcement agencies) - has not installed a hidden back door to compromise the CPU's - random number generation facilities. This can also be configured - at boot with "random.trust_cpu=on/off". + Initialize the RNG using random numbers supplied by the CPU's + RNG instructions (e.g. RDRAND), if supported and available. These + random numbers are never used directly, but are rather hashed into + the main input pool, and this happens regardless of whether or not + this option is enabled. Instead, this option controls whether the + they are credited and hence can initialize the RNG. Additionally, + other sources of randomness are always used, regardless of this + setting. Enabling this implies trusting that the CPU can supply high + quality and non-backdoored random numbers. + + Say Y here unless you have reason to mistrust your CPU or believe + its RNG facilities may be faulty. This may also be configured at + boot time with "random.trust_cpu=on/off". config RANDOM_TRUST_BOOTLOADER - bool "Trust the bootloader to initialize Linux's CRNG" - help - Some bootloaders can provide entropy to increase the kernel's initial - device randomness. Say Y here to assume the entropy provided by the - booloader is trustworthy so it will be added to the kernel's entropy - pool. Otherwise, say N here so it will be regarded as device input that - only mixes the entropy pool. This can also be configured at boot with - "random.trust_bootloader=on/off". + bool "Initialize RNG using bootloader-supplied seed" + default y + help + Initialize the RNG using a seed supplied by the bootloader or boot + environment (e.g. EFI or a bootloader-generated device tree). This + seed is not used directly, but is rather hashed into the main input + pool, and this happens regardless of whether or not this option is + enabled. Instead, this option controls whether the seed is credited + and hence can initialize the RNG. Additionally, other sources of + randomness are always used, regardless of this setting. Enabling + this implies trusting that the bootloader can supply high quality and + non-backdoored seeds. + + Say Y here unless you have reason to mistrust your bootloader or + believe its RNG facilities may be faulty. This may also be configured + at boot time with "random.trust_bootloader=on/off". endmenu -- cgit From e052a478a7daeca67664f7addd308ff51dd40654 Mon Sep 17 00:00:00 2001 From: "Jason A. Donenfeld" Date: Wed, 8 Jun 2022 10:31:25 +0200 Subject: random: remove rng_has_arch_random() With arch randomness being used by every distro and enabled in defconfigs, the distinction between rng_has_arch_random() and rng_is_initialized() is now rather small. In fact, the places where they differ are now places where paranoid users and system builders really don't want arch randomness to be used, in which case we should respect that choice, or places where arch randomness is known to be broken, in which case that choice is all the more important. So this commit just removes the function and its one user. Reviewed-by: Petr Mladek # for vsprintf.c Signed-off-by: Jason A. Donenfeld --- drivers/char/random.c | 13 ------------- include/linux/random.h | 1 - lib/vsprintf.c | 3 +-- 3 files changed, 1 insertion(+), 16 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 0b78b9c4acf5..655e327d425e 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -725,7 +725,6 @@ static void __cold _credit_init_bits(size_t bits) * **********************************************************************/ -static bool used_arch_random; static bool trust_cpu __initdata = IS_ENABLED(CONFIG_RANDOM_TRUST_CPU); static bool trust_bootloader __initdata = IS_ENABLED(CONFIG_RANDOM_TRUST_BOOTLOADER); static int __init parse_trust_cpu(char *arg) @@ -811,7 +810,6 @@ int __init random_init(const char *command_line) crng_reseed(); else if (trust_cpu) _credit_init_bits(arch_bits); - used_arch_random = arch_bits >= POOL_READY_BITS; WARN_ON(register_pm_notifier(&pm_notifier)); @@ -820,17 +818,6 @@ int __init random_init(const char *command_line) return 0; } -/* - * Returns whether arch randomness has been mixed into the initial - * state of the RNG, regardless of whether or not that randomness - * was credited. Knowing this is only good for a very limited set - * of uses, such as early init printk pointer obfuscation. - */ -bool rng_has_arch_random(void) -{ - return used_arch_random; -} - /* * Add device- or boot-specific data to the input pool to help * initialize it. diff --git a/include/linux/random.h b/include/linux/random.h index 223b4bd584e7..20e389a14e5c 100644 --- a/include/linux/random.h +++ b/include/linux/random.h @@ -74,7 +74,6 @@ static inline unsigned long get_random_canary(void) int __init random_init(const char *command_line); bool rng_is_initialized(void); -bool rng_has_arch_random(void); int wait_for_random_bytes(void); /* Calls wait_for_random_bytes() and then calls get_random_bytes(buf, nbytes). diff --git a/lib/vsprintf.c b/lib/vsprintf.c index fb77f7bfd126..3c1853a9d1c0 100644 --- a/lib/vsprintf.c +++ b/lib/vsprintf.c @@ -769,8 +769,7 @@ static inline int __ptr_to_hashval(const void *ptr, unsigned long *hashval_out) static DECLARE_WORK(enable_ptr_key_work, enable_ptr_key_workfn); unsigned long flags; - if (!system_unbound_wq || - (!rng_is_initialized() && !rng_has_arch_random()) || + if (!system_unbound_wq || !rng_is_initialized() || !spin_trylock_irqsave(&filling, flags)) return -EAGAIN; -- cgit From 77006f6edc0e0f58617eb25e53731f78641e820d Mon Sep 17 00:00:00 2001 From: Serge Semin Date: Fri, 10 Jun 2022 13:45:00 +0300 Subject: gpio: dwapb: Don't print error on -EPROBE_DEFER Currently if the APB or Debounce clocks aren't yet ready to be requested the DW GPIO driver will correctly handle that by deferring the probe procedure, but the error is still printed to the system log. It needlessly pollutes the log since there was no real error but a request to postpone the clock request procedure since the clocks subsystem hasn't been fully initialized yet. Let's fix that by using the dev_err_probe method to print the APB/clock request error status. It will correctly handle the deferred probe situation and print the error if it actually happens. Signed-off-by: Serge Semin Reviewed-by: Andy Shevchenko Signed-off-by: Bartosz Golaszewski --- drivers/gpio/gpio-dwapb.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/gpio/gpio-dwapb.c b/drivers/gpio/gpio-dwapb.c index 04afe728e187..c22fcaa44a61 100644 --- a/drivers/gpio/gpio-dwapb.c +++ b/drivers/gpio/gpio-dwapb.c @@ -662,10 +662,9 @@ static int dwapb_get_clks(struct dwapb_gpio *gpio) gpio->clks[1].id = "db"; err = devm_clk_bulk_get_optional(gpio->dev, DWAPB_NR_CLOCKS, gpio->clks); - if (err) { - dev_err(gpio->dev, "Cannot get APB/Debounce clocks\n"); - return err; - } + if (err) + return dev_err_probe(gpio->dev, err, + "Cannot get APB/Debounce clocks\n"); err = clk_bulk_prepare_enable(DWAPB_NR_CLOCKS, gpio->clks); if (err) { -- cgit From a4c934d74e408cb5ea5209b5fdf6184e3e250b1b Mon Sep 17 00:00:00 2001 From: Geert Uytterhoeven Date: Tue, 24 May 2022 09:15:44 +0200 Subject: platform/mellanox: Spelling s/platfom/platform/ Fix a misspelling of the word "platform". Signed-off-by: Geert Uytterhoeven Acked-by: Michael Shych Link: https://lore.kernel.org/r/9c8edde31e271311b7832d7677fe84aba917da8d.1653376503.git.geert@linux-m68k.org Signed-off-by: Hans de Goede --- drivers/platform/mellanox/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/platform/mellanox/Kconfig b/drivers/platform/mellanox/Kconfig index 72df4b8f4dd8..09c7829e95c4 100644 --- a/drivers/platform/mellanox/Kconfig +++ b/drivers/platform/mellanox/Kconfig @@ -85,7 +85,7 @@ config NVSW_SN2201 depends on I2C depends on REGMAP_I2C help - This driver provides support for the Nvidia SN2201 platfom. + This driver provides support for the Nvidia SN2201 platform. The SN2201 is a highly integrated for one rack unit system with L3 management switches. It has 48 x 1Gbps RJ45 + 4 x 100G QSFP28 ports in a compact 1RU form factor. The system also including a -- cgit From dddf30564054796696bcd4c462b232a5beacf72c Mon Sep 17 00:00:00 2001 From: Mike Snitzer Date: Fri, 10 Jun 2022 15:07:48 -0400 Subject: dm: fix zoned locking imbalance due to needless check in clone_endio After the commit ca522482e3ea ("dm: pass NULL bdev to bio_alloc_clone"), clone_endio() only calls dm_zone_endio() when DM targets remap the clone bio's bdev to something other than the md->disk->part0 default. However, if a DM target (e.g. dm-crypt) stacked ontop of a dm-zoned does not remap the clone bio using bio_set_dev() then dm_zone_endio() is not called at completion of the bios and zone locks are not properly unlocked. This triggers a hang, in dm_zone_map_bio(), when blktests block/004 is run for dm-crypt on zoned block devices. To avoid the hang, simply remove the clone_endio() check that verifies the target remapped the clone bio to a device other than the default. Fixes: ca522482e3ea ("dm: pass NULL bdev to bio_alloc_clone") Reported-by: Shin'ichiro Kawasaki Signed-off-by: Mike Snitzer --- drivers/md/dm.c | 26 +++++++++++--------------- 1 file changed, 11 insertions(+), 15 deletions(-) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 8b21155d3c4f..d8f16183bf27 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1016,23 +1016,19 @@ static void clone_endio(struct bio *bio) struct dm_io *io = tio->io; struct mapped_device *md = io->md; - if (likely(bio->bi_bdev != md->disk->part0)) { - struct request_queue *q = bdev_get_queue(bio->bi_bdev); - - if (unlikely(error == BLK_STS_TARGET)) { - if (bio_op(bio) == REQ_OP_DISCARD && - !bdev_max_discard_sectors(bio->bi_bdev)) - disable_discard(md); - else if (bio_op(bio) == REQ_OP_WRITE_ZEROES && - !q->limits.max_write_zeroes_sectors) - disable_write_zeroes(md); - } - - if (static_branch_unlikely(&zoned_enabled) && - unlikely(blk_queue_is_zoned(q))) - dm_zone_endio(io, bio); + if (unlikely(error == BLK_STS_TARGET)) { + if (bio_op(bio) == REQ_OP_DISCARD && + !bdev_max_discard_sectors(bio->bi_bdev)) + disable_discard(md); + else if (bio_op(bio) == REQ_OP_WRITE_ZEROES && + !bdev_write_zeroes_sectors(bio->bi_bdev)) + disable_write_zeroes(md); } + if (static_branch_unlikely(&zoned_enabled) && + unlikely(blk_queue_is_zoned(bdev_get_queue(bio->bi_bdev)))) + dm_zone_endio(io, bio); + if (endio) { int r = endio(ti, bio, &error); switch (r) { -- cgit From 102d841055be8e6e4e24d58917ffc04958262c4d Mon Sep 17 00:00:00 2001 From: David Howells Date: Thu, 19 May 2022 08:40:12 +0100 Subject: afs: Fix some checker issues Remove an unused global variable and make another static as reported by make C=1. Signed-off-by: David Howells cc: linux-afs@lists.infradead.org --- fs/afs/volume.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/afs/volume.c b/fs/afs/volume.c index 94a3d247924b..cc665cef0abe 100644 --- a/fs/afs/volume.c +++ b/fs/afs/volume.c @@ -9,8 +9,7 @@ #include #include "internal.h" -unsigned __read_mostly afs_volume_gc_delay = 10; -unsigned __read_mostly afs_volume_record_life = 60 * 60; +static unsigned __read_mostly afs_volume_record_life = 60 * 60; /* * Insert a volume into a cell. If there's an existing volume record, that is -- cgit From e81fb4198e27925b151aad1450e0fd607d6733f8 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Thu, 9 Jun 2022 15:04:01 -0700 Subject: netfs: Further cleanups after struct netfs_inode wrapper introduced Change the signature of netfs helper functions to take a struct netfs_inode pointer rather than a struct inode pointer where appropriate, thereby relieving the need for the network filesystem to convert its internal inode format down to the VFS inode only for netfslib to bounce it back up. For type safety, it's better not to do that (and it's less typing too). Give netfs_write_begin() an extra argument to pass in a pointer to the netfs_inode struct rather than deriving it internally from the file pointer. Note that the ->write_begin() and ->write_end() ops are intended to be replaced in the future by netfslib code that manages this without the need to call in twice for each page. netfs_readpage() and similar are intended to be pointed at directly by the address_space_operations table, so must stick to the signature dictated by the function pointers there. Changes ======= - Updated the kerneldoc comments and documentation [DH]. Signed-off-by: David Howells cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/CAHk-=wgkwKyNmNdKpQkqZ6DnmUL-x9hp0YBnUGjaPFEAdxDTbw@mail.gmail.com/ --- Documentation/filesystems/netfs_library.rst | 9 +++++---- fs/9p/v9fs.h | 2 +- fs/9p/vfs_addr.c | 2 +- fs/9p/vfs_inode.c | 3 ++- fs/afs/dynroot.c | 2 +- fs/afs/inode.c | 2 +- fs/afs/internal.h | 2 +- fs/afs/write.c | 2 +- fs/ceph/addr.c | 3 ++- fs/ceph/cache.h | 2 +- fs/ceph/inode.c | 2 +- fs/cifs/fscache.h | 2 +- fs/netfs/buffered_read.c | 5 +++-- include/linux/netfs.h | 22 +++++++++------------- 14 files changed, 30 insertions(+), 30 deletions(-) diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst index 3276c3d55142..9332f66373ee 100644 --- a/Documentation/filesystems/netfs_library.rst +++ b/Documentation/filesystems/netfs_library.rst @@ -79,7 +79,7 @@ To help deal with the per-inode context, a number helper functions are provided. Firstly, a function to perform basic initialisation on a context and set the operations table pointer:: - void netfs_inode_init(struct inode *inode, + void netfs_inode_init(struct netfs_inode *ctx, const struct netfs_request_ops *ops); then a function to cast from the VFS inode structure to the netfs context:: @@ -89,7 +89,7 @@ then a function to cast from the VFS inode structure to the netfs context:: and finally, a function to get the cache cookie pointer from the context attached to an inode (or NULL if fscache is disabled):: - struct fscache_cookie *netfs_i_cookie(struct inode *inode); + struct fscache_cookie *netfs_i_cookie(struct netfs_inode *ctx); Buffered Read Helpers @@ -136,8 +136,9 @@ Three read helpers are provided:: void netfs_readahead(struct readahead_control *ractl); int netfs_read_folio(struct file *file, - struct folio *folio); - int netfs_write_begin(struct file *file, + struct folio *folio); + int netfs_write_begin(struct netfs_inode *ctx, + struct file *file, struct address_space *mapping, loff_t pos, unsigned int len, diff --git a/fs/9p/v9fs.h b/fs/9p/v9fs.h index 1b219c21d15e..6acabc2e7dc9 100644 --- a/fs/9p/v9fs.h +++ b/fs/9p/v9fs.h @@ -124,7 +124,7 @@ static inline struct v9fs_inode *V9FS_I(const struct inode *inode) static inline struct fscache_cookie *v9fs_inode_cookie(struct v9fs_inode *v9inode) { #ifdef CONFIG_9P_FSCACHE - return netfs_i_cookie(&v9inode->netfs.inode); + return netfs_i_cookie(&v9inode->netfs); #else return NULL; #endif diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c index 90c6c1ba03ab..c004b9a73a92 100644 --- a/fs/9p/vfs_addr.c +++ b/fs/9p/vfs_addr.c @@ -274,7 +274,7 @@ static int v9fs_write_begin(struct file *filp, struct address_space *mapping, * file. We need to do this before we get a lock on the page in case * there's more than one writer competing for the same cache block. */ - retval = netfs_write_begin(filp, mapping, pos, len, &folio, fsdata); + retval = netfs_write_begin(&v9inode->netfs, filp, mapping, pos, len, &folio, fsdata); if (retval < 0) return retval; diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c index e660c6348b9d..419d2f3cf2c2 100644 --- a/fs/9p/vfs_inode.c +++ b/fs/9p/vfs_inode.c @@ -252,7 +252,8 @@ void v9fs_free_inode(struct inode *inode) */ static void v9fs_set_netfs_context(struct inode *inode) { - netfs_inode_init(inode, &v9fs_req_ops); + struct v9fs_inode *v9inode = V9FS_I(inode); + netfs_inode_init(&v9inode->netfs, &v9fs_req_ops); } int v9fs_init_inode(struct v9fs_session_info *v9ses, diff --git a/fs/afs/dynroot.c b/fs/afs/dynroot.c index 3a5bbffdf053..d7d9402ff718 100644 --- a/fs/afs/dynroot.c +++ b/fs/afs/dynroot.c @@ -76,7 +76,7 @@ struct inode *afs_iget_pseudo_dir(struct super_block *sb, bool root) /* there shouldn't be an existing inode */ BUG_ON(!(inode->i_state & I_NEW)); - netfs_inode_init(inode, NULL); + netfs_inode_init(&vnode->netfs, NULL); inode->i_size = 0; inode->i_mode = S_IFDIR | S_IRUGO | S_IXUGO; if (root) { diff --git a/fs/afs/inode.c b/fs/afs/inode.c index 22811e9eacf5..89630acbc2cc 100644 --- a/fs/afs/inode.c +++ b/fs/afs/inode.c @@ -58,7 +58,7 @@ static noinline void dump_vnode(struct afs_vnode *vnode, struct afs_vnode *paren */ static void afs_set_netfs_context(struct afs_vnode *vnode) { - netfs_inode_init(&vnode->netfs.inode, &afs_req_ops); + netfs_inode_init(&vnode->netfs, &afs_req_ops); } /* diff --git a/fs/afs/internal.h b/fs/afs/internal.h index 984b113a9107..a6f25d9e75b5 100644 --- a/fs/afs/internal.h +++ b/fs/afs/internal.h @@ -670,7 +670,7 @@ struct afs_vnode { static inline struct fscache_cookie *afs_vnode_cache(struct afs_vnode *vnode) { #ifdef CONFIG_AFS_FSCACHE - return netfs_i_cookie(&vnode->netfs.inode); + return netfs_i_cookie(&vnode->netfs); #else return NULL; #endif diff --git a/fs/afs/write.c b/fs/afs/write.c index f80a6096d91c..2c885b22de34 100644 --- a/fs/afs/write.c +++ b/fs/afs/write.c @@ -60,7 +60,7 @@ int afs_write_begin(struct file *file, struct address_space *mapping, * file. We need to do this before we get a lock on the page in case * there's more than one writer competing for the same cache block. */ - ret = netfs_write_begin(file, mapping, pos, len, &folio, fsdata); + ret = netfs_write_begin(&vnode->netfs, file, mapping, pos, len, &folio, fsdata); if (ret < 0) return ret; diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index f5f116ed1b9e..9763e7ea8148 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -1322,10 +1322,11 @@ static int ceph_write_begin(struct file *file, struct address_space *mapping, struct page **pagep, void **fsdata) { struct inode *inode = file_inode(file); + struct ceph_inode_info *ci = ceph_inode(inode); struct folio *folio = NULL; int r; - r = netfs_write_begin(file, inode->i_mapping, pos, len, &folio, NULL); + r = netfs_write_begin(&ci->netfs, file, inode->i_mapping, pos, len, &folio, NULL); if (r == 0) folio_wait_fscache(folio); if (r < 0) { diff --git a/fs/ceph/cache.h b/fs/ceph/cache.h index 26c6ae06e2f4..dc502daac49a 100644 --- a/fs/ceph/cache.h +++ b/fs/ceph/cache.h @@ -28,7 +28,7 @@ void ceph_fscache_invalidate(struct inode *inode, bool dio_write); static inline struct fscache_cookie *ceph_fscache_cookie(struct ceph_inode_info *ci) { - return netfs_i_cookie(&ci->netfs.inode); + return netfs_i_cookie(&ci->netfs); } static inline void ceph_fscache_resize(struct inode *inode, loff_t to) diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index 650746b3ba99..56c53ab3618e 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -460,7 +460,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb) dout("alloc_inode %p\n", &ci->netfs.inode); /* Set parameters for the netfs library */ - netfs_inode_init(&ci->netfs.inode, &ceph_netfs_ops); + netfs_inode_init(&ci->netfs, &ceph_netfs_ops); spin_lock_init(&ci->i_ceph_lock); diff --git a/fs/cifs/fscache.h b/fs/cifs/fscache.h index ab9a51d0125c..aa3b941a5555 100644 --- a/fs/cifs/fscache.h +++ b/fs/cifs/fscache.h @@ -61,7 +61,7 @@ void cifs_fscache_fill_coherency(struct inode *inode, static inline struct fscache_cookie *cifs_inode_cookie(struct inode *inode) { - return netfs_i_cookie(inode); + return netfs_i_cookie(&CIFS_I(inode)->netfs); } static inline void cifs_invalidate_cache(struct inode *inode, unsigned int flags) diff --git a/fs/netfs/buffered_read.c b/fs/netfs/buffered_read.c index d37e012386f3..42f892c5712e 100644 --- a/fs/netfs/buffered_read.c +++ b/fs/netfs/buffered_read.c @@ -297,6 +297,7 @@ zero_out: /** * netfs_write_begin - Helper to prepare for writing + * @ctx: The netfs context * @file: The file to read from * @mapping: The mapping to read from * @pos: File position at which the write will begin @@ -326,12 +327,12 @@ zero_out: * * This is usable whether or not caching is enabled. */ -int netfs_write_begin(struct file *file, struct address_space *mapping, +int netfs_write_begin(struct netfs_inode *ctx, + struct file *file, struct address_space *mapping, loff_t pos, unsigned int len, struct folio **_folio, void **_fsdata) { struct netfs_io_request *rreq; - struct netfs_inode *ctx = netfs_inode(file_inode(file )); struct folio *folio; unsigned int fgp_flags = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE; pgoff_t index = pos >> PAGE_SHIFT; diff --git a/include/linux/netfs.h b/include/linux/netfs.h index 6dbb4c9ce50d..a62739f3726b 100644 --- a/include/linux/netfs.h +++ b/include/linux/netfs.h @@ -277,7 +277,8 @@ struct netfs_cache_ops { struct readahead_control; extern void netfs_readahead(struct readahead_control *); int netfs_read_folio(struct file *, struct folio *); -extern int netfs_write_begin(struct file *, struct address_space *, +extern int netfs_write_begin(struct netfs_inode *, + struct file *, struct address_space *, loff_t, unsigned int, struct folio **, void **); @@ -302,19 +303,17 @@ static inline struct netfs_inode *netfs_inode(struct inode *inode) /** * netfs_inode_init - Initialise a netfslib inode context - * @inode: The inode with which the context is associated + * @inode: The netfs inode to initialise * @ops: The netfs's operations list * * Initialise the netfs library context struct. This is expected to follow on * directly from the VFS inode struct. */ -static inline void netfs_inode_init(struct inode *inode, +static inline void netfs_inode_init(struct netfs_inode *ctx, const struct netfs_request_ops *ops) { - struct netfs_inode *ctx = netfs_inode(inode); - ctx->ops = ops; - ctx->remote_i_size = i_size_read(inode); + ctx->remote_i_size = i_size_read(&ctx->inode); #if IS_ENABLED(CONFIG_FSCACHE) ctx->cache = NULL; #endif @@ -322,28 +321,25 @@ static inline void netfs_inode_init(struct inode *inode, /** * netfs_resize_file - Note that a file got resized - * @inode: The inode being resized + * @ctx: The netfs inode being resized * @new_i_size: The new file size * * Inform the netfs lib that a file got resized so that it can adjust its state. */ -static inline void netfs_resize_file(struct inode *inode, loff_t new_i_size) +static inline void netfs_resize_file(struct netfs_inode *ctx, loff_t new_i_size) { - struct netfs_inode *ctx = netfs_inode(inode); - ctx->remote_i_size = new_i_size; } /** * netfs_i_cookie - Get the cache cookie from the inode - * @inode: The inode to query + * @ctx: The netfs inode to query * * Get the caching cookie (if enabled) from the network filesystem's inode. */ -static inline struct fscache_cookie *netfs_i_cookie(struct inode *inode) +static inline struct fscache_cookie *netfs_i_cookie(struct netfs_inode *ctx) { #if IS_ENABLED(CONFIG_FSCACHE) - struct netfs_inode *ctx = netfs_inode(inode); return ctx->cache; #else return NULL; -- cgit From 40a81101202300df7db273f77dda9fbe6271b1d2 Mon Sep 17 00:00:00 2001 From: David Howells Date: Fri, 25 Feb 2022 11:19:14 +0000 Subject: netfs: Rename the netfs_io_request cleanup op and give it an op pointer The netfs_io_request cleanup op is now always in a position to be given a pointer to a netfs_io_request struct, so this can be passed in instead of the mapping and private data arguments (both of which are included in the struct). So rename the ->cleanup op to ->free_request (to match ->init_request) and pass in the I/O pointer. Signed-off-by: David Howells Reviewed-by: Jeff Layton cc: linux-cachefs@redhat.com --- Documentation/filesystems/netfs_library.rst | 24 ++++++++++++------------ fs/9p/vfs_addr.c | 11 +++++------ fs/afs/file.c | 6 +++--- fs/ceph/addr.c | 9 ++++----- fs/netfs/objects.c | 6 +++--- include/linux/netfs.h | 3 ++- 6 files changed, 29 insertions(+), 30 deletions(-) diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst index 9332f66373ee..4d19b19bcc08 100644 --- a/Documentation/filesystems/netfs_library.rst +++ b/Documentation/filesystems/netfs_library.rst @@ -158,9 +158,10 @@ The helpers manage the read request, calling back into the network filesystem through the suppplied table of operations. Waits will be performed as necessary before returning for helpers that are meant to be synchronous. -If an error occurs and netfs_priv is non-NULL, ops->cleanup() will be called to -deal with it. If some parts of the request are in progress when an error -occurs, the request will get partially completed if sufficient data is read. +If an error occurs, the ->free_request() will be called to clean up the +netfs_io_request struct allocated. If some parts of the request are in +progress when an error occurs, the request will get partially completed if +sufficient data is read. Additionally, there is:: @@ -208,8 +209,7 @@ The above fields are the ones the netfs can use. They are: * ``netfs_priv`` The network filesystem's private data. The value for this can be passed in - to the helper functions or set during the request. The ->cleanup() op will - be called if this is non-NULL at the end. + to the helper functions or set during the request. * ``start`` * ``len`` @@ -294,6 +294,7 @@ through which it can issue requests and negotiate:: struct netfs_request_ops { void (*init_request)(struct netfs_io_request *rreq, struct file *file); + void (*free_request)(struct netfs_io_request *rreq); int (*begin_cache_operation)(struct netfs_io_request *rreq); void (*expand_readahead)(struct netfs_io_request *rreq); bool (*clamp_length)(struct netfs_io_subrequest *subreq); @@ -302,7 +303,6 @@ through which it can issue requests and negotiate:: int (*check_write_begin)(struct file *file, loff_t pos, unsigned len, struct folio *folio, void **_fsdata); void (*done)(struct netfs_io_request *rreq); - void (*cleanup)(struct address_space *mapping, void *netfs_priv); }; The operations are as follows: @@ -310,7 +310,12 @@ The operations are as follows: * ``init_request()`` [Optional] This is called to initialise the request structure. It is given - the file for reference and can modify the ->netfs_priv value. + the file for reference. + + * ``free_request()`` + + [Optional] This is called as the request is being deallocated so that the + filesystem can clean up any state it has attached there. * ``begin_cache_operation()`` @@ -384,11 +389,6 @@ The operations are as follows: [Optional] This is called after the folios in the request have all been unlocked (and marked uptodate if applicable). - * ``cleanup`` - - [Optional] This is called as the request is being deallocated so that the - filesystem can clean up ->netfs_priv. - Read Helper Procedure diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c index c004b9a73a92..a8f512b44a85 100644 --- a/fs/9p/vfs_addr.c +++ b/fs/9p/vfs_addr.c @@ -66,13 +66,12 @@ static int v9fs_init_request(struct netfs_io_request *rreq, struct file *file) } /** - * v9fs_req_cleanup - Cleanup request initialized by v9fs_init_request - * @mapping: unused mapping of request to cleanup - * @priv: private data to cleanup, a fid, guaranted non-null. + * v9fs_free_request - Cleanup request initialized by v9fs_init_rreq + * @rreq: The I/O request to clean up */ -static void v9fs_req_cleanup(struct address_space *mapping, void *priv) +static void v9fs_free_request(struct netfs_io_request *rreq) { - struct p9_fid *fid = priv; + struct p9_fid *fid = rreq->netfs_priv; p9_client_clunk(fid); } @@ -94,9 +93,9 @@ static int v9fs_begin_cache_operation(struct netfs_io_request *rreq) const struct netfs_request_ops v9fs_req_ops = { .init_request = v9fs_init_request, + .free_request = v9fs_free_request, .begin_cache_operation = v9fs_begin_cache_operation, .issue_read = v9fs_issue_read, - .cleanup = v9fs_req_cleanup, }; /** diff --git a/fs/afs/file.c b/fs/afs/file.c index 4de7af7c2f09..42118a4f3383 100644 --- a/fs/afs/file.c +++ b/fs/afs/file.c @@ -382,17 +382,17 @@ static int afs_check_write_begin(struct file *file, loff_t pos, unsigned len, return test_bit(AFS_VNODE_DELETED, &vnode->flags) ? -ESTALE : 0; } -static void afs_priv_cleanup(struct address_space *mapping, void *netfs_priv) +static void afs_free_request(struct netfs_io_request *rreq) { - key_put(netfs_priv); + key_put(rreq->netfs_priv); } const struct netfs_request_ops afs_req_ops = { .init_request = afs_init_request, + .free_request = afs_free_request, .begin_cache_operation = afs_begin_cache_operation, .check_write_begin = afs_check_write_begin, .issue_read = afs_issue_read, - .cleanup = afs_priv_cleanup, }; int afs_write_inode(struct inode *inode, struct writeback_control *wbc) diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 9763e7ea8148..6dee88815491 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -394,11 +394,10 @@ static int ceph_init_request(struct netfs_io_request *rreq, struct file *file) return 0; } -static void ceph_readahead_cleanup(struct address_space *mapping, void *priv) +static void ceph_netfs_free_request(struct netfs_io_request *rreq) { - struct inode *inode = mapping->host; - struct ceph_inode_info *ci = ceph_inode(inode); - int got = (uintptr_t)priv; + struct ceph_inode_info *ci = ceph_inode(rreq->inode); + int got = (uintptr_t)rreq->netfs_priv; if (got) ceph_put_cap_refs(ci, got); @@ -406,12 +405,12 @@ static void ceph_readahead_cleanup(struct address_space *mapping, void *priv) const struct netfs_request_ops ceph_netfs_ops = { .init_request = ceph_init_request, + .free_request = ceph_netfs_free_request, .begin_cache_operation = ceph_begin_cache_operation, .issue_read = ceph_netfs_issue_read, .expand_readahead = ceph_netfs_expand_readahead, .clamp_length = ceph_netfs_clamp_length, .check_write_begin = ceph_netfs_check_write_begin, - .cleanup = ceph_readahead_cleanup, }; #ifdef CONFIG_CEPH_FSCACHE diff --git a/fs/netfs/objects.c b/fs/netfs/objects.c index c6afa605b63b..e17cdf53f6a7 100644 --- a/fs/netfs/objects.c +++ b/fs/netfs/objects.c @@ -75,10 +75,10 @@ static void netfs_free_request(struct work_struct *work) struct netfs_io_request *rreq = container_of(work, struct netfs_io_request, work); - netfs_clear_subrequests(rreq, false); - if (rreq->netfs_priv) - rreq->netfs_ops->cleanup(rreq->mapping, rreq->netfs_priv); trace_netfs_rreq(rreq, netfs_rreq_trace_free); + netfs_clear_subrequests(rreq, false); + if (rreq->netfs_ops->free_request) + rreq->netfs_ops->free_request(rreq); if (rreq->cache_resources.ops) rreq->cache_resources.ops->end_operation(&rreq->cache_resources); kfree(rreq); diff --git a/include/linux/netfs.h b/include/linux/netfs.h index a62739f3726b..097cdd644665 100644 --- a/include/linux/netfs.h +++ b/include/linux/netfs.h @@ -206,7 +206,9 @@ struct netfs_io_request { */ struct netfs_request_ops { int (*init_request)(struct netfs_io_request *rreq, struct file *file); + void (*free_request)(struct netfs_io_request *rreq); int (*begin_cache_operation)(struct netfs_io_request *rreq); + void (*expand_readahead)(struct netfs_io_request *rreq); bool (*clamp_length)(struct netfs_io_subrequest *subreq); void (*issue_read)(struct netfs_io_subrequest *subreq); @@ -214,7 +216,6 @@ struct netfs_request_ops { int (*check_write_begin)(struct file *file, loff_t pos, unsigned len, struct folio *folio, void **_fsdata); void (*done)(struct netfs_io_request *rreq); - void (*cleanup)(struct address_space *mapping, void *netfs_priv); }; /* -- cgit From 6c77676645ad42993e0a8bdb8dafa517851a352a Mon Sep 17 00:00:00 2001 From: David Howells Date: Thu, 9 Jun 2022 09:07:01 +0100 Subject: iov_iter: Fix iter_xarray_get_pages{,_alloc}() The maths at the end of iter_xarray_get_pages() to calculate the actual size doesn't work under some circumstances, such as when it's been asked to extract a partial single page. Various terms of the equation cancel out and you end up with actual == offset. The same issue exists in iter_xarray_get_pages_alloc(). Fix these to just use min() to select the lesser amount from between the amount of page content transcribed into the buffer, minus the offset, and the size limit specified. This doesn't appear to have caused a problem yet upstream because network filesystems aren't getting the pages from an xarray iterator, but rather passing it directly to the socket, which just iterates over it. Cachefiles *does* do DIO from one to/from ext4/xfs/btrfs/etc. but it always asks for whole pages to be written or read. Fixes: 7ff5062079ef ("iov_iter: Add ITER_XARRAY") Reported-by: Jeff Layton Signed-off-by: David Howells cc: Alexander Viro cc: Dominique Martinet cc: Mike Marshall cc: Gao Xiang cc: linux-afs@lists.infradead.org cc: v9fs-developer@lists.sourceforge.net cc: devel@lists.orangefs.org cc: linux-erofs@lists.ozlabs.org cc: linux-cachefs@redhat.com cc: linux-fsdevel@vger.kernel.org Signed-off-by: Al Viro --- lib/iov_iter.c | 20 ++++---------------- 1 file changed, 4 insertions(+), 16 deletions(-) diff --git a/lib/iov_iter.c b/lib/iov_iter.c index 6dd5330f7a99..dda6d5f481c1 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -1434,7 +1434,7 @@ static ssize_t iter_xarray_get_pages(struct iov_iter *i, { unsigned nr, offset; pgoff_t index, count; - size_t size = maxsize, actual; + size_t size = maxsize; loff_t pos; if (!size || !maxpages) @@ -1461,13 +1461,7 @@ static ssize_t iter_xarray_get_pages(struct iov_iter *i, if (nr == 0) return 0; - actual = PAGE_SIZE * nr; - actual -= offset; - if (nr == count && size > 0) { - unsigned last_offset = (nr > 1) ? 0 : offset; - actual -= PAGE_SIZE - (last_offset + size); - } - return actual; + return min(nr * PAGE_SIZE - offset, maxsize); } /* must be done on non-empty ITER_IOVEC one */ @@ -1602,7 +1596,7 @@ static ssize_t iter_xarray_get_pages_alloc(struct iov_iter *i, struct page **p; unsigned nr, offset; pgoff_t index, count; - size_t size = maxsize, actual; + size_t size = maxsize; loff_t pos; if (!size) @@ -1631,13 +1625,7 @@ static ssize_t iter_xarray_get_pages_alloc(struct iov_iter *i, if (nr == 0) return 0; - actual = PAGE_SIZE * nr; - actual -= offset; - if (nr == count && size > 0) { - unsigned last_offset = (nr > 1) ? 0 : offset; - actual -= PAGE_SIZE - (last_offset + size); - } - return actual; + return min(nr * PAGE_SIZE - offset, maxsize); } ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, -- cgit From b9c29f391f4195a778620f3d2efd0d9746a414fe Mon Sep 17 00:00:00 2001 From: Michael Shych Date: Thu, 2 Jun 2022 17:51:03 +0300 Subject: platform/mellanox: Add static in struct declaration. Fix problem of missing static in struct declaration. Fixes: 662f24826f954 ("platform/mellanox: Add support for new SN2201 system") Reported-by: kernel test robot Signed-off-by: Michael Shych Link: https://lore.kernel.org/r/20220602145103.11859-1-michaelsh@nvidia.com Signed-off-by: Hans de Goede --- drivers/platform/mellanox/nvsw-sn2201.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/platform/mellanox/nvsw-sn2201.c b/drivers/platform/mellanox/nvsw-sn2201.c index 0bcdc7c75007..2923daf63b75 100644 --- a/drivers/platform/mellanox/nvsw-sn2201.c +++ b/drivers/platform/mellanox/nvsw-sn2201.c @@ -326,7 +326,7 @@ static struct resource nvsw_sn2201_lpc_res[] = { }; /* SN2201 I2C platform data. */ -struct mlxreg_core_hotplug_platform_data nvsw_sn2201_i2c_data = { +static struct mlxreg_core_hotplug_platform_data nvsw_sn2201_i2c_data = { .irq = NVSW_SN2201_CPLD_SYSIRQ, }; -- cgit From 66cb3a2d7ad0d0e9af4d3430a4f2a32ffb9ac098 Mon Sep 17 00:00:00 2001 From: David Arcari Date: Thu, 26 May 2022 16:31:40 -0400 Subject: platform/x86/intel: Fix pmt_crashlog array reference The probe function pmt_crashlog_probe() may incorrectly reference the 'priv->entry array' as it uses 'i' to reference the array instead of 'priv->num_entries' as it should. This is similar to the problem that was addressed in pmt_telemetry_probe via commit 2cdfa0c20d58 ("platform/x86/intel: Fix 'rmmod pmt_telemetry' panic"). Cc: "David E. Box" Cc: Hans de Goede Cc: Mark Gross Cc: linux-kernel@vger.kernel.org Signed-off-by: David Arcari Reviewed-by: David E. Box Link: https://lore.kernel.org/r/20220526203140.339120-1-darcari@redhat.com Signed-off-by: Hans de Goede --- drivers/platform/x86/intel/pmt/crashlog.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/platform/x86/intel/pmt/crashlog.c b/drivers/platform/x86/intel/pmt/crashlog.c index 34daf9df168b..ace1239bc0a0 100644 --- a/drivers/platform/x86/intel/pmt/crashlog.c +++ b/drivers/platform/x86/intel/pmt/crashlog.c @@ -282,7 +282,7 @@ static int pmt_crashlog_probe(struct auxiliary_device *auxdev, auxiliary_set_drvdata(auxdev, priv); for (i = 0; i < intel_vsec_dev->num_resources; i++) { - struct intel_pmt_entry *entry = &priv->entry[i].entry; + struct intel_pmt_entry *entry = &priv->entry[priv->num_entries].entry; ret = intel_pmt_dev_create(entry, &pmt_crashlog_ns, intel_vsec_dev, i); if (ret < 0) -- cgit From 552f3b801de6eb062b225a76e140995483a0609c Mon Sep 17 00:00:00 2001 From: George D Sworo Date: Wed, 1 Jun 2022 18:26:17 -0700 Subject: platform/x86/intel: pmc: Support Intel Raptorlake P Add Raptorlake P to the list of the platforms that intel_pmc_core driver supports for pmc_core device. Raptorlake P PCH is based on Alderlake P PCH. Signed-off-by: George D Sworo Reviewed-by: David E. Box Link: https://lore.kernel.org/r/20220602012617.20100-1-george.d.sworo@intel.com Signed-off-by: Hans de Goede --- drivers/platform/x86/intel/pmc/core.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/platform/x86/intel/pmc/core.c b/drivers/platform/x86/intel/pmc/core.c index edaf22e5ae98..40183bda7894 100644 --- a/drivers/platform/x86/intel/pmc/core.c +++ b/drivers/platform/x86/intel/pmc/core.c @@ -1912,6 +1912,7 @@ static const struct x86_cpu_id intel_pmc_core_ids[] = { X86_MATCH_INTEL_FAM6_MODEL(ROCKETLAKE, &tgl_reg_map), X86_MATCH_INTEL_FAM6_MODEL(ALDERLAKE_L, &tgl_reg_map), X86_MATCH_INTEL_FAM6_MODEL(ALDERLAKE, &adl_reg_map), + X86_MATCH_INTEL_FAM6_MODEL(RAPTORLAKE_P, &tgl_reg_map), {} }; -- cgit From 011881b80ebe773914b59905bce0f5e0ef93e7ba Mon Sep 17 00:00:00 2001 From: Jiasheng Jiang Date: Thu, 26 May 2022 17:03:45 +0800 Subject: platform/x86: barco-p50-gpio: Add check for platform_driver_register As platform_driver_register() could fail, it should be better to deal with the return value in order to maintain the code consisitency. Fixes: 86af1d02d458 ("platform/x86: Support for EC-connected GPIOs for identify LED/button on Barco P50 board") Signed-off-by: Jiasheng Jiang Acked-by: Peter Korsgaard Link: https://lore.kernel.org/r/20220526090345.1444172-1-jiasheng@iscas.ac.cn Signed-off-by: Hans de Goede --- drivers/platform/x86/barco-p50-gpio.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/platform/x86/barco-p50-gpio.c b/drivers/platform/x86/barco-p50-gpio.c index 05534287bc26..8dd672339485 100644 --- a/drivers/platform/x86/barco-p50-gpio.c +++ b/drivers/platform/x86/barco-p50-gpio.c @@ -405,11 +405,14 @@ MODULE_DEVICE_TABLE(dmi, dmi_ids); static int __init p50_module_init(void) { struct resource res = DEFINE_RES_IO(P50_GPIO_IO_PORT_BASE, P50_PORT_CMD + 1); + int ret; if (!dmi_first_match(dmi_ids)) return -ENODEV; - platform_driver_register(&p50_gpio_driver); + ret = platform_driver_register(&p50_gpio_driver); + if (ret) + return ret; gpio_pdev = platform_device_register_simple(DRIVER_NAME, PLATFORM_DEVID_NONE, &res, 1); if (IS_ERR(gpio_pdev)) { -- cgit From 8a041afe3e774bedd3e0a9b96f65e48a1299a595 Mon Sep 17 00:00:00 2001 From: Piotr Chmura Date: Mon, 6 Jun 2022 19:15:13 +0200 Subject: platform/x86: gigabyte-wmi: Add Z690M AORUS ELITE AX DDR4 support Add dmi_system_id of Gigabyte Z690M AORUS ELITE AX DDR4 board. Tested on my PC. Signed-off-by: Piotr Chmura Link: https://lore.kernel.org/r/bd83567e-ebf5-0b31-074b-5f6dc7f7c147@gmail.com Signed-off-by: Hans de Goede --- drivers/platform/x86/gigabyte-wmi.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/platform/x86/gigabyte-wmi.c b/drivers/platform/x86/gigabyte-wmi.c index 1ef606e3ef80..f52f9a52557b 100644 --- a/drivers/platform/x86/gigabyte-wmi.c +++ b/drivers/platform/x86/gigabyte-wmi.c @@ -156,6 +156,7 @@ static const struct dmi_system_id gigabyte_wmi_known_working_platforms[] = { DMI_EXACT_MATCH_GIGABYTE_BOARD_NAME("X570 GAMING X"), DMI_EXACT_MATCH_GIGABYTE_BOARD_NAME("X570 I AORUS PRO WIFI"), DMI_EXACT_MATCH_GIGABYTE_BOARD_NAME("X570 UD"), + DMI_EXACT_MATCH_GIGABYTE_BOARD_NAME("Z690M AORUS ELITE AX DDR4"), { } }; -- cgit From c6bc7e8ee90845556a90faf8b043cbefd77b8903 Mon Sep 17 00:00:00 2001 From: August Wikerfors Date: Wed, 8 Jun 2022 23:20:28 +0200 Subject: platform/x86: gigabyte-wmi: Add support for B450M DS3H-CF Tested and works on my system. Signed-off-by: August Wikerfors Link: https://lore.kernel.org/r/20220608212028.28307-1-git@augustwikerfors.se Signed-off-by: Hans de Goede --- drivers/platform/x86/gigabyte-wmi.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/platform/x86/gigabyte-wmi.c b/drivers/platform/x86/gigabyte-wmi.c index f52f9a52557b..497ad2f64a51 100644 --- a/drivers/platform/x86/gigabyte-wmi.c +++ b/drivers/platform/x86/gigabyte-wmi.c @@ -140,6 +140,7 @@ static u8 gigabyte_wmi_detect_sensor_usability(struct wmi_device *wdev) }} static const struct dmi_system_id gigabyte_wmi_known_working_platforms[] = { + DMI_EXACT_MATCH_GIGABYTE_BOARD_NAME("B450M DS3H-CF"), DMI_EXACT_MATCH_GIGABYTE_BOARD_NAME("B450M S2H V2"), DMI_EXACT_MATCH_GIGABYTE_BOARD_NAME("B550 AORUS ELITE AX V2"), DMI_EXACT_MATCH_GIGABYTE_BOARD_NAME("B550 AORUS ELITE"), -- cgit From 4c14d7043fede258957d7b01da0cad2d9fe3a205 Mon Sep 17 00:00:00 2001 From: Shyam Prasad N Date: Mon, 6 Jun 2022 09:52:46 +0000 Subject: cifs: populate empty hostnames for extra channels Currently, the secondary channels of a multichannel session also get hostname populated based on the info in primary channel. However, this will end up with a wrong resolution of hostname to IP address during reconnect. This change fixes this by not populating hostname info for all secondary channels. Fixes: 5112d80c162f ("cifs: populate server_hostname for extra channels") Cc: stable@vger.kernel.org Signed-off-by: Shyam Prasad N Signed-off-by: Steve French --- fs/cifs/connect.c | 4 ++++ fs/cifs/sess.c | 5 ++++- 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c index d46702f5a663..1849e3411487 100644 --- a/fs/cifs/connect.c +++ b/fs/cifs/connect.c @@ -97,6 +97,10 @@ static int reconn_set_ipaddr_from_hostname(struct TCP_Server_Info *server) if (!server->hostname) return -EINVAL; + /* if server hostname isn't populated, there's nothing to do here */ + if (server->hostname[0] == '\0') + return 0; + len = strlen(server->hostname) + 3; unc = kmalloc(len, GFP_KERNEL); diff --git a/fs/cifs/sess.c b/fs/cifs/sess.c index 3b7915af1f62..0bece97547d4 100644 --- a/fs/cifs/sess.c +++ b/fs/cifs/sess.c @@ -301,7 +301,10 @@ cifs_ses_add_channel(struct cifs_sb_info *cifs_sb, struct cifs_ses *ses, /* Auth */ ctx.domainauto = ses->domainAuto; ctx.domainname = ses->domainName; - ctx.server_hostname = ses->server->hostname; + + /* no hostname for extra channels */ + ctx.server_hostname = ""; + ctx.username = ses->user_name; ctx.password = ses->password; ctx.sectype = ses->sectype; -- cgit From eacea844594ff338db06437806707313210d4865 Mon Sep 17 00:00:00 2001 From: Vincent Whitchurch Date: Fri, 10 Jun 2022 17:12:03 +0200 Subject: um: virt-pci: set device ready in probe() Call virtio_device_ready() to make this driver work after commit b4ec69d7e09 ("virtio: harden vring IRQ"), since the driver uses the virtqueues in the probe function. (The virtio core sets the device ready when probe returns.) Fixes: 8b4ec69d7e09 ("virtio: harden vring IRQ") Fixes: 68f5d3f3b654 ("um: add PCI over virtio emulation driver") Signed-off-by: Vincent Whitchurch Message-Id: <20220610151203.3492541-1-vincent.whitchurch@axis.com> Signed-off-by: Michael S. Tsirkin Tested-by: Johannes Berg --- arch/um/drivers/virt-pci.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/arch/um/drivers/virt-pci.c b/arch/um/drivers/virt-pci.c index 5c092a9153ea..027847023184 100644 --- a/arch/um/drivers/virt-pci.c +++ b/arch/um/drivers/virt-pci.c @@ -544,6 +544,8 @@ static int um_pci_init_vqs(struct um_pci_device *dev) dev->cmd_vq = vqs[0]; dev->irq_vq = vqs[1]; + virtio_device_ready(dev->vdev); + for (i = 0; i < NUM_IRQ_MSGS; i++) { void *msg = kzalloc(MAX_IRQ_MSG_SIZE, GFP_KERNEL); @@ -587,7 +589,7 @@ static int um_pci_virtio_probe(struct virtio_device *vdev) dev->irq = irq_alloc_desc(numa_node_id()); if (dev->irq < 0) { err = dev->irq; - goto error; + goto err_reset; } um_pci_devices[free].dev = dev; vdev->priv = dev; @@ -604,6 +606,9 @@ static int um_pci_virtio_probe(struct virtio_device *vdev) um_pci_rescan(); return 0; +err_reset: + virtio_reset_device(vdev); + vdev->config->del_vqs(vdev); error: mutex_unlock(&um_pci_mtx); kfree(dev); -- cgit From c349ae5f831cb9817a45e4e36705d2f7d47c7bc3 Mon Sep 17 00:00:00 2001 From: Xin Long Date: Thu, 9 Jun 2022 11:17:13 -0400 Subject: Documentation: add description for net.sctp.reconf_enable Describe it in networking/ip-sysctl.rst like other SCTP options. Fixes: c0d8bab6ae51 ("sctp: add get and set sockopt for reconf_enable") Signed-off-by: Xin Long Acked-by: Marcelo Ricardo Leitner Signed-off-by: Jakub Kicinski --- Documentation/networking/ip-sysctl.rst | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 04216564a03c..aebf87b19fba 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -2925,6 +2925,17 @@ plpmtud_probe_interval - INTEGER Default: 0 +reconf_enable - BOOLEAN + Enable or disable extension of Stream Reconfiguration functionality + specified in RFC6525. This extension provides the ability to "reset" + a stream, and it includes the Parameters of "Outgoing/Incoming SSN + Reset", "SSN/TSN Reset" and "Add Outgoing/Incoming Streams". + + - 1: Enable extension. + - 0: Disable extension. + + Default: 0 + ``/proc/sys/net/core/*`` ======================== -- cgit From e65775fdd389e4f47eb1972ef6372e20c6c2cc05 Mon Sep 17 00:00:00 2001 From: Xin Long Date: Thu, 9 Jun 2022 11:17:14 -0400 Subject: Documentation: add description for net.sctp.intl_enable Describe it in networking/ip-sysctl.rst like other SCTP options. We need to document this especially as when using the feature of User Message Interleaving, some socket options also needs to be set. Fixes: 463118c34a35 ("sctp: support sysctl to allow users to use stream interleave") Signed-off-by: Xin Long Acked-by: Marcelo Ricardo Leitner Signed-off-by: Jakub Kicinski --- Documentation/networking/ip-sysctl.rst | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index aebf87b19fba..5fe837505d43 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -2936,6 +2936,20 @@ reconf_enable - BOOLEAN Default: 0 +intl_enable - BOOLEAN + Enable or disable extension of User Message Interleaving functionality + specified in RFC8260. This extension allows the interleaving of user + messages sent on different streams. With this feature enabled, I-DATA + chunk will replace DATA chunk to carry user messages if also supported + by the peer. Note that to use this feature, one needs to set this option + to 1 and also needs to set socket options SCTP_FRAGMENT_INTERLEAVE to 2 + and SCTP_INTERLEAVING_SUPPORTED to 1. + + - 1: Enable extension. + - 0: Disable extension. + + Default: 0 + ``/proc/sys/net/core/*`` ======================== -- cgit From 249eddaf651fda7cb32e9ebae4c6d5904b390d81 Mon Sep 17 00:00:00 2001 From: Xin Long Date: Thu, 9 Jun 2022 11:17:15 -0400 Subject: Documentation: add description for net.sctp.ecn_enable Describe it in networking/ip-sysctl.rst like other SCTP options. Fixes: 2f5268a9249b ("sctp: allow users to set netns ecn flag with sysctl") Signed-off-by: Xin Long Acked-by: Marcelo Ricardo Leitner Signed-off-by: Jakub Kicinski --- Documentation/networking/ip-sysctl.rst | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 5fe837505d43..9f41961d11d5 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -2950,6 +2950,18 @@ intl_enable - BOOLEAN Default: 0 +ecn_enable - BOOLEAN + Control use of Explicit Congestion Notification (ECN) by SCTP. + Like in TCP, ECN is used only when both ends of the SCTP connection + indicate support for it. This feature is useful in avoiding losses + due to congestion by allowing supporting routers to signal congestion + before having to drop packets. + + 1: Enable ecn. + 0: Disable ecn. + + Default: 1 + ``/proc/sys/net/core/*`` ======================== -- cgit From 1f7a6cf6b07c74a17343c2559cd5f5018a245961 Mon Sep 17 00:00:00 2001 From: Kuan-Ying Lee Date: Fri, 10 Jun 2022 15:14:57 +0800 Subject: scripts/gdb: change kernel config dumping method MAGIC_START("IKCFG_ST") and MAGIC_END("IKCFG_ED") are moved out from the kernel_config_data variable. Thus, we parse kernel_config_data directly instead of considering offset of MAGIC_START and MAGIC_END. Fixes: 13610aa908dc ("kernel/configs: use .incbin directive to embed config_data.gz") Signed-off-by: Kuan-Ying Lee Signed-off-by: Masahiro Yamada --- scripts/gdb/linux/config.py | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/scripts/gdb/linux/config.py b/scripts/gdb/linux/config.py index 90e1565b1967..8843ab3cbadd 100644 --- a/scripts/gdb/linux/config.py +++ b/scripts/gdb/linux/config.py @@ -24,9 +24,9 @@ class LxConfigDump(gdb.Command): filename = arg try: - py_config_ptr = gdb.parse_and_eval("kernel_config_data + 8") - py_config_size = gdb.parse_and_eval( - "sizeof(kernel_config_data) - 1 - 8 * 2") + py_config_ptr = gdb.parse_and_eval("&kernel_config_data") + py_config_ptr_end = gdb.parse_and_eval("&kernel_config_data_end") + py_config_size = py_config_ptr_end - py_config_ptr except gdb.error as e: raise gdb.GdbError("Can't find config, enable CONFIG_IKCONFIG?") -- cgit From 17b0128a136d43e5f8f268631f48bc267373ebff Mon Sep 17 00:00:00 2001 From: "Jason A. Donenfeld" Date: Fri, 10 Jun 2022 16:32:02 +0200 Subject: wireguard: selftests: use maximum cpu features and allow rng seeding By forcing the maximum CPU that QEMU has available, we expose additional capabilities, such as the RNDR instruction, which increases test coverage. This then allows the CI to skip the fake seeding step in some cases. Also enable STRICT_KERNEL_RWX to catch issues related to early jump labels when the RNG is initialized at boot. Signed-off-by: Jason A. Donenfeld --- tools/testing/selftests/wireguard/qemu/Makefile | 28 ++++++++++------------ tools/testing/selftests/wireguard/qemu/init.c | 3 +++ .../testing/selftests/wireguard/qemu/kernel.config | 3 +++ 3 files changed, 19 insertions(+), 15 deletions(-) diff --git a/tools/testing/selftests/wireguard/qemu/Makefile b/tools/testing/selftests/wireguard/qemu/Makefile index bca07b93eeb0..7d1b80988d8a 100644 --- a/tools/testing/selftests/wireguard/qemu/Makefile +++ b/tools/testing/selftests/wireguard/qemu/Makefile @@ -64,8 +64,8 @@ QEMU_VPORT_RESULT := virtio-serial-device ifeq ($(HOST_ARCH),$(ARCH)) QEMU_MACHINE := -cpu host -machine virt,gic_version=host,accel=kvm else -QEMU_MACHINE := -cpu cortex-a53 -machine virt -CFLAGS += -march=armv8-a -mtune=cortex-a53 +QEMU_MACHINE := -cpu max -machine virt +CFLAGS += -march=armv8-a endif else ifeq ($(ARCH),aarch64_be) CHOST := aarch64_be-linux-musl @@ -76,8 +76,8 @@ QEMU_VPORT_RESULT := virtio-serial-device ifeq ($(HOST_ARCH),$(ARCH)) QEMU_MACHINE := -cpu host -machine virt,gic_version=host,accel=kvm else -QEMU_MACHINE := -cpu cortex-a53 -machine virt -CFLAGS += -march=armv8-a -mtune=cortex-a53 +QEMU_MACHINE := -cpu max -machine virt +CFLAGS += -march=armv8-a endif else ifeq ($(ARCH),arm) CHOST := arm-linux-musleabi @@ -88,8 +88,8 @@ QEMU_VPORT_RESULT := virtio-serial-device ifeq ($(HOST_ARCH),$(ARCH)) QEMU_MACHINE := -cpu host -machine virt,gic_version=host,accel=kvm else -QEMU_MACHINE := -cpu cortex-a15 -machine virt -CFLAGS += -march=armv7-a -mtune=cortex-a15 -mabi=aapcs-linux +QEMU_MACHINE := -cpu max -machine virt +CFLAGS += -march=armv7-a -mabi=aapcs-linux endif else ifeq ($(ARCH),armeb) CHOST := armeb-linux-musleabi @@ -100,8 +100,8 @@ QEMU_VPORT_RESULT := virtio-serial-device ifeq ($(HOST_ARCH),$(ARCH)) QEMU_MACHINE := -cpu host -machine virt,gic_version=host,accel=kvm else -QEMU_MACHINE := -cpu cortex-a15 -machine virt -CFLAGS += -march=armv7-a -mabi=aapcs-linux # We don't pass -mtune=cortex-a15 due to a compiler bug on big endian. +QEMU_MACHINE := -cpu max -machine virt +CFLAGS += -march=armv7-a -mabi=aapcs-linux LDFLAGS += -Wl,--be8 endif else ifeq ($(ARCH),x86_64) @@ -112,8 +112,7 @@ KERNEL_BZIMAGE := $(KERNEL_BUILD_PATH)/arch/x86/boot/bzImage ifeq ($(HOST_ARCH),$(ARCH)) QEMU_MACHINE := -cpu host -machine q35,accel=kvm else -QEMU_MACHINE := -cpu Skylake-Server -machine q35 -CFLAGS += -march=skylake-avx512 +QEMU_MACHINE := -cpu max -machine q35 endif else ifeq ($(ARCH),i686) CHOST := i686-linux-musl @@ -123,8 +122,7 @@ KERNEL_BZIMAGE := $(KERNEL_BUILD_PATH)/arch/x86/boot/bzImage ifeq ($(subst x86_64,i686,$(HOST_ARCH)),$(ARCH)) QEMU_MACHINE := -cpu host -machine q35,accel=kvm else -QEMU_MACHINE := -cpu coreduo -machine q35 -CFLAGS += -march=prescott +QEMU_MACHINE := -cpu max -machine q35 endif else ifeq ($(ARCH),mips64) CHOST := mips64-linux-musl @@ -182,7 +180,7 @@ KERNEL_BZIMAGE := $(KERNEL_BUILD_PATH)/vmlinux ifeq ($(HOST_ARCH),$(ARCH)) QEMU_MACHINE := -cpu host,accel=kvm -machine pseries else -QEMU_MACHINE := -machine pseries +QEMU_MACHINE := -machine pseries -device spapr-rng,rng=rng -object rng-random,id=rng endif else ifeq ($(ARCH),powerpc64le) CHOST := powerpc64le-linux-musl @@ -192,7 +190,7 @@ KERNEL_BZIMAGE := $(KERNEL_BUILD_PATH)/vmlinux ifeq ($(HOST_ARCH),$(ARCH)) QEMU_MACHINE := -cpu host,accel=kvm -machine pseries else -QEMU_MACHINE := -machine pseries +QEMU_MACHINE := -machine pseries -device spapr-rng,rng=rng -object rng-random,id=rng endif else ifeq ($(ARCH),powerpc) CHOST := powerpc-linux-musl @@ -247,7 +245,7 @@ QEMU_VPORT_RESULT := virtio-serial-ccw ifeq ($(HOST_ARCH),$(ARCH)) QEMU_MACHINE := -cpu host,accel=kvm -machine s390-ccw-virtio -append $(KERNEL_CMDLINE) else -QEMU_MACHINE := -machine s390-ccw-virtio -append $(KERNEL_CMDLINE) +QEMU_MACHINE := -cpu max -machine s390-ccw-virtio -append $(KERNEL_CMDLINE) endif else $(error I only build: x86_64, i686, arm, armeb, aarch64, aarch64_be, mips, mipsel, mips64, mips64el, powerpc64, powerpc64le, powerpc, m68k, riscv64, riscv32, s390x) diff --git a/tools/testing/selftests/wireguard/qemu/init.c b/tools/testing/selftests/wireguard/qemu/init.c index 2a0f48fac925..c9e128436546 100644 --- a/tools/testing/selftests/wireguard/qemu/init.c +++ b/tools/testing/selftests/wireguard/qemu/init.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include @@ -58,6 +59,8 @@ static void seed_rng(void) { int bits = 256, fd; + if (!getrandom(NULL, 0, GRND_NONBLOCK)) + return; pretty_message("[+] Fake seeding RNG..."); fd = open("/dev/random", O_WRONLY); if (fd < 0) diff --git a/tools/testing/selftests/wireguard/qemu/kernel.config b/tools/testing/selftests/wireguard/qemu/kernel.config index a9b5a520a1d2..bad88f4b0a03 100644 --- a/tools/testing/selftests/wireguard/qemu/kernel.config +++ b/tools/testing/selftests/wireguard/qemu/kernel.config @@ -31,6 +31,7 @@ CONFIG_TTY=y CONFIG_BINFMT_ELF=y CONFIG_BINFMT_SCRIPT=y CONFIG_VDSO=y +CONFIG_STRICT_KERNEL_RWX=y CONFIG_VIRTUALIZATION=y CONFIG_HYPERVISOR_GUEST=y CONFIG_PARAVIRT=y @@ -65,6 +66,8 @@ CONFIG_PROC_FS=y CONFIG_PROC_SYSCTL=y CONFIG_SYSFS=y CONFIG_TMPFS=y +CONFIG_RANDOM_TRUST_CPU=y +CONFIG_RANDOM_TRUST_BOOTLOADER=y CONFIG_CONSOLE_LOGLEVEL_DEFAULT=15 CONFIG_LOG_BUF_SHIFT=18 CONFIG_PRINTK_TIME=y -- cgit From 1c27f1fc1549f0e470429f5497a76ad28a37f21a Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Sat, 11 Jun 2022 10:30:20 -0700 Subject: iov_iter: fix build issue due to possible type mis-match Commit 6c77676645ad ("iov_iter: Fix iter_xarray_get_pages{,_alloc}()") introduced a problem on some 32-bit architectures (at least arm, xtensa, csky,sparc and mips), that have a 'size_t' that is 'unsigned int'. The reason is that we now do min(nr * PAGE_SIZE - offset, maxsize); where 'nr' and 'offset' and both 'unsigned int', and PAGE_SIZE is 'unsigned long'. As a result, the normal C type rules means that the first argument to 'min()' ends up being 'unsigned long'. In contrast, 'maxsize' is of type 'size_t'. Now, 'size_t' and 'unsigned long' are always the same physical type in the kernel, so you'd think this doesn't matter, and from an actual arithmetic standpoint it doesn't. But on 32-bit architectures 'size_t' is commonly 'unsigned int', even if it could also be 'unsigned long'. In that situation, both are unsigned 32-bit types, but they are not the *same* type. And as a result 'min()' will complain about the distinct types (ignore the "pointer types" part of the error message: that's an artifact of the way we have made 'min()' check types for being the same): lib/iov_iter.c: In function 'iter_xarray_get_pages': include/linux/minmax.h:20:35: error: comparison of distinct pointer types lacks a cast [-Werror] 20 | (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1))) | ^~ lib/iov_iter.c:1464:16: note: in expansion of macro 'min' 1464 | return min(nr * PAGE_SIZE - offset, maxsize); | ^~~ This was not visible on 64-bit architectures (where we always define 'size_t' to be 'unsigned long'). Force these cases to use 'min_t(size_t, x, y)' to make the type explicit and avoid the issue. [ Nit-picky note: technically 'size_t' doesn't have to match 'unsigned long' arithmetically. We've certainly historically seen environments with 16-bit address spaces and 32-bit 'unsigned long'. Similarly, even in 64-bit modern environments, 'size_t' could be its own type distinct from 'unsigned long', even if it were arithmetically identical. So the above type commentary is only really descriptive of the kernel environment, not some kind of universal truth for the kinds of wild and crazy situations that are allowed by the C standard ] Reported-by: Sudip Mukherjee Link: https://lore.kernel.org/all/YqRyL2sIqQNDfky2@debian/ Cc: Jeff Layton Cc: David Howells Cc: Al Viro Signed-off-by: Linus Torvalds --- lib/iov_iter.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/iov_iter.c b/lib/iov_iter.c index dda6d5f481c1..0b64695ab632 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -1461,7 +1461,7 @@ static ssize_t iter_xarray_get_pages(struct iov_iter *i, if (nr == 0) return 0; - return min(nr * PAGE_SIZE - offset, maxsize); + return min_t(size_t, nr * PAGE_SIZE - offset, maxsize); } /* must be done on non-empty ITER_IOVEC one */ @@ -1625,7 +1625,7 @@ static ssize_t iter_xarray_get_pages_alloc(struct iov_iter *i, if (nr == 0) return 0; - return min(nr * PAGE_SIZE - offset, maxsize); + return min_t(size_t, nr * PAGE_SIZE - offset, maxsize); } ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, -- cgit From 8bee9dd953b69c634d1c9a3241a8b357469ad4aa Mon Sep 17 00:00:00 2001 From: Jonathan Neuschäfer Date: Fri, 10 Jun 2022 01:41:10 +0200 Subject: workqueue: Switch to new kerneldoc syntax for named variable macro argument MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The syntax without dots is available since commit 43756e347f21 ("scripts/kernel-doc: Add support for named variable macro arguments"). The same HTML output is produced with and without this patch. Signed-off-by: Jonathan Neuschäfer Acked-by: Tejun Heo Signed-off-by: Tejun Heo --- include/linux/workqueue.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index e1f1c8b1121b..62e75dd40d9a 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -406,7 +406,7 @@ alloc_workqueue(const char *fmt, unsigned int flags, int max_active, ...); * alloc_ordered_workqueue - allocate an ordered workqueue * @fmt: printf format for the name of the workqueue * @flags: WQ_* flags (only WQ_FREEZABLE and WQ_MEM_RECLAIM are meaningful) - * @args...: args for @fmt + * @args: args for @fmt * * Allocate an ordered workqueue. An ordered workqueue executes at * most one work item at any given time in the queued order. They are -- cgit From dc6a6ab58379f25bf991d8e4a13b001ed806e881 Mon Sep 17 00:00:00 2001 From: Jorge Lopez Date: Wed, 8 Jun 2022 16:29:23 -0500 Subject: platform/x86: hp-wmi: Resolve WMI query failures on some devices WMI queries fail on some devices where the ACPI method HWMC unconditionally attempts to create Fields beyond the buffer if the buffer is too small, this breaks essential features such as power profiles: CreateByteField (Arg1, 0x10, D008) CreateByteField (Arg1, 0x11, D009) CreateByteField (Arg1, 0x12, D010) CreateDWordField (Arg1, 0x10, D032) CreateField (Arg1, 0x80, 0x0400, D128) In cases where args->data had zero length, ACPI BIOS Error (bug): AE_AML_BUFFER_LIMIT, Field [D008] at bit offset/length 128/8 exceeds size of target Buffer (128 bits) (20211217/dsopcode-198) was obtained. ACPI BIOS Error (bug): AE_AML_BUFFER_LIMIT, Field [D009] at bit offset/length 136/8 exceeds size of target Buffer (136bits) (20211217/dsopcode-198) The original code created a buffer size of 128 bytes regardless if the WMI call required a smaller buffer or not. This particular behavior occurs in older BIOS and reproduced in OMEN laptops. Newer BIOS handles buffer sizes properly and meets the latest specification requirements. This is the reason why testing with a dynamically allocated buffer did not uncover any failures with the test systems at hand. This patch was tested on several OMEN, Elite, and Zbooks. It was confirmed the patch resolves HPWMI_FAN GET/SET calls in an OMEN Laptop 15-ek0xxx. No problems were reported when testing on several Elite and Zbooks notebooks. Fixes: 4b4967cbd268 ("platform/x86: hp-wmi: Changing bios_args.data to be dynamically allocated") Signed-off-by: Jorge Lopez Reviewed-by: Andy Shevchenko Link: https://lore.kernel.org/r/20220608212923.8585-2-jorge.lopez2@hp.com Reviewed-by: Hans de Goede Signed-off-by: Hans de Goede --- drivers/platform/x86/hp-wmi.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/platform/x86/hp-wmi.c b/drivers/platform/x86/hp-wmi.c index 667f94bba905..e4a53c3d7028 100644 --- a/drivers/platform/x86/hp-wmi.c +++ b/drivers/platform/x86/hp-wmi.c @@ -290,14 +290,16 @@ static int hp_wmi_perform_query(int query, enum hp_wmi_command command, struct bios_return *bios_return; union acpi_object *obj = NULL; struct bios_args *args = NULL; - int mid, actual_outsize, ret; + int mid, actual_insize, actual_outsize; size_t bios_args_size; + int ret; mid = encode_outsize_for_pvsz(outsize); if (WARN_ON(mid < 0)) return mid; - bios_args_size = struct_size(args, data, insize); + actual_insize = max(insize, 128); + bios_args_size = struct_size(args, data, actual_insize); args = kmalloc(bios_args_size, GFP_KERNEL); if (!args) return -ENOMEM; -- cgit From 65f936f3535950d2643eac5bf34a735a0e428cdd Mon Sep 17 00:00:00 2001 From: Bedant Patnaik Date: Thu, 9 Jun 2022 00:58:43 +0530 Subject: platform/x86: hp-wmi: Use zero insize parameter only when supported commit be9d73e64957 ("platform/x86: hp-wmi: Fix 0x05 error code reported by several WMI calls") and commit 12b19f14a21a ("platform/x86: hp-wmi: Fix hp_wmi_read_int() reporting error (0x05)") cause ACPI BIOS Error (bug): Attempt to CreateField of length zero (20211217/dsopcode-133) because of the ACPI method HWMC, which unconditionally creates a Field of size (insize*8) bits: CreateField (Arg1, 0x80, (Local5 * 0x08), DAIN) In cases where args->insize = 0, the Field size is 0, resulting in an error. Fix this by using zero insize only if 0x5 error code is returned Tested on Omen 15 AMD (2020) board ID: 8786. Fixes: be9d73e64957 ("platform/x86: hp-wmi: Fix 0x05 error code reported by several WMI calls") Signed-off-by: Bedant Patnaik Tested-by: Jorge Lopez Link: https://lore.kernel.org/r/41be46743d21c78741232a47bbb5f1cdbcc3d21e.camel@gmail.com Reviewed-by: Hans de Goede Signed-off-by: Hans de Goede --- drivers/platform/x86/hp-wmi.c | 23 +++++++++++++++-------- 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/drivers/platform/x86/hp-wmi.c b/drivers/platform/x86/hp-wmi.c index e4a53c3d7028..0d8cb22e30df 100644 --- a/drivers/platform/x86/hp-wmi.c +++ b/drivers/platform/x86/hp-wmi.c @@ -38,6 +38,7 @@ MODULE_ALIAS("wmi:5FB7F034-2C63-45e9-BE91-3D44E2C707E4"); #define HPWMI_EVENT_GUID "95F24279-4D7B-4334-9387-ACCDC67EF61C" #define HPWMI_BIOS_GUID "5FB7F034-2C63-45e9-BE91-3D44E2C707E4" #define HP_OMEN_EC_THERMAL_PROFILE_OFFSET 0x95 +#define zero_if_sup(tmp) (zero_insize_support?0:sizeof(tmp)) // use when zero insize is required /* DMI board names of devices that should use the omen specific path for * thermal profiles. @@ -220,6 +221,7 @@ static struct input_dev *hp_wmi_input_dev; static struct platform_device *hp_wmi_platform_dev; static struct platform_profile_handler platform_profile_handler; static bool platform_profile_support; +static bool zero_insize_support; static struct rfkill *wifi_rfkill; static struct rfkill *bluetooth_rfkill; @@ -376,7 +378,7 @@ static int hp_wmi_read_int(int query) int val = 0, ret; ret = hp_wmi_perform_query(query, HPWMI_READ, &val, - 0, sizeof(val)); + zero_if_sup(val), sizeof(val)); if (ret) return ret < 0 ? ret : -EINVAL; @@ -412,7 +414,8 @@ static int hp_wmi_get_tablet_mode(void) return -ENODEV; ret = hp_wmi_perform_query(HPWMI_SYSTEM_DEVICE_MODE, HPWMI_READ, - system_device_mode, 0, sizeof(system_device_mode)); + system_device_mode, zero_if_sup(system_device_mode), + sizeof(system_device_mode)); if (ret < 0) return ret; @@ -499,7 +502,7 @@ static int hp_wmi_fan_speed_max_get(void) int val = 0, ret; ret = hp_wmi_perform_query(HPWMI_FAN_SPEED_MAX_GET_QUERY, HPWMI_GM, - &val, 0, sizeof(val)); + &val, zero_if_sup(val), sizeof(val)); if (ret) return ret < 0 ? ret : -EINVAL; @@ -511,7 +514,7 @@ static int __init hp_wmi_bios_2008_later(void) { int state = 0; int ret = hp_wmi_perform_query(HPWMI_FEATURE_QUERY, HPWMI_READ, &state, - 0, sizeof(state)); + zero_if_sup(state), sizeof(state)); if (!ret) return 1; @@ -522,7 +525,7 @@ static int __init hp_wmi_bios_2009_later(void) { u8 state[128]; int ret = hp_wmi_perform_query(HPWMI_FEATURE2_QUERY, HPWMI_READ, &state, - 0, sizeof(state)); + zero_if_sup(state), sizeof(state)); if (!ret) return 1; @@ -600,7 +603,7 @@ static int hp_wmi_rfkill2_refresh(void) int err, i; err = hp_wmi_perform_query(HPWMI_WIRELESS2_QUERY, HPWMI_READ, &state, - 0, sizeof(state)); + zero_if_sup(state), sizeof(state)); if (err) return err; @@ -1009,7 +1012,7 @@ static int __init hp_wmi_rfkill2_setup(struct platform_device *device) int err, i; err = hp_wmi_perform_query(HPWMI_WIRELESS2_QUERY, HPWMI_READ, &state, - 0, sizeof(state)); + zero_if_sup(state), sizeof(state)); if (err) return err < 0 ? err : -EINVAL; @@ -1485,11 +1488,15 @@ static int __init hp_wmi_init(void) { int event_capable = wmi_has_guid(HPWMI_EVENT_GUID); int bios_capable = wmi_has_guid(HPWMI_BIOS_GUID); - int err; + int err, tmp = 0; if (!bios_capable && !event_capable) return -ENODEV; + if (hp_wmi_perform_query(HPWMI_HARDWARE_QUERY, HPWMI_READ, &tmp, + sizeof(tmp), sizeof(tmp)) == HPWMI_RET_INVALID_PARAMETERS) + zero_insize_support = true; + if (event_capable) { err = hp_wmi_input_setup(); if (err) -- cgit From d4fe9cc4ff8656704b58cfd9363d7c3c9d65e519 Mon Sep 17 00:00:00 2001 From: Duke Lee Date: Tue, 7 Jun 2022 14:36:54 -0700 Subject: platform/x86/intel: hid: Add Surface Go to VGBS allow list The Surface Go reports Chassis Type 9 (Laptop,) so the device needs to be added to dmi_vgbs_allow_list to enable tablet mode when an attached Type Cover is folded back. BugLink: https://github.com/linux-surface/linux-surface/issues/837 Signed-off-by: Duke Lee Reviewed-by: Andy Shevchenko Link: https://lore.kernel.org/r/20220607213654.5567-1-krnhotwings@gmail.com Reviewed-by: Hans de Goede Signed-off-by: Hans de Goede --- drivers/platform/x86/intel/hid.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/drivers/platform/x86/intel/hid.c b/drivers/platform/x86/intel/hid.c index 216d31e3403d..79cff1fc675c 100644 --- a/drivers/platform/x86/intel/hid.c +++ b/drivers/platform/x86/intel/hid.c @@ -122,6 +122,12 @@ static const struct dmi_system_id dmi_vgbs_allow_list[] = { DMI_MATCH(DMI_PRODUCT_NAME, "HP Spectre x360 Convertible 15-df0xxx"), }, }, + { + .matches = { + DMI_MATCH(DMI_SYS_VENDOR, "Microsoft Corporation"), + DMI_MATCH(DMI_PRODUCT_NAME, "Surface Go"), + }, + }, { } }; -- cgit From b13baccc3850ca8b8cccbf8ed9912dbaa0fdf7f3 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Sun, 12 Jun 2022 16:11:37 -0700 Subject: Linux 5.19-rc2 --- Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Makefile b/Makefile index b2e93c1a8021..1a6678d817bd 100644 --- a/Makefile +++ b/Makefile @@ -2,7 +2,7 @@ VERSION = 5 PATCHLEVEL = 19 SUBLEVEL = 0 -EXTRAVERSION = -rc1 +EXTRAVERSION = -rc2 NAME = Superb Owl # *DOCUMENTATION* -- cgit From 9eda7d8bcbdb6909f202edeedff51948f1cad1e5 Mon Sep 17 00:00:00 2001 From: Guangbin Huang Date: Sat, 11 Jun 2022 20:25:24 +0800 Subject: net: hns3: set port base vlan tbl_sta to false before removing old vlan When modify port base vlan, the port base vlan tbl_sta needs to set to false before removing old vlan, to indicate this operation is not finish. Fixes: c0f46de30c96 ("net: hns3: fix port base vlan add fail when concurrent with reset") Signed-off-by: Guangbin Huang Signed-off-by: David S. Miller --- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 1ebad0e50e6a..fc0265b63331 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -10117,6 +10117,7 @@ static int hclge_modify_port_base_vlan_tag(struct hclge_vport *vport, if (ret) return ret; + vport->port_base_vlan_cfg.tbl_sta = false; /* remove old VLAN tag */ if (old_info->vlan_tag == 0) ret = hclge_set_vf_vlan_common(hdev, vport->vport_id, -- cgit From 283847e3ef6dbf79bf67083b5ce7b8033e8b6f34 Mon Sep 17 00:00:00 2001 From: Jian Shen Date: Sat, 11 Jun 2022 20:25:25 +0800 Subject: net: hns3: don't push link state to VF if unalive It's unnecessary to push link state to unalive VF, and the VF will query link state from PF when it being start works. Fixes: 18b6e31f8bf4 ("net: hns3: PF add support for pushing link status to VFs") Signed-off-by: Jian Shen Signed-off-by: Guangbin Huang Signed-off-by: David S. Miller --- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index fc0265b63331..2e891b837c51 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -3376,6 +3376,12 @@ static int hclge_set_vf_link_state(struct hnae3_handle *handle, int vf, link_state_old = vport->vf_info.link_state; vport->vf_info.link_state = link_state; + /* return success directly if the VF is unalive, VF will + * query link state itself when it starts work. + */ + if (!test_bit(HCLGE_VPORT_STATE_ALIVE, &vport->state)) + return 0; + ret = hclge_push_vf_link_status(vport); if (ret) { vport->vf_info.link_state = link_state_old; -- cgit From cfd80687a5388e731b3db65ad6a557ede9b45905 Mon Sep 17 00:00:00 2001 From: Jie Wang Date: Sat, 11 Jun 2022 20:25:26 +0800 Subject: net: hns3: modify the ring param print info Currently tx push is also a ring param. So the original ring param print info in hns3_is_ringparam_changed should be adjusted. Fixes: 07fdc163ac88 ("net: hns3: refactor hns3_set_ringparam()") Signed-off-by: Jie Wang Signed-off-by: Guangbin Huang Signed-off-by: David S. Miller --- drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c index 6d20974519fe..4c7988e308a2 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c @@ -1129,7 +1129,7 @@ hns3_is_ringparam_changed(struct net_device *ndev, if (old_ringparam->tx_desc_num == new_ringparam->tx_desc_num && old_ringparam->rx_desc_num == new_ringparam->rx_desc_num && old_ringparam->rx_buf_len == new_ringparam->rx_buf_len) { - netdev_info(ndev, "ringparam not changed\n"); + netdev_info(ndev, "descriptor number and rx buffer length not changed\n"); return false; } -- cgit From e93530ae0e5d8fcf2d908933d206e0c93bc3c09b Mon Sep 17 00:00:00 2001 From: Guangbin Huang Date: Sat, 11 Jun 2022 20:25:27 +0800 Subject: net: hns3: restore tm priority/qset to default settings when tc disabled Currently, settings parameters of schedule mode, dwrr, shaper of tm priority or qset of one tc are only be set when tc is enabled, they are not restored to the default settings when tc is disabled. It confuses users when they cat tm_priority or tm_qset files of debugfs. So this patch fixes it. Fixes: 848440544b41 ("net: hns3: Add support of TX Scheduler & Shaper to HNS3 driver") Signed-off-by: Guangbin Huang Signed-off-by: David S. Miller --- drivers/net/ethernet/hisilicon/hns3/hnae3.h | 1 + .../net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c | 95 +++++++++++++++------- 2 files changed, 65 insertions(+), 31 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h index 8a3a446219f7..94f80e1c4020 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h +++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h @@ -769,6 +769,7 @@ struct hnae3_tc_info { u8 prio_tc[HNAE3_MAX_USER_PRIO]; /* TC indexed by prio */ u16 tqp_count[HNAE3_MAX_TC]; u16 tqp_offset[HNAE3_MAX_TC]; + u8 max_tc; /* Total number of TCs */ u8 num_tc; /* Total number of enabled TCs */ bool mqprio_active; }; diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c index 1f87a8a3fe32..ad53a3447322 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c @@ -282,8 +282,8 @@ static int hclge_tm_pg_to_pri_map_cfg(struct hclge_dev *hdev, return hclge_cmd_send(&hdev->hw, &desc, 1); } -static int hclge_tm_qs_to_pri_map_cfg(struct hclge_dev *hdev, - u16 qs_id, u8 pri) +static int hclge_tm_qs_to_pri_map_cfg(struct hclge_dev *hdev, u16 qs_id, u8 pri, + bool link_vld) { struct hclge_qs_to_pri_link_cmd *map; struct hclge_desc desc; @@ -294,7 +294,7 @@ static int hclge_tm_qs_to_pri_map_cfg(struct hclge_dev *hdev, map->qs_id = cpu_to_le16(qs_id); map->priority = pri; - map->link_vld = HCLGE_TM_QS_PRI_LINK_VLD_MSK; + map->link_vld = link_vld ? HCLGE_TM_QS_PRI_LINK_VLD_MSK : 0; return hclge_cmd_send(&hdev->hw, &desc, 1); } @@ -642,11 +642,13 @@ static void hclge_tm_update_kinfo_rss_size(struct hclge_vport *vport) * one tc for VF for simplicity. VF's vport_id is non zero. */ if (vport->vport_id) { + kinfo->tc_info.max_tc = 1; kinfo->tc_info.num_tc = 1; vport->qs_offset = HNAE3_MAX_TC + vport->vport_id - HCLGE_VF_VPORT_START_NUM; vport_max_rss_size = hdev->vf_rss_size_max; } else { + kinfo->tc_info.max_tc = hdev->tc_max; kinfo->tc_info.num_tc = min_t(u16, vport->alloc_tqps, hdev->tm_info.num_tc); vport->qs_offset = 0; @@ -714,14 +716,22 @@ static void hclge_tm_vport_info_update(struct hclge_dev *hdev) static void hclge_tm_tc_info_init(struct hclge_dev *hdev) { - u8 i; + u8 i, tc_sch_mode; + u32 bw_limit; + + for (i = 0; i < hdev->tc_max; i++) { + if (i < hdev->tm_info.num_tc) { + tc_sch_mode = HCLGE_SCH_MODE_DWRR; + bw_limit = hdev->tm_info.pg_info[0].bw_limit; + } else { + tc_sch_mode = HCLGE_SCH_MODE_SP; + bw_limit = 0; + } - for (i = 0; i < hdev->tm_info.num_tc; i++) { hdev->tm_info.tc_info[i].tc_id = i; - hdev->tm_info.tc_info[i].tc_sch_mode = HCLGE_SCH_MODE_DWRR; + hdev->tm_info.tc_info[i].tc_sch_mode = tc_sch_mode; hdev->tm_info.tc_info[i].pgid = 0; - hdev->tm_info.tc_info[i].bw_limit = - hdev->tm_info.pg_info[0].bw_limit; + hdev->tm_info.tc_info[i].bw_limit = bw_limit; } for (i = 0; i < HNAE3_MAX_USER_PRIO; i++) @@ -926,10 +936,13 @@ static int hclge_tm_pri_q_qs_cfg_tc_base(struct hclge_dev *hdev) for (k = 0; k < hdev->num_alloc_vport; k++) { struct hnae3_knic_private_info *kinfo = &vport[k].nic.kinfo; - for (i = 0; i < kinfo->tc_info.num_tc; i++) { + for (i = 0; i < kinfo->tc_info.max_tc; i++) { + u8 pri = i < kinfo->tc_info.num_tc ? i : 0; + bool link_vld = i < kinfo->tc_info.num_tc; + ret = hclge_tm_qs_to_pri_map_cfg(hdev, vport[k].qs_offset + i, - i); + pri, link_vld); if (ret) return ret; } @@ -949,7 +962,7 @@ static int hclge_tm_pri_q_qs_cfg_vnet_base(struct hclge_dev *hdev) for (i = 0; i < HNAE3_MAX_TC; i++) { ret = hclge_tm_qs_to_pri_map_cfg(hdev, vport[k].qs_offset + i, - k); + k, true); if (ret) return ret; } @@ -989,33 +1002,39 @@ static int hclge_tm_pri_tc_base_shaper_cfg(struct hclge_dev *hdev) { u32 max_tm_rate = hdev->ae_dev->dev_specs.max_tm_rate; struct hclge_shaper_ir_para ir_para; - u32 shaper_para; + u32 shaper_para_c, shaper_para_p; int ret; u32 i; - for (i = 0; i < hdev->tm_info.num_tc; i++) { + for (i = 0; i < hdev->tc_max; i++) { u32 rate = hdev->tm_info.tc_info[i].bw_limit; - ret = hclge_shaper_para_calc(rate, HCLGE_SHAPER_LVL_PRI, - &ir_para, max_tm_rate); - if (ret) - return ret; + if (rate) { + ret = hclge_shaper_para_calc(rate, HCLGE_SHAPER_LVL_PRI, + &ir_para, max_tm_rate); + if (ret) + return ret; + + shaper_para_c = hclge_tm_get_shapping_para(0, 0, 0, + HCLGE_SHAPER_BS_U_DEF, + HCLGE_SHAPER_BS_S_DEF); + shaper_para_p = hclge_tm_get_shapping_para(ir_para.ir_b, + ir_para.ir_u, + ir_para.ir_s, + HCLGE_SHAPER_BS_U_DEF, + HCLGE_SHAPER_BS_S_DEF); + } else { + shaper_para_c = 0; + shaper_para_p = 0; + } - shaper_para = hclge_tm_get_shapping_para(0, 0, 0, - HCLGE_SHAPER_BS_U_DEF, - HCLGE_SHAPER_BS_S_DEF); ret = hclge_tm_pri_shapping_cfg(hdev, HCLGE_TM_SHAP_C_BUCKET, i, - shaper_para, rate); + shaper_para_c, rate); if (ret) return ret; - shaper_para = hclge_tm_get_shapping_para(ir_para.ir_b, - ir_para.ir_u, - ir_para.ir_s, - HCLGE_SHAPER_BS_U_DEF, - HCLGE_SHAPER_BS_S_DEF); ret = hclge_tm_pri_shapping_cfg(hdev, HCLGE_TM_SHAP_P_BUCKET, i, - shaper_para, rate); + shaper_para_p, rate); if (ret) return ret; } @@ -1125,7 +1144,7 @@ static int hclge_tm_pri_tc_base_dwrr_cfg(struct hclge_dev *hdev) int ret; u32 i, k; - for (i = 0; i < hdev->tm_info.num_tc; i++) { + for (i = 0; i < hdev->tc_max; i++) { pg_info = &hdev->tm_info.pg_info[hdev->tm_info.tc_info[i].pgid]; dwrr = pg_info->tc_dwrr[i]; @@ -1135,9 +1154,15 @@ static int hclge_tm_pri_tc_base_dwrr_cfg(struct hclge_dev *hdev) return ret; for (k = 0; k < hdev->num_alloc_vport; k++) { + struct hnae3_knic_private_info *kinfo = &vport[k].nic.kinfo; + + if (i >= kinfo->tc_info.max_tc) + continue; + + dwrr = i < kinfo->tc_info.num_tc ? vport[k].dwrr : 0; ret = hclge_tm_qs_weight_cfg( hdev, vport[k].qs_offset + i, - vport[k].dwrr); + dwrr); if (ret) return ret; } @@ -1303,6 +1328,7 @@ static int hclge_tm_schd_mode_tc_base_cfg(struct hclge_dev *hdev, u8 pri_id) { struct hclge_vport *vport = hdev->vport; int ret; + u8 mode; u16 i; ret = hclge_tm_pri_schd_mode_cfg(hdev, pri_id); @@ -1310,9 +1336,16 @@ static int hclge_tm_schd_mode_tc_base_cfg(struct hclge_dev *hdev, u8 pri_id) return ret; for (i = 0; i < hdev->num_alloc_vport; i++) { + struct hnae3_knic_private_info *kinfo = &vport[i].nic.kinfo; + + if (pri_id >= kinfo->tc_info.max_tc) + continue; + + mode = pri_id < kinfo->tc_info.num_tc ? HCLGE_SCH_MODE_DWRR : + HCLGE_SCH_MODE_SP; ret = hclge_tm_qs_schd_mode_cfg(hdev, vport[i].qs_offset + pri_id, - HCLGE_SCH_MODE_DWRR); + mode); if (ret) return ret; } @@ -1353,7 +1386,7 @@ static int hclge_tm_lvl34_schd_mode_cfg(struct hclge_dev *hdev) u8 i; if (hdev->tx_sch_mode == HCLGE_FLAG_TC_BASE_SCH_MODE) { - for (i = 0; i < hdev->tm_info.num_tc; i++) { + for (i = 0; i < hdev->tc_max; i++) { ret = hclge_tm_schd_mode_tc_base_cfg(hdev, i); if (ret) return ret; -- cgit From 71b215f36dca1a3d5d1c576b2099e6d7ea03047e Mon Sep 17 00:00:00 2001 From: Jie Wang Date: Sat, 11 Jun 2022 20:25:28 +0800 Subject: net: hns3: fix PF rss size initialization bug Currently hns3 driver misuses the VF rss size to initialize the PF rss size in hclge_tm_vport_tc_info_update. So this patch fix it by checking the vport id before initialization. Fixes: 7347255ea389 ("net: hns3: refactor PF rss get APIs with new common rss get APIs") Signed-off-by: Jie Wang Signed-off-by: Guangbin Huang Signed-off-by: David S. Miller --- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c index ad53a3447322..f5296ff60694 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c @@ -681,7 +681,9 @@ static void hclge_tm_vport_tc_info_update(struct hclge_vport *vport) kinfo->num_tqps = hclge_vport_get_tqp_num(vport); vport->dwrr = 100; /* 100 percent as init */ vport->bw_limit = hdev->tm_info.pg_info[0].bw_limit; - hdev->rss_cfg.rss_size = kinfo->rss_size; + + if (vport->vport_id == PF_VPORT_ID) + hdev->rss_cfg.rss_size = kinfo->rss_size; /* when enable mqprio, the tc_info has been updated. */ if (kinfo->tc_info.mqprio_active) -- cgit From 12a3670887725df364cc3e030cf3bede6f13b364 Mon Sep 17 00:00:00 2001 From: Guangbin Huang Date: Sat, 11 Jun 2022 20:25:29 +0800 Subject: net: hns3: fix tm port shapping of fibre port is incorrect after driver initialization Currently in driver initialization process, driver will set shapping parameters of tm port to default speed read from firmware. However, the speed of SFP module may not be default speed, so shapping parameters of tm port may be incorrect. To fix this problem, driver sets new shapping parameters for tm port after getting exact speed of SFP module in this case. Fixes: 88d10bd6f730 ("net: hns3: add support for multiple media type") Signed-off-by: Guangbin Huang Signed-off-by: David S. Miller --- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 11 ++++++++--- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c | 2 +- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.h | 1 + 3 files changed, 10 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 2e891b837c51..fae79764dc44 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -3268,7 +3268,7 @@ static int hclge_tp_port_init(struct hclge_dev *hdev) static int hclge_update_port_info(struct hclge_dev *hdev) { struct hclge_mac *mac = &hdev->hw.mac; - int speed = HCLGE_MAC_SPEED_UNKNOWN; + int speed; int ret; /* get the port info from SFP cmd if not copper port */ @@ -3279,10 +3279,13 @@ static int hclge_update_port_info(struct hclge_dev *hdev) if (!hdev->support_sfp_query) return 0; - if (hdev->ae_dev->dev_version >= HNAE3_DEVICE_VERSION_V2) + if (hdev->ae_dev->dev_version >= HNAE3_DEVICE_VERSION_V2) { + speed = mac->speed; ret = hclge_get_sfp_info(hdev, mac); - else + } else { + speed = HCLGE_MAC_SPEED_UNKNOWN; ret = hclge_get_sfp_speed(hdev, &speed); + } if (ret == -EOPNOTSUPP) { hdev->support_sfp_query = false; @@ -3294,6 +3297,8 @@ static int hclge_update_port_info(struct hclge_dev *hdev) if (hdev->ae_dev->dev_version >= HNAE3_DEVICE_VERSION_V2) { if (mac->speed_type == QUERY_ACTIVE_SPEED) { hclge_update_port_capability(hdev, mac); + if (mac->speed != speed) + (void)hclge_tm_port_shaper_cfg(hdev); return 0; } return hclge_cfg_mac_speed_dup(hdev, mac->speed, diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c index f5296ff60694..2f33b036a47a 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c @@ -420,7 +420,7 @@ static int hclge_tm_pg_shapping_cfg(struct hclge_dev *hdev, return hclge_cmd_send(&hdev->hw, &desc, 1); } -static int hclge_tm_port_shaper_cfg(struct hclge_dev *hdev) +int hclge_tm_port_shaper_cfg(struct hclge_dev *hdev) { struct hclge_port_shapping_cmd *shap_cfg_cmd; struct hclge_shaper_ir_para ir_para; diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.h index 619cc30a2dfc..d943943912f7 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.h @@ -237,6 +237,7 @@ int hclge_pause_addr_cfg(struct hclge_dev *hdev, const u8 *mac_addr); void hclge_pfc_rx_stats_get(struct hclge_dev *hdev, u64 *stats); void hclge_pfc_tx_stats_get(struct hclge_dev *hdev, u64 *stats); int hclge_tm_qs_shaper_cfg(struct hclge_vport *vport, int max_tx_rate); +int hclge_tm_port_shaper_cfg(struct hclge_dev *hdev); int hclge_tm_get_qset_num(struct hclge_dev *hdev, u16 *qset_num); int hclge_tm_get_pri_num(struct hclge_dev *hdev, u8 *pri_num); int hclge_tm_get_qset_map_pri(struct hclge_dev *hdev, u16 qset_id, u8 *priority, -- cgit From 00be43a74ca262267ceb96c0c5e3f51d3a56342e Mon Sep 17 00:00:00 2001 From: Andy Chiu Date: Mon, 13 Jun 2022 11:42:01 +0800 Subject: net: axienet: make the 64b addresable DMA depends on 64b archectures Currently it is not safe to config the IP as 64-bit addressable on 32-bit archectures, which cannot perform a double-word store on its descriptor pointers. The pointer is 64-bit wide if the IP is configured as 64-bit, and the device would process the partially updated pointer on some states if the pointer was updated via two store-words. To prevent such condition, we force a probe fail if we discover that the IP has 64-bit capability but it is not running on a 64-Bit kernel. This is a series of patch (1/2). The next patch must be applied in order to make 64b DMA safe on 64b archectures. Signed-off-by: Andy Chiu Reported-by: Max Hsu Reviewed-by: Greentime Hu Signed-off-by: David S. Miller --- drivers/net/ethernet/xilinx/xilinx_axienet.h | 36 +++++++++++++++++++++++ drivers/net/ethernet/xilinx/xilinx_axienet_main.c | 28 +++--------------- 2 files changed, 40 insertions(+), 24 deletions(-) diff --git a/drivers/net/ethernet/xilinx/xilinx_axienet.h b/drivers/net/ethernet/xilinx/xilinx_axienet.h index 4225efbeda3d..6c95676ba172 100644 --- a/drivers/net/ethernet/xilinx/xilinx_axienet.h +++ b/drivers/net/ethernet/xilinx/xilinx_axienet.h @@ -547,6 +547,42 @@ static inline void axienet_iow(struct axienet_local *lp, off_t offset, iowrite32(value, lp->regs + offset); } +/** + * axienet_dma_out32 - Memory mapped Axi DMA register write. + * @lp: Pointer to axienet local structure + * @reg: Address offset from the base address of the Axi DMA core + * @value: Value to be written into the Axi DMA register + * + * This function writes the desired value into the corresponding Axi DMA + * register. + */ + +static inline void axienet_dma_out32(struct axienet_local *lp, + off_t reg, u32 value) +{ + iowrite32(value, lp->dma_regs + reg); +} + +#ifdef CONFIG_64BIT +static void axienet_dma_out_addr(struct axienet_local *lp, off_t reg, + dma_addr_t addr) +{ + axienet_dma_out32(lp, reg, lower_32_bits(addr)); + + if (lp->features & XAE_FEATURE_DMA_64BIT) + axienet_dma_out32(lp, reg + 4, upper_32_bits(addr)); +} + +#else /* CONFIG_64BIT */ + +static void axienet_dma_out_addr(struct axienet_local *lp, off_t reg, + dma_addr_t addr) +{ + axienet_dma_out32(lp, reg, lower_32_bits(addr)); +} + +#endif /* CONFIG_64BIT */ + /* Function prototypes visible in xilinx_axienet_mdio.c for other files */ int axienet_mdio_enable(struct axienet_local *lp); void axienet_mdio_disable(struct axienet_local *lp); diff --git a/drivers/net/ethernet/xilinx/xilinx_axienet_main.c b/drivers/net/ethernet/xilinx/xilinx_axienet_main.c index 93c9f305bba4..fa7bcd2c1892 100644 --- a/drivers/net/ethernet/xilinx/xilinx_axienet_main.c +++ b/drivers/net/ethernet/xilinx/xilinx_axienet_main.c @@ -133,30 +133,6 @@ static inline u32 axienet_dma_in32(struct axienet_local *lp, off_t reg) return ioread32(lp->dma_regs + reg); } -/** - * axienet_dma_out32 - Memory mapped Axi DMA register write. - * @lp: Pointer to axienet local structure - * @reg: Address offset from the base address of the Axi DMA core - * @value: Value to be written into the Axi DMA register - * - * This function writes the desired value into the corresponding Axi DMA - * register. - */ -static inline void axienet_dma_out32(struct axienet_local *lp, - off_t reg, u32 value) -{ - iowrite32(value, lp->dma_regs + reg); -} - -static void axienet_dma_out_addr(struct axienet_local *lp, off_t reg, - dma_addr_t addr) -{ - axienet_dma_out32(lp, reg, lower_32_bits(addr)); - - if (lp->features & XAE_FEATURE_DMA_64BIT) - axienet_dma_out32(lp, reg + 4, upper_32_bits(addr)); -} - static void desc_set_phys_addr(struct axienet_local *lp, dma_addr_t addr, struct axidma_bd *desc) { @@ -2061,6 +2037,10 @@ static int axienet_probe(struct platform_device *pdev) iowrite32(0x0, desc); } } + if (!IS_ENABLED(CONFIG_64BIT) && lp->features & XAE_FEATURE_DMA_64BIT) { + dev_err(&pdev->dev, "64-bit addressable DMA is not compatible with 32-bit archecture\n"); + goto cleanup_clk; + } ret = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(addr_width)); if (ret) { -- cgit From b690f8df6497b654c2c871871e0a598e9750c0eb Mon Sep 17 00:00:00 2001 From: Andy Chiu Date: Mon, 13 Jun 2022 11:42:02 +0800 Subject: net: axienet: Use iowrite64 to write all 64b descriptor pointers According to commit f735c40ed93c ("net: axienet: Autodetect 64-bit DMA capability") and AXI-DMA spec (pg021), on 64-bit capable dma, only writing MSB part of tail descriptor pointer causes DMA engine to start fetching descriptors. However, we found that it is true only if dma is in idle state. In other words, dma would use a tailp even if it only has LSB updated, when the dma is running. The non-atomicity of this behavior could be problematic if enough delay were introduced in between the 2 writes. For example, if an interrupt comes right after the LSB write and the cpu spends long enough time in the handler for the dma to get back into idle state by completing descriptors, then the seconcd write to MSB would treat dma to start fetching descriptors again. Since the descriptor next to the one pointed by current tail pointer is not filled by the kernel yet, fetching a null descriptor here causes a dma internal error and halt the dma engine down. We suggest that the dma engine should start process a 64-bit MMIO write to the descriptor pointer only if ONE 32-bit part of it is written on all states. Or we should restrict the use of 64-bit addressable dma on 32-bit platforms, since those devices have no instruction to guarantee the write to LSB and MSB part of tail pointer occurs atomically to the dma. initial condition: curp = x-3; tailp = x-2; LSB = x; MSB = 0; cpu: |dma: iowrite32(LSB, tailp) | completes #(x-3) desc, curp = x-3 ... | tailp updated => irq | completes #(x-2) desc, curp = x-2 ... | completes #(x-1) desc, curp = x-1 ... | ... ... | completes #x desc, curp = tailp = x <= irqreturn | reaches tailp == curp = x, idle iowrite32(MSB, tailp + 4) | ... | tailp updated, starts fetching... | fetches #(x + 1) desc, sees cntrl = 0 | post Tx error, halt Signed-off-by: Andy Chiu Reported-by: Max Hsu Reviewed-by: Greentime Hu Signed-off-by: David S. Miller --- drivers/net/ethernet/xilinx/xilinx_axienet.h | 21 ++++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/xilinx/xilinx_axienet.h b/drivers/net/ethernet/xilinx/xilinx_axienet.h index 6c95676ba172..97ddc0273b8a 100644 --- a/drivers/net/ethernet/xilinx/xilinx_axienet.h +++ b/drivers/net/ethernet/xilinx/xilinx_axienet.h @@ -564,13 +564,28 @@ static inline void axienet_dma_out32(struct axienet_local *lp, } #ifdef CONFIG_64BIT +/** + * axienet_dma_out64 - Memory mapped Axi DMA register write. + * @lp: Pointer to axienet local structure + * @reg: Address offset from the base address of the Axi DMA core + * @value: Value to be written into the Axi DMA register + * + * This function writes the desired value into the corresponding Axi DMA + * register. + */ +static inline void axienet_dma_out64(struct axienet_local *lp, + off_t reg, u64 value) +{ + iowrite64(value, lp->dma_regs + reg); +} + static void axienet_dma_out_addr(struct axienet_local *lp, off_t reg, dma_addr_t addr) { - axienet_dma_out32(lp, reg, lower_32_bits(addr)); - if (lp->features & XAE_FEATURE_DMA_64BIT) - axienet_dma_out32(lp, reg + 4, upper_32_bits(addr)); + axienet_dma_out64(lp, reg, addr); + else + axienet_dma_out32(lp, reg, lower_32_bits(addr)); } #else /* CONFIG_64BIT */ -- cgit From 5f7b84151a89f6f3a8d1db4db2bc4f5b270d66ee Mon Sep 17 00:00:00 2001 From: "David S. Miller" Date: Mon, 13 Jun 2022 12:49:21 +0100 Subject: xilinx: Fix build on x86. CONFIG_64BIT is not sufficient for checking for availability of iowrite64() and friends. Also, the out_addr helpers need to be inline. Fixes: b690f8df6497 ("net: axienet: Use iowrite64 to write all 64b descriptor pointers") Signed-off-by: David S. Miller --- drivers/net/ethernet/xilinx/xilinx_axienet.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/xilinx/xilinx_axienet.h b/drivers/net/ethernet/xilinx/xilinx_axienet.h index 97ddc0273b8a..f2e2261b4b7d 100644 --- a/drivers/net/ethernet/xilinx/xilinx_axienet.h +++ b/drivers/net/ethernet/xilinx/xilinx_axienet.h @@ -563,7 +563,7 @@ static inline void axienet_dma_out32(struct axienet_local *lp, iowrite32(value, lp->dma_regs + reg); } -#ifdef CONFIG_64BIT +#if defined(CONFIG_64BIT) && defined(iowrite64) /** * axienet_dma_out64 - Memory mapped Axi DMA register write. * @lp: Pointer to axienet local structure @@ -579,8 +579,8 @@ static inline void axienet_dma_out64(struct axienet_local *lp, iowrite64(value, lp->dma_regs + reg); } -static void axienet_dma_out_addr(struct axienet_local *lp, off_t reg, - dma_addr_t addr) +static inline void axienet_dma_out_addr(struct axienet_local *lp, off_t reg, + dma_addr_t addr) { if (lp->features & XAE_FEATURE_DMA_64BIT) axienet_dma_out64(lp, reg, addr); @@ -590,7 +590,7 @@ static void axienet_dma_out_addr(struct axienet_local *lp, off_t reg, #else /* CONFIG_64BIT */ -static void axienet_dma_out_addr(struct axienet_local *lp, off_t reg, +static inline void axienet_dma_out_addr(struct axienet_local *lp, off_t reg, dma_addr_t addr) { axienet_dma_out32(lp, reg, lower_32_bits(addr)); -- cgit From 619c010a65391d06bc96e79fa0e7725790e5d1a9 Mon Sep 17 00:00:00 2001 From: Suman Ghosh Date: Sun, 12 Jun 2022 23:15:36 +0530 Subject: octeontx2-vf: Add support for adaptive interrupt coalescing Fixes: 6e144b47f560 (octeontx2-pf: Add support for adaptive interrupt coalescing) Added support for VF interfaces as well. Signed-off-by: Suman Ghosh Signed-off-by: David S. Miller --- drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c index bc614a4def9e..3f60a80e34c8 100644 --- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c +++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_ethtool.c @@ -1390,7 +1390,8 @@ static int otx2vf_get_link_ksettings(struct net_device *netdev, static const struct ethtool_ops otx2vf_ethtool_ops = { .supported_coalesce_params = ETHTOOL_COALESCE_USECS | - ETHTOOL_COALESCE_MAX_FRAMES, + ETHTOOL_COALESCE_MAX_FRAMES | + ETHTOOL_COALESCE_USE_ADAPTIVE, .supported_ring_params = ETHTOOL_RING_USE_RX_BUF_LEN | ETHTOOL_RING_USE_CQE_SIZE, .get_link = otx2_get_link, -- cgit From 57cd6d157eb479f0a8e820fd36b7240845c8a937 Mon Sep 17 00:00:00 2001 From: Sami Tolvanen Date: Tue, 31 May 2022 10:59:10 -0700 Subject: cfi: Fix __cfi_slowpath_diag RCU usage with cpuidle RCU_NONIDLE usage during __cfi_slowpath_diag can result in an invalid RCU state in the cpuidle code path: WARNING: CPU: 1 PID: 0 at kernel/rcu/tree.c:613 rcu_eqs_enter+0xe4/0x138 ... Call trace: rcu_eqs_enter+0xe4/0x138 rcu_idle_enter+0xa8/0x100 cpuidle_enter_state+0x154/0x3a8 cpuidle_enter+0x3c/0x58 do_idle.llvm.6590768638138871020+0x1f4/0x2ec cpu_startup_entry+0x28/0x2c secondary_start_kernel+0x1b8/0x220 __secondary_switched+0x94/0x98 Instead, call rcu_irq_enter/exit to wake up RCU only when needed and disable interrupts for the entire CFI shadow/module check when we do. Signed-off-by: Sami Tolvanen Link: https://lore.kernel.org/r/20220531175910.890307-1-samitolvanen@google.com Fixes: cf68fffb66d6 ("add support for Clang CFI") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook --- kernel/cfi.c | 22 ++++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-) diff --git a/kernel/cfi.c b/kernel/cfi.c index 9594cfd1cf2c..08102d19ec15 100644 --- a/kernel/cfi.c +++ b/kernel/cfi.c @@ -281,6 +281,8 @@ static inline cfi_check_fn find_module_check_fn(unsigned long ptr) static inline cfi_check_fn find_check_fn(unsigned long ptr) { cfi_check_fn fn = NULL; + unsigned long flags; + bool rcu_idle; if (is_kernel_text(ptr)) return __cfi_check; @@ -290,13 +292,21 @@ static inline cfi_check_fn find_check_fn(unsigned long ptr) * the shadow and __module_address use RCU, so we need to wake it * up if necessary. */ - RCU_NONIDLE({ - if (IS_ENABLED(CONFIG_CFI_CLANG_SHADOW)) - fn = find_shadow_check_fn(ptr); + rcu_idle = !rcu_is_watching(); + if (rcu_idle) { + local_irq_save(flags); + rcu_irq_enter(); + } + + if (IS_ENABLED(CONFIG_CFI_CLANG_SHADOW)) + fn = find_shadow_check_fn(ptr); + if (!fn) + fn = find_module_check_fn(ptr); - if (!fn) - fn = find_module_check_fn(ptr); - }); + if (rcu_idle) { + rcu_irq_exit(); + local_irq_restore(flags); + } return fn; } -- cgit From 993d0b287e2ef7bee2e8b13b0ce4d2b5066f278e Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Sun, 12 Jun 2022 22:32:25 +0100 Subject: usercopy: Handle vm_map_ram() areas vmalloc does not allocate a vm_struct for vm_map_ram() areas. That causes us to deny usercopies from those areas. This affects XFS which uses vm_map_ram() for its directories. Fix this by calling find_vmap_area() instead of find_vm_area(). Fixes: 0aef499f3172 ("mm/usercopy: Detect vmalloc overruns") Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Uladzislau Rezki (Sony) Tested-by: Zorro Lang Signed-off-by: Kees Cook Link: https://lore.kernel.org/r/20220612213227.3881769-2-willy@infradead.org --- include/linux/vmalloc.h | 1 + mm/usercopy.c | 10 ++++------ mm/vmalloc.c | 2 +- 3 files changed, 6 insertions(+), 7 deletions(-) diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index b159c2789961..096d48aa3437 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -215,6 +215,7 @@ extern struct vm_struct *__get_vm_area_caller(unsigned long size, void free_vm_area(struct vm_struct *area); extern struct vm_struct *remove_vm_area(const void *addr); extern struct vm_struct *find_vm_area(const void *addr); +struct vmap_area *find_vmap_area(unsigned long addr); static inline bool is_vm_area_hugepages(const void *addr) { diff --git a/mm/usercopy.c b/mm/usercopy.c index baeacc735b83..cd4b41d9bf76 100644 --- a/mm/usercopy.c +++ b/mm/usercopy.c @@ -173,16 +173,14 @@ static inline void check_heap_object(const void *ptr, unsigned long n, } if (is_vmalloc_addr(ptr)) { - struct vm_struct *area = find_vm_area(ptr); + struct vmap_area *area = find_vmap_area((unsigned long)ptr); unsigned long offset; - if (!area) { + if (!area) usercopy_abort("vmalloc", "no area", to_user, 0, n); - return; - } - offset = ptr - area->addr; - if (offset + n > get_vm_area_size(area)) + offset = (unsigned long)ptr - area->va_start; + if ((unsigned long)ptr + n > area->va_end) usercopy_abort("vmalloc", NULL, to_user, offset, n); return; } diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 07db42455dd4..effd1ff6a4b4 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1798,7 +1798,7 @@ static void free_unmap_vmap_area(struct vmap_area *va) free_vmap_area_noflush(va); } -static struct vmap_area *find_vmap_area(unsigned long addr) +struct vmap_area *find_vmap_area(unsigned long addr) { struct vmap_area *va; -- cgit From 35fb9ae4aa2e838b234323e6f7cf6336ff019e5a Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Sun, 12 Jun 2022 22:32:26 +0100 Subject: usercopy: Cast pointer to an integer once Get rid of a lot of annoying casts by setting 'addr' once at the top of the function. Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Uladzislau Rezki (Sony) Tested-by: Zorro Lang Signed-off-by: Kees Cook Link: https://lore.kernel.org/r/20220612213227.3881769-3-willy@infradead.org --- mm/usercopy.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/mm/usercopy.c b/mm/usercopy.c index cd4b41d9bf76..30a4db3cb1df 100644 --- a/mm/usercopy.c +++ b/mm/usercopy.c @@ -161,26 +161,27 @@ static inline void check_bogus_address(const unsigned long ptr, unsigned long n, static inline void check_heap_object(const void *ptr, unsigned long n, bool to_user) { + uintptr_t addr = (uintptr_t)ptr; struct folio *folio; if (is_kmap_addr(ptr)) { - unsigned long page_end = (unsigned long)ptr | (PAGE_SIZE - 1); + unsigned long page_end = addr | (PAGE_SIZE - 1); - if ((unsigned long)ptr + n - 1 > page_end) + if (addr + n - 1 > page_end) usercopy_abort("kmap", NULL, to_user, offset_in_page(ptr), n); return; } if (is_vmalloc_addr(ptr)) { - struct vmap_area *area = find_vmap_area((unsigned long)ptr); + struct vmap_area *area = find_vmap_area(addr); unsigned long offset; if (!area) usercopy_abort("vmalloc", "no area", to_user, 0, n); - offset = (unsigned long)ptr - area->va_start; - if ((unsigned long)ptr + n > area->va_end) + offset = addr - area->va_start; + if (addr + n > area->va_end) usercopy_abort("vmalloc", NULL, to_user, offset, n); return; } -- cgit From 1dfbe9fcda4afc957f0e371e207ae3cb7e8f3b0e Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Sun, 12 Jun 2022 22:32:27 +0100 Subject: usercopy: Make usercopy resilient against ridiculously large copies If 'n' is so large that it's negative, we might wrap around and mistakenly think that the copy is OK when it's not. Such a copy would probably crash, but just doing the arithmetic in a more simple way lets us detect and refuse this case. Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Uladzislau Rezki (Sony) Tested-by: Zorro Lang Signed-off-by: Kees Cook Link: https://lore.kernel.org/r/20220612213227.3881769-4-willy@infradead.org --- mm/usercopy.c | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/mm/usercopy.c b/mm/usercopy.c index 30a4db3cb1df..4e1da708699b 100644 --- a/mm/usercopy.c +++ b/mm/usercopy.c @@ -162,27 +162,26 @@ static inline void check_heap_object(const void *ptr, unsigned long n, bool to_user) { uintptr_t addr = (uintptr_t)ptr; + unsigned long offset; struct folio *folio; if (is_kmap_addr(ptr)) { - unsigned long page_end = addr | (PAGE_SIZE - 1); - - if (addr + n - 1 > page_end) - usercopy_abort("kmap", NULL, to_user, - offset_in_page(ptr), n); + offset = offset_in_page(ptr); + if (n > PAGE_SIZE - offset) + usercopy_abort("kmap", NULL, to_user, offset, n); return; } if (is_vmalloc_addr(ptr)) { struct vmap_area *area = find_vmap_area(addr); - unsigned long offset; if (!area) usercopy_abort("vmalloc", "no area", to_user, 0, n); - offset = addr - area->va_start; - if (addr + n > area->va_end) + if (n > area->va_end - addr) { + offset = addr - area->va_start; usercopy_abort("vmalloc", NULL, to_user, offset, n); + } return; } @@ -195,8 +194,8 @@ static inline void check_heap_object(const void *ptr, unsigned long n, /* Check slab allocator for flags and size. */ __check_heap_object(ptr, n, folio_slab(folio), to_user); } else if (folio_test_large(folio)) { - unsigned long offset = ptr - folio_address(folio); - if (offset + n > folio_size(folio)) + offset = ptr - folio_address(folio); + if (n > folio_size(folio) - offset) usercopy_abort("page alloc", NULL, to_user, offset, n); } } -- cgit From 884c65e4daf3eab8730b2bbd5abc5a2c0403b3f3 Mon Sep 17 00:00:00 2001 From: Jean-Philippe Brucker Date: Thu, 9 Jun 2022 17:14:59 +0100 Subject: amd-xgbe: Use platform_irq_count() The AMD XGbE driver currently counts the number of interrupts assigned to the device by inspecting the pdev->resource array. Since commit a1a2b7125e10 ("of/platform: Drop static setup of IRQ resource from DT core") removed IRQs from this array, the driver now attempts to get all interrupts from 1 to -1U and gives up probing once it reaches an invalid interrupt index. Obtain the number of IRQs with platform_irq_count() instead. Fixes: a1a2b7125e10 ("of/platform: Drop static setup of IRQ resource from DT core") Signed-off-by: Jean-Philippe Brucker Acked-by: Rob Herring Acked-by: Tom Lendacky Link: https://lore.kernel.org/r/20220609161457.69614-1-jean-philippe@linaro.org Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/amd/xgbe/xgbe-platform.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-platform.c b/drivers/net/ethernet/amd/xgbe/xgbe-platform.c index 4ebd2410185a..4d790a89fe77 100644 --- a/drivers/net/ethernet/amd/xgbe/xgbe-platform.c +++ b/drivers/net/ethernet/amd/xgbe/xgbe-platform.c @@ -338,7 +338,7 @@ static int xgbe_platform_probe(struct platform_device *pdev) * the PHY resources listed last */ phy_memnum = xgbe_resource_count(pdev, IORESOURCE_MEM) - 3; - phy_irqnum = xgbe_resource_count(pdev, IORESOURCE_IRQ) - 1; + phy_irqnum = platform_irq_count(pdev) - 1; dma_irqnum = 1; dma_irqend = phy_irqnum; } else { @@ -348,7 +348,7 @@ static int xgbe_platform_probe(struct platform_device *pdev) phy_memnum = 0; phy_irqnum = 0; dma_irqnum = 1; - dma_irqend = xgbe_resource_count(pdev, IORESOURCE_IRQ); + dma_irqend = platform_irq_count(pdev); } /* Obtain the mmio areas for the device */ -- cgit From 9cc8ea99bf7ae6f5a5a305bb14a6f1e3f18f5f54 Mon Sep 17 00:00:00 2001 From: Jonathan Neuschäfer Date: Fri, 10 Jun 2022 09:28:08 +0200 Subject: docs: networking: phy: Fix a typo MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Write "to be operated" instead of "to be operate". Signed-off-by: Jonathan Neuschäfer Reviewed-by: Andrew Lunn Link: https://lore.kernel.org/r/20220610072809.352962-1-j.neuschaefer@gmx.net Signed-off-by: Jakub Kicinski --- Documentation/networking/phy.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst index d43da709bf40..704f31da5167 100644 --- a/Documentation/networking/phy.rst +++ b/Documentation/networking/phy.rst @@ -104,7 +104,7 @@ Whenever possible, use the PHY side RGMII delay for these reasons: * PHY device drivers in PHYLIB being reusable by nature, being able to configure correctly a specified delay enables more designs with similar delay - requirements to be operate correctly + requirements to be operated correctly For cases where the PHY is not capable of providing this delay, but the Ethernet MAC driver is capable of doing so, the correct phy_interface_t value -- cgit From 168f912893407a5acb798a4a58613b5f1f98c717 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Mon, 13 Jun 2022 13:15:17 +0200 Subject: fs: account for group membership When calling setattr_prepare() to determine the validity of the attributes the ia_{g,u}id fields contain the value that will be written to inode->i_{g,u}id. This is exactly the same for idmapped and non-idmapped mounts and allows callers to pass in the values they want to see written to inode->i_{g,u}id. When group ownership is changed a caller whose fsuid owns the inode can change the group of the inode to any group they are a member of. When searching through the caller's groups we need to use the gid mapped according to the idmapped mount otherwise we will fail to change ownership for unprivileged users. Consider a caller running with fsuid and fsgid 1000 using an idmapped mount that maps id 65534 to 1000 and 65535 to 1001. Consequently, a file owned by 65534:65535 in the filesystem will be owned by 1000:1001 in the idmapped mount. The caller now requests the gid of the file to be changed to 1000 going through the idmapped mount. In the vfs we will immediately map the requested gid to the value that will need to be written to inode->i_gid and place it in attr->ia_gid. Since this idmapped mount maps 65534 to 1000 we place 65534 in attr->ia_gid. When we check whether the caller is allowed to change group ownership we first validate that their fsuid matches the inode's uid. The inode->i_uid is 65534 which is mapped to uid 1000 in the idmapped mount. Since the caller's fsuid is 1000 we pass the check. We now check whether the caller is allowed to change inode->i_gid to the requested gid by calling in_group_p(). This will compare the passed in gid to the caller's fsgid and search the caller's additional groups. Since we're dealing with an idmapped mount we need to pass in the gid mapped according to the idmapped mount. This is akin to checking whether a caller is privileged over the future group the inode is owned by. And that needs to take the idmapped mount into account. Note, all helpers are nops without idmapped mounts. New regression test sent to xfstests. Link: https://github.com/lxc/lxd/issues/10537 Link: https://lore.kernel.org/r/20220613111517.2186646-1-brauner@kernel.org Fixes: 2f221d6f7b88 ("attr: handle idmapped mounts") Cc: Seth Forshee Cc: Christoph Hellwig Cc: Aleksa Sarai Cc: Al Viro Cc: stable@vger.kernel.org # 5.15+ CC: linux-fsdevel@vger.kernel.org Reviewed-by: Seth Forshee Signed-off-by: Christian Brauner (Microsoft) --- fs/attr.c | 26 ++++++++++++++++++++------ 1 file changed, 20 insertions(+), 6 deletions(-) diff --git a/fs/attr.c b/fs/attr.c index 66899b6e9bd8..dbe996b0dedf 100644 --- a/fs/attr.c +++ b/fs/attr.c @@ -61,9 +61,15 @@ static bool chgrp_ok(struct user_namespace *mnt_userns, const struct inode *inode, kgid_t gid) { kgid_t kgid = i_gid_into_mnt(mnt_userns, inode); - if (uid_eq(current_fsuid(), i_uid_into_mnt(mnt_userns, inode)) && - (in_group_p(gid) || gid_eq(gid, inode->i_gid))) - return true; + if (uid_eq(current_fsuid(), i_uid_into_mnt(mnt_userns, inode))) { + kgid_t mapped_gid; + + if (gid_eq(gid, inode->i_gid)) + return true; + mapped_gid = mapped_kgid_fs(mnt_userns, i_user_ns(inode), gid); + if (in_group_p(mapped_gid)) + return true; + } if (capable_wrt_inode_uidgid(mnt_userns, inode, CAP_CHOWN)) return true; if (gid_eq(kgid, INVALID_GID) && @@ -123,12 +129,20 @@ int setattr_prepare(struct user_namespace *mnt_userns, struct dentry *dentry, /* Make sure a caller can chmod. */ if (ia_valid & ATTR_MODE) { + kgid_t mapped_gid; + if (!inode_owner_or_capable(mnt_userns, inode)) return -EPERM; + + if (ia_valid & ATTR_GID) + mapped_gid = mapped_kgid_fs(mnt_userns, + i_user_ns(inode), attr->ia_gid); + else + mapped_gid = i_gid_into_mnt(mnt_userns, inode); + /* Also check the setgid bit! */ - if (!in_group_p((ia_valid & ATTR_GID) ? attr->ia_gid : - i_gid_into_mnt(mnt_userns, inode)) && - !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID)) + if (!in_group_p(mapped_gid) && + !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID)) attr->ia_mode &= ~S_ISGID; } -- cgit From 4b7a632ac4e7101ceefee8484d5c2ca505d347b3 Mon Sep 17 00:00:00 2001 From: Petr Machata Date: Mon, 13 Jun 2022 15:50:17 +0300 Subject: mlxsw: spectrum_cnt: Reorder counter pools Both RIF and ACL flow counters use a 24-bit SW-managed counter address to communicate which counter they want to bind. In a number of Spectrum FW releases, binding a RIF counter is broken and slices the counter index to 16 bits. As a result, on Spectrum-2 and above, no more than about 410 RIF counters can be effectively used. This translates to 205 netdevices for which L3 HW stats can be enabled. (This does not happen on Spectrum-1, because there are fewer counters available overall and the counter index never exceeds 16 bits.) Binding counters to ACLs does not have this issue. Therefore reorder the counter allocation scheme so that RIF counters come first and therefore get lower indices that are below the 16-bit barrier. Fixes: 98e60dce4da1 ("Merge branch 'mlxsw-Introduce-initial-Spectrum-2-support'") Reported-by: Maksym Yaremchuk Signed-off-by: Petr Machata Signed-off-by: Ido Schimmel Link: https://lore.kernel.org/r/20220613125017.2018162-1-idosch@nvidia.com Signed-off-by: Paolo Abeni --- drivers/net/ethernet/mellanox/mlxsw/spectrum_cnt.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_cnt.h b/drivers/net/ethernet/mellanox/mlxsw/spectrum_cnt.h index a68d931090dd..15c8d4de8350 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_cnt.h +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_cnt.h @@ -8,8 +8,8 @@ #include "spectrum.h" enum mlxsw_sp_counter_sub_pool_id { - MLXSW_SP_COUNTER_SUB_POOL_FLOW, MLXSW_SP_COUNTER_SUB_POOL_RIF, + MLXSW_SP_COUNTER_SUB_POOL_FLOW, }; int mlxsw_sp_counter_alloc(struct mlxsw_sp *mlxsw_sp, -- cgit From 71a579f0d3777a704355e6f1572dfba92a9b58b2 Mon Sep 17 00:00:00 2001 From: Michal Michalik Date: Tue, 10 May 2022 13:03:43 +0200 Subject: ice: Fix PTP TX timestamp offset calculation The offset was being incorrectly calculated for E822 - that led to collisions in choosing TX timestamp register location when more than one port was trying to use timestamping mechanism. In E822 one quad is being logically split between ports, so quad 0 is having trackers for ports 0-3, quad 1 ports 4-7 etc. Each port should have separate memory location for tracking timestamps. Due to error for example ports 1 and 2 had been assigned to quad 0 with same offset (0), while port 1 should have offset 0 and 1 offset 16. Fix it by correctly calculating quad offset. Fixes: 3a7496234d17 ("ice: implement basic E822 PTP support") Signed-off-by: Michal Michalik Tested-by: Gurucharan (A Contingent worker at Intel) Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/ice/ice_ptp.c | 2 +- drivers/net/ethernet/intel/ice/ice_ptp.h | 31 +++++++++++++++++++++++++++++++ 2 files changed, 32 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c index 662947c882e8..ef9344ef0d8e 100644 --- a/drivers/net/ethernet/intel/ice/ice_ptp.c +++ b/drivers/net/ethernet/intel/ice/ice_ptp.c @@ -2271,7 +2271,7 @@ static int ice_ptp_init_tx_e822(struct ice_pf *pf, struct ice_ptp_tx *tx, u8 port) { tx->quad = port / ICE_PORTS_PER_QUAD; - tx->quad_offset = tx->quad * INDEX_PER_PORT; + tx->quad_offset = (port % ICE_PORTS_PER_QUAD) * INDEX_PER_PORT; tx->len = INDEX_PER_PORT; return ice_ptp_alloc_tx_tracker(tx); diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.h b/drivers/net/ethernet/intel/ice/ice_ptp.h index afd048d69959..10e396abf130 100644 --- a/drivers/net/ethernet/intel/ice/ice_ptp.h +++ b/drivers/net/ethernet/intel/ice/ice_ptp.h @@ -49,6 +49,37 @@ struct ice_perout_channel { * To allow multiple ports to access the shared register block independently, * the blocks are split up so that indexes are assigned to each port based on * hardware logical port number. + * + * The timestamp blocks are handled differently for E810- and E822-based + * devices. In E810 devices, each port has its own block of timestamps, while in + * E822 there is a need to logically break the block of registers into smaller + * chunks based on the port number to avoid collisions. + * + * Example for port 5 in E810: + * +--------+--------+--------+--------+--------+--------+--------+--------+ + * |register|register|register|register|register|register|register|register| + * | block | block | block | block | block | block | block | block | + * | for | for | for | for | for | for | for | for | + * | port 0 | port 1 | port 2 | port 3 | port 4 | port 5 | port 6 | port 7 | + * +--------+--------+--------+--------+--------+--------+--------+--------+ + * ^^ + * || + * |--- quad offset is always 0 + * ---- quad number + * + * Example for port 5 in E822: + * +-----------------------------+-----------------------------+ + * | register block for quad 0 | register block for quad 1 | + * |+------+------+------+------+|+------+------+------+------+| + * ||port 0|port 1|port 2|port 3|||port 0|port 1|port 2|port 3|| + * |+------+------+------+------+|+------+------+------+------+| + * +-----------------------------+-------^---------------------+ + * ^ | + * | --- quad offset* + * ---- quad number + * + * * PHY port 5 is port 1 in quad 1 + * */ /** -- cgit From 9542ef4fba8c73e176b8aa18a8adf04aecb889e5 Mon Sep 17 00:00:00 2001 From: Roman Storozhenko Date: Tue, 7 Jun 2022 08:54:57 +0200 Subject: ice: Sync VLAN filtering features for DVM VLAN filtering features, that is C-Tag and S-Tag, in DVM mode must be both enabled or disabled. In case of turning off/on only one of the features, another feature must be turned off/on automatically with issuing an appropriate message to the kernel log. Fixes: 1babaf77f49d ("ice: Advertise 802.1ad VLAN filtering and offloads for PF netdev") Signed-off-by: Roman Storozhenko Co-developed-by: Anatolii Gerasymenko Signed-off-by: Anatolii Gerasymenko Tested-by: Gurucharan (A Contingent worker at Intel) Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/ice/ice_main.c | 49 +++++++++++++++++++------------ 1 file changed, 31 insertions(+), 18 deletions(-) diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c index e1cae253412c..c1ac2f746714 100644 --- a/drivers/net/ethernet/intel/ice/ice_main.c +++ b/drivers/net/ethernet/intel/ice/ice_main.c @@ -5763,25 +5763,38 @@ static netdev_features_t ice_fix_features(struct net_device *netdev, netdev_features_t features) { struct ice_netdev_priv *np = netdev_priv(netdev); - netdev_features_t supported_vlan_filtering; - netdev_features_t requested_vlan_filtering; - struct ice_vsi *vsi = np->vsi; - - requested_vlan_filtering = features & NETIF_VLAN_FILTERING_FEATURES; - - /* make sure supported_vlan_filtering works for both SVM and DVM */ - supported_vlan_filtering = NETIF_F_HW_VLAN_CTAG_FILTER; - if (ice_is_dvm_ena(&vsi->back->hw)) - supported_vlan_filtering |= NETIF_F_HW_VLAN_STAG_FILTER; - - if (requested_vlan_filtering && - requested_vlan_filtering != supported_vlan_filtering) { - if (requested_vlan_filtering & NETIF_F_HW_VLAN_CTAG_FILTER) { - netdev_warn(netdev, "cannot support requested VLAN filtering settings, enabling all supported VLAN filtering settings\n"); - features |= supported_vlan_filtering; + netdev_features_t req_vlan_fltr, cur_vlan_fltr; + bool cur_ctag, cur_stag, req_ctag, req_stag; + + cur_vlan_fltr = netdev->features & NETIF_VLAN_FILTERING_FEATURES; + cur_ctag = cur_vlan_fltr & NETIF_F_HW_VLAN_CTAG_FILTER; + cur_stag = cur_vlan_fltr & NETIF_F_HW_VLAN_STAG_FILTER; + + req_vlan_fltr = features & NETIF_VLAN_FILTERING_FEATURES; + req_ctag = req_vlan_fltr & NETIF_F_HW_VLAN_CTAG_FILTER; + req_stag = req_vlan_fltr & NETIF_F_HW_VLAN_STAG_FILTER; + + if (req_vlan_fltr != cur_vlan_fltr) { + if (ice_is_dvm_ena(&np->vsi->back->hw)) { + if (req_ctag && req_stag) { + features |= NETIF_VLAN_FILTERING_FEATURES; + } else if (!req_ctag && !req_stag) { + features &= ~NETIF_VLAN_FILTERING_FEATURES; + } else if ((!cur_ctag && req_ctag && !cur_stag) || + (!cur_stag && req_stag && !cur_ctag)) { + features |= NETIF_VLAN_FILTERING_FEATURES; + netdev_warn(netdev, "802.1Q and 802.1ad VLAN filtering must be either both on or both off. VLAN filtering has been enabled for both types.\n"); + } else if ((cur_ctag && !req_ctag && cur_stag) || + (cur_stag && !req_stag && cur_ctag)) { + features &= ~NETIF_VLAN_FILTERING_FEATURES; + netdev_warn(netdev, "802.1Q and 802.1ad VLAN filtering must be either both on or both off. VLAN filtering has been disabled for both types.\n"); + } } else { - netdev_warn(netdev, "cannot support requested VLAN filtering settings, clearing all supported VLAN filtering settings\n"); - features &= ~supported_vlan_filtering; + if (req_vlan_fltr & NETIF_F_HW_VLAN_STAG_FILTER) + netdev_warn(netdev, "cannot support requested 802.1ad filtering setting in SVM mode\n"); + + if (req_vlan_fltr & NETIF_F_HW_VLAN_CTAG_FILTER) + features |= NETIF_F_HW_VLAN_CTAG_FILTER; } } -- cgit From be2af71496a54a7195ac62caba6fab49cfe5006c Mon Sep 17 00:00:00 2001 From: Przemyslaw Patynowski Date: Thu, 2 Jun 2022 12:09:04 +0200 Subject: ice: Fix queue config fail handling Disable VF's RX/TX queues, when VIRTCHNL_OP_CONFIG_VSI_QUEUES fail. Not disabling them might lead to scenario, where PF driver leaves VF queues enabled, when VF's VSI failed queue config. In this scenario VF should not have RX/TX queues enabled. If PF failed to set up VF's queues, VF will reset due to TX timeouts in VF driver. Initialize iterator 'i' to -1, so if error happens prior to configuring queues then error path code will not disable queue 0. Loop that configures queues will is using same iterator, so error path code will only disable queues that were configured. Fixes: 77ca27c41705 ("ice: add support for virtchnl_queue_select.[tx|rx]_queues bitmap") Suggested-by: Slawomir Laba Signed-off-by: Przemyslaw Patynowski Signed-off-by: Mateusz Palczewski Tested-by: Konrad Jankowski Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/ice/ice_virtchnl.c | 53 +++++++++++++-------------- 1 file changed, 26 insertions(+), 27 deletions(-) diff --git a/drivers/net/ethernet/intel/ice/ice_virtchnl.c b/drivers/net/ethernet/intel/ice/ice_virtchnl.c index 1d9b84c3937a..4547bc1f7cee 100644 --- a/drivers/net/ethernet/intel/ice/ice_virtchnl.c +++ b/drivers/net/ethernet/intel/ice/ice_virtchnl.c @@ -1569,35 +1569,27 @@ error_param: */ static int ice_vc_cfg_qs_msg(struct ice_vf *vf, u8 *msg) { - enum virtchnl_status_code v_ret = VIRTCHNL_STATUS_SUCCESS; struct virtchnl_vsi_queue_config_info *qci = (struct virtchnl_vsi_queue_config_info *)msg; struct virtchnl_queue_pair_info *qpi; struct ice_pf *pf = vf->pf; struct ice_vsi *vsi; - int i, q_idx; + int i = -1, q_idx; - if (!test_bit(ICE_VF_STATE_ACTIVE, vf->vf_states)) { - v_ret = VIRTCHNL_STATUS_ERR_PARAM; + if (!test_bit(ICE_VF_STATE_ACTIVE, vf->vf_states)) goto error_param; - } - if (!ice_vc_isvalid_vsi_id(vf, qci->vsi_id)) { - v_ret = VIRTCHNL_STATUS_ERR_PARAM; + if (!ice_vc_isvalid_vsi_id(vf, qci->vsi_id)) goto error_param; - } vsi = ice_get_vf_vsi(vf); - if (!vsi) { - v_ret = VIRTCHNL_STATUS_ERR_PARAM; + if (!vsi) goto error_param; - } if (qci->num_queue_pairs > ICE_MAX_RSS_QS_PER_VF || qci->num_queue_pairs > min_t(u16, vsi->alloc_txq, vsi->alloc_rxq)) { dev_err(ice_pf_to_dev(pf), "VF-%d requesting more than supported number of queues: %d\n", vf->vf_id, min_t(u16, vsi->alloc_txq, vsi->alloc_rxq)); - v_ret = VIRTCHNL_STATUS_ERR_PARAM; goto error_param; } @@ -1610,7 +1602,6 @@ static int ice_vc_cfg_qs_msg(struct ice_vf *vf, u8 *msg) !ice_vc_isvalid_ring_len(qpi->txq.ring_len) || !ice_vc_isvalid_ring_len(qpi->rxq.ring_len) || !ice_vc_isvalid_q_id(vf, qci->vsi_id, qpi->txq.queue_id)) { - v_ret = VIRTCHNL_STATUS_ERR_PARAM; goto error_param; } @@ -1620,7 +1611,6 @@ static int ice_vc_cfg_qs_msg(struct ice_vf *vf, u8 *msg) * for selected "vsi" */ if (q_idx >= vsi->alloc_txq || q_idx >= vsi->alloc_rxq) { - v_ret = VIRTCHNL_STATUS_ERR_PARAM; goto error_param; } @@ -1630,14 +1620,13 @@ static int ice_vc_cfg_qs_msg(struct ice_vf *vf, u8 *msg) vsi->tx_rings[i]->count = qpi->txq.ring_len; /* Disable any existing queue first */ - if (ice_vf_vsi_dis_single_txq(vf, vsi, q_idx)) { - v_ret = VIRTCHNL_STATUS_ERR_PARAM; + if (ice_vf_vsi_dis_single_txq(vf, vsi, q_idx)) goto error_param; - } /* Configure a queue with the requested settings */ if (ice_vsi_cfg_single_txq(vsi, vsi->tx_rings, q_idx)) { - v_ret = VIRTCHNL_STATUS_ERR_PARAM; + dev_warn(ice_pf_to_dev(pf), "VF-%d failed to configure TX queue %d\n", + vf->vf_id, i); goto error_param; } } @@ -1651,17 +1640,13 @@ static int ice_vc_cfg_qs_msg(struct ice_vf *vf, u8 *msg) if (qpi->rxq.databuffer_size != 0 && (qpi->rxq.databuffer_size > ((16 * 1024) - 128) || - qpi->rxq.databuffer_size < 1024)) { - v_ret = VIRTCHNL_STATUS_ERR_PARAM; + qpi->rxq.databuffer_size < 1024)) goto error_param; - } vsi->rx_buf_len = qpi->rxq.databuffer_size; vsi->rx_rings[i]->rx_buf_len = vsi->rx_buf_len; if (qpi->rxq.max_pkt_size > max_frame_size || - qpi->rxq.max_pkt_size < 64) { - v_ret = VIRTCHNL_STATUS_ERR_PARAM; + qpi->rxq.max_pkt_size < 64) goto error_param; - } vsi->max_frame = qpi->rxq.max_pkt_size; /* add space for the port VLAN since the VF driver is @@ -1672,16 +1657,30 @@ static int ice_vc_cfg_qs_msg(struct ice_vf *vf, u8 *msg) vsi->max_frame += VLAN_HLEN; if (ice_vsi_cfg_single_rxq(vsi, q_idx)) { - v_ret = VIRTCHNL_STATUS_ERR_PARAM; + dev_warn(ice_pf_to_dev(pf), "VF-%d failed to configure RX queue %d\n", + vf->vf_id, i); goto error_param; } } } + /* send the response to the VF */ + return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_CONFIG_VSI_QUEUES, + VIRTCHNL_STATUS_SUCCESS, NULL, 0); error_param: + /* disable whatever we can */ + for (; i >= 0; i--) { + if (ice_vsi_ctrl_one_rx_ring(vsi, false, i, true)) + dev_err(ice_pf_to_dev(pf), "VF-%d could not disable RX queue %d\n", + vf->vf_id, i); + if (ice_vf_vsi_dis_single_txq(vf, vsi, i)) + dev_err(ice_pf_to_dev(pf), "VF-%d could not disable TX queue %d\n", + vf->vf_id, i); + } + /* send the response to the VF */ - return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_CONFIG_VSI_QUEUES, v_ret, - NULL, 0); + return ice_vc_send_msg_to_vf(vf, VIRTCHNL_OP_CONFIG_VSI_QUEUES, + VIRTCHNL_STATUS_ERR_PARAM, NULL, 0); } /** -- cgit From efe41860008e57fb6b69855b4b93fdf34bc42798 Mon Sep 17 00:00:00 2001 From: Przemyslaw Patynowski Date: Thu, 2 Jun 2022 12:09:17 +0200 Subject: ice: Fix memory corruption in VF driver Disable VF's RX/TX queues, when it's disabled. VF can have queues enabled, when it requests a reset. If PF driver assumes that VF is disabled, while VF still has queues configured, VF may unmap DMA resources. In such scenario device still can map packets to memory, which ends up silently corrupting it. Previously, VF driver could experience memory corruption, which lead to crash: [ 5119.170157] BUG: unable to handle kernel paging request at 00001b9780003237 [ 5119.170166] PGD 0 P4D 0 [ 5119.170173] Oops: 0002 [#1] PREEMPT_RT SMP PTI [ 5119.170181] CPU: 30 PID: 427592 Comm: kworker/u96:2 Kdump: loaded Tainted: G W I --------- - - 4.18.0-372.9.1.rt7.166.el8.x86_64 #1 [ 5119.170189] Hardware name: Dell Inc. PowerEdge R740/014X06, BIOS 2.3.10 08/15/2019 [ 5119.170193] Workqueue: iavf iavf_adminq_task [iavf] [ 5119.170219] RIP: 0010:__page_frag_cache_drain+0x5/0x30 [ 5119.170238] Code: 0f 0f b6 77 51 85 f6 74 07 31 d2 e9 05 df ff ff e9 90 fe ff ff 48 8b 05 49 db 33 01 eb b4 0f 1f 80 00 00 00 00 0f 1f 44 00 00 29 77 34 74 01 c3 48 8b 07 f6 c4 80 74 0f 0f b6 77 51 85 f6 74 [ 5119.170244] RSP: 0018:ffffa43b0bdcfd78 EFLAGS: 00010282 [ 5119.170250] RAX: ffffffff896b3e40 RBX: ffff8fb282524000 RCX: 0000000000000002 [ 5119.170254] RDX: 0000000049000000 RSI: 0000000000000000 RDI: 00001b9780003203 [ 5119.170259] RBP: ffff8fb248217b00 R08: 0000000000000022 R09: 0000000000000009 [ 5119.170262] R10: 2b849d6300000000 R11: 0000000000000020 R12: 0000000000000000 [ 5119.170265] R13: 0000000000001000 R14: 0000000000000009 R15: 0000000000000000 [ 5119.170269] FS: 0000000000000000(0000) GS:ffff8fb1201c0000(0000) knlGS:0000000000000000 [ 5119.170274] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5119.170279] CR2: 00001b9780003237 CR3: 00000008f3e1a003 CR4: 00000000007726e0 [ 5119.170283] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 5119.170286] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 5119.170290] PKRU: 55555554 [ 5119.170292] Call Trace: [ 5119.170298] iavf_clean_rx_ring+0xad/0x110 [iavf] [ 5119.170324] iavf_free_rx_resources+0xe/0x50 [iavf] [ 5119.170342] iavf_free_all_rx_resources.part.51+0x30/0x40 [iavf] [ 5119.170358] iavf_virtchnl_completion+0xd8a/0x15b0 [iavf] [ 5119.170377] ? iavf_clean_arq_element+0x210/0x280 [iavf] [ 5119.170397] iavf_adminq_task+0x126/0x2e0 [iavf] [ 5119.170416] process_one_work+0x18f/0x420 [ 5119.170429] worker_thread+0x30/0x370 [ 5119.170437] ? process_one_work+0x420/0x420 [ 5119.170445] kthread+0x151/0x170 [ 5119.170452] ? set_kthread_struct+0x40/0x40 [ 5119.170460] ret_from_fork+0x35/0x40 [ 5119.170477] Modules linked in: iavf sctp ip6_udp_tunnel udp_tunnel mlx4_en mlx4_core nfp tls vhost_net vhost vhost_iotlb tap tun xt_CHECKSUM ipt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_counter nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink bridge stp llc rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc intel_rapl_msr iTCO_wdt iTCO_vendor_support dell_smbios wmi_bmof dell_wmi_descriptor dcdbas kvm_intel kvm irqbypass intel_rapl_common isst_if_common skx_edac irdma nfit libnvdimm x86_pkg_temp_thermal i40e intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ib_uverbs rapl ipmi_ssif intel_cstate intel_uncore mei_me pcspkr acpi_ipmi ib_core mei lpc_ich i2c_i801 ipmi_si ipmi_devintf wmi ipmi_msghandler acpi_power_meter xfs libcrc32c sd_mod t10_pi sg mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ice ahci drm libahci crc32c_intel libata tg3 megaraid_sas [ 5119.170613] i2c_algo_bit dm_mirror dm_region_hash dm_log dm_mod fuse [last unloaded: iavf] [ 5119.170627] CR2: 00001b9780003237 Fixes: ec4f5a436bdf ("ice: Check if VF is disabled for Opcode and other operations") Signed-off-by: Przemyslaw Patynowski Co-developed-by: Slawomir Laba Signed-off-by: Slawomir Laba Signed-off-by: Mateusz Palczewski Tested-by: Konrad Jankowski Signed-off-by: Tony Nguyen --- drivers/net/ethernet/intel/ice/ice_vf_lib.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/net/ethernet/intel/ice/ice_vf_lib.c b/drivers/net/ethernet/intel/ice/ice_vf_lib.c index cd8e6b50968c..7adf9ddf129e 100644 --- a/drivers/net/ethernet/intel/ice/ice_vf_lib.c +++ b/drivers/net/ethernet/intel/ice/ice_vf_lib.c @@ -504,6 +504,11 @@ int ice_reset_vf(struct ice_vf *vf, u32 flags) } if (ice_is_vf_disabled(vf)) { + vsi = ice_get_vf_vsi(vf); + if (WARN_ON(!vsi)) + return -EINVAL; + ice_vsi_stop_lan_tx_rings(vsi, ICE_NO_RESET, vf->vf_id); + ice_vsi_stop_all_rx_rings(vsi); dev_dbg(dev, "VF is already disabled, there is no need for resetting it, telling VM, all is fine %d\n", vf->vf_id); return 0; -- cgit From 018ab4fabddd94f1c96f3b59e180691b9e88d5d8 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Tue, 14 Jun 2022 10:36:11 -0700 Subject: netfs: fix up netfs_inode_init() docbook comment Commit e81fb4198e27 ("netfs: Further cleanups after struct netfs_inode wrapper introduced") changed the argument types and names, and actually updated the comment too (although that was thanks to David Howells, not me: my original patch only changed the code). But the comment fixup didn't go quite far enough, and didn't change the argument name in the comment, resulting in include/linux/netfs.h:314: warning: Function parameter or member 'ctx' not described in 'netfs_inode_init' include/linux/netfs.h:314: warning: Excess function parameter 'inode' description in 'netfs_inode_init' during htmldoc generation. Fixes: e81fb4198e27 ("netfs: Further cleanups after struct netfs_inode wrapper introduced") Reported-by: Stephen Rothwell Signed-off-by: Linus Torvalds --- include/linux/netfs.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/netfs.h b/include/linux/netfs.h index 097cdd644665..1773e5df8e65 100644 --- a/include/linux/netfs.h +++ b/include/linux/netfs.h @@ -304,7 +304,7 @@ static inline struct netfs_inode *netfs_inode(struct inode *inode) /** * netfs_inode_init - Initialise a netfslib inode context - * @inode: The netfs inode to initialise + * @ctx: The netfs inode to initialise * @ops: The netfs's operations list * * Initialise the netfs library context struct. This is expected to follow on -- cgit From d7dd6eccfbc95ac47a12396f84e7e1b361db654b Mon Sep 17 00:00:00 2001 From: Christophe JAILLET Date: Mon, 13 Jun 2022 22:53:50 +0200 Subject: net: bgmac: Fix an erroneous kfree() in bgmac_remove() 'bgmac' is part of a managed resource allocated with bgmac_alloc(). It should not be freed explicitly. Remove the erroneous kfree() from the .remove() function. Fixes: 34a5102c3235 ("net: bgmac: allocate struct bgmac just once & don't copy it") Signed-off-by: Christophe JAILLET Reviewed-by: Florian Fainelli Link: https://lore.kernel.org/r/a026153108dd21239036a032b95c25b5cece253b.1655153616.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/broadcom/bgmac-bcma.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/net/ethernet/broadcom/bgmac-bcma.c b/drivers/net/ethernet/broadcom/bgmac-bcma.c index e6f48786949c..02bd3cf9a260 100644 --- a/drivers/net/ethernet/broadcom/bgmac-bcma.c +++ b/drivers/net/ethernet/broadcom/bgmac-bcma.c @@ -332,7 +332,6 @@ static void bgmac_remove(struct bcma_device *core) bcma_mdio_mii_unregister(bgmac->mii_bus); bgmac_enet_remove(bgmac); bcma_set_drvdata(core, NULL); - kfree(bgmac); } static struct bcma_driver bgmac_bcma_driver = { -- cgit From 56315b6bf7fc63d2b26c37869d2753f765849bd6 Mon Sep 17 00:00:00 2001 From: Oleksij Rempel Date: Fri, 10 Jun 2022 10:16:21 +0200 Subject: ARM: dts: at91: ksz9477_evb: fix port/phy validation Latest drivers version requires phy-mode to be set. Otherwise we will use "NA" mode and the switch driver will invalidate this port mode. Fixes: 65ac79e18120 ("net: dsa: microchip: add the phylink get_caps") Signed-off-by: Oleksij Rempel Link: https://lore.kernel.org/r/20220610081621.584393-1-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski --- arch/arm/boot/dts/at91-sama5d3_ksz9477_evb.dts | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/arch/arm/boot/dts/at91-sama5d3_ksz9477_evb.dts b/arch/arm/boot/dts/at91-sama5d3_ksz9477_evb.dts index 443e8b022897..14af1fd6d247 100644 --- a/arch/arm/boot/dts/at91-sama5d3_ksz9477_evb.dts +++ b/arch/arm/boot/dts/at91-sama5d3_ksz9477_evb.dts @@ -120,26 +120,31 @@ port@0 { reg = <0>; label = "lan1"; + phy-mode = "internal"; }; port@1 { reg = <1>; label = "lan2"; + phy-mode = "internal"; }; port@2 { reg = <2>; label = "lan3"; + phy-mode = "internal"; }; port@3 { reg = <3>; label = "lan4"; + phy-mode = "internal"; }; port@4 { reg = <4>; label = "lan5"; + phy-mode = "internal"; }; port@5 { -- cgit From b60377de779052bf00b34a62f0bae03c92b88776 Mon Sep 17 00:00:00 2001 From: Lukas Bulwahn Date: Mon, 13 Jun 2022 14:18:26 +0200 Subject: MAINTAINERS: add include/dt-bindings/net to NETWORKING DRIVERS Maintainers of the directory Documentation/devicetree/bindings/net are also the maintainers of the corresponding directory include/dt-bindings/net. Add the file entry for include/dt-bindings/net to the appropriate section in MAINTAINERS. Signed-off-by: Lukas Bulwahn Link: https://lore.kernel.org/r/20220613121826.11484-1-lukas.bulwahn@gmail.com Signed-off-by: Jakub Kicinski --- MAINTAINERS | 1 + 1 file changed, 1 insertion(+) diff --git a/MAINTAINERS b/MAINTAINERS index 96158b337b40..0c3847fb2bbc 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -13799,6 +13799,7 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git F: Documentation/devicetree/bindings/net/ F: drivers/connector/ F: drivers/net/ +F: include/dt-bindings/net/ F: include/linux/etherdevice.h F: include/linux/fcdevice.h F: include/linux/fddidevice.h -- cgit From 36a15e1cb134c0395261ba1940762703f778438c Mon Sep 17 00:00:00 2001 From: Jose Alonso Date: Mon, 13 Jun 2022 15:32:44 -0300 Subject: net: usb: ax88179_178a needs FLAG_SEND_ZLP The extra byte inserted by usbnet.c when (length % dev->maxpacket == 0) is causing problems to device. This patch sets FLAG_SEND_ZLP to avoid this. Tested with: 0b95:1790 ASIX Electronics Corp. AX88179 Gigabit Ethernet Problems observed: ====================================================================== 1) Using ssh/sshfs. The remote sshd daemon can abort with the message: "message authentication code incorrect" This happens because the tcp message sent is corrupted during the USB "Bulk out". The device calculate the tcp checksum and send a valid tcp message to the remote sshd. Then the encryption detects the error and aborts. 2) NETDEV WATCHDOG: ... (ax88179_178a): transmit queue 0 timed out 3) Stop normal work without any log message. The "Bulk in" continue receiving packets normally. The host sends "Bulk out" and the device responds with -ECONNRESET. (The netusb.c code tx_complete ignore -ECONNRESET) Under normal conditions these errors take days to happen and in intense usage take hours. A test with ping gives packet loss, showing that something is wrong: ping -4 -s 462 {destination} # 462 = 512 - 42 - 8 Not all packets fail. My guess is that the device tries to find another packet starting at the extra byte and will fail or not depending on the next bytes (old buffer content). ====================================================================== Signed-off-by: Jose Alonso Signed-off-by: David S. Miller --- drivers/net/usb/ax88179_178a.c | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/drivers/net/usb/ax88179_178a.c b/drivers/net/usb/ax88179_178a.c index 7a8c11a26eb5..4704ed6f00ef 100644 --- a/drivers/net/usb/ax88179_178a.c +++ b/drivers/net/usb/ax88179_178a.c @@ -1750,7 +1750,7 @@ static const struct driver_info ax88179_info = { .link_reset = ax88179_link_reset, .reset = ax88179_reset, .stop = ax88179_stop, - .flags = FLAG_ETHER | FLAG_FRAMING_AX, + .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_SEND_ZLP, .rx_fixup = ax88179_rx_fixup, .tx_fixup = ax88179_tx_fixup, }; @@ -1763,7 +1763,7 @@ static const struct driver_info ax88178a_info = { .link_reset = ax88179_link_reset, .reset = ax88179_reset, .stop = ax88179_stop, - .flags = FLAG_ETHER | FLAG_FRAMING_AX, + .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_SEND_ZLP, .rx_fixup = ax88179_rx_fixup, .tx_fixup = ax88179_tx_fixup, }; @@ -1776,7 +1776,7 @@ static const struct driver_info cypress_GX3_info = { .link_reset = ax88179_link_reset, .reset = ax88179_reset, .stop = ax88179_stop, - .flags = FLAG_ETHER | FLAG_FRAMING_AX, + .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_SEND_ZLP, .rx_fixup = ax88179_rx_fixup, .tx_fixup = ax88179_tx_fixup, }; @@ -1789,7 +1789,7 @@ static const struct driver_info dlink_dub1312_info = { .link_reset = ax88179_link_reset, .reset = ax88179_reset, .stop = ax88179_stop, - .flags = FLAG_ETHER | FLAG_FRAMING_AX, + .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_SEND_ZLP, .rx_fixup = ax88179_rx_fixup, .tx_fixup = ax88179_tx_fixup, }; @@ -1802,7 +1802,7 @@ static const struct driver_info sitecom_info = { .link_reset = ax88179_link_reset, .reset = ax88179_reset, .stop = ax88179_stop, - .flags = FLAG_ETHER | FLAG_FRAMING_AX, + .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_SEND_ZLP, .rx_fixup = ax88179_rx_fixup, .tx_fixup = ax88179_tx_fixup, }; @@ -1815,7 +1815,7 @@ static const struct driver_info samsung_info = { .link_reset = ax88179_link_reset, .reset = ax88179_reset, .stop = ax88179_stop, - .flags = FLAG_ETHER | FLAG_FRAMING_AX, + .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_SEND_ZLP, .rx_fixup = ax88179_rx_fixup, .tx_fixup = ax88179_tx_fixup, }; @@ -1828,7 +1828,7 @@ static const struct driver_info lenovo_info = { .link_reset = ax88179_link_reset, .reset = ax88179_reset, .stop = ax88179_stop, - .flags = FLAG_ETHER | FLAG_FRAMING_AX, + .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_SEND_ZLP, .rx_fixup = ax88179_rx_fixup, .tx_fixup = ax88179_tx_fixup, }; @@ -1841,7 +1841,7 @@ static const struct driver_info belkin_info = { .link_reset = ax88179_link_reset, .reset = ax88179_reset, .stop = ax88179_stop, - .flags = FLAG_ETHER | FLAG_FRAMING_AX, + .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_SEND_ZLP, .rx_fixup = ax88179_rx_fixup, .tx_fixup = ax88179_tx_fixup, }; @@ -1854,7 +1854,7 @@ static const struct driver_info toshiba_info = { .link_reset = ax88179_link_reset, .reset = ax88179_reset, .stop = ax88179_stop, - .flags = FLAG_ETHER | FLAG_FRAMING_AX, + .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_SEND_ZLP, .rx_fixup = ax88179_rx_fixup, .tx_fixup = ax88179_tx_fixup, }; @@ -1867,7 +1867,7 @@ static const struct driver_info mct_info = { .link_reset = ax88179_link_reset, .reset = ax88179_reset, .stop = ax88179_stop, - .flags = FLAG_ETHER | FLAG_FRAMING_AX, + .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_SEND_ZLP, .rx_fixup = ax88179_rx_fixup, .tx_fixup = ax88179_tx_fixup, }; @@ -1880,7 +1880,7 @@ static const struct driver_info at_umc2000_info = { .link_reset = ax88179_link_reset, .reset = ax88179_reset, .stop = ax88179_stop, - .flags = FLAG_ETHER | FLAG_FRAMING_AX, + .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_SEND_ZLP, .rx_fixup = ax88179_rx_fixup, .tx_fixup = ax88179_tx_fixup, }; @@ -1893,7 +1893,7 @@ static const struct driver_info at_umc200_info = { .link_reset = ax88179_link_reset, .reset = ax88179_reset, .stop = ax88179_stop, - .flags = FLAG_ETHER | FLAG_FRAMING_AX, + .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_SEND_ZLP, .rx_fixup = ax88179_rx_fixup, .tx_fixup = ax88179_tx_fixup, }; @@ -1906,7 +1906,7 @@ static const struct driver_info at_umc2000sp_info = { .link_reset = ax88179_link_reset, .reset = ax88179_reset, .stop = ax88179_stop, - .flags = FLAG_ETHER | FLAG_FRAMING_AX, + .flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_SEND_ZLP, .rx_fixup = ax88179_rx_fixup, .tx_fixup = ax88179_tx_fixup, }; -- cgit From 219b51a6f040fa5367adadd7d58c4dda0896a01d Mon Sep 17 00:00:00 2001 From: Duoming Zhou Date: Tue, 14 Jun 2022 17:25:57 +0800 Subject: net: ax25: Fix deadlock caused by skb_recv_datagram in ax25_recvmsg The skb_recv_datagram() in ax25_recvmsg() will hold lock_sock and block until it receives a packet from the remote. If the client doesn`t connect to server and calls read() directly, it will not receive any packets forever. As a result, the deadlock will happen. The fail log caused by deadlock is shown below: [ 369.606973] INFO: task ax25_deadlock:157 blocked for more than 245 seconds. [ 369.608919] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 369.613058] Call Trace: [ 369.613315] [ 369.614072] __schedule+0x2f9/0xb20 [ 369.615029] schedule+0x49/0xb0 [ 369.615734] __lock_sock+0x92/0x100 [ 369.616763] ? destroy_sched_domains_rcu+0x20/0x20 [ 369.617941] lock_sock_nested+0x6e/0x70 [ 369.618809] ax25_bind+0xaa/0x210 [ 369.619736] __sys_bind+0xca/0xf0 [ 369.620039] ? do_futex+0xae/0x1b0 [ 369.620387] ? __x64_sys_futex+0x7c/0x1c0 [ 369.620601] ? fpregs_assert_state_consistent+0x19/0x40 [ 369.620613] __x64_sys_bind+0x11/0x20 [ 369.621791] do_syscall_64+0x3b/0x90 [ 369.622423] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 369.623319] RIP: 0033:0x7f43c8aa8af7 [ 369.624301] RSP: 002b:00007f43c8197ef8 EFLAGS: 00000246 ORIG_RAX: 0000000000000031 [ 369.625756] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f43c8aa8af7 [ 369.626724] RDX: 0000000000000010 RSI: 000055768e2021d0 RDI: 0000000000000005 [ 369.628569] RBP: 00007f43c8197f00 R08: 0000000000000011 R09: 00007f43c8198700 [ 369.630208] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff845e6afe [ 369.632240] R13: 00007fff845e6aff R14: 00007f43c8197fc0 R15: 00007f43c8198700 This patch replaces skb_recv_datagram() with an open-coded variant of it releasing the socket lock before the __skb_wait_for_more_packets() call and re-acquiring it after such call in order that other functions that need socket lock could be executed. what's more, the socket lock will be released only when recvmsg() will block and that should produce nicer overall behavior. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Suggested-by: Thomas Osterried Signed-off-by: Duoming Zhou Reported-by: Thomas Habets Acked-by: Paolo Abeni Reviewed-by: Eric Dumazet Signed-off-by: David S. Miller --- net/ax25/af_ax25.c | 33 ++++++++++++++++++++++++++++----- 1 file changed, 28 insertions(+), 5 deletions(-) diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c index 95393bb2760b..4c7030ed8d33 100644 --- a/net/ax25/af_ax25.c +++ b/net/ax25/af_ax25.c @@ -1661,9 +1661,12 @@ static int ax25_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, int flags) { struct sock *sk = sock->sk; - struct sk_buff *skb; + struct sk_buff *skb, *last; + struct sk_buff_head *sk_queue; int copied; int err = 0; + int off = 0; + long timeo; lock_sock(sk); /* @@ -1675,10 +1678,29 @@ static int ax25_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, goto out; } - /* Now we can treat all alike */ - skb = skb_recv_datagram(sk, flags, &err); - if (skb == NULL) - goto out; + /* We need support for non-blocking reads. */ + sk_queue = &sk->sk_receive_queue; + skb = __skb_try_recv_datagram(sk, sk_queue, flags, &off, &err, &last); + /* If no packet is available, release_sock(sk) and try again. */ + if (!skb) { + if (err != -EAGAIN) + goto out; + release_sock(sk); + timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT); + while (timeo && !__skb_wait_for_more_packets(sk, sk_queue, &err, + &timeo, last)) { + skb = __skb_try_recv_datagram(sk, sk_queue, flags, &off, + &err, &last); + if (skb) + break; + + if (err != -EAGAIN) + goto done; + } + if (!skb) + goto done; + lock_sock(sk); + } if (!sk_to_ax25(sk)->pidincl) skb_pull(skb, 1); /* Remove PID */ @@ -1725,6 +1747,7 @@ static int ax25_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, out: release_sock(sk); +done: return err; } -- cgit From 6a1c3767d82ed8233de1263aa7da81595e176087 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Sun, 12 Jun 2022 02:22:30 +0900 Subject: certs/blacklist_hashes.c: fix const confusion in certs blacklist MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This file fails to compile as follows: CC certs/blacklist_hashes.o certs/blacklist_hashes.c:4:1: error: ignoring attribute ‘section (".init.data")’ because it conflicts with previous ‘section (".init.rodata")’ [-Werror=attributes] 4 | const char __initdata *const blacklist_hashes[] = { | ^~~~~ In file included from certs/blacklist_hashes.c:2: certs/blacklist.h:5:38: note: previous declaration here 5 | extern const char __initconst *const blacklist_hashes[]; | ^~~~~~~~~~~~~~~~ Apply the same fix as commit 2be04df5668d ("certs/blacklist_nohashes.c: fix const confusion in certs blacklist"). Fixes: 734114f8782f ("KEYS: Add a system blacklist keyring") Signed-off-by: Masahiro Yamada Reviewed-by: Jarkko Sakkinen Reviewed-by: Mickaël Salaün Signed-off-by: Jarkko Sakkinen --- certs/blacklist_hashes.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/certs/blacklist_hashes.c b/certs/blacklist_hashes.c index 344892337be0..d5961aa3d338 100644 --- a/certs/blacklist_hashes.c +++ b/certs/blacklist_hashes.c @@ -1,7 +1,7 @@ // SPDX-License-Identifier: GPL-2.0 #include "blacklist.h" -const char __initdata *const blacklist_hashes[] = { +const char __initconst *const blacklist_hashes[] = { #include CONFIG_SYSTEM_BLACKLIST_HASH_LIST , NULL }; -- cgit From 27b5b22d252c6d71a2a37a4bdf18d0be6d25ee5a Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Sun, 12 Jun 2022 02:22:31 +0900 Subject: certs: fix and refactor CONFIG_SYSTEM_BLACKLIST_HASH_LIST build MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Commit addf466389d9 ("certs: Check that builtin blacklist hashes are valid") was applied 8 months after the submission. In the meantime, the base code had been removed by commit b8c96a6b466c ("certs: simplify $(srctree)/ handling and remove config_filename macro"). Fix the Makefile. Create a local copy of $(CONFIG_SYSTEM_BLACKLIST_HASH_LIST). It is included from certs/blacklist_hashes.c and also works as a timestamp. Send error messages from check-blacklist-hashes.awk to stderr instead of stdout. Fixes: addf466389d9 ("certs: Check that builtin blacklist hashes are valid") Signed-off-by: Masahiro Yamada Reviewed-by: Jarkko Sakkinen Reviewed-by: Mickaël Salaün Signed-off-by: Jarkko Sakkinen --- certs/.gitignore | 2 +- certs/Makefile | 20 ++++++++++---------- certs/blacklist_hashes.c | 2 +- 3 files changed, 12 insertions(+), 12 deletions(-) diff --git a/certs/.gitignore b/certs/.gitignore index 56637aceaf81..cec5465f31c1 100644 --- a/certs/.gitignore +++ b/certs/.gitignore @@ -1,5 +1,5 @@ # SPDX-License-Identifier: GPL-2.0-only -/blacklist_hashes_checked +/blacklist_hash_list /extract-cert /x509_certificate_list /x509_revocation_list diff --git a/certs/Makefile b/certs/Makefile index cb1a9da3fc58..a8d628fd5f7b 100644 --- a/certs/Makefile +++ b/certs/Makefile @@ -7,22 +7,22 @@ obj-$(CONFIG_SYSTEM_TRUSTED_KEYRING) += system_keyring.o system_certificates.o c obj-$(CONFIG_SYSTEM_BLACKLIST_KEYRING) += blacklist.o common.o obj-$(CONFIG_SYSTEM_REVOCATION_LIST) += revocation_certificates.o ifneq ($(CONFIG_SYSTEM_BLACKLIST_HASH_LIST),) -quiet_cmd_check_blacklist_hashes = CHECK $(patsubst "%",%,$(2)) - cmd_check_blacklist_hashes = $(AWK) -f $(srctree)/scripts/check-blacklist-hashes.awk $(2); touch $@ -$(eval $(call config_filename,SYSTEM_BLACKLIST_HASH_LIST)) +$(obj)/blacklist_hashes.o: $(obj)/blacklist_hash_list +CFLAGS_blacklist_hashes.o := -I $(obj) -$(obj)/blacklist_hashes.o: $(obj)/blacklist_hashes_checked +quiet_cmd_check_and_copy_blacklist_hash_list = GEN $@ + cmd_check_and_copy_blacklist_hash_list = \ + $(AWK) -f $(srctree)/scripts/check-blacklist-hashes.awk $(CONFIG_SYSTEM_BLACKLIST_HASH_LIST) >&2; \ + cat $(CONFIG_SYSTEM_BLACKLIST_HASH_LIST) > $@ -CFLAGS_blacklist_hashes.o += -I$(srctree) - -targets += blacklist_hashes_checked -$(obj)/blacklist_hashes_checked: $(SYSTEM_BLACKLIST_HASH_LIST_SRCPREFIX)$(SYSTEM_BLACKLIST_HASH_LIST_FILENAME) scripts/check-blacklist-hashes.awk FORCE - $(call if_changed,check_blacklist_hashes,$(SYSTEM_BLACKLIST_HASH_LIST_SRCPREFIX)$(CONFIG_SYSTEM_BLACKLIST_HASH_LIST)) +$(obj)/blacklist_hash_list: $(CONFIG_SYSTEM_BLACKLIST_HASH_LIST) FORCE + $(call if_changed,check_and_copy_blacklist_hash_list) obj-$(CONFIG_SYSTEM_BLACKLIST_KEYRING) += blacklist_hashes.o else obj-$(CONFIG_SYSTEM_BLACKLIST_KEYRING) += blacklist_nohashes.o endif +targets += blacklist_hash_list quiet_cmd_extract_certs = CERT $@ cmd_extract_certs = $(obj)/extract-cert $(extract-cert-in) $@ @@ -33,7 +33,7 @@ $(obj)/system_certificates.o: $(obj)/x509_certificate_list $(obj)/x509_certificate_list: $(CONFIG_SYSTEM_TRUSTED_KEYS) $(obj)/extract-cert FORCE $(call if_changed,extract_certs) -targets += x509_certificate_list blacklist_hashes_checked +targets += x509_certificate_list # If module signing is requested, say by allyesconfig, but a key has not been # supplied, then one will need to be generated to make sure the build does not diff --git a/certs/blacklist_hashes.c b/certs/blacklist_hashes.c index d5961aa3d338..86d66fe11348 100644 --- a/certs/blacklist_hashes.c +++ b/certs/blacklist_hashes.c @@ -2,6 +2,6 @@ #include "blacklist.h" const char __initconst *const blacklist_hashes[] = { -#include CONFIG_SYSTEM_BLACKLIST_HASH_LIST +#include "blacklist_hash_list" , NULL }; -- cgit From 593d1ebe00a45af5cb7bda1235c0790987c2a2b2 Mon Sep 17 00:00:00 2001 From: Joanne Koong Date: Wed, 15 Jun 2022 12:32:13 -0700 Subject: Revert "net: Add a second bind table hashed by port and address" This reverts: commit d5a42de8bdbe ("net: Add a second bind table hashed by port and address") commit 538aaf9b2383 ("selftests: Add test for timing a bind request to a port with a populated bhash entry") Link: https://lore.kernel.org/netdev/20220520001834.2247810-1-kuba@kernel.org/ There are a few things that need to be fixed here: * Updating bhash2 in cases where the socket's rcv saddr changes * Adding bhash2 hashbucket locks Links to syzbot reports: https://lore.kernel.org/netdev/00000000000022208805e0df247a@google.com/ https://lore.kernel.org/netdev/0000000000003f33bc05dfaf44fe@google.com/ Fixes: d5a42de8bdbe ("net: Add a second bind table hashed by port and address") Reported-by: syzbot+015d756bbd1f8b5c8f09@syzkaller.appspotmail.com Reported-by: syzbot+98fd2d1422063b0f8c44@syzkaller.appspotmail.com Reported-by: syzbot+0a847a982613c6438fba@syzkaller.appspotmail.com Signed-off-by: Joanne Koong Link: https://lore.kernel.org/r/20220615193213.2419568-1-joannelkoong@gmail.com Signed-off-by: Jakub Kicinski --- include/net/inet_connection_sock.h | 3 - include/net/inet_hashtables.h | 68 +------ include/net/sock.h | 14 -- net/dccp/proto.c | 33 +--- net/ipv4/inet_connection_sock.c | 247 +++++++------------------- net/ipv4/inet_hashtables.c | 193 ++------------------ net/ipv4/tcp.c | 14 +- tools/testing/selftests/net/.gitignore | 1 - tools/testing/selftests/net/Makefile | 2 - tools/testing/selftests/net/bind_bhash_test.c | 119 ------------- 10 files changed, 83 insertions(+), 611 deletions(-) delete mode 100644 tools/testing/selftests/net/bind_bhash_test.c diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index 077cd730ce2f..85cd695e7fd1 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -25,7 +25,6 @@ #undef INET_CSK_CLEAR_TIMERS struct inet_bind_bucket; -struct inet_bind2_bucket; struct tcp_congestion_ops; /* @@ -58,7 +57,6 @@ struct inet_connection_sock_af_ops { * * @icsk_accept_queue: FIFO of established children * @icsk_bind_hash: Bind node - * @icsk_bind2_hash: Bind node in the bhash2 table * @icsk_timeout: Timeout * @icsk_retransmit_timer: Resend (no ack) * @icsk_rto: Retransmit timeout @@ -85,7 +83,6 @@ struct inet_connection_sock { struct inet_sock icsk_inet; struct request_sock_queue icsk_accept_queue; struct inet_bind_bucket *icsk_bind_hash; - struct inet_bind2_bucket *icsk_bind2_hash; unsigned long icsk_timeout; struct timer_list icsk_retransmit_timer; struct timer_list icsk_delack_timer; diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h index a0887b70967b..ebfa3df6f8dc 100644 --- a/include/net/inet_hashtables.h +++ b/include/net/inet_hashtables.h @@ -90,32 +90,11 @@ struct inet_bind_bucket { struct hlist_head owners; }; -struct inet_bind2_bucket { - possible_net_t ib_net; - int l3mdev; - unsigned short port; - union { -#if IS_ENABLED(CONFIG_IPV6) - struct in6_addr v6_rcv_saddr; -#endif - __be32 rcv_saddr; - }; - /* Node in the inet2_bind_hashbucket chain */ - struct hlist_node node; - /* List of sockets hashed to this bucket */ - struct hlist_head owners; -}; - static inline struct net *ib_net(struct inet_bind_bucket *ib) { return read_pnet(&ib->ib_net); } -static inline struct net *ib2_net(struct inet_bind2_bucket *ib) -{ - return read_pnet(&ib->ib_net); -} - #define inet_bind_bucket_for_each(tb, head) \ hlist_for_each_entry(tb, head, node) @@ -124,15 +103,6 @@ struct inet_bind_hashbucket { struct hlist_head chain; }; -/* This is synchronized using the inet_bind_hashbucket's spinlock. - * Instead of having separate spinlocks, the inet_bind2_hashbucket can share - * the inet_bind_hashbucket's given that in every case where the bhash2 table - * is useful, a lookup in the bhash table also occurs. - */ -struct inet_bind2_hashbucket { - struct hlist_head chain; -}; - /* Sockets can be hashed in established or listening table. * We must use different 'nulls' end-of-chain value for all hash buckets : * A socket might transition from ESTABLISH to LISTEN state without @@ -164,12 +134,6 @@ struct inet_hashinfo { */ struct kmem_cache *bind_bucket_cachep; struct inet_bind_hashbucket *bhash; - /* The 2nd binding table hashed by port and address. - * This is used primarily for expediting the resolution of bind - * conflicts. - */ - struct kmem_cache *bind2_bucket_cachep; - struct inet_bind2_hashbucket *bhash2; unsigned int bhash_size; /* The 2nd listener table hashed by local port and address */ @@ -229,36 +193,6 @@ inet_bind_bucket_create(struct kmem_cache *cachep, struct net *net, void inet_bind_bucket_destroy(struct kmem_cache *cachep, struct inet_bind_bucket *tb); -static inline bool check_bind_bucket_match(struct inet_bind_bucket *tb, - struct net *net, - const unsigned short port, - int l3mdev) -{ - return net_eq(ib_net(tb), net) && tb->port == port && - tb->l3mdev == l3mdev; -} - -struct inet_bind2_bucket * -inet_bind2_bucket_create(struct kmem_cache *cachep, struct net *net, - struct inet_bind2_hashbucket *head, - const unsigned short port, int l3mdev, - const struct sock *sk); - -void inet_bind2_bucket_destroy(struct kmem_cache *cachep, - struct inet_bind2_bucket *tb); - -struct inet_bind2_bucket * -inet_bind2_bucket_find(struct inet_hashinfo *hinfo, struct net *net, - const unsigned short port, int l3mdev, - struct sock *sk, - struct inet_bind2_hashbucket **head); - -bool check_bind2_bucket_match_nulladdr(struct inet_bind2_bucket *tb, - struct net *net, - const unsigned short port, - int l3mdev, - const struct sock *sk); - static inline u32 inet_bhashfn(const struct net *net, const __u16 lport, const u32 bhash_size) { @@ -266,7 +200,7 @@ static inline u32 inet_bhashfn(const struct net *net, const __u16 lport, } void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb, - struct inet_bind2_bucket *tb2, const unsigned short snum); + const unsigned short snum); /* Caller must disable local BH processing. */ int __inet_inherit_port(const struct sock *sk, struct sock *child); diff --git a/include/net/sock.h b/include/net/sock.h index c585ef6565d9..72ca97ccb460 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -348,7 +348,6 @@ struct sk_filter; * @sk_txtime_report_errors: set report errors mode for SO_TXTIME * @sk_txtime_unused: unused txtime flags * @ns_tracker: tracker for netns reference - * @sk_bind2_node: bind node in the bhash2 table */ struct sock { /* @@ -538,7 +537,6 @@ struct sock { #endif struct rcu_head sk_rcu; netns_tracker ns_tracker; - struct hlist_node sk_bind2_node; }; enum sk_pacing { @@ -819,16 +817,6 @@ static inline void sk_add_bind_node(struct sock *sk, hlist_add_head(&sk->sk_bind_node, list); } -static inline void __sk_del_bind2_node(struct sock *sk) -{ - __hlist_del(&sk->sk_bind2_node); -} - -static inline void sk_add_bind2_node(struct sock *sk, struct hlist_head *list) -{ - hlist_add_head(&sk->sk_bind2_node, list); -} - #define sk_for_each(__sk, list) \ hlist_for_each_entry(__sk, list, sk_node) #define sk_for_each_rcu(__sk, list) \ @@ -846,8 +834,6 @@ static inline void sk_add_bind2_node(struct sock *sk, struct hlist_head *list) hlist_for_each_entry_safe(__sk, tmp, list, sk_node) #define sk_for_each_bound(__sk, list) \ hlist_for_each_entry(__sk, list, sk_bind_node) -#define sk_for_each_bound_bhash2(__sk, list) \ - hlist_for_each_entry(__sk, list, sk_bind2_node) /** * sk_for_each_entry_offset_rcu - iterate over a list at a given struct offset diff --git a/net/dccp/proto.c b/net/dccp/proto.c index 2e78458900f2..eb8e128e43e8 100644 --- a/net/dccp/proto.c +++ b/net/dccp/proto.c @@ -1120,12 +1120,6 @@ static int __init dccp_init(void) SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL); if (!dccp_hashinfo.bind_bucket_cachep) goto out_free_hashinfo2; - dccp_hashinfo.bind2_bucket_cachep = - kmem_cache_create("dccp_bind2_bucket", - sizeof(struct inet_bind2_bucket), 0, - SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL); - if (!dccp_hashinfo.bind2_bucket_cachep) - goto out_free_bind_bucket_cachep; /* * Size and allocate the main established and bind bucket @@ -1156,7 +1150,7 @@ static int __init dccp_init(void) if (!dccp_hashinfo.ehash) { DCCP_CRIT("Failed to allocate DCCP established hash table"); - goto out_free_bind2_bucket_cachep; + goto out_free_bind_bucket_cachep; } for (i = 0; i <= dccp_hashinfo.ehash_mask; i++) @@ -1182,23 +1176,14 @@ static int __init dccp_init(void) goto out_free_dccp_locks; } - dccp_hashinfo.bhash2 = (struct inet_bind2_hashbucket *) - __get_free_pages(GFP_ATOMIC | __GFP_NOWARN, bhash_order); - - if (!dccp_hashinfo.bhash2) { - DCCP_CRIT("Failed to allocate DCCP bind2 hash table"); - goto out_free_dccp_bhash; - } - for (i = 0; i < dccp_hashinfo.bhash_size; i++) { spin_lock_init(&dccp_hashinfo.bhash[i].lock); INIT_HLIST_HEAD(&dccp_hashinfo.bhash[i].chain); - INIT_HLIST_HEAD(&dccp_hashinfo.bhash2[i].chain); } rc = dccp_mib_init(); if (rc) - goto out_free_dccp_bhash2; + goto out_free_dccp_bhash; rc = dccp_ackvec_init(); if (rc) @@ -1222,38 +1207,30 @@ out_ackvec_exit: dccp_ackvec_exit(); out_free_dccp_mib: dccp_mib_exit(); -out_free_dccp_bhash2: - free_pages((unsigned long)dccp_hashinfo.bhash2, bhash_order); out_free_dccp_bhash: free_pages((unsigned long)dccp_hashinfo.bhash, bhash_order); out_free_dccp_locks: inet_ehash_locks_free(&dccp_hashinfo); out_free_dccp_ehash: free_pages((unsigned long)dccp_hashinfo.ehash, ehash_order); -out_free_bind2_bucket_cachep: - kmem_cache_destroy(dccp_hashinfo.bind2_bucket_cachep); out_free_bind_bucket_cachep: kmem_cache_destroy(dccp_hashinfo.bind_bucket_cachep); out_free_hashinfo2: inet_hashinfo2_free_mod(&dccp_hashinfo); out_fail: dccp_hashinfo.bhash = NULL; - dccp_hashinfo.bhash2 = NULL; dccp_hashinfo.ehash = NULL; dccp_hashinfo.bind_bucket_cachep = NULL; - dccp_hashinfo.bind2_bucket_cachep = NULL; return rc; } static void __exit dccp_fini(void) { - int bhash_order = get_order(dccp_hashinfo.bhash_size * - sizeof(struct inet_bind_hashbucket)); - ccid_cleanup_builtins(); dccp_mib_exit(); - free_pages((unsigned long)dccp_hashinfo.bhash, bhash_order); - free_pages((unsigned long)dccp_hashinfo.bhash2, bhash_order); + free_pages((unsigned long)dccp_hashinfo.bhash, + get_order(dccp_hashinfo.bhash_size * + sizeof(struct inet_bind_hashbucket))); free_pages((unsigned long)dccp_hashinfo.ehash, get_order((dccp_hashinfo.ehash_mask + 1) * sizeof(struct inet_ehash_bucket))); diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index c0b7e6c21360..53f5f956d948 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -117,32 +117,6 @@ bool inet_rcv_saddr_any(const struct sock *sk) return !sk->sk_rcv_saddr; } -static bool use_bhash2_on_bind(const struct sock *sk) -{ -#if IS_ENABLED(CONFIG_IPV6) - int addr_type; - - if (sk->sk_family == AF_INET6) { - addr_type = ipv6_addr_type(&sk->sk_v6_rcv_saddr); - return addr_type != IPV6_ADDR_ANY && - addr_type != IPV6_ADDR_MAPPED; - } -#endif - return sk->sk_rcv_saddr != htonl(INADDR_ANY); -} - -static u32 get_bhash2_nulladdr_hash(const struct sock *sk, struct net *net, - int port) -{ -#if IS_ENABLED(CONFIG_IPV6) - struct in6_addr nulladdr = {}; - - if (sk->sk_family == AF_INET6) - return ipv6_portaddr_hash(net, &nulladdr, port); -#endif - return ipv4_portaddr_hash(net, 0, port); -} - void inet_get_local_port_range(struct net *net, int *low, int *high) { unsigned int seq; @@ -156,71 +130,16 @@ void inet_get_local_port_range(struct net *net, int *low, int *high) } EXPORT_SYMBOL(inet_get_local_port_range); -static bool bind_conflict_exist(const struct sock *sk, struct sock *sk2, - kuid_t sk_uid, bool relax, - bool reuseport_cb_ok, bool reuseport_ok) -{ - int bound_dev_if2; - - if (sk == sk2) - return false; - - bound_dev_if2 = READ_ONCE(sk2->sk_bound_dev_if); - - if (!sk->sk_bound_dev_if || !bound_dev_if2 || - sk->sk_bound_dev_if == bound_dev_if2) { - if (sk->sk_reuse && sk2->sk_reuse && - sk2->sk_state != TCP_LISTEN) { - if (!relax || (!reuseport_ok && sk->sk_reuseport && - sk2->sk_reuseport && reuseport_cb_ok && - (sk2->sk_state == TCP_TIME_WAIT || - uid_eq(sk_uid, sock_i_uid(sk2))))) - return true; - } else if (!reuseport_ok || !sk->sk_reuseport || - !sk2->sk_reuseport || !reuseport_cb_ok || - (sk2->sk_state != TCP_TIME_WAIT && - !uid_eq(sk_uid, sock_i_uid(sk2)))) { - return true; - } - } - return false; -} - -static bool check_bhash2_conflict(const struct sock *sk, - struct inet_bind2_bucket *tb2, kuid_t sk_uid, - bool relax, bool reuseport_cb_ok, - bool reuseport_ok) -{ - struct sock *sk2; - - sk_for_each_bound_bhash2(sk2, &tb2->owners) { - if (sk->sk_family == AF_INET && ipv6_only_sock(sk2)) - continue; - - if (bind_conflict_exist(sk, sk2, sk_uid, relax, - reuseport_cb_ok, reuseport_ok)) - return true; - } - return false; -} - -/* This should be called only when the corresponding inet_bind_bucket spinlock - * is held - */ -static int inet_csk_bind_conflict(const struct sock *sk, int port, - struct inet_bind_bucket *tb, - struct inet_bind2_bucket *tb2, /* may be null */ +static int inet_csk_bind_conflict(const struct sock *sk, + const struct inet_bind_bucket *tb, bool relax, bool reuseport_ok) { - struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo; - kuid_t uid = sock_i_uid((struct sock *)sk); - struct sock_reuseport *reuseport_cb; - struct inet_bind2_hashbucket *head2; - bool reuseport_cb_ok; struct sock *sk2; - struct net *net; - int l3mdev; - u32 hash; + bool reuseport_cb_ok; + bool reuse = sk->sk_reuse; + bool reuseport = !!sk->sk_reuseport; + struct sock_reuseport *reuseport_cb; + kuid_t uid = sock_i_uid((struct sock *)sk); rcu_read_lock(); reuseport_cb = rcu_dereference(sk->sk_reuseport_cb); @@ -231,42 +150,40 @@ static int inet_csk_bind_conflict(const struct sock *sk, int port, /* * Unlike other sk lookup places we do not check * for sk_net here, since _all_ the socks listed - * in tb->owners and tb2->owners list belong - * to the same net + * in tb->owners list belong to the same net - the + * one this bucket belongs to. */ - if (!use_bhash2_on_bind(sk)) { - sk_for_each_bound(sk2, &tb->owners) - if (bind_conflict_exist(sk, sk2, uid, relax, - reuseport_cb_ok, reuseport_ok) && - inet_rcv_saddr_equal(sk, sk2, true)) - return true; + sk_for_each_bound(sk2, &tb->owners) { + int bound_dev_if2; - return false; + if (sk == sk2) + continue; + bound_dev_if2 = READ_ONCE(sk2->sk_bound_dev_if); + if ((!sk->sk_bound_dev_if || + !bound_dev_if2 || + sk->sk_bound_dev_if == bound_dev_if2)) { + if (reuse && sk2->sk_reuse && + sk2->sk_state != TCP_LISTEN) { + if ((!relax || + (!reuseport_ok && + reuseport && sk2->sk_reuseport && + reuseport_cb_ok && + (sk2->sk_state == TCP_TIME_WAIT || + uid_eq(uid, sock_i_uid(sk2))))) && + inet_rcv_saddr_equal(sk, sk2, true)) + break; + } else if (!reuseport_ok || + !reuseport || !sk2->sk_reuseport || + !reuseport_cb_ok || + (sk2->sk_state != TCP_TIME_WAIT && + !uid_eq(uid, sock_i_uid(sk2)))) { + if (inet_rcv_saddr_equal(sk, sk2, true)) + break; + } + } } - - if (tb2 && check_bhash2_conflict(sk, tb2, uid, relax, reuseport_cb_ok, - reuseport_ok)) - return true; - - net = sock_net(sk); - - /* check there's no conflict with an existing IPV6_ADDR_ANY (if ipv6) or - * INADDR_ANY (if ipv4) socket. - */ - hash = get_bhash2_nulladdr_hash(sk, net, port); - head2 = &hinfo->bhash2[hash & (hinfo->bhash_size - 1)]; - - l3mdev = inet_sk_bound_l3mdev(sk); - inet_bind_bucket_for_each(tb2, &head2->chain) - if (check_bind2_bucket_match_nulladdr(tb2, net, port, l3mdev, sk)) - break; - - if (tb2 && check_bhash2_conflict(sk, tb2, uid, relax, reuseport_cb_ok, - reuseport_ok)) - return true; - - return false; + return sk2 != NULL; } /* @@ -274,20 +191,16 @@ static int inet_csk_bind_conflict(const struct sock *sk, int port, * inet_bind_hashbucket lock held. */ static struct inet_bind_hashbucket * -inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, - struct inet_bind2_bucket **tb2_ret, - struct inet_bind2_hashbucket **head2_ret, int *port_ret) +inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, int *port_ret) { struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo; - struct inet_bind2_hashbucket *head2; + int port = 0; struct inet_bind_hashbucket *head; struct net *net = sock_net(sk); + bool relax = false; int i, low, high, attempt_half; - struct inet_bind2_bucket *tb2; struct inet_bind_bucket *tb; u32 remaining, offset; - bool relax = false; - int port = 0; int l3mdev; l3mdev = inet_sk_bound_l3mdev(sk); @@ -326,12 +239,10 @@ other_parity_scan: head = &hinfo->bhash[inet_bhashfn(net, port, hinfo->bhash_size)]; spin_lock_bh(&head->lock); - tb2 = inet_bind2_bucket_find(hinfo, net, port, l3mdev, sk, - &head2); inet_bind_bucket_for_each(tb, &head->chain) - if (check_bind_bucket_match(tb, net, port, l3mdev)) { - if (!inet_csk_bind_conflict(sk, port, tb, tb2, - relax, false)) + if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev && + tb->port == port) { + if (!inet_csk_bind_conflict(sk, tb, relax, false)) goto success; goto next_port; } @@ -361,8 +272,6 @@ next_port: success: *port_ret = port; *tb_ret = tb; - *tb2_ret = tb2; - *head2_ret = head2; return head; } @@ -458,81 +367,54 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum) { bool reuse = sk->sk_reuse && sk->sk_state != TCP_LISTEN; struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo; - bool bhash_created = false, bhash2_created = false; - struct inet_bind2_bucket *tb2 = NULL; - struct inet_bind2_hashbucket *head2; - struct inet_bind_bucket *tb = NULL; + int ret = 1, port = snum; struct inet_bind_hashbucket *head; struct net *net = sock_net(sk); - int ret = 1, port = snum; - bool found_port = false; + struct inet_bind_bucket *tb = NULL; int l3mdev; l3mdev = inet_sk_bound_l3mdev(sk); if (!port) { - head = inet_csk_find_open_port(sk, &tb, &tb2, &head2, &port); + head = inet_csk_find_open_port(sk, &tb, &port); if (!head) return ret; - if (tb && tb2) - goto success; - found_port = true; - } else { - head = &hinfo->bhash[inet_bhashfn(net, port, - hinfo->bhash_size)]; - spin_lock_bh(&head->lock); - inet_bind_bucket_for_each(tb, &head->chain) - if (check_bind_bucket_match(tb, net, port, l3mdev)) - break; - - tb2 = inet_bind2_bucket_find(hinfo, net, port, l3mdev, sk, - &head2); - } - - if (!tb) { - tb = inet_bind_bucket_create(hinfo->bind_bucket_cachep, net, - head, port, l3mdev); if (!tb) - goto fail_unlock; - bhash_created = true; - } - - if (!tb2) { - tb2 = inet_bind2_bucket_create(hinfo->bind2_bucket_cachep, - net, head2, port, l3mdev, sk); - if (!tb2) - goto fail_unlock; - bhash2_created = true; + goto tb_not_found; + goto success; } - - /* If we had to find an open port, we already checked for conflicts */ - if (!found_port && !hlist_empty(&tb->owners)) { + head = &hinfo->bhash[inet_bhashfn(net, port, + hinfo->bhash_size)]; + spin_lock_bh(&head->lock); + inet_bind_bucket_for_each(tb, &head->chain) + if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev && + tb->port == port) + goto tb_found; +tb_not_found: + tb = inet_bind_bucket_create(hinfo->bind_bucket_cachep, + net, head, port, l3mdev); + if (!tb) + goto fail_unlock; +tb_found: + if (!hlist_empty(&tb->owners)) { if (sk->sk_reuse == SK_FORCE_REUSE) goto success; if ((tb->fastreuse > 0 && reuse) || sk_reuseport_match(tb, sk)) goto success; - if (inet_csk_bind_conflict(sk, port, tb, tb2, true, true)) + if (inet_csk_bind_conflict(sk, tb, true, true)) goto fail_unlock; } success: inet_csk_update_fastreuse(tb, sk); if (!inet_csk(sk)->icsk_bind_hash) - inet_bind_hash(sk, tb, tb2, port); + inet_bind_hash(sk, tb, port); WARN_ON(inet_csk(sk)->icsk_bind_hash != tb); - WARN_ON(inet_csk(sk)->icsk_bind2_hash != tb2); ret = 0; fail_unlock: - if (ret) { - if (bhash_created) - inet_bind_bucket_destroy(hinfo->bind_bucket_cachep, tb); - if (bhash2_created) - inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, - tb2); - } spin_unlock_bh(&head->lock); return ret; } @@ -1079,7 +961,6 @@ struct sock *inet_csk_clone_lock(const struct sock *sk, inet_sk_set_state(newsk, TCP_SYN_RECV); newicsk->icsk_bind_hash = NULL; - newicsk->icsk_bind2_hash = NULL; inet_sk(newsk)->inet_dport = inet_rsk(req)->ir_rmt_port; inet_sk(newsk)->inet_num = inet_rsk(req)->ir_num; diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index 545f91b6cb5e..b9d995b5ce24 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -81,41 +81,6 @@ struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep, return tb; } -struct inet_bind2_bucket *inet_bind2_bucket_create(struct kmem_cache *cachep, - struct net *net, - struct inet_bind2_hashbucket *head, - const unsigned short port, - int l3mdev, - const struct sock *sk) -{ - struct inet_bind2_bucket *tb = kmem_cache_alloc(cachep, GFP_ATOMIC); - - if (tb) { - write_pnet(&tb->ib_net, net); - tb->l3mdev = l3mdev; - tb->port = port; -#if IS_ENABLED(CONFIG_IPV6) - if (sk->sk_family == AF_INET6) - tb->v6_rcv_saddr = sk->sk_v6_rcv_saddr; - else -#endif - tb->rcv_saddr = sk->sk_rcv_saddr; - INIT_HLIST_HEAD(&tb->owners); - hlist_add_head(&tb->node, &head->chain); - } - return tb; -} - -static bool bind2_bucket_addr_match(struct inet_bind2_bucket *tb2, struct sock *sk) -{ -#if IS_ENABLED(CONFIG_IPV6) - if (sk->sk_family == AF_INET6) - return ipv6_addr_equal(&tb2->v6_rcv_saddr, - &sk->sk_v6_rcv_saddr); -#endif - return tb2->rcv_saddr == sk->sk_rcv_saddr; -} - /* * Caller must hold hashbucket lock for this tb with local BH disabled */ @@ -127,25 +92,12 @@ void inet_bind_bucket_destroy(struct kmem_cache *cachep, struct inet_bind_bucket } } -/* Caller must hold the lock for the corresponding hashbucket in the bhash table - * with local BH disabled - */ -void inet_bind2_bucket_destroy(struct kmem_cache *cachep, struct inet_bind2_bucket *tb) -{ - if (hlist_empty(&tb->owners)) { - __hlist_del(&tb->node); - kmem_cache_free(cachep, tb); - } -} - void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb, - struct inet_bind2_bucket *tb2, const unsigned short snum) + const unsigned short snum) { inet_sk(sk)->inet_num = snum; sk_add_bind_node(sk, &tb->owners); inet_csk(sk)->icsk_bind_hash = tb; - sk_add_bind2_node(sk, &tb2->owners); - inet_csk(sk)->icsk_bind2_hash = tb2; } /* @@ -157,7 +109,6 @@ static void __inet_put_port(struct sock *sk) const int bhash = inet_bhashfn(sock_net(sk), inet_sk(sk)->inet_num, hashinfo->bhash_size); struct inet_bind_hashbucket *head = &hashinfo->bhash[bhash]; - struct inet_bind2_bucket *tb2; struct inet_bind_bucket *tb; spin_lock(&head->lock); @@ -166,13 +117,6 @@ static void __inet_put_port(struct sock *sk) inet_csk(sk)->icsk_bind_hash = NULL; inet_sk(sk)->inet_num = 0; inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb); - - if (inet_csk(sk)->icsk_bind2_hash) { - tb2 = inet_csk(sk)->icsk_bind2_hash; - __sk_del_bind2_node(sk); - inet_csk(sk)->icsk_bind2_hash = NULL; - inet_bind2_bucket_destroy(hashinfo->bind2_bucket_cachep, tb2); - } spin_unlock(&head->lock); } @@ -189,19 +133,14 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child) struct inet_hashinfo *table = sk->sk_prot->h.hashinfo; unsigned short port = inet_sk(child)->inet_num; const int bhash = inet_bhashfn(sock_net(sk), port, - table->bhash_size); + table->bhash_size); struct inet_bind_hashbucket *head = &table->bhash[bhash]; - struct inet_bind2_hashbucket *head_bhash2; - bool created_inet_bind_bucket = false; - struct net *net = sock_net(sk); - struct inet_bind2_bucket *tb2; struct inet_bind_bucket *tb; int l3mdev; spin_lock(&head->lock); tb = inet_csk(sk)->icsk_bind_hash; - tb2 = inet_csk(sk)->icsk_bind2_hash; - if (unlikely(!tb || !tb2)) { + if (unlikely(!tb)) { spin_unlock(&head->lock); return -ENOENT; } @@ -214,45 +153,25 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child) * as that of the child socket. We have to look up or * create a new bind bucket for the child here. */ inet_bind_bucket_for_each(tb, &head->chain) { - if (check_bind_bucket_match(tb, net, port, l3mdev)) + if (net_eq(ib_net(tb), sock_net(sk)) && + tb->l3mdev == l3mdev && tb->port == port) break; } if (!tb) { tb = inet_bind_bucket_create(table->bind_bucket_cachep, - net, head, port, l3mdev); + sock_net(sk), head, port, + l3mdev); if (!tb) { spin_unlock(&head->lock); return -ENOMEM; } - created_inet_bind_bucket = true; } inet_csk_update_fastreuse(tb, child); - - goto bhash2_find; - } else if (!bind2_bucket_addr_match(tb2, child)) { - l3mdev = inet_sk_bound_l3mdev(sk); - -bhash2_find: - tb2 = inet_bind2_bucket_find(table, net, port, l3mdev, child, - &head_bhash2); - if (!tb2) { - tb2 = inet_bind2_bucket_create(table->bind2_bucket_cachep, - net, head_bhash2, port, - l3mdev, child); - if (!tb2) - goto error; - } } - inet_bind_hash(child, tb, tb2, port); + inet_bind_hash(child, tb, port); spin_unlock(&head->lock); return 0; - -error: - if (created_inet_bind_bucket) - inet_bind_bucket_destroy(table->bind_bucket_cachep, tb); - spin_unlock(&head->lock); - return -ENOMEM; } EXPORT_SYMBOL_GPL(__inet_inherit_port); @@ -756,76 +675,6 @@ void inet_unhash(struct sock *sk) } EXPORT_SYMBOL_GPL(inet_unhash); -static bool check_bind2_bucket_match(struct inet_bind2_bucket *tb, - struct net *net, unsigned short port, - int l3mdev, struct sock *sk) -{ -#if IS_ENABLED(CONFIG_IPV6) - if (sk->sk_family == AF_INET6) - return net_eq(ib2_net(tb), net) && tb->port == port && - tb->l3mdev == l3mdev && - ipv6_addr_equal(&tb->v6_rcv_saddr, &sk->sk_v6_rcv_saddr); - else -#endif - return net_eq(ib2_net(tb), net) && tb->port == port && - tb->l3mdev == l3mdev && tb->rcv_saddr == sk->sk_rcv_saddr; -} - -bool check_bind2_bucket_match_nulladdr(struct inet_bind2_bucket *tb, - struct net *net, const unsigned short port, - int l3mdev, const struct sock *sk) -{ -#if IS_ENABLED(CONFIG_IPV6) - struct in6_addr nulladdr = {}; - - if (sk->sk_family == AF_INET6) - return net_eq(ib2_net(tb), net) && tb->port == port && - tb->l3mdev == l3mdev && - ipv6_addr_equal(&tb->v6_rcv_saddr, &nulladdr); - else -#endif - return net_eq(ib2_net(tb), net) && tb->port == port && - tb->l3mdev == l3mdev && tb->rcv_saddr == 0; -} - -static struct inet_bind2_hashbucket * -inet_bhashfn_portaddr(struct inet_hashinfo *hinfo, const struct sock *sk, - const struct net *net, unsigned short port) -{ - u32 hash; - -#if IS_ENABLED(CONFIG_IPV6) - if (sk->sk_family == AF_INET6) - hash = ipv6_portaddr_hash(net, &sk->sk_v6_rcv_saddr, port); - else -#endif - hash = ipv4_portaddr_hash(net, sk->sk_rcv_saddr, port); - return &hinfo->bhash2[hash & (hinfo->bhash_size - 1)]; -} - -/* This should only be called when the spinlock for the socket's corresponding - * bind_hashbucket is held - */ -struct inet_bind2_bucket * -inet_bind2_bucket_find(struct inet_hashinfo *hinfo, struct net *net, - const unsigned short port, int l3mdev, struct sock *sk, - struct inet_bind2_hashbucket **head) -{ - struct inet_bind2_bucket *bhash2 = NULL; - struct inet_bind2_hashbucket *h; - - h = inet_bhashfn_portaddr(hinfo, sk, net, port); - inet_bind_bucket_for_each(bhash2, &h->chain) { - if (check_bind2_bucket_match(bhash2, net, port, l3mdev, sk)) - break; - } - - if (head) - *head = h; - - return bhash2; -} - /* RFC 6056 3.3.4. Algorithm 4: Double-Hash Port Selection Algorithm * Note that we use 32bit integers (vs RFC 'short integers') * because 2^16 is not a multiple of num_ephemeral and this @@ -846,13 +695,10 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, { struct inet_hashinfo *hinfo = death_row->hashinfo; struct inet_timewait_sock *tw = NULL; - struct inet_bind2_hashbucket *head2; struct inet_bind_hashbucket *head; int port = inet_sk(sk)->inet_num; struct net *net = sock_net(sk); - struct inet_bind2_bucket *tb2; struct inet_bind_bucket *tb; - bool tb_created = false; u32 remaining, offset; int ret, i, low, high; int l3mdev; @@ -909,7 +755,8 @@ other_parity_scan: * the established check is already unique enough. */ inet_bind_bucket_for_each(tb, &head->chain) { - if (check_bind_bucket_match(tb, net, port, l3mdev)) { + if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev && + tb->port == port) { if (tb->fastreuse >= 0 || tb->fastreuseport >= 0) goto next_port; @@ -927,7 +774,6 @@ other_parity_scan: spin_unlock_bh(&head->lock); return -ENOMEM; } - tb_created = true; tb->fastreuse = -1; tb->fastreuseport = -1; goto ok; @@ -943,17 +789,6 @@ next_port: return -EADDRNOTAVAIL; ok: - /* Find the corresponding tb2 bucket since we need to - * add the socket to the bhash2 table as well - */ - tb2 = inet_bind2_bucket_find(hinfo, net, port, l3mdev, sk, &head2); - if (!tb2) { - tb2 = inet_bind2_bucket_create(hinfo->bind2_bucket_cachep, net, - head2, port, l3mdev, sk); - if (!tb2) - goto error; - } - /* Here we want to add a little bit of randomness to the next source * port that will be chosen. We use a max() with a random here so that * on low contention the randomness is maximal and on high contention @@ -963,7 +798,7 @@ ok: WRITE_ONCE(table_perturb[index], READ_ONCE(table_perturb[index]) + i + 2); /* Head lock still held and bh's disabled */ - inet_bind_hash(sk, tb, tb2, port); + inet_bind_hash(sk, tb, port); if (sk_unhashed(sk)) { inet_sk(sk)->inet_sport = htons(port); inet_ehash_nolisten(sk, (struct sock *)tw, NULL); @@ -975,12 +810,6 @@ ok: inet_twsk_deschedule_put(tw); local_bh_enable(); return 0; - -error: - if (tb_created) - inet_bind_bucket_destroy(hinfo->bind_bucket_cachep, tb); - spin_unlock_bh(&head->lock); - return -ENOMEM; } /* diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 9984d23a7f3e..028513d3e2a2 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -4604,12 +4604,6 @@ void __init tcp_init(void) SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL); - tcp_hashinfo.bind2_bucket_cachep = - kmem_cache_create("tcp_bind2_bucket", - sizeof(struct inet_bind2_bucket), 0, - SLAB_HWCACHE_ALIGN | SLAB_PANIC | - SLAB_ACCOUNT, - NULL); /* Size and allocate the main established and bind bucket * hash tables. @@ -4632,9 +4626,8 @@ void __init tcp_init(void) if (inet_ehash_locks_alloc(&tcp_hashinfo)) panic("TCP: failed to alloc ehash_locks"); tcp_hashinfo.bhash = - alloc_large_system_hash("TCP bind bhash tables", - sizeof(struct inet_bind_hashbucket) + - sizeof(struct inet_bind2_hashbucket), + alloc_large_system_hash("TCP bind", + sizeof(struct inet_bind_hashbucket), tcp_hashinfo.ehash_mask + 1, 17, /* one slot per 128 KB of memory */ 0, @@ -4643,12 +4636,9 @@ void __init tcp_init(void) 0, 64 * 1024); tcp_hashinfo.bhash_size = 1U << tcp_hashinfo.bhash_size; - tcp_hashinfo.bhash2 = - (struct inet_bind2_hashbucket *)(tcp_hashinfo.bhash + tcp_hashinfo.bhash_size); for (i = 0; i < tcp_hashinfo.bhash_size; i++) { spin_lock_init(&tcp_hashinfo.bhash[i].lock); INIT_HLIST_HEAD(&tcp_hashinfo.bhash[i].chain); - INIT_HLIST_HEAD(&tcp_hashinfo.bhash2[i].chain); } diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore index b984f8c8d523..a29f79618934 100644 --- a/tools/testing/selftests/net/.gitignore +++ b/tools/testing/selftests/net/.gitignore @@ -37,4 +37,3 @@ gro ioam6_parser toeplitz cmsg_sender -bind_bhash_test diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile index 464df13831f2..7ea54af55490 100644 --- a/tools/testing/selftests/net/Makefile +++ b/tools/testing/selftests/net/Makefile @@ -59,7 +59,6 @@ TEST_GEN_FILES += toeplitz TEST_GEN_FILES += cmsg_sender TEST_GEN_FILES += stress_reuseport_listen TEST_PROGS += test_vxlan_vnifiltering.sh -TEST_GEN_FILES += bind_bhash_test TEST_FILES := settings @@ -70,5 +69,4 @@ include bpf/Makefile $(OUTPUT)/reuseport_bpf_numa: LDLIBS += -lnuma $(OUTPUT)/tcp_mmap: LDLIBS += -lpthread -$(OUTPUT)/bind_bhash_test: LDLIBS += -lpthread $(OUTPUT)/tcp_inq: LDLIBS += -lpthread diff --git a/tools/testing/selftests/net/bind_bhash_test.c b/tools/testing/selftests/net/bind_bhash_test.c deleted file mode 100644 index 252e73754e76..000000000000 --- a/tools/testing/selftests/net/bind_bhash_test.c +++ /dev/null @@ -1,119 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 -/* - * This times how long it takes to bind to a port when the port already - * has multiple sockets in its bhash table. - * - * In the setup(), we populate the port's bhash table with - * MAX_THREADS * MAX_CONNECTIONS number of entries. - */ - -#include -#include -#include -#include - -#define MAX_THREADS 600 -#define MAX_CONNECTIONS 40 - -static const char *bind_addr = "::1"; -static const char *port; - -static int fd_array[MAX_THREADS][MAX_CONNECTIONS]; - -static int bind_socket(int opt, const char *addr) -{ - struct addrinfo *res, hint = {}; - int sock_fd, reuse = 1, err; - - sock_fd = socket(AF_INET6, SOCK_STREAM, 0); - if (sock_fd < 0) { - perror("socket fd err"); - return -1; - } - - hint.ai_family = AF_INET6; - hint.ai_socktype = SOCK_STREAM; - - err = getaddrinfo(addr, port, &hint, &res); - if (err) { - perror("getaddrinfo failed"); - return -1; - } - - if (opt) { - err = setsockopt(sock_fd, SOL_SOCKET, opt, &reuse, sizeof(reuse)); - if (err) { - perror("setsockopt failed"); - return -1; - } - } - - err = bind(sock_fd, res->ai_addr, res->ai_addrlen); - if (err) { - perror("failed to bind to port"); - return -1; - } - - return sock_fd; -} - -static void *setup(void *arg) -{ - int sock_fd, i; - int *array = (int *)arg; - - for (i = 0; i < MAX_CONNECTIONS; i++) { - sock_fd = bind_socket(SO_REUSEADDR | SO_REUSEPORT, bind_addr); - if (sock_fd < 0) - return NULL; - array[i] = sock_fd; - } - - return NULL; -} - -int main(int argc, const char *argv[]) -{ - int listener_fd, sock_fd, i, j; - pthread_t tid[MAX_THREADS]; - clock_t begin, end; - - if (argc != 2) { - printf("Usage: listener \n"); - return -1; - } - - port = argv[1]; - - listener_fd = bind_socket(SO_REUSEADDR | SO_REUSEPORT, bind_addr); - if (listen(listener_fd, 100) < 0) { - perror("listen failed"); - return -1; - } - - /* Set up threads to populate the bhash table entry for the port */ - for (i = 0; i < MAX_THREADS; i++) - pthread_create(&tid[i], NULL, setup, fd_array[i]); - - for (i = 0; i < MAX_THREADS; i++) - pthread_join(tid[i], NULL); - - begin = clock(); - - /* Bind to the same port on a different address */ - sock_fd = bind_socket(0, "2001:0db8:0:f101::1"); - - end = clock(); - - printf("time spent = %f\n", (double)(end - begin) / CLOCKS_PER_SEC); - - /* clean up */ - close(sock_fd); - close(listener_fd); - for (i = 0; i < MAX_THREADS; i++) { - for (j = 0; i < MAX_THREADS; i++) - close(fd_array[i][j]); - } - - return 0; -} -- cgit From 2e7bf4a6af482f73f01245f08b4a953412c77070 Mon Sep 17 00:00:00 2001 From: Yang Yingliang Date: Thu, 16 Jun 2022 14:29:17 +0800 Subject: net: axienet: add missing error return code in axienet_probe() It should return error code in error path in axienet_probe(). Fixes: 00be43a74ca2 ("net: axienet: make the 64b addresable DMA depends on 64b archectures") Reported-by: Hulk Robot Signed-off-by: Yang Yingliang Link: https://lore.kernel.org/r/20220616062917.3601-1-yangyingliang@huawei.com Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/xilinx/xilinx_axienet_main.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/xilinx/xilinx_axienet_main.c b/drivers/net/ethernet/xilinx/xilinx_axienet_main.c index fa7bcd2c1892..1760930ec0c4 100644 --- a/drivers/net/ethernet/xilinx/xilinx_axienet_main.c +++ b/drivers/net/ethernet/xilinx/xilinx_axienet_main.c @@ -2039,6 +2039,7 @@ static int axienet_probe(struct platform_device *pdev) } if (!IS_ENABLED(CONFIG_64BIT) && lp->features & XAE_FEATURE_DMA_64BIT) { dev_err(&pdev->dev, "64-bit addressable DMA is not compatible with 32-bit archecture\n"); + ret = -EINVAL; goto cleanup_clk; } -- cgit